# Sorting text regions as columns or as rows

**Note**: This tutorial follows and depends on the tutorial on [Reading PageXML files from archives](./Demo-reading-pagexml-files-from-archive.ipynb). It assumes you have downloaded the PageXML archives and derived line format files from them. 

Text in documents can have different reading orders and a single document can have multiple valid reading orders. Depending on the reading order you want, you can sort the `TextRegion`s and `TextLine`s of a document as columns (columns from left to right, lines within columns from top to bottom) or as rows (horizontally overlapping lines from left to right, and sets of horizontally overlapping lines from top to bottom).

The `pagexml_helper` module supports both types of sorting.

In [1]:
%load_ext autoreload
%autoreload 2


We start with accessing the PageXML files from the zip downloaded in the tutorial on [Reading PageXML files from archives](./Demo-reading-pagexml-files-from-archive.ipynb).

In [5]:
from pagexml.analysis.stats import get_doc_stats
from pagexml.parser import parse_pagexml_files_from_archive

na_archive_file = '../data/HTR results 1.05.06 PAGE.zip'

scans = parse_pagexml_files_from_archive(na_archive_file)

Below, two scans from this archive are identified, one representing a scan with two columns of text, the other representing a table with many columns and rows. The two functions below turn the scan IDs into URLs so we can view the scan images.

In [68]:
# https://www.nationaalarchief.nl/onderzoeken/archief/1.05.06/invnr/1/file/NL-HaNA_1.05.06_1_0001?eadID=1.05.06&unitID=1&query=1.05.06
def parse_doc_id(doc_id):
    return {
        'archive_id': doc_id.split('_')[1],
        'inv_num': doc_id.split('_')[2],
        'scan_num': int(doc_id.split('_')[3].replace('.jpg', ''))        
    }


def make_iiif_url_from_doc_id(doc_id):
    base_url = 'https://www.nationaalarchief.nl/onderzoeken/archief'
    doc_info = parse_doc_id(doc_id)
    return f"{base_url}/{doc_info['archive_id']}/invnr/{doc_info['inv_num']}/file/{doc_id[:-4]}?eadId={doc_info['archive_id']}&unitID={doc_info['inv_num']}"


two_column_scan_id = 'NL-HaNA_1.05.06_1005_0013.jpg'
table_scan_id = 'NL-HaNA_1.05.06_103_0001.jpg'

print(make_iiif_url_from_doc_id(two_column_scan_id))
print(make_iiif_url_from_doc_id(table_scan_id))


https://www.nationaalarchief.nl/onderzoeken/archief/1.05.06/invnr/1005/file/NL-HaNA_1.05.06_1005_0013?eadId=1.05.06&unitID=1005
https://www.nationaalarchief.nl/onderzoeken/archief/1.05.06/invnr/103/file/NL-HaNA_1.05.06_103_0001?eadId=1.05.06&unitID=103


Next, we extract the scans from the archive.

In [74]:
scans = parse_pagexml_files_from_archive(na_archive_file)

two_column_scan_id = 'NL-HaNA_1.05.06_1005_0013.jpg'
table_scan_id = 'NL-HaNA_1.05.06_103_0001.jpg'
table_scan_id = 'NL-HaNA_1.05.06_114_0003.jpg'

two_column_scan = None
table_scan = None

for scan in scans:
    if scan.id == two_column_scan_id:
        two_column_scan = scan
    if scan.id == table_scan_id:
        table_scan = scan
    if table_scan is not None and two_column_scan is not None:
        break

NL-HaNA_1.05.06_103_0001.jpg {'lines': 410, 'words': 329, 'text_regions': 161, 'columns': 0, 'extra': 0, 'pages': 0}
https://www.nationaalarchief.nl/onderzoeken/archief/1.05.06/invnr/103/file/NL-HaNA_1.05.06_103_0001?eadId=1.05.06&unitID=103
NL-HaNA_1.05.06_114_0002.jpg {'lines': 517, 'words': 753, 'text_regions': 15, 'columns': 0, 'extra': 0, 'pages': 0}
https://www.nationaalarchief.nl/onderzoeken/archief/1.05.06/invnr/114/file/NL-HaNA_1.05.06_114_0002?eadId=1.05.06&unitID=114
NL-HaNA_1.05.06_114_0003.jpg {'lines': 560, 'words': 750, 'text_regions': 29, 'columns': 0, 'extra': 0, 'pages': 0}
https://www.nationaalarchief.nl/onderzoeken/archief/1.05.06/invnr/114/file/NL-HaNA_1.05.06_114_0003?eadId=1.05.06&unitID=114


## Pretty printing

In [29]:
pretty_print_textregion(two_column_scan)

                          22                                                                       23
kunnen brengen. Dit is best te zien, wanneer men de on-                   geld en victualie, des maands 2165 guldens, dus in 27 maan-
kosten, die de Oostindische Compagnie moet doen, om de                   den 25564 guldens; by gevolg kosten de dertig schepen
Waaren uit Europa na Batavia, en weer anderen van daar                   aan maandgeld en victualie in 21 maanden „E1366950
na Europa te zenden, vergelykt met de onkosten, die de                     zoo menrekend dat ieder tregat-schip gl. boooc
Westindische Compagnie volgens dit Project zou behoeven                   kost, en de Intrest van dit Capitaal rekend tegen
te doen, om, als gezegd is, haare Waaren uit Europa over                   3:½ per cent in ’t jaar, zou ieder schip jaarlyks aan
de Kust van Magellaan na ’t Land van 5. Esprit, en van daar                   Interest kosten gl. 2100, en dus in 21 maanden
weer anderen 

This shows fairly clearly the two columns next to each other.

Now for the scan with the table:

In [81]:
pretty_print_textregion(table_scan)

                                                                 Gedestineert
                                                 bale
                                                                                                                                                             Gedestineerd naar
                                        Vate             Bale                                                                                   ate balen
 datum
                                            Ort     Vate                                                                         vate Ortbale
         Naander schoepNaam der schip                                           waar datum Naamder scheepsNaamder schip per
                                        ZuykZuikCoffenCosfiCatoe                                                                   ZuykeJunykeCoffy Coffy Oatoe
                                                                                             Aan

 1787
  

Hmm. This needs a different approach to make pretty printing work.

## Sorting by column or by row

It is possible to sort the `TextLine`s as columns or as rows.

### Sorting by column

We start with sorting by column:


In [24]:
from pagexml.helper.pagexml_helper import sort_lines_in_column_reading_order

for line in sort_lines_in_column_reading_order(two_column_scan):
    print(f'{line.coords.left: >4}-{line.coords.right}\t{line.coords.top: >4}-{line.coords.bottom}\t{line.text}')

1440-1583	 369-462	22
 688-2321	 418-544	kunnen brengen. Dit is best te zien, wanneer men de on-
 685-2329	 509-615	kosten, die de Oostindische Compagnie moet doen, om de
 688-2313	 577-682	Waaren uit Europa na Batavia, en weer anderen van daar
 683-2320	 650-755	na Europa te zenden, vergelykt met de onkosten, die de
 691-2332	 720-821	Westindische Compagnie volgens dit Project zou behoeven
 694-2328	 791-893	te doen, om, als gezegd is, haare Waaren uit Europa over
 695-2331	 862-955	de Kust van Magellaan na ’t Land van 5. Esprit, en van daar
 700-2317	 932-1034	weer anderen na Europate zenden: want, wanneer men, by
 704-2329	 994-1095	voorheeld, verondersteld, dat die beide Compagnien jaar-
 686-2332	1057-1163	lyks dertig scheepslaadingen na de voorschreeve Plaatsen
 689-2330	1134-1228	uitzenden, en weer t'huisbekomen, zoo blykt, dat de onkos-
 695-2327	1206-1302	ten, die de Westindische Compagnie ten dien einde zou be-
 694-2334	1278-1367	hoeven aan te wenden, verscheide Ponnen Gonds

In [82]:
for line in sort_lines_in_column_reading_order(table_scan):
    print(f'{line.coords.left: >4}-{line.coords.right}\t{line.coords.top: >4}-{line.coords.bottom}\t{line.text}')

  94-277	 132-247	datum
 100-229	 356-418	1787
  95-282	 411-505	Maart
 236-283	 510-576	21
 228-298	 588-665	31
  78-206	 645-728	Apri
 241-283	 665-735	10
 243-293	 737-822	2
 240-310	 919-977	24
  75-193	 966-1057	Mai
  73-211	1184-1290	Juni
 240-296	1399-1470	10
 231-306	1460-1546	28
  63-169	1535-1620	July
 246-293	1720-1793	11
 236-321	1795-1868	23
 248-311	1952-2017	27
 125-234	2016-2090	1788
  64-322	2071-2168	Maart 26
  63-236	2230-2310	April
 248-311	2408-2481	10
 253-316	2551-2621	17
 257-311	2731-2809	21
  63-193	2779-2857	May
 272-325	2805-2867	5
 254-324	2854-2947	13
 264-328	2930-2998	14
 264-307	2996-3059	11
 248-311	3115-3167	21
  62-176	3144-3224	Iuny
 257-323	3164-3211	28
  70-169	3204-3288	Rui
 271-316	3211-3274	15
 264-328	3272-3323	28
 262-318	3318-3378	30
  68-168	3333-3409	sept
 248-318	3440-3496	19
 327-712	 159-256	Naander schoep
 320-615	 377-503	decourier
 316-648	 496-591	de zee Compas
 322-694	 576-666	Esseg: Societeit
 320-600	 646-743	de Vreede
 339-589	

### Sorting by row

In [25]:
from pagexml.helper.pagexml_helper import sort_lines_in_row_reading_order

for line in sort_lines_in_row_reading_order(two_column_scan):
    print(f'{line.coords.left: >4}-{line.coords.right}\t{line.coords.top: >4}-{line.coords.bottom}\t{line.text}')

1440-1583	 369-462	22
3628-3723	 366-465	23
 688-2321	 418-544	kunnen brengen. Dit is best te zien, wanneer men de on-
2882-4503	 427-542	geld en victualie, des maands 2165 guldens, dus in 27 maan-
 685-2329	 509-615	kosten, die de Oostindische Compagnie moet doen, om de
2892-4508	 507-612	den 25564 guldens; by gevolg kosten de dertig schepen
 688-2313	 577-682	Waaren uit Europa na Batavia, en weer anderen van daar
2881-4037	 584-676	aan maandgeld en victualie in 21 maanden
4068-4158	 582-661	„
4167-4255	 588-663	E
4253-4504	 589-671	1366950
 683-2320	 650-755	na Europa te zenden, vergelykt met de onkosten, die de
2942-4221	 651-753	zoo menrekend dat ieder tregat-schip gl. boooc
 691-2332	 720-821	Westindische Compagnie volgens dit Project zou behoeven
2880-4224	 719-820	kost, en de Intrest van dit Capitaal rekend tegen
 694-2328	 791-893	te doen, om, als gezegd is, haare Waaren uit Europa over
2890-4247	 784-888	3:½ per cent in ’t jaar, zou ieder schip jaarlyks aan
 695-2331	 862-955	

In [80]:
prev_line = None
for line in sort_lines_in_row_reading_order(table_scan):
    if prev_line and prev_line.coords.left > line.coords.left:
        print()
    print(f'{line.coords.left: >4}-{line.coords.right}\t{line.coords.top: >4}-{line.coords.bottom}\t{line.text}')
    prev_line = line

1827-2259	  85-262	Gedestineert

1383-1472	 115-224	bale
4283-4834	 117-267	Gedestineerd naar

1144-1254	 128-217	Vate
1624-1732	 119-215	Bale
3967-4049	 126-211	ate
4095-4202	 123-210	balen

  94-277	 132-247	datum
1262-1348	 136-225	Ort
1507-1592	 133-227	Vate
3575-3678	 141-220	vate
3717-3796	 141-224	Ort
3822-3932	 141-213	bale

 327-712	 159-256	Naander schoep
 721-1110	 168-257	Naam der schip
2263-2448	 173-250	waar
2501-2733	 163-259	datum
2771-3165	 167-265	Naamder scheeps
3171-3561	 172-272	Naamder schip per

1139-1247	 209-277	Zuyk
1254-1364	 224-289	Zuik
1375-1484	 217-299	Coffen
1501-1609	 221-300	Cosfi
1623-1735	 210-289	Catoe
3547-3702	 206-294	Zuyke
3721-3808	 211-302	Junyke
3826-3920	 212-301	Coffy
3949-4063	 203-296	Coffy
4091-4190	 204-289	Oatoe

2556-2752	 323-469	Aan

 100-229	 356-418	1787
 744-1112	 367-587	W: g den Boet

 320-615	 377-503	decourier
2248-2454	 383-482	Amstf
3167-3696	 383-482	nrestaande Maertans 2294

2775-3160	 387-572	: Werdenbende

2759-3157	 3

###  Spliting documents by columns

When there is a gap between adjacent lines and that gap is repeated for each row of lines, those gaps represent a gap between two columns. This can be used to identify columns of text lines.

In [88]:
from pagexml.column_parser import split_lines_on_column_gaps


split_lines_on_column_gaps(table_scan, gap_threshold=2, debug=True)

COLUMN RANGES: [{'start': 62, 'end': 2477}, {'start': 2481, 'end': 4869}]


[PageXMLColumn(
 	id=NL-HaNA_1.05.06_114_0003.jpg-column-62-85-2415-3525, 
 	type=['pagexml_doc', 'text_region', 'column', 'scan'], 
 	stats={"lines": 344, "words": 479, "text_regions": 0}
 ),
 PageXMLColumn(
 	id=NL-HaNA_1.05.06_114_0003.jpg-column-2481-117-2388-3173, 
 	type=['pagexml_doc', 'text_region', 'column', 'scan'], 
 	stats={"lines": 216, "words": 271, "text_regions": 0}
 )]

In [87]:
for scan in table_scans:
    columns = split_lines_on_column_gaps(scan, gap_threshold=2, debug=True)
    print(len(columns))

COLUMN RANGES: [{'start': 45, 'end': 3176}, {'start': 3192, 'end': 3386}, {'start': 3390, 'end': 3502}, {'start': 3556, 'end': 3880}, {'start': 3884, 'end': 4212}, {'start': 4239, 'end': 4741}, {'start': 4746, 'end': 4928}, {'start': 4942, 'end': 5422}, {'start': 5467, 'end': 6050}, {'start': 6053, 'end': 6381}, {'start': 6398, 'end': 7188}, {'start': 7281, 'end': 7638}, {'start': 7866, 'end': 8228}, {'start': 8270, 'end': 8531}]
14
COLUMN RANGES: [{'start': 43, 'end': 2447}, {'start': 2459, 'end': 4859}]
2
COLUMN RANGES: [{'start': 62, 'end': 2477}, {'start': 2481, 'end': 4869}]
2
