# Restructuring PageXML Documents

It can be useful to restructure the content of a PageXML document. That is, grouping `TextRegion`s or `TextLine`s in new `TextRegion`s of `Column`s. Layout analysis typically only contains information on which text lines belong together in a text region, but if you know more about the kinds of scans in a dataset and how they are structured, you may want to use that knowledge to make the PageXML document reflect that structure. 

There can be many reasons to restructure the text content of a PageXML document:

- Many scans are based on openings of books, in which cases they represent two pages. Splitting a scan into pages allows you to analyse, classify and process individual pages instead of combinations of pages.
- Text regions that are horizontally aligned can be grouped into columns, which allows you to process a document column by column.
- A scan can have main text elements and elements that are part of the margin, e.g. headers, footers, marginalia, etc. Separating them allows focusing the analysis on only the main text or only the margins. Moreover, when making running text over multiple scans or pages, having access to the main text elements as columns makes it easier to make paragraphs.

As an example of a text archive, this tutorial uses a dataset provided by the [National Archives of the Netherlands](https://www.nationaalarchief.nl/en) (NA) via their HTR repository on [Zenodo](https://zenodo.org/): https://zenodo.org/record/6414086#.Y8Elk-zMIUo. The repository contains many other HTR PageXML datasets that NA made available.

The dataset contains HTR output in [PageXML](https://www.primaresearch.org/tools/PAGELibraries) format of scans from the following archive: 
- (medium) _Verspreide West-Indische stukken, 1614-1875, 1.05.06, 1-1413_ ([EAD](https://www.nationaalarchief.nl/onderzoeken/archief/1.05.06/invnr/%401?query=1.05.06&search-type=inventory)). This is an archive maintained by the [Nationaal Archief](https://www.nationaalarchief.nl/en). 

You can download the datasets via the following URLs:
- https://zenodo.org/record/6414086/files/HTR%20results%201.05.06%20PAGE.zip?download=1


In [1]:
%reload_ext autoreload
%autoreload 2


## Extracting PageXML files from Zip/Tar files

The example zip files contains of smaller zip files, that each have a number of PageXML files. 

In [3]:
import json

import matplotlib.pyplot as plt

from pagexml.analysis.stats import get_doc_stats
from pagexml.helper.file_helper import read_page_archive_files
from pagexml.parser import parse_pagexml_files_from_archive

da_archive_file = '../data/HTR results DA 0114.11 PAGE.zip'
na_archive_file = '../data/HTR results 1.05.06 PAGE.zip'

for scan in parse_pagexml_files_from_archive(na_archive_file):
    print(scan.stats)
    break


{'lines': 33, 'words': 165, 'text_regions': 1, 'columns': 0, 'extra': 0, 'pages': 0}


The `pagexml.column_parser` module has a function called `split_lines_on_column_gaps`, which takes all lines from a `TextRegion` object and sorts them into lists of horizontally overlapping lines. Then, it measures the horizontal gap (in number of pixels) between each adjacent pair of lists. If the pixel gap surpasses the threshold, the two lists are considered to be part of different columns. If not, the lists are part of the same column. 

Below, the first 10 scans of the archive are analysed and split into column. As you can see, in these first scans, most of the larger text regions (those containing more than one line) represent columns. Probably, these are scans of double pages with one main text column each.

In [11]:
from pagexml import column_parser

for si, scan in enumerate(parse_pagexml_files_from_archive(na_archive_file)):
    columns = column_parser.split_lines_on_column_gaps(scan, gap_threshold=10)
    print(f"scan: {scan.id}\ttext regions: {scan.stats['text_regions']: >4}\tcolumns: {len(columns): >4}")
    for tr in scan.text_regions:
        print(f"\ttext region: {tr.id}\t{tr.stats}")
    for column in columns:
        print(f"\tcolumn: {column.id}\t{column.stats}")
    print()
    if (si+1) >= 10:
        break

scan: NL-HaNA_1.05.06_1_0001.jpg	text regions:    1	columns:    1
	text region: r1	{'lines': 33, 'words': 165, 'text_regions': 0}
	column: NL-HaNA_1.05.06_1_0001.jpg-column-109-136-2356-3546	{'lines': 33, 'words': 165, 'text_regions': 0}

scan: NL-HaNA_1.05.06_1_0002.jpg	text regions:    2	columns:    2
	text region: r1	{'lines': 29, 'words': 140, 'text_regions': 0}
	text region: r2	{'lines': 30, 'words': 146, 'text_regions': 0}
	column: NL-HaNA_1.05.06_1_0002.jpg-column-588-318-1853-3380	{'lines': 29, 'words': 140, 'text_regions': 0}
	column: NL-HaNA_1.05.06_1_0002.jpg-column-2975-349-1841-3271	{'lines': 30, 'words': 146, 'text_regions': 0}

scan: NL-HaNA_1.05.06_1_0003.jpg	text regions:    2	columns:    2
	text region: r1	{'lines': 30, 'words': 148, 'text_regions': 0}
	text region: r2	{'lines': 24, 'words': 115, 'text_regions': 0}
	column: NL-HaNA_1.05.06_1_0003.jpg-column-600-340-1829-3329	{'lines': 30, 'words': 148, 'text_regions': 0}
	column: NL-HaNA_1.05.06_1_0003.jpg-column-3003

One type of page where identifying columns is potentially very useful is a page containing a table or multiple lists. Especially where the scans consists of a large number of small text regions.

We can look for scans with at least 10 text regions and see what columns split does with this.

In [14]:
from pagexml import column_parser

for si, scan in enumerate(parse_pagexml_files_from_archive(na_archive_file)):
    if scan.stats['text_regions'] < 10:
        continue
    columns = column_parser.split_lines_on_column_gaps(scan, gap_threshold=10)
    print(f"scan: {scan.id}\ttext regions: {scan.stats['text_regions']: >4}\tcolumns: {len(columns): >4}")

    for column in columns:
        print(f"\tcolumn: {column.id}\t{column.stats}")
    print()
    if (si+1) >= 10:
        break

scan: NL-HaNA_1.05.06_103_0001.jpg	text regions:  161	columns:   10
	column: NL-HaNA_1.05.06_103_0001.jpg-column-45-223-3131-6267	{'lines': 159, 'words': 153, 'text_regions': 0}
	column: NL-HaNA_1.05.06_103_0001.jpg-column-3192-467-310-4879	{'lines': 6, 'words': 5, 'text_regions': 0}
	column: NL-HaNA_1.05.06_103_0001.jpg-column-3556-423-656-5220	{'lines': 40, 'words': 26, 'text_regions': 0}
	column: NL-HaNA_1.05.06_103_0001.jpg-column-4239-241-689-6195	{'lines': 31, 'words': 22, 'text_regions': 0}
	column: NL-HaNA_1.05.06_103_0001.jpg-column-4942-392-480-5160	{'lines': 16, 'words': 12, 'text_regions': 0}
	column: NL-HaNA_1.05.06_103_0001.jpg-column-5467-306-914-5279	{'lines': 44, 'words': 26, 'text_regions': 0}
	column: NL-HaNA_1.05.06_103_0001.jpg-column-6398-343-790-5152	{'lines': 42, 'words': 30, 'text_regions': 0}
	column: NL-HaNA_1.05.06_103_0001.jpg-column-7281-249-357-3875	{'lines': 10, 'words': 4, 'text_regions': 0}
	column: NL-HaNA_1.05.06_103_0001.jpg-column-7866-386-362-5207

In [18]:
from pagexml.helper.pagexml_helper import pretty_print_textregion

for column in columns:
    pretty_print_textregion(column)
    print('\n------------------------------------\n')

                           N                 1      H
                                                       5
                                         ee
                                                       400
                       e    8                        2
                                   130                  5
                                 dC              N
                                         9 d —
                  d
                                 PpC v9      o5 o65
         E                  235
                                   152
                          5          7 15   13 5     5
  Kaassen in Poort    1                                  3
                                                          N
                                          1
  Rogie Meet.       1    d       1—       E     D.
                     7
  Jarwe J Meel.
                       D
                           1                          2
                     N
   Zout.
  Stokvis i