# Reading and analysing archives with many PageXML files

This tutorial demonstrates how to read larger sets of [PageXML](https://www.primaresearch.org/tools/PAGELibraries) files that are combined in zip or tar archives. 

As an example of a zipped archive, this tutorial uses one of datasets provided by the [National Archives of the Netherlands](https://www.nationaalarchief.nl/en) (NA) via their HTR repository on [Zenodo](https://zenodo.org/): https://zenodo.org/record/6414086#.Y8Elk-zMIUo. The repository contains many other HTR PageXML datasets that NA made available.

The datasets contains HTR output in [PageXML](https://www.primaresearch.org/tools/PAGELibraries) format of scans from the following archive: 
- _Notaris mr. D.A.M.de Fremery te Assen, 1899-1915, 114.11, 1_ ([EAD](https://www.drentsarchief.nl/onderzoeken/archiefstukken?mivast=34&mizig=210&miadt=34&micode=0114.11&miview=inv2)). This is an archive maintained by the [Drents Archief](https://www.drentsarchief.nl).
- _Verspreide West-Indische stukken, 1614-1875, 1.05.06, 1-1413_ ([EAD](https://www.nationaalarchief.nl/onderzoeken/archief/1.05.06/invnr/%401?query=1.05.06&search-type=inventory)). This is an archive maintained by the [Nationaal Archief](https://www.nationaalarchief.nl/en). 

You can download the datasets via the following URLs:
- https://zenodo.org/record/6414086/files/HTR%20results%20DA%200114.11%20PAGE.zip?download=1
- https://zenodo.org/record/6414086/files/HTR%20results%201.05.06%20PAGE.zip?download=1


In [1]:
%reload_ext autoreload
%autoreload 2


## Extracting PageXML files from Zip/Tar files

The example zip files contains of smaller zip files, that each have a number of PageXML files. 

In [2]:
from pagexml.helper.file_helper import read_page_archive_files

da_archive_file = '../data/HTR results DA 0114.11 PAGE.zip'
na_archive_file = '../data/HTR results 1.05.06 PAGE.zip'

read_page_archive_files(da_archive_file)


<generator object read_page_archive_files at 0x11236c740>

In [3]:
page_fileinfo, page_data = next(read_page_archive_files(da_archive_file))

# file information of the first PageXML file in the zip file
page_fileinfo


{'source_file': ['../data/HTR results DA 0114.11 PAGE.zip', '1.zip'],
 'archived_filename': 'NL-AsnDA_0114.11_1_0001.xml',
 'archived_filepath': 'NL-AsnDA_0114.11_1_0001.xml'}

The fileinfo contains three fields:

1. `source_file`: The name of the zip/tar archive file that the PageXML file was extracted from. If the archive file consists of zip/tar archvie files, this property will be a list of the hierarchy of zip/tar files that the PageXML file was extracted from.
2. `archived_filename`: the name of the PageXML file
3. `archived_filepath`: the full path of the PageXML in the zip/tar archive.



The file data contains the PageXML content in byte string:

In [4]:
# the content of the first PageXML file in the zip file
page_data[:1000]

b'<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15/pagecontent.xsd">\n  <Metadata>\n    <Creator>prov=READ-COOP:name=PyLaia@Transkribus:model_id=21087:version=0.0.1\nprov=University of Rostock/Institute of Mathematics/CITlab|PLANET AI GmbH/Tobias Gruening/tobias.gruening@planet-ai.de:name=de.uros.citlab.module.baseline2polygon.B2PSeamMultiOriented:v=2.6.6\nprov=University of Rostock/Institute of Mathematics/CITlab|PLANET AI GmbH/Tobias Gruening/tobias.gruening@planet-ai.de:name=/net_tf/LA73_249_0mod360.pb:de.uros.citlab.segmentation.CITlab_LA_ML:v=2.6.6\nTranskribus\n</Creator>\n    <Created>2020-09-16T15:29:40.548+02:00</Created>\n    <LastChange>2020-09-16T15:29:51.557+02:00</LastChange>\n  </Metadata>\n  <Page imageFilename="NL-AsnDA_0114.11_1_0001.jpg" i

In [5]:
from pagexml.parser import parse_pagexml_file

scan = parse_pagexml_file(pagexml_file=page_fileinfo['archived_filename'], pagexml_data=page_data)
scan

PageXMLScan(
	id=NL-AsnDA_0114.11_1_0001.jpg, 
	type=['pagexml_doc', 'text_region', 'scan'], 
	stats={"lines": 49, "words": 306, "text_regions": 3, "columns": 0, "extra": 0, "pages": 0}
)

The West-Indische Stukken archive

In [6]:
page_fileinfo, page_data = next(read_page_archive_files(na_archive_file))
page_fileinfo


{'source_file': ['../data/HTR results 1.05.06 PAGE.zip', '1.zip'],
 'archived_filename': 'NL-HaNA_1.05.06_1_0001.xml',
 'archived_filepath': 'NL-HaNA_1.05.06_1_0001.xml'}

In [7]:
scan = parse_pagexml_file(pagexml_file=page_fileinfo['archived_filename'], pagexml_data=page_data)

scan

PageXMLScan(
	id=NL-HaNA_1.05.06_1_0001.jpg, 
	type=['pagexml_doc', 'text_region', 'scan'], 
	stats={"lines": 33, "words": 165, "text_regions": 1, "columns": 0, "extra": 0, "pages": 0}
)

In [8]:
scan.stats

{'lines': 33,
 'words': 165,
 'text_regions': 1,
 'columns': 0,
 'extra': 0,
 'pages': 0}

The static `.stats` function returns a dictionary with a count of the number of text regions, lines and words (elements that are part of the PageXML specification), as well as the three elements `pages`, `columns` and `extra`. The latter three are sub-classes of PageXMLTextRegion that are not part of the PageXML spec but are part of the physical structure domain. 

Often scan images represent an opening of a book, consisting of two physical pages (odd and even numbered, left and right, ...). If you know where the page split is, you can split the scan and create two PageXMLPage documents. Pages can have `PageXMLColumn`s as well as extra `PageXMLTextRegion`s such as headers, footers and marginalia. 


In [9]:
from pagexml.helper.pagexml_helper import pretty_print_textregion

pretty_print_textregion(scan)

         Jieem — Nederland
 kopij om te leggen nevens de minuut.

11 Octob. 1614.

               Die Staten Generaal der Vereenichden

           Nederlanden, Allen den ghenen die dese
           jegenwoordige sullen werden gethoont,
          doen te weten: Alsoo Gerrit Jacobs
           Witssen, Oud Borgermeester der stadt
           Amstelredam, Jonas Witssen, Simon
           Monisen, reeders van den schepe ge„
          naempt het Vosken, daer schipper op is
          geweest Jan de With, Hans Hongers, Pau„
          lus Pelgrom, Lambrecht van Tweenhuijsen
          reeders van de twee seepen genaempt
  ƒ88488   den Tijger ende de Fortuijn, daer schip
 Raem.
           pers op zijn geweest Adriaen Bloek
          ende Henrick Corstiaenssen, Arnoult
          van Lijbergen, Wessel Schenck, Hans
          Claessen ende Berent Sweertssen, ree„
          ders van het schip genaempt den
          Nachtegael, daer schipper op is geweest
          Thijs Volkertssen, coopluijden der voor

## Examing the structure of each archive file

In these example datasets, the zip file contains one or more sub-archives of HTR data, each sub-archive contained in a smaller zip file. From the hierarchy of `source_file`s, it is possible to group the PageXML files per sub-archive.

Because the PageXML objects can be big and archives can contain tens or hundreds of thousands of PageXML files, extracting them all can therefore take a significant amount of time, it can be handy to first get an overview of the structure of these zip files and see how many scans there are. 

You can pass `filenames_only=True` to the `read_page_archive_files` function to extract only the filenames. 

In [10]:
from collections import defaultdict

scans = defaultdict(lambda: defaultdict(set))

archives = [
    {'id': 'NL-AsnDA_0114.11', 'file': '../data/HTR results DA 0114.11 PAGE.zip'},
    {'id': 'NL-HaNA_1.05.06', 'file': '../data/HTR results 1.05.06 PAGE.zip'}    
]

In [10]:
for archive in archives:
    for page_fileinfo, _ in read_page_archive_files(archive['file'], 
                                                            filenames_only=True,
                                                            show_progress=True):
        volume = f"{archive['id']}_{page_fileinfo['source_file'][-1].replace('.zip', '')}"
        scans[archive['id']][volume].add(page_fileinfo['archived_filename'])



extracting PageXML files from ../data/HTR results DA 0114.11 PAGE.zip: 630it [00:00, 2832.57it/s]
extracting PageXML files from ../data/HTR results 1.05.06 PAGE.zip: 15857it [00:03, 4085.46it/s]


In [11]:
for archive_id in scans:
    print(f'\narchive {archive_id}')
    for volume in scans[archive_id]:
        print(f'\tvolume: {volume}\tnumber of scans: {len(scans[archive_id][volume])}')


archive NL-AsnDA_0114.11
	volume: NL-AsnDA_0114.11_1	number of scans: 630

archive NL-HaNA_1.05.06
	volume: NL-HaNA_1.05.06_1	number of scans: 5
	volume: NL-HaNA_1.05.06_10	number of scans: 53
	volume: NL-HaNA_1.05.06_100	number of scans: 3
	volume: NL-HaNA_1.05.06_1005	number of scans: 18
	volume: NL-HaNA_1.05.06_1007	number of scans: 7
	volume: NL-HaNA_1.05.06_101	number of scans: 5
	volume: NL-HaNA_1.05.06_102	number of scans: 39
	volume: NL-HaNA_1.05.06_103	number of scans: 4
	volume: NL-HaNA_1.05.06_1038	number of scans: 45
	volume: NL-HaNA_1.05.06_1039	number of scans: 55
	volume: NL-HaNA_1.05.06_104	number of scans: 158
	volume: NL-HaNA_1.05.06_105	number of scans: 17
	volume: NL-HaNA_1.05.06_1050	number of scans: 8
	volume: NL-HaNA_1.05.06_1051	number of scans: 5
	volume: NL-HaNA_1.05.06_1053	number of scans: 9
	volume: NL-HaNA_1.05.06_1055	number of scans: 3
	volume: NL-HaNA_1.05.06_106	number of scans: 24
	volume: NL-HaNA_1.05.06_107	number of scans: 5
	volume: NL-HaNA_1.05.

It turns out that the Notarial archive zip file only contains a single zip file with a volume of 630 PageXML files, while the West Indische Stukken has many smaller volumes.

In [12]:
archive_id = 'NL-AsnDA_0114.11'
archive_file = '../data/HTR results DA 0114.11 PAGE.zip'
archive_stats = defaultdict(int)

for page_fileinfo, page_data in read_page_archive_files(archive_file, 
                                                        show_progress=True):
    scan = parse_pagexml_file(pagexml_file=page_fileinfo['archived_filename'], pagexml_data=page_data)
    scan_stats = scan.stats
    for field in scan_stats:
        archive_stats[field] += scan_stats[field]

archive_stats

extracting PageXML files from ../data/HTR results DA 0114.11 PAGE.zip: 630it [00:19, 32.18it/s]


defaultdict(int,
            {'lines': 56700,
             'words': 238926,
             'text_regions': 2484,
             'columns': 0,
             'extra': 0,
             'pages': 0})

The 630 PageXML files in the dataset of the notarial archive of D.A.M. de Fremery contains close to 240,000 words. 

## Transforming XML into plain text line format files

If you want to work on the text of the PageXML files, it can be useful to make a plain text format that is faster to process and uses less memory. The PageXML tools package contains functionality to extract the text lines from PageXML files into a simple tab-separated value format.

In [11]:
from pagexml.helper.text_helper import make_line_format_file, make_page_extractor


for archive in archives:
    line_tsv_file = f"../data/line_format-{archive['id']}.tsv.gz"
    page_extractor = make_page_extractor(archive['file'])
    make_line_format_file(page_extractor, line_tsv_file, add_bounding_box=True)


In [12]:
from pagexml.helper.text_helper import LineReader

line_tsv_file = '../data/line_format-NL-HaNA_1.05.06.tsv.gz'

reader = LineReader(pagexml_line_files=line_tsv_file)
for line in reader:
    print(line)
    break

{'doc_id': 'NL-HaNA_1.05.06_1_0001.jpg', 'textregion_id': 'r1', 'line_id': 'r1l1', 'text': 'Jieem — Nederland', 'doc_box': '0,0,2522,3957', 'textregion_box': '120,152,2333,3493', 'line_box': '586,136,1108,113'}
