# Processing PageXML with Tables

In [1]:
%reload_ext autoreload
%autoreload 2


## Reading PageXML files with Tables

Reading a PageXML file with a table is no different than reading any other PageXML file:

In [2]:
from pagexml.parser import parse_pagexml_file

page_file = '../data/PageXML-with-Tables-TypoScript/4421891/1206/page/0001_NL-HaNA_0.00.00_1206_0338_contrast.xml'
scan = parse_pagexml_file(page_file)
scan

PageXMLScan(
	id=0001_NL-HaNA_0.00.00_1206_0338_contrast.png, 
	type=['structure_doc', 'physical_structure_doc', 'text_region', 'pagexml_doc', 'scan'], 
	stats={"lines": 58, "words": 314, "text_regions": 0, "table_regions": 1, "columns": 0, "extra": 0, "pages": 0}
)

Note that in the `stats` dictionary, the number of `table_regions` is shown.

## Access and interacting with tables

The `.table_regions` property gives access to tables that are direct children of the `scan` object:

In [3]:
table = scan.table_regions[0]
table

PageXMLTableRegion(
	id=t1, 
	type=['structure_doc', 'physical_structure_doc', 'table_region', 'pagexml_doc'], 
	stats={"rows": 23, "cells": 46, "lines": 58, "words": 314}
)

The `PageXMLTableRegion` has the same `stats` property as other `PageXML` objects. You can get the shape of the table via the `.shape` property:

In [4]:
table.shape

(23, 2)

You can access any row by its index in the rows property:

In [5]:
table.rows[0]

PageXMLTableRow(
	id=0, 
	type=table_row, 
	stats={"cells": 2, "lines": 1, "words": 6}
)

However, You can also access its by index directly on the table object:

In [6]:
table[0]

PageXMLTableRow(
	id=0, 
	type=table_row, 
	stats={"cells": 2, "lines": 1, "words": 6}
)

The same applies for access a cell in a row:

In [7]:
cell = table.rows[1].cells[1]  # via properties
cell = table[1][1]             # shorthand
cell

PageXMLTableCell(
	id=t1c4, 
	type=table_cell, 
	row=1, col=1
	stats={"lines": 2, "words": 19}
)

You can iterate over rows and cell using the same shorthand. Each cell has a `row` and `col` propery that corresponds to the cell index in the row and the row index in the table:

In [13]:
for row in table:
    for cell in row:
        print(cell.row, cell.col, cell.id)

0 0 t1c1
0 1 t1c2
1 0 t1c3
1 1 t1c4
2 0 t1c5
2 1 t1c6
3 0 t1c7
3 1 t1c8
4 0 t1c9
4 1 t1c10
5 0 t1c11
5 1 t1c12
6 0 t1c13
6 1 t1c14
7 0 t1c15
7 1 t1c16
8 0 t1c17
8 1 t1c18
9 0 t1c19
9 1 t1c20
10 0 t1c21
10 1 t1c22
11 0 t1c23
11 1 t1c24
12 0 t1c25
12 1 t1c26
13 0 t1c27
13 1 t1c28
14 0 t1c29
14 1 t1c30
15 0 t1c31
15 1 t1c32
16 0 t1c33
16 1 t1c34
17 0 t1c35
17 1 t1c36
18 0 t1c37
18 1 t1c38
19 0 t1c39
19 1 t1c40
20 0 t1c41
20 1 t1c42
21 0 t1c43
21 1 t1c44
22 0 t1c45
22 1 t1c46


### Cell Values

You can access the cell content in two ways, via its `value` propoerty or via the `PageXML` elements contained by the `TableCell` object.

**Note**: In the current implementation, it is assumed that cells contain `TextLine`s.

In [14]:
cell = table[1][1]
[line for line in cell.lines]

[PageXMLTextLine(
 	id=t1c4_tl_1, 
 	type=['structure_doc', 'physical_structure_doc', 'line', 'pagexml_doc'], 
 	text="Een brief van den gouverneur Rijk Tulbagh en den raad aan" 
 	conf=None
 ),
 PageXMLTextLine(
 	id=t1c4_tl_2, 
 	type=['structure_doc', 'physical_structure_doc', 'line', 'pagexml_doc'], 
 	text="de Kamer Amsterdam in dato 26 September 1763." 
 	conf=None
 )]

<br/>
<br/>
For easy acces, the text of the lines are concatenated in the `value` property:

In [15]:
cell.value

'Een brief van den gouverneur Rijk Tulbagh en den raad aan de Kamer Amsterdam in dato 26 September 1763.'

<br/>
<br/>
Similarly, the `values` property of a row returns a list of the values of all its cells:

In [16]:
row = table[1]
row.values

['1-2',
 'Een brief van den gouverneur Rijk Tulbagh en den raad aan de Kamer Amsterdam in dato 26 September 1763.']

<br/>
<br/>
And in the same vain, the `values` property of the table returns all values of the rows in a list of lists:

In [17]:
table.values

[['', "Per't engelsch snauw scheepje the Mercury."],
 ['1-2',
  'Een brief van den gouverneur Rijk Tulbagh en den raad aan de Kamer Amsterdam in dato 26 September 1763.'],
 ['3-8',
  "Een dito aan de vergadering van 17en in dato als even. Per't fransche oorlogschip le Comte d'Argenson."],
 ['9-12', 'Een dito aan de vergadering van 17en in dato 22 October 1763.'],
 ['', "Per de fransche scheepen le Comte d'Artois, le Conde en le Massiac."],
 ['13-15', 'Een dito aan de vergadering van 17en in dato 12 November 1763.'],
 ['16-17',
  'Een dito aan de Kamer Amsterdam in dato als boven. Perütoengelsch schip The Royal George.'],
 ['18-23', 'Een dito aan de vergadering van 17en in dato 2 January 1764.'],
 ['', "Per't vroegschip Baarsande."],
 ['24-26', 'Register der papieren.'],
 ['27-38',
  'Origineele missive van den gouverneur Rijk Tulbagh en der raad aan de vergadering van 17en in dato 7 January 1764.'],
 ['39-41',
  'Origineele missive van den gouverneur en raad aan de Ka¬ mer Amsterdam in

<br/>
<br/>
This can then easily be load in e.g. `pandas` for richer interaction:

In [18]:
import pandas as pd

pd.DataFrame(table.values)

Unnamed: 0,0,1
0,,Per't engelsch snauw scheepje the Mercury.
1,1-2,Een brief van den gouverneur Rijk Tulbagh en d...
2,3-8,Een dito aan de vergadering van 17en in dato a...
3,9-12,Een dito aan de vergadering van 17en in dato 2...
4,,"Per de fransche scheepen le Comte d'Artois, le..."
5,13-15,Een dito aan de vergadering van 17en in dato 1...
6,16-17,Een dito aan de Kamer Amsterdam in dato als bo...
7,18-23,Een dito aan de vergadering van 17en in dato 2...
8,,Per't vroegschip Baarsande.
9,24-26,Register der papieren.


## Tables with headers and more complex tables

In [19]:
page_file = '../data/PageXML-VOC-tabel/1150820/Losse_tabel/page/0001_NL-HaNA_1.04.02_2466_1795.xml'

scan = parse_pagexml_file(page_file)
table = scan.table_regions[0]
table.shape

(18, 11)

In this case, the table has 11 columns but the first row only has two cells:

In [20]:
table[0]

PageXMLTableRow(
	id=0, 
	type=table_row, 
	stats={"cells": 2, "lines": 2, "words": 5}
)

To understand what is going on here, the number of columns is based on the row with the maximum number of cells. Rows with fewer cells specify per cell which column it belongs to:

In [21]:
for cell in table[0]:
    print(cell.id, cell.row, cell.col)

TableCell_1649063529743_4183 0 0
None 0 1
None 0 2
None 0 3
None 0 4
TableCell_1649063529743_4181 0 5
None 0 6
None 0 7
None 0 8
None 0 9
None 0 10


The first cell is part of the zeroth column, second cell is part of the fifth column. In the other columns, there are no cells. When you iterate over the cells of a row, it will generate empty cells on the fly.

In [22]:
for cell in table[1].cells:
    print(cell.id, cell.row, cell.col)

TableCell_1649061546469_2318 1 0
TableCell_1649061549881_2426 1 1
TableCell_1649061552522_2534 1 2
TableCell_1649061555573_2642 1 3
TableCell_1649063529975_4189 1 4
TableCell_1649063529975_4187 1 5
TableCell_1649061562262_2858 1 6
TableCell_1649061564976_2966 1 7
TableCell_1649061567469_3074 1 8
TableCell_1649061580962_3182 1 9
TableCell_1649061580962_3180 1 10


The second row (at index 1) has a cell in each column.

If you iterate over the rows and print the `values` of each row, you'll see that for non-existing cells, the value `''` is used:

In [23]:
for row in table:
    print(row.values)

['Eerste halve maand', '', '', '', '', 'Laaste halvemaand', '', '', '', '', '']
['Ingebl:', 'Ingek:', 'uijtgeg:', 'overled:', 'blijven', '', 'Ingek:', 'uijtgeg:', 'overl:', 'blijven', '']
['14', '1', '1', '1', '13', '', '1', '2', '1', '11.', 'Hooftwagt']
['2', '„', '„', '„', '2', '', '1', '„', '„', '3', 'Waterpoort']
['1', '„', '„', '„', '1', '', '1', '2', ',', '„', 'd’E E: agtb: wagt']
['2', '„', '„', '„', '2', '', '„', '„', '1', '1', 'P:t Bonij']
['1', '1', ',,', '„', '2', '', '1', '„', '2', '1', '„ Bouton']
['1', '1', '„', '„', '2', '', ',,', '„', '„', '2', '„ mandersaha']
['1', '„', '„', '„', '1', '', '„', '„', '„', '1', '„ amboina']
['3', '„', '„', '„', '3', '', '„', '„', '„', '3', '„ . batsiam']
['1', '„', '1', '„', '„', '', '„', '„', '„', '„', 'T Ravelijn']
['2', '„', '„', '„', '2', '', '„', '„', '1', '1.', 'Redout']
['1', '„', '„', '„', '1', '', '„', '„', '„', '1', 'de Pagger Voorsorg']
['7', '„', '1', '„', '6', '', '„', '1', '„', '5', '„ ambagts gesellen']
['14', '1', '3', '1'

<br/>
<br/>
Again, you can load this in `pandas` to interact with the table:

In [24]:
pd.DataFrame(table.values)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,Eerste halve maand,,,,,Laaste halvemaand,,,,,
1,Ingebl:,Ingek:,uijtgeg:,overled:,blijven,,Ingek:,uijtgeg:,overl:,blijven,
2,14,1,1,1,13,,1,2,1,11.,Hooftwagt
3,2,„,„,„,2,,1,„,„,3,Waterpoort
4,1,„,„,„,1,,1,2,",",„,d’E E: agtb: wagt
5,2,„,„,„,2,,„,„,1,1,P:t Bonij
6,1,1,",,",„,2,,1,„,2,1,„ Bouton
7,1,1,„,„,2,,",,",„,„,2,„ mandersaha
8,1,„,„,„,1,,„,„,„,1,„ amboina
9,3,„,„,„,3,,„,„,„,3,„ . batsiam
