# Turning text lines into running text elements

**Note**: This tutorial follows and depends on the tutorial on [Reading PageXML files from archives](./Demo-reading-pagexml-files-from-archive.ipynb). It assumes you have downloaded the PageXML archives and derived line format files from them. 

PageXML documents contain text lines as represented in the **physical document**, but these lines do not correspond one-to-one with sentences and paragraphs of the **logical structure** of the document.

Turning lines of text into running paragraphs of text requires taking into account paragraph boundaries and word-break characters like soft-hyphens ('-').

As an example of a text archive, this tutorial uses a dataset provided by the [National Archives of the Netherlands](https://www.nationaalarchief.nl/en) (NA) via their HTR repository on [Zenodo](https://zenodo.org/): https://zenodo.org/record/6414086#.Y8Elk-zMIUo. The repository contains many other HTR PageXML datasets that NA made available.

The dataset contains HTR output in [PageXML](https://www.primaresearch.org/tools/PAGELibraries) format of scans from the following archive: 
- (medium) _Verspreide West-Indische stukken, 1614-1875, 1.05.06, 1-1413_ ([EAD](https://www.nationaalarchief.nl/onderzoeken/archief/1.05.06/invnr/%401?query=1.05.06&search-type=inventory)). This is an archive maintained by the [Nationaal Archief](https://www.nationaalarchief.nl/en). 

You can download the datasets via the following URLs:
- https://zenodo.org/record/6414086/files/HTR%20results%201.05.06%20PAGE.zip?download=1


## Detecting word break characters

**Note**: This tutorial follows and depends on the tutorial on [Reading PageXML files from archives](./Demo-reading-pagexml-files-from-archive.ipynb). It assumes you have downloaded the PageXML archives and derived line format files from them. 

This tutorials shows how you can check whether lines contain a word-break character (a word is split across the end of one line and the start of the next line, usually with a soft hyphen).

In [1]:
%reload_ext autoreload
%autoreload 2


In [2]:
import pagexml.helper.text_helper as text_helper

da_line_file = '../data/line_format-NL-AsnDA_0114.11.tsv.gz'
na_line_file = '../data/line_format-NL-HaNA_1.05.06.tsv.gz'

# reading from line file is much faster 
# than reading from archived PageXML files
line_reader = text_helper.LineReader(pagexml_line_files=na_line_file, add_bounding_box=True)


The `text_stats` package contains class for analysing character and word frequencies and for detecting word-break characters and the _end_ and the _beginning_ of text lines. 

We start with analysing character frequencies in four different categories:

- `all`: the overall frequency of characters
- `start`: the frequency of characters that appear as the first character of a line
- `end`: the frequency of characters that appear as the last character of a line
- `mid`: the frequency of characters that appear between the first and last characters of a line

This allows us to calculate relative frequencies of character in the start, middle and end of lines.

The `LineCharAnalyser` takes as argument a iterable of text lines (which can be a list of strings, a list of `PageXMLTextLine`s or a list of dictionaries with a `text` property (e.g. from line-format files).

In [3]:
from pagexml.analysis.text_stats import LineCharAnalyser

lca = LineCharAnalyser(line_reader)


In [4]:
import pandas as pd

char_df = pd.DataFrame(lca.get_stats())
char_df.head(5)

Unnamed: 0,token_type,all_freq,all_frac,start_freq,start_frac,start_rel_frac,mid_freq,mid_frac,mid_rel_frac,end_freq,end_frac,end_rel_frac
0,e,5630529,0.16522,30123,0.030213,0.182865,5462615,0.169998,1.028917,137791,0.145274,0.879274
1,,4849265,0.142295,0,0.0,0.0,4849265,0.15091,1.060545,0,0.0,0.0
2,n,2894479,0.084935,21729,0.021794,0.256597,2700582,0.084043,0.9895,172168,0.181518,2.137149
3,a,2026303,0.059459,27859,0.027942,0.46994,1986422,0.061818,1.039672,12022,0.012675,0.213169
4,o,1710173,0.050183,25457,0.025533,0.508802,1669379,0.051951,1.035247,15337,0.01617,0.32222


In [5]:
min_freq = char_df.all_freq.sum() / 1e4


char_df[char_df.all_freq > min_freq].sort_values('end_rel_frac', ascending=False).head(5)

Unnamed: 0,token_type,all_freq,all_frac,start_freq,start_frac,start_rel_frac,mid_freq,mid_frac,mid_rel_frac,end_freq,end_frac,end_rel_frac
29,-,122764,0.003602,3588,0.003599,0.998994,36817,0.001146,0.318058,82359,0.086832,24.104202
31,„,97019,0.002847,38243,0.038357,13.473387,14227,0.000443,0.15552,44549,0.046968,16.498098
24,.,182205,0.005347,1056,0.001059,0.1981,99568,0.003099,0.579547,81581,0.086011,16.087237
75,½,3578,0.000105,125,0.000125,1.194129,2014,6.3e-05,0.596964,1439,0.001517,14.450177
62,—,23118,0.000678,8739,0.008765,12.92091,7103,0.000221,0.325852,7276,0.007671,11.308244


The hyphen `-`, low quotes `„` and period `.` are far more common as the last character of a line than letters like `e`, `n`, `a` and `o`.

Both the hyphen and low quotes are typical word-break characters in hand-written text.

In [6]:
prev_line = None
for li, line in enumerate(line_reader):
    if prev_line is None or (prev_line['text'] is None or prev_line['text'] == ''):
        prev_line = line
        continue
    if prev_line is not None and prev_line['text'][-1] in {'-', '„'}:
        print(f"{prev_line['text']}   {line['text']}\n")
    if (li+1) >= 30:
        break
    prev_line = line
    


Monisen, reeders van den schepe ge„   naempt het Vosken, daer schipper op is

geweest Jan de With, Hans Hongers, Pau„   lus Pelgrom, Lambrecht van Tweenhuijsen

Claessen ende Berent Sweertssen, ree„   ders van het schip genaempt den



In [7]:
prev_line = None
for li, line in enumerate(line_reader):
    if prev_line is None or (prev_line['text'] is None or prev_line['text'] == ''):
        prev_line = line
        continue
    if prev_line is not None and prev_line['text'][-1] in {'-'}:
        print(f"{li}\t{prev_line['text']}   {line['text']}\n")
    if (li+1) >= 5000:
        break
    prev_line = line
    


1510	een reijs van 16/m het eene door het andere d:o ƒ 24000-   aan Equipagien proncen uijt onthuijs komende maandgelden

3148	de en Horissante staat te brengen, zouden hoofdzaakelyk be-   staan:

3151	van Magellaan, om aldaar eene algemeene Laad- en Los-   plaats voor de schepen te hebben, en van de Producten

3161	Ten derden, in de Nieuwe Carolynsche Pilanden, mids-   Laders de Goud- en Parel-Filanden,

3164	Mirlofte buldekt zyn, en alle andere Pilanden, in de Zuid-   zoe gelegen, op eene veilige wyze, op te zoeken, en te

3169	met de Vrugten, Planten en Gewasschen, die deeze Lan-   den geeven, of aldaar met goed succes zouden kunnen ge-

3170	den geeven, of aldaar met goed succes zouden kunnen ge-   zaaid en geplant worden; midsgaders met der zelver andere

3172	Produelen, een voordeeligen Handel, zoo in Europa, als el-   ders, te dryven; en daar tegen weer Europische en andere

3176	Manschap konnen bevaaren worden, en weinig geld aan in-   koop en onderhoud kosten, tot het transpor

Here it's clear that low quotes are used as word-break characters. We can make paragraphs of running text using these word-break characters naively.

In [8]:
import pagexml.parser as parser
import pagexml.helper.pagexml_helper as pagexml_helper

word_break_chars='-„'

scans = text_helper.read_pagexml_docs_from_line_file(na_line_file)
for scan in scans:
    for tr in scan.text_regions:
        print(tr.id, tr.stats)
        for line in tr.lines:
            print(f'\t{line.text}')
        text, line_ranges = pagexml_helper.make_text_region_text(tr.lines, word_break_chars=word_break_chars)
        print(text)
    break

r1 {'lines': 33, 'words': 165, 'text_regions': 0}
	Jieem — Nederland
	kopij om te leggen nevens de minuut.
	11 Octob. 1614.
	Die Staten Generaal der Vereenichden
	Nederlanden, Allen den ghenen die dese
	jegenwoordige sullen werden gethoont,
	doen te weten: Alsoo Gerrit Jacobs
	Witssen, Oud Borgermeester der stadt
	Amstelredam, Jonas Witssen, Simon
	Monisen, reeders van den schepe ge„
	naempt het Vosken, daer schipper op is
	geweest Jan de With, Hans Hongers, Pau„
	lus Pelgrom, Lambrecht van Tweenhuijsen
	reeders van de twee seepen genaempt
	ƒ88488
	den Tijger ende de Fortuijn, daer schip
	Raem.
	pers op zijn geweest Adriaen Bloek
	ende Henrick Corstiaenssen, Arnoult
	van Lijbergen, Wessel Schenck, Hans
	Claessen ende Berent Sweertssen, ree„
	ders van het schip genaempt den
	Nachtegael, daer schipper op is geweest
	Thijs Volkertssen, coopluijden der voors:
	stadt Amstelredam, -en de Pieter Clements
	Brouwer, Jan Clementssen Kies en de
	Cornelis Volckertsen, coopluijden der
	stede Hoorn,

## Deriving a word-break detector from corpus statistics

The hyphen is sometimes used as a soft hyphen, but it is also used to indicate e.g. monetary amounts. So it should not always be treated as a soft hyphen. 

To attempt to do a better job, it is possible to build a word break detector that gathers corpus statistics about words and their frequencies in the middle of a line, and at the start and end of a line.

In [9]:
from pagexml.analysis.text_stats import WordBreakDetector

word_break_chars='-„'

wbd = WordBreakDetector(min_bigram_word_freq=5, 
                        word_break_chars=word_break_chars,
                        lines=line_reader)


Step 1: setting unigram counters
1007175 lines processed	all:   409555 types	 6658439 tokens
Step 2: setting bigram counter
1007175 lines processed	all: 588086 bigrams
Step 3: setting common merge counters
Step 4: setting typical line ends and starts
Step 5: setting common line ends and starts


In [10]:
wbd.print_counter_stats()

number of lines: 997021
number of words per line: 6.678333756259899
all:            409555 types	   6658439 tokens
start:          132129 types	    997021 tokens
mid:            255368 types	   4815847 tokens
end:            146299 types	    997021 tokens
mid bigrams:    588086 types	   2947515 tokens
Number of typical merge line ends: 140
Number of typical merge line starts: 142
Number of common merge line ends: 37
Number of common merge line starts: 23


In [11]:
word_break_chars='-„'

scans = text_helper.read_pagexml_docs_from_line_file(na_line_file, add_bounding_box=True)
for scan in scans:
    for tr in scan.text_regions:
        print(tr.id, tr.stats)
        for line in tr.lines:
            print(f'\t{line.text}')
        text, line_ranges = pagexml_helper.make_text_region_text(tr.lines, word_break_chars=word_break_chars)
        print(text)
    break

r1 {'lines': 33, 'words': 165, 'text_regions': 0}
	Jieem — Nederland
	kopij om te leggen nevens de minuut.
	11 Octob. 1614.
	Die Staten Generaal der Vereenichden
	Nederlanden, Allen den ghenen die dese
	jegenwoordige sullen werden gethoont,
	doen te weten: Alsoo Gerrit Jacobs
	Witssen, Oud Borgermeester der stadt
	Amstelredam, Jonas Witssen, Simon
	Monisen, reeders van den schepe ge„
	naempt het Vosken, daer schipper op is
	geweest Jan de With, Hans Hongers, Pau„
	lus Pelgrom, Lambrecht van Tweenhuijsen
	reeders van de twee seepen genaempt
	ƒ88488
	den Tijger ende de Fortuijn, daer schip
	Raem.
	pers op zijn geweest Adriaen Bloek
	ende Henrick Corstiaenssen, Arnoult
	van Lijbergen, Wessel Schenck, Hans
	Claessen ende Berent Sweertssen, ree„
	ders van het schip genaempt den
	Nachtegael, daer schipper op is geweest
	Thijs Volkertssen, coopluijden der voors:
	stadt Amstelredam, -en de Pieter Clements
	Brouwer, Jan Clementssen Kies en de
	Cornelis Volckertsen, coopluijden der
	stede Hoorn,

In [12]:
hyphen_scan = None
for scan in scans:
    for tr in scan.text_regions:
        hyphen_lines = [line for line in tr.lines if line.text and line.text[-1] == '-']
        if len(hyphen_lines) > 4:
            hyphen_scan = scan
            break
    if hyphen_scan is not None:
        break



In [13]:
for tr in hyphen_scan.text_regions:
    for line in tr.lines:
        print(line.text)
    text, line_ranges = pagexml_helper.make_text_region_text(tr.lines, word_break_chars=word_break_chars)
    print(text)
    text, line_ranges = pagexml_helper.make_text_region_text(tr.lines, wbd=wbd)
    print(text)
        

v mor
rnn
nanarm
marr
v mor rnn nanarm marr
v mor rnn nanarm marr

None
None
171
171
171
31
P. J.
eMiddelen, waar van men zich, naar onze gedagten,
8
zou behooren te bedienen, om de Nederlandsche
—
Geoctroyeerde Westindische Compagnie in een goe
de en Horissante staat te brengen, zouden hoofdzaakelyk be-
staan:
Vooreerst, in een Volkplanting aan te leggen op de Kust
van Magellaan, om aldaar eene algemeene Laad- en Los-
plaats voor de schepen te hebben, en van de Producten
van die Kust te protiteeren, zoo om daar mee een voordee
tigen Handel te dryven, als om de schepen met de zelve te
victualieeren, &amp;c.
Ten twoeden, in de Walvischvangst het heele jaar door te
pleegen op differente plaatsen, daar die Visch zich in me
nigte ophoudt, en dat met zoo weinig onkosten, en daar
omtrent zoodanig een schikking te maaken, als nooit voor
deezen is gepractiseert geweest
Ten derden, in de Nieuwe Carolynsche Pilanden, mids-
Laders de Goud- en Parel-Filanden,
die door Jean Michel
Mirlofte buldekt 