# Analysing scan characteristics: Checking quality

**Note**: This tutorial follows and depends on the tutorial on [Reading PageXML files from archives](./Demo-reading-pagexml-files-from-archive.ipynb). It assumes you have downloaded the PageXML archives and derived line format files from them. 

This tutorials shows how you can analyse the characteristics of subsets of PageXML files and see if they meet expectations. This allows different ways of qualitatively checking the quality of the HTR/OCR output.

Checks:
- _language identification_ of scans,
- _outliers_ in text characteristics, based on word frequency distributions,
- _outliers_ in layout characteristics


As an example of a zipped archive, this tutorial uses one of datasets provided by the [National Archives of the Netherlands](https://www.nationaalarchief.nl/en) (NA) via their HTR repository on [Zenodo](https://zenodo.org/): https://zenodo.org/record/6414086#.Y8Elk-zMIUo. The repository contains many other HTR PageXML datasets that NA made available.

The dataset contains HTR output in [PageXML](https://www.primaresearch.org/tools/PAGELibraries) format of scans from the following archive: 
- _Verspreide West-Indische stukken, 1614-1875, 1.05.06, 1-1413_ ([EAD](https://www.nationaalarchief.nl/onderzoeken/archief/1.05.06/invnr/%401?query=1.05.06&search-type=inventory)). This is an archive maintained by the [Nationaal Archief](https://www.nationaalarchief.nl/en). 

You can download the datasets via the following URLs:
- https://zenodo.org/record/6414086/files/HTR%20results%201.05.06%20PAGE.zip?download=1


In [1]:
%reload_ext autoreload
%autoreload 2


This tutorial uses the `langid` package for identifying the language of a piece of text.

In [2]:
# pip install -u langid
from langid.langid import LanguageIdentifier, model

import pagexml.helper.text_helper as text_helper


identifier = LanguageIdentifier.from_modelstring(model, norm_probs=True)

na_line_file = '../data/line_format-NL-HaNA_1.05.06.tsv.gz'

# reading from line file is much faster 
# than reading from archived PageXML files
line_reader = text_helper.LineReader(pagexml_line_files=na_line_file, add_bounding_box=True)


In [14]:
import re
from collections import Counter
from collections import defaultdict

def make_clickable(val):
    return f'<a target="_blank" href="{val}">{val}</a>'


def read_paragraphs(line_file, word_break_chars):
    scans = text_helper.read_pagexml_docs_from_line_file(line_file)
    for scan in scans:
        for tr in scan.text_regions:
            yield pagexml_helper.make_text_region_text(tr.lines, word_break_chars=word_break_chars)
            

lang_word_freq = defaultdict(Counter)
word_break_chars='-„'

for text, line_ranges in read_paragraphs(na_line_file, word_break_chars):
    if text is None or len(text) < 10:
        continue
    lang, score = identifier.classify(text)
    words = [word.lower() for word in re.split(r'\W+', text) if word != '' and word.isdigit() is False]
    lang_word_freq[lang].update(words)
    


In [15]:
for lang in lang_word_freq:
    for word, freq in lang_word_freq[lang].most_common(10):
        print(f"{lang}\t{word: <25}{freq: >8}")

nl	de                         250900
nl	van                        220802
nl	en                         165333
nl	te                         103227
nl	het                         89926
nl	in                          84122
nl	dat                         78923
nl	den                         63818
nl	op                          50695
nl	aan                         47605
en	the                          1124
en	of                            926
en	to                            521
en	and                           496
en	in                            393
en	a                             277
en	r                             271
en	o                             252
en	or                            247
en	m                             230
de	der                           114
de	idem                           92
de	den                            78
de	die                            73
de	in                             66
de	van                            59
de	als                            54
d

In [9]:
import pagexml.parser as parser
import pagexml.helper.pagexml_helper as pagexml_helper

lang_score_freq = Counter()

for text, line_ranges in read_paragraphs(na_line_file, word_break_chars):
    if text is None or len(text) < 3:
        continue
    lang, score = identifier.classify(text)
    lang_score_freq.update([(lang, score)])
    if lang != 'nl':
        print(lang, score, text)
    if (si+1) >= 1000:
        break
        
        
lang_score_freq.most_common()

en 0.16946150595865334 3:e
fi 0.49236192111457683 Vaan A
en 0.16946150595865334 lqne
bs 0.6121068279841232 uijtger
en 0.16946150595865334 Copia H60 31
de 0.5925873083142155 v mor rnn nanarm marr
en 0.16946150595865334 171
sv 0.24219228203921914 ¶ Lawoat
en 0.16946150595865334 N=o 2
en 0.16946150595865334 Copie
en 0.16946150595865334 Copie
en 0.16946150595865334 Copie
en 0.16946150595865334 7 N
en 0.16946150595865334 1 1 D
en 0.16946150595865334 8 130 dC pC P 152 235 d
en 0.16946150595865334 — 2 3
en 0.16946150595865334 D 5
en 0.7337638832247866 ee 9 d 15 1
en 0.16946150595865334 No —2 5 3 20
en 0.16946150595865334 1 6
en 0.16946150595865334 5 5 3
en 0.16946150595865334 1 o5 13
en 0.16946150595865334 — 2 1
en 0.16946150595865334 o 5
en 0.16946150595865334 „400
en 0.16946150595865334 5 400 2 5 5
en 0.16946150595865334 D. 2
en 0.16946150595865334 5 3 N
en 0.16946150595865334 5. 3 P
en 0.16946150595865334 VE 6 C
en 0.16946150595865334 5 5
en 0.16946150595865334 5 — 2 — 5
en 0.1694615059586

en 0.16946150595865334 A„o 1732
en 0.6533591149519777 All ven 1788
lb 0.9999677430965175 Petitie van eenige Materialen en Gereedschap pen in d’ Edele Compagnies Smits benoodigt Als volgt 50 Staaven staal ½ Duijm breet ¼ d=m dik Jeder Alaaf. 24 Staaven slat Eyser 3 Duum breed½ d=m dik d„o ½ d=m d„o 3¼ d„o d„o 50 d„o d„o ¾ d=m do d„o — 1½ d„o 50. d„o d„o 1¼ d=m d„o d„o„ - - - d„o 50 d„o d„o - - - 3¼ d„o - - - - d„o ¼ d=m d„o 50 d„o 23 - - - 3 d=o - - - - d„o ¼ d=m do d„o 50 d„o - - d„o ¼ d=m d„o 2½ d„o - - - do —50 d„o Rond Eyser d„o - - - - d„o 1 d=m d„o 50 d„o over syn Diameter. rik 50 d„o - - - Rond Eyser ¾ duym oversyn Diameeter — - -d„o — d„o ½ p=m dik over syn Diameter 50 do 100 d„o - - -0 „ . . . d„o 3 /8 d=m dik over syn Diametes 40o Htaaven 4 kant Yser 31/8 d=m dik oversyn diameter 2 Blaade Rood kooper ieder dik /8 Duijm 300 Stuks Rood Koopere Klink Nagels Lang 2 duym yder en dik ƒ88 Duym 1 Blad geel kooper 1/8 duyn dik 6 Nieuwe hand schroeven of vul kloben 6 Pond vyn Eyser bind

af 0.9816041411913222 36 1 1 E P 24 1od roon Hesl Anv ƒpa verled ei aiinolod veds sis berst an ae ƒa363 I ghisemols rob Jan 114 92 36 doo deer e
es 0.7565793051523442 Rio Essequebo den 10.' April 1780
en 0.16946150595865334 den
en 0.9065293493965573 25 July 1814. N=o 47.
en 0.16946150595865334 Dutft 1 Juny 1814
en 0.16946150595865334 4 omgans
en 0.16946150595865334 Coll:
en 0.16946150595865334 R. 5lert 1814
en 0.16946150595865334 Coln:
af 0.887290601239196 geplaatst by kooph: en Colonia geplaatst te Leerwaarde en geniet Persioe geniet perscoen van ƒ2400 geplaatst by koophe Colonie al bove als boven geplaatst by de Marine by kooptColonie by de friance by de staats secretari by de Marmna als Pereepten van Nieuwe of onde Amstel by 1 toloniael Kantoor te ansterda by den Marine als boven 6y koopt eColonie e boven zoo ver ik wees ry en Bodenb de manie geplecht by '1 Colonicel Kantore te Amste
en 0.16946150595865334 Art.1. _ Art 2 Ant. 3. Art. 4. Art: 5 art: 6
en 0.16946150595865334 10. 10:
e

ro 0.4676907209713214 Vate Cosfi - - - - - 2 – — -.
en 0.16946150595865334 -- - 8 — 47
en 0.16946150595865334 1 —
tl 0.566101660226197 Gedestineert Geen Lading 3 balen Cacao 236xh. Limpensag 3022. Rum „ Geen Lading idem. Geen Lading — -- Geen Ladin - — —
de 0.9580167852448463 waar Amstf idem Midalber iden Amst idem „ Middelh: Amst. middelb: Amstsr middels idem Rotterdam Amst. iden idem iden middelb idem Vlissinge Amstvr: idem Lissingen Amm middelb: Amsts idem idem middelb Amstr idem middelb. idem Amst. middelo Amst. iden middelb Amst Roerda Amsterdam
mt 0.5330710321265705 datum Aan 1792 October C 1789 Januari 21 Febrwana Maart 13 20 20 93 April 28 2 Junij
pl 0.3159538043209843 Ort Junyke
en 0.16946150595865334 6 71
en 0.16946150595865334 ate Coffy 45 45
en 0.16946150595865334 — — 10 1 20 5
an 0.3953533576258429 ƒ. 33= Liet Rapst 9 maert 1773. f:o 342 Rappt: 1 April f:o 29. Ziet Not: 8. April 1773. f:o 43. Vide Not: f: 5 April 1773. f:o 32=o en 33.
fr 0.5975321831827071 Noorder quartier

de 0.6996083987098517 Over de reinheid den straten
no 0.9819912071252366 over het Pierim feett der Ioden.
en 0.16946150595865334 102
en 0.16946150595865334 110
en 0.16946150595865334 112
en 0.16946150595865334 118
en 0.16946150595865334 120
en 0.16946150595865334 122
en 0.16946150595865334 124
en 0.16946150595865334 126
en 0.16946150595865334 128
en 0.16946150595865334 130
en 0.16946150595865334 132
en 0.16946150595865334 Hoog
en 0.16946150595865334 134
en 0.16946150595865334 138
en 0.16946150595865334 144
en 0.16946150595865334 146
en 0.16946150595865334 148
en 0.16946150595865334 154
en 0.16946150595865334 156
en 0.16946150595865334 159
en 0.16946150595865334 160
en 0.16946150595865334 162
en 0.16946150595865334 166
en 0.4178766419210845 Decrpline Ver slaven
en 0.16946150595865334 173
en 0.16946150595865334 13 57
ja 0.5194690884182106 ƒo 67
en 0.16946150595865334 71 86
la 0.5038881062212179 fo 89 „ tot 150:ƒ.3:— siet f:o 233 verto nader 90 verso: 9: 90 102 103 107
en 0.16946150595865

en 0.16946150595865334 116 10.
de 1.0 (26 emplangen werden: Aus dieser considerauon haben Wis einheffig in Ratt pelchlossen; Weil die Hollander selbigen Ort wegen des Kriegs verlassen, solchen einzunehmen und post zulassen, weil erschr favorabel vor Insere Negotie is, in deme rund umb viele Oorster und Negersleyn, wesche die Kausteute zy unser grossen avanlage anbringen werden. Als have den 5 Febr: Anno 168. einen fehurich, Getreyten und sects gemeine Knechte, mitdreg dreyptundigen eysernen stucken, lunttizig Cranaten, und zur defension gehorigen Ammunition, dat in nach Jaccarary geschicker, umt S: C: F: D. mndderoselben Br: Atr: Compag: Slagge allda zu pstantzen und waten zu lassen, auch gleich van den Negers undsoldaten emnekking Redouse misPalsiladen umbleizet, machen zu lassen beschlen. Der Her Fiscal Daniel Gerhard Regnerman is depuurt, hebst dem tehurich du Montsolches werck wol in stand zu bringen Dieles obenstekende is nach Resssiger und wollbedachtlicher berahischlagung von On

fr 1.0 ƒ187 31 Droigue te 8. 9. au tusdit kerit contienne, dink que d'in res endroits, plusseurs positions quine concerent pas t'affaire en question, &amp; qui en elles memeslont odieules &amp; offen1 gantes, ta Compagnie nesy arretera pas, tes regardant comme des luites de Pembaras ou se trouvoit t'Auteur, ou 2un tropgrand zele d'apuyer son sentiment, la Compagnie erant d'avisque l'on doit exiter toute occasson d'aipreur &amp; de re proches. Passant done la Resutation du P. to. du susdit lcrit, onn't trouve quune pelitio principi par laquelle on vent soutenis ce que ta Compagnie aretute par tant de railons, aquoion te rapporte pour eviter les redites. De raitonnement par sequel on rache d'apuier cequelon Contient dans ledit &amp.; ne merite pas la moindre refsexion, car ta Compagnie Occidentale lait fort bien quaucune Nation ne peut en obliger une autre 2 quoique ce loit, ja moins quelle Yait droit parles Traites ou par le droit naturel, zinst puitque la Compagnie Occidentale ac-devan

fr 1.0 136 rant pour le Commerce quepour la Police &amp; les bornes des Posselhons, il netois pas susceprible de terme, dont il nesf effectivement parle que dans les artieles ou lon traite de la cessation des hostilitez. De lail luit naturestement, que toutesles stipulations relatives du Commerce, a la correspondance, aux limites des possessions &amp;c. aurquelles en na pas deroge directement, ou indirectement dans les Traitez posserieurs, lont lenlees devoir sublister. Cela est Avras, que dans les Traitez posterieurs on na rien stidule lur les limites &amp; sur l'etat des possessions respectives. Mr. Avocat Pajuge rinss, puilque en plusieurs endroits il cite cesstipulations pour apuier ce quil avance en faveur des pretensions de la Compagnie= Pounous concluons quil nous permettra de taire lameme ulage de ze Traite, apres ce prelude necessaire pour prevenis toute disputepassons l' Art. lV. du Praice de 1661. le voici1 tera libre aux Sujets des Liberum quoque Piederalis BeProvinces- Oni

en 0.16946150595865334 16154.
fr 1.0 1521 n Cestrois articles etant demontrez, comme nous croyons l'avoir lait au jngement de toutes personnes impartiales, la consequence que nous en avons tiree eff toute naturelle. Donc la Compagnie de son propre aven a vidleles Traitez E vo8 e a commis contre notre Nation des violences E des hostilitez dom te Roia ete en droit de demander satisfaction, E ne pouvant tob 1 tenir; duser des represailles. 671 5 80 1 „L2 compagnie le plaint que nous luiavous en levez un Vrissean 's Gravelandt, Capitaine Adrien de Parre, que nous avons en leve587 ksclaves du Vaisseau le Sonnestein &amp; que nous avons mal traitez denx autres Vaissenur la Juffrouw Maria Jacoba &amp; te Zemgater. 117. Pvocat pretend (a) quenous sommes fort embarassez de jastitier notre conduite, &amp; il me fait dire assez pitoyablement, que nous n'avons en d'autre pretexie de les traiter ginst, que par ce qus les Capitaines de la Compagnie ont manque de civilité &amp; de politesse enversles

en 0.16946150595865334 1883
en 0.16946150595865334 34 21 3
en 0.16946150595865334 189. 82.
de 0.8539995237632337 Advis redres west-indische Compagnie.
en 0.16946150595865334 195 84
en 0.16946150595865334 190 83.
en 0.16946150595865334 19685
en 0.16946150595865334 19e 86
en 0.16946150595865334 1957.
en 0.16946150595865334 14988.
en 0.16946150595865334 200
it 0.7470778946617124 dito Accoordi 411. 2. Dito accoord Ar. 4 dito Accoord 411.5.
en 0.16946150595865334 20190.
en 0.16946150595865334 202. 95
en 0.16946150595865334 zog. 93.
en 0.16946150595865334 203. 92.
en 0.16946150595865334 205 94.
en 0.16946150595865334 206: 95.
en 0.16946150595865334 208 97.
it 0.4994946589782139 Nader Accoord AV. 3 dito Accoord Ar. 22.
en 0.16946150595865334 2498.
en 0.16946150595865334 2799.
en 0.16946150595865334 2100.
en 0.16946150595865334 212. 105.
en 0.16946150595865334 102. 213
en 0.16946150595865334 104 214
an 0.6009251592556437 Dito Accoorde 411.7. Dito Accoordt 411.8. dito Accoorde 411.2.
it 0.96660

[(('nl', 1.0), 1493),
 (('en', 0.16946150595865334), 410),
 (('fr', 1.0), 67),
 (('en', 1.0), 12),
 (('de', 1.0), 7),
 (('pl', 0.20867229353658717), 5),
 (('ja', 0.5194690884182106), 5),
 (('nl', 0.9999999999999996), 4),
 (('ja', 0.24692265417222928), 4),
 (('es', 0.6445395518320378), 4),
 (('fi', 0.9549374742514781), 3),
 (('es', 0.37234385553831917), 2),
 (('de', 0.5962513848629446), 2),
 (('ur', 0.5784885130887354), 2),
 (('nl', 0.3201928436806311), 2),
 (('it', 0.8831344561708748), 2),
 (('it', 0.37336173696489644), 2),
 (('nl', 0.9999999999999993), 2),
 (('ur', 0.5783619873000427), 2),
 (('en', 0.5367134438022823), 2),
 (('nl', 0.8878704301283589), 2),
 (('zh', 0.6284123156100075), 2),
 (('nl', 0.8101485325400807), 2),
 (('fi', 0.49236192111457683), 1),
 (('bs', 0.6121068279841232), 1),
 (('nl', 0.9999999987180854), 1),
 (('nl', 0.9999999989479904), 1),
 (('nl', 0.9999840818854093), 1),
 (('de', 0.5925873083142155), 1),
 (('sv', 0.24219228203921914), 1),
 (('nl', 0.999999999999969

## Looking for failed word segmentation

In some cases, text recognition fails to identify individual words and concatenates a sequence of words, resulting in very long 'words'. Using the word length statistics, it is possible to zoom in on scans with high numbers of long words, which might signal that word segmentation failed. 

In [8]:
import pandas as pd
import seaborn as sns

sns.set_theme()

stats_file = '../data/na_scan_stats.tsv.gz'

df = pd.read_csv(stats_file, sep='\t', compression='gzip', index_col=False)

In [18]:
df.columns

Index(['doc_id', 'doc_num', 'lines', 'words', 'text_regions', 'columns',
       'extra', 'pages', 'num_words', 'num_number_words', 'num_title_words',
       'num_non_title_words', 'num_stop_words', 'num_punctuation_words',
       'num_oversized_words', 'num_words_length_5', 'num_words_length_10',
       'num_words_length_15', 'num_words_length_20', 'num_words_length_25',
       'num_words_length_30', 'line_width_range_0-600',
       'line_width_range_600-800', 'line_width_range_800-1300',
       'line_width_range_1300-1600', 'line_width_range_1600-', 'archive_id',
       'inv_num', 'scan_num', 'iiif_url'],
      dtype='object')

In [17]:
wl_cols = [col for col in df.columns if 'num_words_length_' in col]
df[wl_cols].head(5)

Unnamed: 0,num_words_length_5,num_words_length_10,num_words_length_15,num_words_length_20,num_words_length_25,num_words_length_30
0,89,59,17,0,0,0
1,135,134,17,0,0,0
2,134,103,26,0,0,0
3,0,0,0,0,0,0
4,0,0,0,0,0,0


In [13]:
df[wl_cols].sort_values('num_words_length_20')

Unnamed: 0,num_words_length_5,num_words_length_10,num_words_length_15,num_words_length_20,num_words_length_25,num_words_length_30
0,89,59,17,0,0,0
9434,102,48,12,0,0,0
9435,209,110,24,0,0,0
9442,152,97,21,0,0,0
9446,25,20,4,0,0,0
...,...,...,...,...,...,...
1743,556,251,77,20,0,0
1742,559,238,84,21,3,0
1044,597,379,94,22,0,0
1051,521,309,72,22,1,0


In [21]:
wl_long = [col for col in wl_cols if int(col.split('_')[-1]) >= 20] + ['num_oversized_words']
df[wl_long]

Unnamed: 0,num_words_length_20,num_words_length_25,num_words_length_30,num_oversized_words
0,0,0,0,0
1,0,0,0,0
2,0,0,0,0
3,0,0,0,0
4,0,0,0,0
...,...,...,...,...
15852,0,0,0,0
15853,0,0,0,0
15854,0,0,0,0
15855,0,0,0,0


In [27]:
df['long_words'] = df[wl_long].sum(axis=1)
df[['doc_id', 'long_words', 'iiif_url']].sort_values('long_words', ascending=False).head(10).style.format({'iiif_url': make_clickable})

Unnamed: 0,doc_id,long_words,iiif_url
1742,NL-HaNA_1.05.06_1167_0319.jpg,24,https://www.nationaalarchief.nl/onderzoeken/archief/1.05.06/invnr/1167/file/NL-HaNA_1.05.06_1167_0319?eadId=1.05.06&unitID=1167
1026,NL-HaNA_1.05.06_1166_0010.jpg,24,https://www.nationaalarchief.nl/onderzoeken/archief/1.05.06/invnr/1166/file/NL-HaNA_1.05.06_1166_0010?eadId=1.05.06&unitID=1166
1051,NL-HaNA_1.05.06_1166_0035.jpg,23,https://www.nationaalarchief.nl/onderzoeken/archief/1.05.06/invnr/1166/file/NL-HaNA_1.05.06_1166_0035?eadId=1.05.06&unitID=1166
1044,NL-HaNA_1.05.06_1166_0029.jpg,22,https://www.nationaalarchief.nl/onderzoeken/archief/1.05.06/invnr/1166/file/NL-HaNA_1.05.06_1166_0029?eadId=1.05.06&unitID=1166
1743,NL-HaNA_1.05.06_1167_0320.jpg,20,https://www.nationaalarchief.nl/onderzoeken/archief/1.05.06/invnr/1167/file/NL-HaNA_1.05.06_1167_0320?eadId=1.05.06&unitID=1167
1040,NL-HaNA_1.05.06_1166_0024.jpg,18,https://www.nationaalarchief.nl/onderzoeken/archief/1.05.06/invnr/1166/file/NL-HaNA_1.05.06_1166_0024?eadId=1.05.06&unitID=1166
970,NL-HaNA_1.05.06_1159_0012.jpg,18,https://www.nationaalarchief.nl/onderzoeken/archief/1.05.06/invnr/1159/file/NL-HaNA_1.05.06_1159_0012?eadId=1.05.06&unitID=1159
1740,NL-HaNA_1.05.06_1167_0317.jpg,18,https://www.nationaalarchief.nl/onderzoeken/archief/1.05.06/invnr/1167/file/NL-HaNA_1.05.06_1167_0317?eadId=1.05.06&unitID=1167
1049,NL-HaNA_1.05.06_1166_0033.jpg,18,https://www.nationaalarchief.nl/onderzoeken/archief/1.05.06/invnr/1166/file/NL-HaNA_1.05.06_1166_0033?eadId=1.05.06&unitID=1166
1028,NL-HaNA_1.05.06_1166_0011.jpg,17,https://www.nationaalarchief.nl/onderzoeken/archief/1.05.06/invnr/1166/file/NL-HaNA_1.05.06_1166_0011?eadId=1.05.06&unitID=1166
