Harm has added a file with some corrections for htr errors.

Let's read this file:

In [1]:
import csv

file = 'htr_verbeterd_1.tsv'
with open(file) as f:
    reader = csv.reader(f,delimiter='\t')
    data = [tuple(row) for row in reader]
data

[('tgeene', '13974', "'t geene"),
 ('tgemeene', '968', "'t gemeene"),
 ('tselve', '16031', "'t selve"),
 ('Cnomene', '62', '(nomine'),
 ('Cnomine', '279', '(nomine'),
 ('Cnomne', '49', '(nomine'),
 ('AClaesz', '185', 'A. Claesz'),
 ('Abommelin', '66', 'A. Commelin'),
 ('Ahuijlman', '85', 'A. Huijlman'),
 ('Ahuijsman', '70', 'A. Huijlman'),
 ('Abintestato', '100', 'ab intestato'),
 ('abintestato', '561', 'ab intestato'),
 ('aberbauel', '74', 'Aberbanel'),
 ('aboaf', '67', 'aboas'),
 ('AAbraham', '114', 'Abraham'),
 ('Abaham', '57', 'Abraham'),
 ('Abrahan', '161', 'Abraham'),
 ('Abrahem', '77', 'Abraham'),
 ('Abrahom', '59', 'Abraham'),
 ('Abrahum', '58', 'Abraham'),
 ('Abrakam', '94', 'Abraham'),
 ('Abralam', '85', 'Abraham'),
 ('Abranam', '72', 'Abraham'),
 ('Abraram', '130', 'Abraham'),
 ('Abrasam', '188', 'Abraham'),
 ('Abroham', '76', 'Abraham'),
 ('Araham', '145', 'Abraham'),
 ('Abrahamde', '169', 'Abraham de'),
 ('ablolut', '48', 'absolut'),
 ('alsolute', '72', 'absolute'),
 ('acc

There are superfluous lines at the end of the file, we will ignore these.

Also, in some lines, the original word and the replacement word are identical:

In [2]:
[t for t in data if t and t[0] == t[2]]

[('andersints', '18206', 'andersints'),
 ('attestor', '49080', 'attestor'),
 ('Bernand', '49', 'Bernand'),
 ('Capitein', '975', 'Capitein'),
 ('Clenodien', '171', 'Clenodien'),
 ('Theodonis', '240', 'Theodonis')]

So, to get the replacements we need, we create a dictionary mapping the original word(s) to their replacement, filtering out the empty lines and the lines with identical original word and replacement:

In [3]:
replacements = {t[0]:t[2] for t in data if t and t[0] != t[2]}
replacements

{'tgeene': "'t geene",
 'tgemeene': "'t gemeene",
 'tselve': "'t selve",
 'Cnomene': '(nomine',
 'Cnomine': '(nomine',
 'Cnomne': '(nomine',
 'AClaesz': 'A. Claesz',
 'Abommelin': 'A. Commelin',
 'Ahuijlman': 'A. Huijlman',
 'Ahuijsman': 'A. Huijlman',
 'Abintestato': 'ab intestato',
 'abintestato': 'ab intestato',
 'aberbauel': 'Aberbanel',
 'aboaf': 'aboas',
 'AAbraham': 'Abraham',
 'Abaham': 'Abraham',
 'Abrahan': 'Abraham',
 'Abrahem': 'Abraham',
 'Abrahom': 'Abraham',
 'Abrahum': 'Abraham',
 'Abrakam': 'Abraham',
 'Abralam': 'Abraham',
 'Abranam': 'Abraham',
 'Abraram': 'Abraham',
 'Abrasam': 'Abraham',
 'Abroham': 'Abraham',
 'Araham': 'Abraham',
 'Abrahamde': 'Abraham de',
 'ablolut': 'absolut',
 'alsolute': 'absolute',
 'accentatie': 'acceptatie',
 'acceplatie': 'acceptatie',
 'accepratie': 'acceptatie',
 'accertatie': 'acceptatie',
 'accesstatie': 'acceptatie',
 'accestatie': 'acceptatie',
 'accextatie': 'acceptatie',
 'accoptatie': 'acceptatie',
 'aceeptatie': 'acceptatie',
 

We'll use the nltk package for tokenizing the htr text:

In [4]:
import nltk.tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/bramb/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Now we read in the combined text of archive `4271_NOTD02659`, that we prepared earlier by running the scripts `extract-text.py` followed by `dehyphenize.py` on all `*.xml` in `4271_NOTD02659`: 

In [5]:
file = '../pagexml/4271_NOTD02659.txt'
with open(file) as f:
    text = f.read()
print(text)

„5
ƒ8r:-:
Notaricie Archieven
Amsterdam
1a 2332
ƒ ƒ -62
D
640X4 9r
xorOot
Contiuerende tweehondert Drie eartachentign
pdtien
Uijtgegeven aan den Notaris
Dirck vander Groe
Op den o Ocober. Ao 160
Mdr: Backt
„
:11
eromne
11:
1951
Asruns seses osesters
404:1
4
de manten gesz. open Huijden den Eersten decembr: 1693 Compareerden voor mij
zegel van twaef
f
Diick van der Groe noth etc. in presentie vande nabesz: get. Sr. Aron moreno Henriques ende
st
Abraham zuzarte beijde van Competeren ouderdom woonende binnen deser steede ende hebben
bij ware woorden in plaetse ende onder presentatie van eede solemneel ten versoecke van Juffr
Rachel marhorre wede van Sabataij schen en moses machorie beijde kinderen van ghana na
varro in haer lepen wede: van Jacob machorre Cmnsz get. veclaert ende geadtest. hoe waer
dat sij gec leer wel gekent hebben Abraham naparro dewelcke (soo sij get.vstaen
hebben) tot bombai in oostindien sonder eenige gelijffs erven na telaten deser werelt
is Comen te overlijden ende 

Next, we tokenize this text:

In [6]:
words = nltk.word_tokenize(text)
words

['„',
 '5',
 'ƒ8r',
 ':',
 '-',
 ':',
 'Notaricie',
 'Archieven',
 'Amsterdam',
 '1a',
 '2332',
 'ƒ',
 'ƒ',
 '-62',
 'D',
 '640X4',
 '9r',
 'xorOot',
 'Contiuerende',
 'tweehondert',
 'Drie',
 'eartachentign',
 'pdtien',
 'Uijtgegeven',
 'aan',
 'den',
 'Notaris',
 'Dirck',
 'vander',
 'Groe',
 'Op',
 'den',
 'o',
 'Ocober',
 '.',
 'Ao',
 '160',
 'Mdr',
 ':',
 'Backt',
 '„',
 ':11',
 'eromne',
 '11',
 ':',
 '1951',
 'Asruns',
 'seses',
 'osesters',
 '404:1',
 '4',
 'de',
 'manten',
 'gesz',
 '.',
 'open',
 'Huijden',
 'den',
 'Eersten',
 'decembr',
 ':',
 '1693',
 'Compareerden',
 'voor',
 'mij',
 'zegel',
 'van',
 'twaef',
 'f',
 'Diick',
 'van',
 'der',
 'Groe',
 'noth',
 'etc',
 '.',
 'in',
 'presentie',
 'vande',
 'nabesz',
 ':',
 'get',
 '.',
 'Sr.',
 'Aron',
 'moreno',
 'Henriques',
 'ende',
 'st',
 'Abraham',
 'zuzarte',
 'beijde',
 'van',
 'Competeren',
 'ouderdom',
 'woonende',
 'binnen',
 'deser',
 'steede',
 'ende',
 'hebben',
 'bij',
 'ware',
 'woorden',
 'in',
 'plaetse',


Let's check which words we can replace:

In [7]:
[(w,replacements[w]) for w in words if w in replacements.keys()]

[('tgeene', "'t geene"),
 ('aberbauel', 'Aberbanel'),
 ('amsterdam', 'Amsterdam'),
 ('bijsbert', 'Gijsbert'),
 ('acteptatie', 'acceptatie'),
 ('betalf', 'betalinge'),
 ('becalen', 'betalen'),
 ('adlites', 'ad lites'),
 ('becaelt', 'betaelt'),
 ('anwoorde', 'antwoorde'),
 ('adlites', 'ad lites'),
 ('vand', 'van de'),
 ('amsterdam', 'Amsterdam'),
 ('tselve', "'t selve"),
 ('bijsbert', 'Gijsbert'),
 ('authonij', 'anthonij'),
 ('aenden', 'aen den'),
 ('tselve', "'t selve"),
 ('tgeene', "'t geene"),
 ('tselve', "'t selve"),
 ('becaelt', 'betaelt'),
 ('amsterdam', 'Amsterdam'),
 ("d'hr", "d' hr"),
 ('becaelt', 'betaelt'),
 ("d'hr", "d' hr"),
 ("d'eene", "d' eene"),
 ('aente', 'aen te'),
 ("d'Hr", "d' Hr"),
 ('boeckhonder', 'boeckhouder'),
 ('becaelt', 'betaelt'),
 ('authori', 'Anthoni'),
 ('adlites', 'ad lites'),
 ("d'hr", "d' hr"),
 ('anwoorde', 'antwoorde'),
 ('tgeene', "'t geene"),
 ('alsborge', 'als borge'),
 ('beprachter', 'bevrachter'),
 ('Beprachter', 'bevrachter'),
 ('beprachter', 'b

Now, this only works for those replacements keys that would not be split up when using the nltk tokenizer.

Which replacement keys would get tokenized into more than 1 token?:

In [8]:
for original in replacements.keys():
    tokens = nltk.word_tokenize(original)
    if (len(tokens)>1):
        print(f'{original} -> {tokens}')

Cathari na -> ['Cathari', 'na']
Catharin a -> ['Catharin', 'a']
d'C -> ['d', "'", 'C']
d'd -> ['d', "'d"]
d'D -> ['d', "'D"]
d'e -> ['d', "'", 'e']
d'E -> ['d', "'", 'E']
d'h -> ['d', "'", 'h']
d'H -> ['d', "'", 'H']
de biteur -> ['de', 'biteur']
de biteuren -> ['de', 'biteuren']
