# Parsing charter books

The charter books of Holland and Zeeland have been scanned and OCR'ed and made available at the page level. This Jupyter Notebook contains code to:

- make the text available at the charter level, such that users can jump to specific charters in stead of specific pages,
- identify which places are attested in a charter,
- identify in which year or period a charter was written,

### Exploitable structure

- pages have headers with charter number and year or year and charter number
- charters are numbered in ascending order
- charters have dates in ascending order
- line numbers switch sides on alternating pages

### Issues of preprocessing

- lines are not properly split
- several words are recognised as lists of individual characters separated by whitespace
- several words are split over lines with first page followed by linebreak
- margins contain line numbers 


### Tasks

- identify charter titles
- identify charter years
- identify place names mentioned in charters
- identify language of charter text paragraphs (Latin, Middle Dutch, Modern Dutch, French, ...)

### Decisions

Initially, the OCR'ed text on the [resources.huygens.knaw.nl](http://resources.huygens.knaw.nl/retroboeken/ohz/#page=0&accessor=toc&view=homePane) website was used, but it has issues with whitespacing within words (i.e. many words are split into whitespaced individual characters), which is difficult to resolve for sequences of words, e.g. "l o c u m E p t e r n a c u m" instead of "locum Epternacum" (OHZ part 1, page 5).

Instead, the HTML formatted OCR output of a new OCR using Tesseract was used. Besides solving the issue with whitespacing, it seems to have slightly better recognition in that numbers are more often recognized correctly, which is important for identifying charter numbers and dates. 





In [214]:
import json
import os
import re
import copy

from elasticsearch import Elasticsearch
from openpyxl import load_workbook
from collections import defaultdict
from collections import Counter
from bs4 import BeautifulSoup as bsoup

import parse_hocr_files # local module
from fuzzy_matcher import FuzzyMatcher # local module

es = Elasticsearch()



The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Extracting Charter Page Structure from hOCR files

Read charter books per page and extract:

- page headers, footers, line numbers, paragraphs
- charter number, charter title

In [2]:
def deconfuse(text):
    # Quick and dirty deconfusion for charter numbers and dates
    text = re.sub("i", "1", text, flags=re.IGNORECASE)
    text = re.sub("z", "2", text, flags=re.IGNORECASE)
    text = re.sub("g", "9", text, flags=re.IGNORECASE)
    text = re.sub("o", "0", text, flags=re.IGNORECASE)
    text = re.sub(r"u(\d+)", r"11\1", text, flags=re.IGNORECASE)
    return text

def prune_paragraph(paragraph):
    # Efficiency stuff: prune paragraphs with the wrong characteristics
    # 1. Charter titles can have multiple lines, but no more than 3
    if len(paragraph["line_texts"]) > 3:
        return True
    # 2. Charter titles never have full text lines with over 60 characters
    most_chars_per_line = max([len(line.replace(" ","")) for line in paragraph["line_texts"]])
    if most_chars_per_line > 60:
        return True
    return False

def determine_lookup_number(next_number):
    # Map charters with known OCR issues to OCR'ed representation.
    # Some charter numbers are recognized as numbers with fewer digits.
    # Easiest solution is to just list them and map to the recognized number
    if next_number in known_ocr_errors: 
        return known_ocr_errors[next_number]
    else:
        return next_number

def check_paragraph(paragraph, next_number):
    # Check if this paragraph is a charter title. If so, extract charter number.
    lookup_number = determine_lookup_number(next_number)
    candidate = False
    if prune_paragraph(paragraph):
        return False
    for line in paragraph["line_texts"]:
        # Generally: 
        # Charter numbers are almost always at least 20 characters/whitespaces from the start.
        # Some special rules:
        # Numbers above 2000 probably don't refer to a year,
        # so if line contains right number, it's probably the charter number
        if lookup_number > 2000 and re.search(r".{20,}  %s " % lookup_number, line):
            candidate = next_number
            return candidate
        # Exception:
        # Charters 2535 and 2536 are on a single line, 2535 is left-column-centred so close to left edge
        # Only three whitespaces between date and number
        if lookup_number > 2000 and re.search(r".{10,} {3,}%s {4,}" % lookup_number, line):
            candidate = next_number
            return candidate
        # Another exception:
        # Two charter numbers separated by whitespace, e.g. charters 3075 and 3076
        if re.search(r".{20,} %s %s {2,}" % (lookup_number-1, lookup_number), line):
            candidate = next_number
            return candidate
        # Another exception:
        # Two titles with dates, where first title number occurs early, centered above left column,
        # so less than 20 characters after start of line, but with plenty of white space around it.
        # E.g. charters 82 and 83 on page 150 in book 1, charters 849 and 850 on page 189 in book 2.
        if re.search(r".{10,} {4,}%s {4,}" % lookup_number, line):
            candidate = next_number
            return candidate
        # Generally, if the expected number appears surrounded by several whitespaces, 
        # at least 20 characters from the start, it should be the charter number
        if re.search(r".{20,} {2,}%s {2,}" % lookup_number, line):
            candidate = next_number
            return candidate
        # If the expected number is at the line (e.g. no geographical indication at the right side)
        if re.search(r".{20,} {2,}%s$" % lookup_number, line):
            candidate = next_number
            return candidate
        # There are some exceptions where a range of charters are grouped, without
        # spelling out each charter number, e.g. 2599-2608 in part 5, page 50.
        # Solution: capture start and end number and returns as a tuple
        if re.search(r".{20,} {2,}%s-\d+" % lookup_number, line):
            candidate = next_number
            m = re.search(r" {2,}%s-(\d+)" % lookup_number, line)
            print("MATCH:", m.group(1))
            sequence_end = int(m.group(1))
            return (candidate, sequence_end)
        # There is at least one charter title with three charter numbers separated by hyphens.
        # The next two checks cover this exception.
        if re.search(r".{20,} {2,}(\d+-)+%s" % lookup_number, line):
            candidate = next_number
            return candidate
        if re.search(r".{20,} {2,}\d+-%s-\d+" % lookup_number, line):
            candidate = next_number
            return candidate
        # If above exceptions don't hold, look for the charter number at the centre of the string.
        center_string = line[34:48]
        if len(center_string) == 0: # if it's emtpy if doesn't have the charter number.
            continue
        # Determine how many text characters the charter number should take up
        max_center_length = len(str(lookup_number)) + 1
        if len(center_string.replace(" ","")) > max_center_length: 
            # if it takes up at least two more than characters than expected,
            # it's not the charter number.
            continue
        center_text = center_string.strip()
        center_text = deconfuse(center_text)
        if center_text.isdigit():
            # If the center text is a number, it should be the expected number
            candidate = int(center_text)
            return candidate
    return candidate

def has_charter_title(paragraph, numbers):
    is_sequence = False
    # The titles of some charters are missing because of missing pages in the hOCR output.
    # If the title for a charter number is known to be missing, skip to the next charter number
    if numbers["next_charter"] in missing_numbers:
        numbers["current_charter"] = [numbers["next_charter"]]
        numbers["next_charter"] += 1
    # check if paragraph has a candidate title number
    candidate = check_paragraph(paragraph, numbers["next_charter"])
    # candidate is normally an integer for a single charter number. If it's a tuple, the charter title
    # contains a range of charter numbers
    if isinstance(candidate, tuple):
        is_sequence = True
        sequence_end = candidate[1]
        sequence_start = candidate[0]
        candidate = candidate[0]
    # Some charter numbers are missing a number in the OCR, making it hard to match. Use mapping table to 
    # determine what number in the OCR the look for
    if numbers["next_charter"] in known_ocr_errors and candidate == known_ocr_errors[numbers["next_charter"]]:
        candidate = numbers["next_charter"]
    while candidate == numbers["next_charter"]:
        paragraph["type"] = "charter_title"
        numbers["current_charter"] = [numbers["next_charter"]]
        if "charter_number" not in paragraph:
            paragraph["charter_number"] = []
        if is_sequence:
            for candidate in range(sequence_start, sequence_end+1):
                paragraph["charter_number"] += [numbers["next_charter"]]
                print("CANDIDATE", candidate, "for charter", numbers["next_charter"], "on page", numbers["page_num"], paragraph)
                print("\tFOUND:", candidate)
                numbers["next_charter"] += 1
                is_sequence = False
        else:
            paragraph["charter_number"] += [numbers["next_charter"]]
            print("CANDIDATE", candidate, "for charter", numbers["next_charter"], "on page", numbers["page_num"], paragraph)
            print("\tFOUND:", candidate)
            numbers["next_charter"] += 1
        while numbers["next_charter"] in missing_numbers:
            numbers["next_charter"] += 1
        candidate = check_paragraph(paragraph, numbers["next_charter"])
        if isinstance(candidate, tuple):
            is_sequence = True
            sequence_end = candidate[1]
            sequence_start = candidate[0]
            candidate = candidate[0]

def process_charter_page(hocr_page, numbers):
    for index, paragraph in enumerate(hocr_page.paragraphs):
        is_charter_title = False
        if numbers["page_num"] > 1 and index == 0: # skip page headers, except on page 1 which has no header
            paragraph["type"] = "header"
        elif index == len(hocr_page.paragraphs) - 1: # skip page footer which contains only page number
            paragraph["type"] = "footer"
        else:
            # Keep track of paragraph type, e.g. main_text, charter_title, header or footer
            # By default, paragraph is main_text
            paragraph["type"] = "main_text"
            # Next, check if paragraph is a charter title with the expected charter number
            has_charter_title(paragraph, numbers)
            # If so, change it's type
            if paragraph["type"] == "charter_title":
                # keep track of current charter number(s)
                numbers["current_charter"] = paragraph["charter_number"]
            # If not, make sure subsequent paragraphs have the same charter number
            # as the current charter title
            if paragraph["type"] == "main_text":
                # pass on charter number(s) from title paragraph to later paragraphs of same charter
                paragraph["charter_number"] = numbers["current_charter"]

def index_page(hocr_page, book_num):
    doc = {
        "book_num": book_num,
        "page_num": hocr_page.page_num,
        "lines": hocr_page.lines,
        "paragraphs": hocr_page.paragraphs,
    }
    doc_id = "OHZ-{b}-page-{p}".format(b=book_num, p=hocr_page.page_num)
    es.index(index="retroboeken", doc_type="ohz-page", id=doc_id, body=doc)
    
def index_paragraphs(hocr_page, book_num):
    for paragraph in hocr_page.paragraphs:
        doc_id = "OHZ-{b}-page-{page}-paragraph-{par}".format(b=book_num, page=hocr_page.page_num, par=paragraph["paragraph_num"])
        paragraph["book_num"] = book_num
        paragraph["doc_id"] = doc_id
        es.index(index="retroboeken", doc_type="ohz-paragraph", id=doc_id, body=paragraph)
    
def get_hocr_files(hocr_dir):
    for root, dirs, files in os.walk(hocr_dir):
        for fname in sorted(files):
            filepath = os.path.join(root, fname)
            parts = fname.replace(".hocr","").split("_")
            try:
                page_num = int(parts[-1])
                yield filepath, page_num
            except ValueError:
                continue
    
def process_book(book_num, first_charter_num):
    numbers = {
        "current_charter": [0],
        "next_charter": first_charter_num
    }
    next_number = first_charter_num
    curr_number = [0]
    book_name = "OHZ" + str(book_num)
    hocr_dir = "hOCR/" + book_name
    page_index = {}
    place_name_index = {}
    non_places = {}
    fuzzy_places = {}
    print("Processing book {name} starting with charter number {number}".format(name=hocr_dir, number=first_charter_num))
    for filepath, page_num in get_hocr_files(hocr_dir):
        numbers["page_num"] = page_num
        hocr_page = parse_hocr_files.make_hocr_page(filepath, page_num, remove_line_numbers=True)
        process_charter_page(hocr_page, numbers)
        #process_place_names(hocr_page, non_places)
        #index_paragraphs(hocr_page, book_num)
        if page_num > last_page[book_name]:
            # Skip pages beyond the last body matter page
            break



## Index hOCR Pages of Charter Books

In [7]:
#if es.indices.exists(index='retroboeken'):
#    es.indices.delete(index='retroboeken')

next_number = 2
for book_num in range(1,2):
    book_name = "OHZ{b}".format(b=book_num)
    process_book(book_num, starting_charter_numbers[book_name])

Processing book hOCR/OHZ1 starting with charter number 1
CANDIDATE 2 for charter 2 on page 2 {'type': 'charter_title', 'line_texts': ['[726 okt. 21-727 mei 13]            2'], 'line_numbers': [4], 'page_num': 2, 'paragraph_num': 2, 'merged_text': '[726 okt. 21-727 mei 13]            2 ', 'charter_number': [2]}
	FOUND: 2
CANDIDATE 3 for charter 3 on page 6 {'type': 'charter_title', 'line_texts': ['[719-739] mei 12                    3                               Trier'], 'line_numbers': [8], 'page_num': 6, 'paragraph_num': 3, 'merged_text': '[719-739] mei 12                    3                               Trier ', 'charter_number': [3]}
	FOUND: 3
CANDIDATE 4 for charter 4 on page 7 {'type': 'charter_title', 'line_texts': ['    [719-739 nov. 7]                    4'], 'line_numbers': [16], 'page_num': 7, 'paragraph_num': 2, 'merged_text': '[719-739 nov. 7]                    4 ', 'charter_number': [4]}
	FOUND: 4
CANDIDATE 5 for charter 5 on page 9 {'type': 'charter_title', 'line_tex

CANDIDATE 29 for charter 29 on page 50 {'type': 'charter_title', 'line_texts': ['928                                 29                        Maastricht'], 'line_numbers': [4], 'page_num': 50, 'paragraph_num': 2, 'merged_text': '928                                 29                        Maastricht ', 'charter_number': [29]}
	FOUND: 29


KeyboardInterrupt: 

## Keeping Track of OHZ Data Issues

Things to keep track of:

- starting charter number of each book
- last page of main body matter
- missing pages in hOCR output
- missing charter titles because of missing pages
- hard to interpret charter numbers because of OCR recognizing fewer characters than are printed


In [6]:
# Identify the number of the first charter in each book.
starting_charter_numbers = {
    "OHZ1": 1,
    "OHZ2": 424,
    "OHZ3": 1085,
    "OHZ4": 1826,
    "OHZ5": 2574,
}

# Identify which page number of the last page containing charter text.
# Later pages contain literature references and other non-charter material.
last_page = {
    "OHZ1": 608,
    "OHZ2": 776,
    "OHZ3": 961,
    "OHZ4": 989,
    "OHZ5": 1137,
}

# The following charter numbers have been incorrectly OCR'ed. 
# When looking for charter titles, use the OCR'ed number.
known_ocr_errors = {
    1111: 111,
    1112: 112,
    1113: 113,
    1114: 114,
    1116: 116,
    1118: 118,
    1121: 121,
    1131: 131,
    1149: 149,
    1153: 153,
    1164: 164,
    1169: 169,
    1175: 175,
    1190: 190,
    1194: 194,
}


# Some pages have not been properly OCR'ed (yet) so cannot be parsed.
# Identify which charter numbers cannot be found because of these missing pages.
missing_numbers = [
    1,    # OHZ part 1, missing page 1
    1258, # OHZ part 3, missing page 265
    1270, # OHZ part 3, missing page 277
    1471, # OHZ part 3, missing page 514
    2052, # OHZ part 4, missing page 264
    2581, # OHZ part 5, missing page 8
    2584, # OHZ part 5, missing page 18
    2598, # OHZ part 5, missing page 49
    2599, # OHZ part 5, missing page 50
    2600, # OHZ part 5, missing page 50
    2601, # OHZ part 5, missing page 50
    2602, # OHZ part 5, missing page 50
    2603, # OHZ part 5, missing page 50
    2604, # OHZ part 5, missing page 50
    2605, # OHZ part 5, missing page 50
    2606, # OHZ part 5, missing page 50
    2607, # OHZ part 5, missing page 50
    2608, # OHZ part 5, missing page 50
    2629, # OHZ part 5, missing page 57
    2630, # OHZ part 5, missing page 57
    2639, # OHZ part 5, missing page 65
    2640, # OHZ part 5, missing page 65
    2644, # OHZ part 5, missing page 75
    2654, # OHZ part 5, missing page 84
    2655, # OHZ part 5, missing page 84
    2660, # OHZ part 5, missing page 89
    2670, # OHZ part 5, missing page 101
    2676, # OHZ part 5, missing page 108
    2694, # OHZ part 5, missing page 147
    2741, # OHZ part 5, missing page 222
    2773, # OHZ part 5, missing page 265
    2776, # OHZ part 5, missing page 267
    2805, # OHZ part 5, missing page 302
    2874, # OHZ part 5, missing page 377
    2889, # OHZ part 5, missing page 393
    2898, # OHZ part 5, missing page 400
    2929, # OHZ part 5, missing page 438
    2953, # OHZ part 5, missing page 463
    2981, # OHZ part 5, missing page 494
    2983, # OHZ part 5, missing page 496
    2984, # OHZ part 5, missing page 496
    3001, # OHZ part 5, missing page 514
    3004, # OHZ part 5, missing page 517
    3008, # OHZ part 5, missing page 521
    3157, # OHZ part 5, missing page 674
    3170, # OHZ part 5, missing page 688
    3172, # OHZ part 5, missing page 690
    3237, # OHZ part 5, missing page 748
    3293, # OHZ part 5, missing page 808
    3365, # OHZ part 5, missing page 888
    3366, # OHZ part 5, missing page 888
    3378, # OHZ part 5, missing page 898
    3436, # OHZ part 5, missing page 979
    3437, # OHZ part 5, missing page 979
    3443, # OHZ part 5, missing page 988
    3444, # OHZ part 5, missing page 990
    3453, # OHZ part 5, missing page 1012
    3460, # OHZ part 5, missing page 1023
    3484, # OHZ part 5, missing page 1050
    3521, # OHZ part 5, missing page 1098
    3529, # OHZ part 5, missing page 1113
    3536, # OHZ part 5, missing page 1137
]

missing_data = [
    {"charter_num": 1, "book_num": 1, "page_num": 1},
    {"charter_num": 1258, "book_num": 3, "page_num": 265},
    {"charter_num": 1270, "book_num": 3, "page_num": 277},
    {"charter_num": 1471, "book_num": 3, "page_num": 514},
    {"charter_num": 2052, "book_num": 4, "page_num": 264},
    {"charter_num": 2581, "book_num": 5, "page_num": 8},
    {"charter_num": 2584, "book_num": 5, "page_num": 18},
    {"charter_num": 2598, "book_num": 5, "page_num": 49},
    {"charter_num": 2599, "book_num": 5, "page_num": 50},
    {"charter_num": 2600, "book_num": 5, "page_num": 50},
    {"charter_num": 2601, "book_num": 5, "page_num": 50},
    {"charter_num": 2602, "book_num": 5, "page_num": 50},
    {"charter_num": 2603, "book_num": 5, "page_num": 50},
    {"charter_num": 2604, "book_num": 5, "page_num": 50},
    {"charter_num": 2605, "book_num": 5, "page_num": 50},
    {"charter_num": 2606, "book_num": 5, "page_num": 50},
    {"charter_num": 2607, "book_num": 5, "page_num": 50},
    {"charter_num": 2608, "book_num": 5, "page_num": 50},
    {"charter_num": 2629, "book_num": 5, "page_num": 57},
    {"charter_num": 2630, "book_num": 5, "page_num": 57},
    {"charter_num": 2639, "book_num": 5, "page_num": 65},
    {"charter_num": 2640, "book_num": 5, "page_num": 65},
    {"charter_num": 2644, "book_num": 5, "page_num": 75},
    {"charter_num": 2654, "book_num": 5, "page_num": 84},
    {"charter_num": 2655, "book_num": 5, "page_num": 84},
    {"charter_num": 2660, "book_num": 5, "page_num": 89},
    {"charter_num": 2670, "book_num": 5, "page_num": 101},
    {"charter_num": 2676, "book_num": 5, "page_num": 108},
    {"charter_num": 2694, "book_num": 5, "page_num": 147},
    {"charter_num": 2741, "book_num": 5, "page_num": 222},
    {"charter_num": 2773, "book_num": 5, "page_num": 265},
    {"charter_num": 2776, "book_num": 5, "page_num": 267},
    {"charter_num": 2805, "book_num": 5, "page_num": 302},
    {"charter_num": 2874, "book_num": 5, "page_num": 377},
    {"charter_num": 2889, "book_num": 5, "page_num": 393},
    {"charter_num": 2898, "book_num": 5, "page_num": 400},
    {"charter_num": 2929, "book_num": 5, "page_num": 438},
    {"charter_num": 2953, "book_num": 5, "page_num": 463},
    {"charter_num": 2981, "book_num": 5, "page_num": 494},
    {"charter_num": 2983, "book_num": 5, "page_num": 496},
    {"charter_num": 2984, "book_num": 5, "page_num": 496},
    {"charter_num": 3001, "book_num": 5, "page_num": 514},
    {"charter_num": 3004, "book_num": 5, "page_num": 517},
    {"charter_num": 3008, "book_num": 5, "page_num": 521},
    {"charter_num": 3157, "book_num": 5, "page_num": 674},
    {"charter_num": 3170, "book_num": 5, "page_num": 688},
    {"charter_num": 3172, "book_num": 5, "page_num": 690},
    {"charter_num": 3237, "book_num": 5, "page_num": 748},
    {"charter_num": 3293, "book_num": 5, "page_num": 808},
    {"charter_num": 3365, "book_num": 5, "page_num": 888},
    {"charter_num": 3366, "book_num": 5, "page_num": 888},
    {"charter_num": 3378, "book_num": 5, "page_num": 898},
    {"charter_num": 3436, "book_num": 5, "page_num": 979},
    {"charter_num": 3437, "book_num": 5, "page_num": 979},
    {"charter_num": 3443, "book_num": 5, "page_num": 988},
    {"charter_num": 3444, "book_num": 5, "page_num": 990},
    {"charter_num": 3453, "book_num": 5, "page_num": 1012},
    {"charter_num": 3460, "book_num": 5, "page_num": 1023},
    {"charter_num": 3484, "book_num": 5, "page_num": 1050},
    {"charter_num": 3521, "book_num": 5, "page_num": 1098},
    {"charter_num": 3529, "book_num": 5, "page_num": 1113},
    {"charter_num": 3536, "book_num": 5, "page_num": 1137},
]

# keep track of pages missing in the hOCR output so that OHZ index reference to missing pages can be identified
missing_pages = defaultdict(list)

for book_num in range(1,6):
    prev_page_num = 0
    book_name = "OHZ" + str(book_num)
    hocr_dir = "hOCR/" + book_name
    for page_path, page_num in parse_hocr_files.get_hocr_files(hocr_dir):
        while prev_page_num != page_num - 1:
            missing_pages[book_num] += [prev_page_num+1]
            prev_page_num += 1
        #print(prev_page_num, page_num)
        prev_page_num = page_num

print(missing_pages)

defaultdict(<class 'list'>, {1: [1], 3: [265, 277, 514], 4: [264], 5: [8, 15, 18, 49, 50, 57, 65, 75, 84, 89, 101, 108, 120, 129, 147, 222, 265, 267, 290, 302, 341, 377, 393, 400, 437, 438, 463, 494, 496, 512, 514, 517, 521, 545, 590, 674, 688, 690, 748, 808, 847, 886, 888, 898, 971, 979, 988, 990, 1012, 1022, 1023, 1031, 1050, 1098, 1113, 1115, 1125, 1133, 1137, 1149]})


# Read place name list

The list of place names is based on the OHZ Index that has place names as well as person names. The excel sheet contains a column with sets of place names separated by an empty cell. 

- The first name in the set tends to be the modern name/spelling (although for some places there is no modern name/spelling) 
- The other names in the set are variant names

There are additional columns:

- one column with country code labels indicating in which modern day country the place is geographically located.
- one column with comments 

In [161]:
from openpyxl import load_workbook
from collections import defaultdict

class PlaceIndex(object):
    
    def __init__(self):
        self.has_variants = defaultdict(dict)
        self.is_variant_of = defaultdict(dict)
        self.placename_ngram_index = defaultdict(dict)
        self.placename_info = {}
        self.non_placename = {}
        
    def add_placename(self, placename, placename_id, country, note):
        self.placename_info[placename] = {
            "placename": placename,
            "placename_id": placename_id,
            "country": country,
            "note": note
        }
        
    def add_variant(self, pref_name, variant_name, variant_id):
        self.has_variants[pref_name][variant_name] = variant_id
        self.is_variant_of[variant_name][pref_name] = variant_id

    def index_placename_ngrams(self, place_name, ngram_size=3):
        for ngram in get_ngrams(place_name, ngram_size):
            self.placename_ngram_index[ngram][place_name] = True

    def is_ambiguous(self, placename_string, is_variant_of):
        if isinstance(get_preferred_placename(self, variant_placename), str):
            return False
        else:
            return True
        
    def get_preferred_placename(self, variant_placename):
        if variant_placename not in self.is_variant_of:
            return None
        preferred_placenames = list(self.is_variant_of[variant_placename].keys())
        if len(preferred_placenames) == 1:
            return preferred_placenames[0]
        else:
            return preferred_placenames

    def add_non_placename(self, non_placename):
        self.non_placename[non_placename] = 1
        
    def remove_non_placename(self, non_placename):
        del self.non_placename[non_placename]

    def is_non_placename(self, non_placename):
        return non_placename in self.non_placename
        
    def is_placename(self, non_placename):
        return non_placename in self.is_variant_of
        
def get_ngrams(term, ngram_size=3):
    for start_index in range(0, len(term) - (ngram_size-1)):
        yield term.lower()[start_index:start_index+ngram_size]

def index_place_ngrams(place_name, ngram_size=3):
    #print("Place name:", place_name)
    for ngram in get_ngrams(place_name, ngram_size):
        place_ngram_index[ngram][place_name] = True

def index_row(row, placename_index, preferred_placename):
    row = {cell.column: cell.value for cell in row}
    if row['B'] == "Plaatsnaam" or not row['B']:
        preferred_placename = None
    else:
        placename_id = row['A']
        placename_string = str(row['B'].strip())
        if not preferred_placename:
            preferred_placename = placename_string
            country = row['C']
            note = row['D']
            placename_index.add_placename(preferred_placename, placename_id, country, note)
        placename_index.add_variant(preferred_placename, placename_string, placename_id)
        if len(placename_string) >= 4:
            placename_index.index_placename_ngrams(placename_string)
    return preferred_placename
        
def create_placename_index(placename_excel_file):
    wb2 = load_workbook(placename_excel_file)
    ws = wb2.active
    placename_index = PlaceIndex()
    preferred_placename = None
    for row in ws:
        preferred_placename = index_row(row, placename_index, preferred_placename)
    return placename_index

a = [1,2,3]
b = [2]
intersect = set(a) & set(b)
len(intersect) > 0

True

## Fuzzy Matching of Place Names

Fuzzy matching on multiple criteria is an effective way to deal with OCR mistakes. Depending on the quality of the OCR, the thresholds for the various criteria can be high or low. With lower thresholds, more candidates are found, with more uncertainty. The advantage of using multiple criteria is that the uncertainty introduced by individual criteria is alleviated by the complementary signal of the other criteria. 

In [256]:
def select_best_ranked_candidates(candidate_name, ranked_candidates):
    # Multiple variants of the same place may be fuzzy matches.
    # In that case, select only the best matching one. 
    ranked_preferred_placenames = []
    selected_candidates = []
    for ranked_candidate in ranked_candidates:
        if ranked_candidate["total"] < 2: #len(candidate_name):
            continue
        preferred_placenames = placename_index.is_variant_of[ranked_candidate["candidate"]].keys()
        new_placenames = [placename for placename in preferred_placenames if placename not in ranked_preferred_placenames]
        if len(new_placenames) == 0:
            continue
        selected_candidates += [ranked_candidate["candidate"]] 
        ranked_preferred_placenames += new_placenames
    return selected_candidates

def is_candidate_ngram_placename_match(candidate_name, placename):
    # Pruning step:
    # - discard candidates that are much longer or shorter strings
    if abs(len(placename) - len(candidate_name)) >= 3:
        return False
    # - discard candidates that start with a different initial (crude but effective)
    if placename[0].lower() != candidate_name[0].lower():
        return False
    return True

def get_ngram_placenames(candidate_name, placename_index, ngram):
    placenames = placename_index.placename_ngram_index[ngram].keys()
    return [placename for placename in placenames if is_candidate_ngram_placename_match(candidate_name, placename)]

def select_ngram_candidate_placenames(candidate_name, place_index, fuzzy_matcher):
    ngrams = [ngram for ngram in get_ngrams(candidate_name) if ngram in placename_index.placename_ngram_index]
    placenames = [placename for ngram in ngrams for placename in get_ngram_placenames(candidate_name, placename_index, ngram)]
    placenames = fuzzy_matcher.filter_candidates(list(set(placenames)), candidate_name, 2)
    #print("Ngram candidates:", placenames)
    return [placename for placename in placenames if is_candidate_ngram_placename_match(candidate_name, placename)]

def find_fuzzy_placename_candidates(candidate_name, placename_index, fuzzy_matcher):
    candidate_places = Counter()
    ngram_candidate_placenames = select_ngram_candidate_placenames(candidate_name, place_index, fuzzy_matcher)
    ranked_candidates = fuzzy_matcher.rank_candidates(ngram_candidate_placenames, candidate_name, ngram_size=2)
    #print("candidate_name", candidate_name)
    #print("ranked_candidates:", ranked_candidates)
    #print()
    selected_candidates = select_best_ranked_candidates(candidate_name, ranked_candidates)
    return selected_candidates

def is_fuzzy_candidate(candidate, placename_index):
    if placename_index.is_non_placename(candidate):
        # If candidate was established earlier to be not a placename, skip it.
        # This is an efficiency/pruning step to avoid checking the same 
        # non-matches over and over.
        return False
    if placename_index.is_placename(candidate):
        # Exact matches are dealt with elsewhere, skip here.
        return False
    return True

def filter_fuzzy_candidates(candidates, placename_index):
    return [candidate for candidate in set(candidates) if is_fuzzy_candidate(candidate, placename_index)]
    
def find_fuzzy_placename_matches(text, placename_index, fuzzy_matcher):
    candidates = re.findall(r"\b[A-Z]\w+", text)
    candidates += [match[0] for match in re.findall(r"\b([A-Z]\w+([ -][A-Z]\w+)*)", text) if match[0] not in candidates]
    fuzzy_candidates = filter_fuzzy_candidates(candidates, placename_index)
    fuzzy_place_matches = []
    for fuzzy_candidate in fuzzy_candidates:
        if len(fuzzy_candidate) < 4:
            continue
        candidate_places = find_fuzzy_placename_candidates(fuzzy_candidate, placename_index, fuzzy_matcher)
        if len(candidate_places) == 0:
            # candidate doesn't match with any known placename, so register as non placename
            # for later pruning.
            placename_index.add_non_placename(fuzzy_candidate)
        else:
            candidate_places = [{"placename_variant": placename, "preferred_placename": placename_index.get_preferred_placename(placename)} for placename in candidate_places]
            fuzzy_place_matches += [{"fuzzy_place_string": fuzzy_candidate, "candidate_places": candidate_places}]
    return fuzzy_place_matches

char_match_threshold=0.7
ngram_threshold=0.5
levenshtein_threshold=0.8
fuzzy_matcher = FuzzyMatcher(char_match_threshold=char_match_threshold, ngram_threshold=ngram_threshold, levenshtein_threshold=levenshtein_threshold)
placename_index.non_placename = {}
test_text = "Origineel niet voorhanden. Afschriften: B (begin 14e e.) Staatsarchief Hannover, Kopiar II-4r van het domkapittel Hanburg (later Bremen), fol. 62 1, nr. 84, ad 1706, verbrand in het bombardement op Hannover van 1943 okt. 8/9. ? C (eind 15e e.) Ibidem, Kopiar 1-43 van het zelfde kapittel, fol. 84 v, nr. 92, op dezelfde wijze verbrand. ? D (eind 16e e.) Erpold Lindenbrog, Privilegia archiecclesie Hammaburgensis, hs. verbrand in de stadsbrand van Hamburg van 1842. "

fuzzy_matches = find_fuzzy_placename_matches(test_text, placename_index, fuzzy_matcher)
print(fuzzy_matches)

test_text = "Drukken van DE: a. Schannat, Corpus Fuld., p. 316, nr. 70, naar E. ? b. Miraeus-Foppens, Opera dipl., HI, p. 8, nr. 5, $ 70, v??r c. 800, naar a. ? c. Dronke, Traditiones Fuldenses, D. 44, cap. 7, nr. 23, naar D; p. 51, nr. 124, naar E. ? d. Van den Bergh, Handboek, p. 266, nr. 70, ad [2e helft 8e e.], naar ab. ? e. Van den Bergh, OHZ, L, nr. 9, p. Gen 10, $ 23 en 124, c. eind 8e e., maar c. ? JS. Friedl?nder, Ostfriesisches UB,    p. 787, in nr. 3, naar D; b. 792, in nr. 9, naar E. ? g. Bunte, Untersuchungen, p. 31 en 40, nrs. 23 en 124, naar c. " 

fuzzy_matches = find_fuzzy_placename_matches(test_text, placename_index, fuzzy_matcher)
print(fuzzy_matches)

test_text = "Waterlandie Prenominati Aribone Traiectensem Westfrisie"

fuzzy_matches = find_fuzzy_placename_matches(test_text, placename_index, fuzzy_matcher)
print(fuzzy_matches)



[{'candidate_places': [{'placename_variant': 'Hamburg',
    'preferred_placename': 'Hamburg'}],
  'fuzzy_place_string': 'Hanburg'}]

# Index charters

Gather paragraphs by charter, identify place name mentions and index at charter level.

In [257]:
import json

def get_paragraphs_by_charter(charter_num):
    query = {"query": {"match": {"charter_number": charter_num}}, "size": 10000}
    response = es.search(index="retroboeken", doc_type="ohz-paragraph", body=query)
    return [hit["_source"] for hit in response["hits"]["hits"]]

def parse_paragraphs(paragraphs):
    charter = {
        "paragraphs": [],
        "exact_place_matches": [],
        "fuzzy_place_matches": [],
        "page_numbers": [],
    }
    for paragraph in paragraphs:
        if paragraph["page_num"] not in charter["page_numbers"]:
            charter["page_numbers"] += [paragraph["page_num"]]
        charter["charter_number"] = paragraph["charter_number"]
        charter["book_number"] = paragraph["book_num"]
        charter["paragraphs"] += [
            {
                "lines": paragraph["line_texts"],
                "line_numbers": paragraph["line_numbers"],
                "book_number": paragraph["book_num"],
                "page_number": paragraph["page_num"],
                "paragraph_number": paragraph["paragraph_num"],
                "merged_text": paragraph["merged_text"],
                "type": paragraph["type"],
            }
        ]
    charter["page_numbers"].sort()
    return charter

def is_ambiguous(placename_string, is_variant_of):
    if len(list(is_variant_of[placename_string].keys())) == 1:
        return True
    else:
        return False

def create_placename_match(placename_string, charter_num, paragraph, match):
    placename_pref = list(is_variant_of[placename_string].keys())
    placename_ambiguity = "ambiguous"
    if len(placename_pref) == 1:
        placename_pref = placename_pref[0]
        placename_ambiguity = "unambiguous"
    placename_match = {
        "placename_string": placename_string,
        "placename_pref": placename_pref,
        "charter": charter_num,
        "book_number": paragraph["book_number"],
        "page_number": paragraph["page_number"], 
        "paragraph_number": paragraph["paragraph_number"], 
        "line_number": match["line_number"],
        "match_type": match["match_type"],
        "placename_ambiguity": is_ambiguous(placename_string, is_variant_of),
    }
    if match["match_type"] == "cross_line":
        placename_match["match_parts"] = match["match_parts"]
    return placename_match

def lookup_placename_in_paragraph(charter_num, paragraph, placename_string):
    place_matches = []
    for line_index, line in enumerate(paragraph["lines"]):
        if len(line) == 0:
            continue
        match_info = lookup_placename_in_single_line(placename_string, paragraph, line_index)
        if not match_info and line[-1] == "-" and len(paragraph["lines"]) > line_index+1:
            match_info = lookup_placename_in_merged_line(placename_string, paragraph, line_index)
        if match_info:
            placename_match = create_placename_match(placename_string, charter_num, paragraph, match_info)
            placename_matches += [placename_match]
    return placename_matches

def fuzzy_lookup_place_in_paragraph(charter_num, paragraph, place_string):
    fuzzy_matches = []
    for line_index, line in enumerate(paragraph["lines"]):
        if len(line) == 0:
            continue
        for fuzzy_match in find_fuzzy_placename_matches(line, placename_index, fuzzy_matcher):
            fuzzy_match["charter"] = charter_num
            fuzzy_match["book_number"] = paragraph["book_number"]
            fuzzy_match["page_number"] = paragraph["page_number"]
            fuzzy_match["paragraph_number"] = paragraph["paragraph_number"]
            fuzzy_match["match_type"] = "single_line"
            fuzzy_match["line_number"] = paragraph["line_numbers"][line_index]
            fuzzy_matches += [fuzzy_match]
    return fuzzy_matches

def term_in_text(term, text):
    if term not in text:
        return False
    if re.search(r"\W" + to_regex_string(term) + r"\W", text):
        return True
    elif re.search(r"\W" + to_regex_string(term) + r"$", text):
        return True
    elif re.search(r"^" + to_regex_string(term) + r"\W", text):
        return True
    else:
        return False
    
def to_regex_string(string):
    return string.replace("[", r"\[").replace("]", r"\]").replace("(", r"\(").replace(")", r"\)").replace("*", r"\*")

def lookup_placename_in_single_line(place_string, paragraph, line_index):
    if term_in_text(place_string, paragraph["lines"][line_index]):
        return {
            "match_type": "single_line", 
            "line_number": paragraph["line_numbers"][line_index], 
        }
    else:
        return None

def lookup_placename_in_merged_line(place_string, paragraph, line_index):
    if term_in_text(place_string, paragraph["lines"][line_index+1]):
        return None
    merged_line = paragraph["lines"][line_index][:-1] + paragraph["lines"][line_index+1].strip()
    if term_in_text(place_string, merged_line):
        offset = merged_line.index(place_string)
        curr_line_part = paragraph["lines"][line_index][offset:]
        next_line_part = place_string[(len(curr_line_part)-1):]
        return {
            "match_type": "cross_line", 
            "line_number": [paragraph["line_numbers"][line_index], paragraph["line_numbers"][line_index+1]], 
            "match_parts": [curr_line_part, next_line_part]
        }
    else:
        return None

def add_placename_matches(charter, charter_text, is_variant_of):
    # Iterate over all placenames in the list and scan the charter for each of these.
    for place_string in is_variant_of:
        # Skip the small number of very short placenames, like A and Le as they
        # results in enormous amounts of mostly incorrect matches. 
        if len(place_string) < 3:
            continue
        # Check if the placename occurs in the overall charter text as string.
        # If not, skip further analysis. If so, check that it occurs as single
        # or multiterm mention and not as part of a larger term.
        if place_string in charter_text:
            for paragraph in charter["paragraphs"]:
                exact_matches = lookup_place_in_paragraph(charter["charter_number"], paragraph, place_string)
                charter["exact_place_matches"] += exact_matches
    for paragraph in charter["paragraphs"]:
        fuzzy_matches = fuzzy_lookup_place_in_paragraph(charter["charter_number"], paragraph, place_string)
        charter["fuzzy_place_matches"] += fuzzy_matches

# Some placenames mentions are incomplete, where bits of the text is missing.
# In these cases, the missing part is replace by [...] with the number of 
# dots indicating the estimates number of missing characters.
is_incomplete_variant_of = {place: is_variant_of[place] for place in is_variant_of if re.search(r"\[.*\]",place)}
# Other placenames have no square brackets, suggesting they are complete.
is_complete_variant_of = {place: is_variant_of[place] for place in is_variant_of if not re.search(r"\[.*\]",place)}


# remove index to get rid of incorrect charters from previous iterations
if es.indices.exists(index='ohz-test'):
    es.indices.delete(index='ohz-test')

for charter_num in range(1,3538):
    paragraphs = get_paragraphs_by_charter(charter_num)
    if len(paragraphs) == 0:
        print("No paragraphs for charter", charter_num)
        continue
    charter = parse_paragraphs(paragraphs)
    charter_text = " ".join([paragraph["merged_text"] for paragraph in charter["paragraphs"]])
    # First, look for incomplete placenames in original text
    add_placename_matches(charter, charter_text, is_incomplete_variant_of)
    # Remove all square brackets from the charter text, which incidate uncertainty or missing text. 
    charter_text = charter_text.replace("[","").replace("]","")
    # Second, look for complete placenames in modified charter text.
    add_placename_matches(charter, charter_text, is_complete_variant_of)
    for charter_number in charter["charter_number"]:
        es.index(index="ohz-test", doc_type="ohz-charter", id=charter_number, body=charter)
    if charter_num % 50 == 0:
        print(charter_num)
        #print(json.dumps(charter, indent=4))


50
100
150
200
250
300
350
400
450
500
550
600
650
700
750
800
850
900
950
1000
1050
1100
1150
1200
1250
No paragraphs for charter 1258
No paragraphs for charter 1270
1300
1350
1400
1450
No paragraphs for charter 1471
1500
1550
1600
1650
1700
1750
1800
1850
1900
1950
2000
2050
No paragraphs for charter 2052
2100
2150
2200
2250
2300
2350
2400
2450
2500
2550
No paragraphs for charter 2581
No paragraphs for charter 2584
No paragraphs for charter 2598
No paragraphs for charter 2599
No paragraphs for charter 2600
No paragraphs for charter 2601
No paragraphs for charter 2602
No paragraphs for charter 2603
No paragraphs for charter 2604
No paragraphs for charter 2605
No paragraphs for charter 2606
No paragraphs for charter 2607
No paragraphs for charter 2608
No paragraphs for charter 2629
No paragraphs for charter 2630
No paragraphs for charter 2639
No paragraphs for charter 2640
No paragraphs for charter 2644
2650
No paragraphs for charter 2654
No paragraphs for charter 2655
No paragraphs fo

## Explicating Charter Dates

Charter titles contain a date, a charter number and optionally a location. Charter numbers have already been identified. the next step is to interpret the date strings so that they can be queried numerically.

Issues to deal with:

- Certain dates are ranges, e.g. "726 okt. 21 - 727 mei 13" (charter 2 OHZ 1 p. 2), so have a start and end date 
- Certain dates are non-numerically expressed, e.g. "Eind 8e eeuw? - 817" (charter 7, OHZ 1 p. 12)
- Certain dates are estimates, e.g. "c. 2de helft 9e eeuw" (charter 23, OHZ 1 p. ) or "Waarschijnlijk 941 nov. 28-dec.24"
- Certain dates are conjectures ("Tussen rechte haken staan de dateringselementen die conjecturaal zijn." OHZ 1, p. XV), indicated by square brackets.

Choice to make:

- Select a specific century for century-based estimates?
- Select a sort date (e.g. a single date on which all charters can be sorted)?


In [35]:
# check assumptions:
# - charter date and charter number are always separated by at least 4 whitespaces
# - exceptions:
#   - charters 82 and 83
#   - charters 139 and 140
#   - charters 849 and 850
#   - charters 2535 and 2536
#   - charter 1906 (printed together with 1903 even though 1903 is separately listed earlier, see OHZ 4 p. 84 and p. 89)

# Check data:
# - charter date string is never more than 36 characters per line
# - year constraints: between 719 and 1299
#   - part 1: between 719 and 1222
#   - part 2: between 1222 and 1256
#   - part 3: between 1256 and 1278
#   - part 4: between 1278 and 1291
#   - part 5: between 1291 and 1299

from datetime import date as to_date

year_constraints = {
    1: {"min": 719, "max": 1222},
    2: {"min": 1222, "max": 1256},
    3: {"min": 1256, "max": 1278},
    4: {"min": 1278, "max": 1291},
    5: {"min": 1291, "max": 1299},
}

# There are a few charters that have complex titles that deviate 
# strongly from the typical charter title format.
# Instead of writing a large chunk of code to capture these exceptions, 
# I've chosen to write out the proper date information for them.
exceptions = {
    82: {
        "sort_year": 1057, "start_year": 1057, "end_year": 1057, "year_type": "exact",
        "sort_date": "1057-10-30", "start_date": "1057-10-30", "end_date": "1057-10-30",
        "date_certainty": "certain", "date_specifity": "specific_date",
    },
    83: {
        "sort_year": 1058, "start_year": 1058, "end_year": 1058, "year_type": "exact",
        "sort_date": "1058-06-25", "start_date": "1058-06-25", "end_date": "1058-06-25",
        "date_certainty": "certain", "date_specifity": "specific_date",
    },
    139: {
        "sort_year": 1156, "start_year": 1156, "end_year": 1156, "year_type": "exact",
        "sort_date": "1156-01-01", "start_date": "1156-01-01", "end_date": "1156-01-01",
        "date_certainty": "certain", "date_specifity": "specific_year",
    },
    140: {
        "sort_year": 1156, "start_year": 1156, "end_year": 1156, "year_type": "exact",
        "sort_date": "1156-01-01", "start_date": "1156-01-01", "end_date": "1156-01-01",
        "date_certainty": "certain", "date_specifity": "specific_year_month",
    },
    473: {
        "sort_year": 1226, "start_year": 1226, "end_year": 1227, "year_type": "exact",
        "sort_date": "1226-01-01", "start_date": "1226-06-19", "end_date": "1226-01-01",
        "date_certainty": "uncertain", "date_specifity": "circa_date",
    },
    849: {
        "sort_year": 1250, "start_year": 1250, "end_year": 1250, "year_type": "exact",
        "sort_date": "1250-05-19", "start_date": "1250-05-19", "end_date": "1250-05-19",
        "date_certainty": "certain", "date_specifity": "specific_date",
    },
    850: {
        "sort_year": 1250, "start_year": 1250, "end_year": 1250, "year_type": "exact",
        "sort_date": "1250-05-19", "start_date": "1250-05-19", "end_date": "1250-05-19",
        "date_certainty": "certain", "date_specifity": "specific_date",
    },
    1906: {
        "sort_year": 1280, "start_year": 1280, "end_year": 1280, "year_type": "exact",
        "sort_date": "1280-05-13", "start_date": "1280-05-13", "end_date": "1280-05-13",
        "date_certainty": "certain", "date_specifity": "specific_date",
    },
    2535: {
        "sort_year": 1290, "start_year": 1290, "end_year": 1290, "year_type": "exact",
        "sort_date": "1290-11-13", "start_date": "1280-11-13", "end_date": "1280-11-13",
        "date_certainty": "false", "date_specifity": "specific_date",
    },
    2536: {
        "sort_year": 1290, "start_year": 1290, "end_year": 1290, "year_type": "exact",
        "sort_date": "1290-11-13", "start_date": "1280-11-13", "end_date": "1280-11-13",
        "date_certainty": "false", "date_specifity": "specific_date",
    },
}

full_date_pattern = r"\b(\d{3,4}) (jan\.|feb\.|febr\.|mrt\.|apr\.|mei|juni|juli|aug\.|sept\.|okt\.|nov\.|dec\.) (\d+)\b"

def get_charter(charter_num):
    if es.exists(index="ohz", doc_type="ohz-charter", id=charter_num):
        response = es.get(index="ohz", doc_type="ohz-charter", id=charter_num)
        return response["_source"]
    else:
        return None
    
# Make sure that month references always have the same format.
def normalize_months(date_string):
    date_string = date_string.replace("januari", "jan")
    date_string = date_string.replace("februari", "feb")
    date_string = date_string.replace("febr", "feb")
    date_string = date_string.replace("maart", "mrt")
    date_string = date_string.replace("april", "apr")
    date_string = date_string.replace("augustus", "aug")
    date_string = date_string.replace("september", "sept")
    date_string = date_string.replace("oktober", "okt")
    date_string = date_string.replace("november", "nov")
    date_string = date_string.replace("december", "dec")
    date_string = re.sub(r"\b(jan|feb|mrt|apr|aug|sept|okt|nov|dec):", r"\1.", date_string)
    date_string = re.sub(r"\b(jan|feb|mrt|apr|aug|sept|okt|nov|dec) ", r"\1. ", date_string)
    date_string = re.sub(r"\b(jan|feb|mrt|apr|aug|sept|okt|nov|dec)$", r"\1.", date_string)
    return date_string

# Map month name to month number
def map_month(month_string):
    months = ["jan.", "feb.", "mrt.", "apr.", "mei", "juni", "juli", "aug.", "sept.", "okt.", "nov.", "dec."]
    if month_string == "febr.":
        month_string = "feb."
    return months.index(month_string) + 1

# Dates examples:
# - [Waarschijnlijk 983] jan. 9
# - 98[5] juni 26
# - 985 aug. 25
# - [981-985] okt. 2 
def is_specific_date(date_string):
    match = re.match(r"^([0-9gioz]{3,4}) (jan\.|feb\.|febr\.|mrt\.|apr\.|mei|juni|juli|aug\.|sept\.|okt\.|nov\.|dec\.) ([0-9gioz]{1,2})$", date_string)
    if not match:
        return False
    year = int(deconfuse(match.group(1)))
    month = map_month(match.group(2))
    day = int(deconfuse(match.group(3)))
    if year < MINIMUM_YEAR or year > MAXIMUM_YEAR:
        print("DATE ERROR year:", [year, month, day], date_string)
        return False
    if day > 31:
        print("DATE ERROR day:", [year, month, day], date_string)
        return False
    return (year, month, day)
    
def has_specific_year_and_date_range(date_string):
    match = re.match(r"^([0-9gioz]{3,4}) (jan\.|feb\.|febr\.|mrt\.|apr\.|mei|juni|juli|aug\.|sept\.|okt\.|nov\.|dec\.) ([0-9gioz]{1,2})(-|\?|-\?| \?|\? | \? )(jan\.|feb\.|febr\.|mrt\.|apr\.|mei|juni|juli|aug\.|sept\.|okt\.|nov\.|dec\.) ([0-9gioz]{1,2})$", date_string)
    if not match:
        return False
    start_year = int(match.group(1))
    start_month = map_month(match.group(2))
    start_day = int(match.group(3))
    end_year = int(match.group(1))
    end_month = map_month(match.group(5))
    end_day = int(match.group(6))
    if start_year < MINIMUM_YEAR or start_year > MAXIMUM_YEAR:
        print("DATE ERROR start_year:", [start_year], date_string)
        return False
    if end_year < MINIMUM_YEAR or end_year > MAXIMUM_YEAR:
        print("DATE ERROR end_year:", [end_year], date_string)
        return False
    if start_day > 31:
        print("DATE ERROR start_day:", [start_year, start_month, start_day], date_string)
        return False
    if end_day > 31:
        print("DATE ERROR end_day:", [end_year, end_month, end_day], date_string)
        return False
    return (start_year, start_month, start_day, end_year, end_month, end_day)
    
def has_specific_year_month_and_day_range(date_string):
    match = re.match(r"^([0-9gioz]{3,4}) (jan\.|feb\.|febr\.|mrt\.|apr\.|mei|juni|juli|aug\.|sept\.|okt\.|nov\.|dec\.) ([0-9gioz]{1,2})(-|\?|-\?| \?|\? | \? )([0-9gioz]{1,2})$", date_string)
    if not match:
        return False
    start_year = int(match.group(1))
    start_month = map_month(match.group(2))
    start_day = int(match.group(3))
    end_year = int(match.group(1))
    end_month = map_month(match.group(2))
    end_day = int(match.group(5))
    if start_year < MINIMUM_YEAR or start_year > MAXIMUM_YEAR:
        print("DATE ERROR start_year:", [start_year], date_string)
        return False
    if end_year < MINIMUM_YEAR or end_year > MAXIMUM_YEAR:
        print("DATE ERROR end_year:", [end_year], date_string)
        return False
    if start_day > 31:
        print("DATE ERROR start_day:", [start_year, start_month, start_day], date_string)
        return False
    if end_day > 31:
        print("DATE ERROR end_day:", [end_year, end_month, end_day], date_string)
        return False
    return (start_year, start_month, start_day, end_year, end_month, end_day)

def has_year_range_and_specific_date(date_string):
    match = re.match(r"^([0-9gioz]{3,4})(-|\?|-\?| \?|\? | \? )([0-9gioz]{3,4}) (jan\.|feb\.|febr\.|mrt\.|apr\.|mei|juni|juli|aug\.|sept\.|okt\.|nov\.|dec\.) ([0-9gioz]{1,2})$", date_string)
    if not match:
        return False
    start_year = int(match.group(1))
    start_month = map_month(match.group(4))
    start_day = int(match.group(5))
    end_year = int(match.group(3))
    end_month = map_month(match.group(4))
    end_day = int(match.group(5))
    if start_year < MINIMUM_YEAR or start_year > MAXIMUM_YEAR:
        print("DATE ERROR start_year:", [start_year], date_string)
        return False
    if end_year < MINIMUM_YEAR or end_year > MAXIMUM_YEAR:
        print("DATE ERROR end_year:", [end_year], date_string)
        return False
    if start_day > 31:
        print("DATE ERROR start_day:", [start_year, start_month, start_day], date_string)
        return False
    if end_day > 31:
        print("DATE ERROR end_day:", [end_year, end_month, end_day], date_string)
        return False
    return (start_year, start_month, start_day, end_year, end_month, end_day)
    
def has_full_date_range(date_string):
    match = re.match(r"^([0-9gioz]{3,4}) (jan\.|feb\.|febr\.|mrt\.|apr\.|mei|juni|juli|aug\.|sept\.|okt\.|nov\.|dec\.) ([0-9gioz]{1,2})(-|\?|-\?| \?|\? | \? )([0-9gioz]{3,4}) (jan\.|feb\.|febr\.|mrt\.|apr\.|mei|juni|juli|aug\.|sept\.|okt\.|nov\.|dec\.) ([0-9gioz]{1,2})$", date_string)
    if not match:
        return False
    start_year = int(match.group(1))
    start_month = map_month(match.group(2))
    start_day = int(match.group(3))
    end_year = int(match.group(5))
    end_month = map_month(match.group(6))
    end_day = int(match.group(7))
    if start_year < MINIMUM_YEAR or start_year > MAXIMUM_YEAR:
        print("DATE ERROR start_year:", [start_year], date_string)
        return False
    if end_year < MINIMUM_YEAR or end_year > MAXIMUM_YEAR:
        print("DATE ERROR end_year:", [end_year], date_string)
        return False
    if start_day > 31:
        print("DATE ERROR start_day:", [start_year, start_month, start_day], date_string)
        return False
    if end_day > 31:
        print("DATE ERROR end_day:", [end_year, end_month, end_day], date_string)
        return False
    return (start_year, start_month, start_day, end_year, end_month, end_day)

def has_year_month_and_full_date(date_string):
    match = re.match(r"^([0-9gioz]{3,4}) (jan\.|feb\.|febr\.|mrt\.|apr\.|mei|juni|juli|aug\.|sept\.|okt\.|nov\.|dec\.)(-|\?|-\?| \?|\? | \? )([0-9gioz]{3,4}) (jan\.|feb\.|febr\.|mrt\.|apr\.|mei|juni|juli|aug\.|sept\.|okt\.|nov\.|dec\.) ([0-9gioz]{1,2})$", date_string)
    if not match:
        return False
    start_year = int(match.group(1))
    start_month = map_month(match.group(2))
    end_year = int(match.group(4))
    end_month = map_month(match.group(5))
    end_day = int(match.group(6))
    if start_year < MINIMUM_YEAR or start_year > MAXIMUM_YEAR:
        print("DATE ERROR start_year:", [start_year], date_string)
        return False
    if end_year < MINIMUM_YEAR or end_year > MAXIMUM_YEAR:
        print("DATE ERROR end_year:", [end_year], date_string)
        return False
    if end_day > 31:
        print("DATE ERROR end_day:", [end_year, end_month, end_day], date_string)
        return False
    return (start_year, start_month, end_year, end_month, end_day)

def has_year_month_year_month(date_string):
    match = re.match(r"^([0-9gioz]{3,4}) (jan\.|feb\.|febr\.|mrt\.|apr\.|mei|juni|juli|aug\.|sept\.|okt\.|nov\.|dec\.)(-|\?|-\?| \?|\? | \? )([0-9gioz]{3,4}) (jan\.|feb\.|febr\.|mrt\.|apr\.|mei|juni|juli|aug\.|sept\.|okt\.|nov\.|dec\.)$", date_string)
    if not match:
        return False
    start_year = int(match.group(1))
    start_month = map_month(match.group(2))
    end_year = int(match.group(4))
    end_month = map_month(match.group(5))
    if start_year < MINIMUM_YEAR or start_year > MAXIMUM_YEAR:
        print("DATE ERROR start_year:", [start_year], date_string)
        return False
    if end_year < MINIMUM_YEAR or end_year > MAXIMUM_YEAR:
        print("DATE ERROR end_year:", [end_year], date_string)
        return False
    return (start_year, start_month, end_year, end_month)

def has_year_month_month_day(date_string):
    match = re.match(r"^([0-9gioz]{3,4}) (jan\.|feb\.|febr\.|mrt\.|apr\.|mei|juni|juli|aug\.|sept\.|okt\.|nov\.|dec\.)(-|\?|-\?| \?|\? | \? )(jan\.|feb\.|febr\.|mrt\.|apr\.|mei|juni|juli|aug\.|sept\.|okt\.|nov\.|dec\.) ([0-9gioz]{1,2})$", date_string)
    if not match:
        return False
    start_year = int(match.group(1))
    start_month = map_month(match.group(2))
    end_year = start_year
    end_month = map_month(match.group(4))
    end_day = int(match.group(5))
    if start_year < MINIMUM_YEAR or start_year > MAXIMUM_YEAR:
        print("DATE ERROR start_year:", [start_year], date_string)
        return False
    if end_year < MINIMUM_YEAR or end_year > MAXIMUM_YEAR:
        print("DATE ERROR end_year:", [end_year], date_string)
        return False
    if end_day > 31:
        print("DATE ERROR end_day:", [end_year, end_month, end_day], date_string)
        return False
    return (start_year, start_month, end_year, end_month, end_day)

# year circa indictor month day

def has_year_month(date_string):
    match = re.match(r"^([0-9gioz]{3,4}) (jan\.|feb\.|febr\.|mrt\.|apr\.|mei|juni|juli|aug\.|sept\.|okt\.|nov\.|dec\.)$", date_string)
    if not match:
        return False
    year = int(match.group(1))
    month = map_month(match.group(2))
    if year < MINIMUM_YEAR or year > MAXIMUM_YEAR:
        print("DATE ERROR year:", [year, month], date_string)
        return False
    return (year, month)
    
def has_specific_year(date_string):
    match = re.match(r"^([0-9gioz]{3,4})$", deconfuse(date_string))
    if not match:
        return False
    year = int(match.group(1))
    if year < MINIMUM_YEAR or year > MAXIMUM_YEAR:
        print("DATE ERROR year:", [year], date_string)
        return False
    return year
    
def has_year_range(date_string):
    match = re.match(r"^([0-9gioz]{3,4})-([0-9gioz]{3,4})$", deconfuse(date_string))
    if not match:
        return False
    start_year = int(match.group(1))
    end_year = int(match.group(2))
    if start_year < MINIMUM_YEAR or start_year > MAXIMUM_YEAR:
        print("DATE ERROR start_year:", [start_year], date_string)
        return False
    if end_year < MINIMUM_YEAR or end_year > MAXIMUM_YEAR:
        print("DATE ERROR end_year:", [end_year], date_string)
        return False
    return start_year, end_year
    
def make_proper_date(date_tuple):
    """
    date_tuple should have three integers for year, month and date. 
    if there's only a year and month, assume first day of the month.
    if there's only a year, assume first day of january.
    """
    if len(date_tuple) == 1:
        return to_date(date_tuple[0], 1, 1)
    if len(date_tuple) == 2:
        return to_date(date_tuple[0], date_tuple[1], 1)
    else:
        return to_date(date_tuple[0], date_tuple[1], date_tuple[2])

def make_year_info(sort_year, end_year=None):
    if not end_year:
        end_year = sort_year
    return {"sort_year": sort_year, "start_year": sort_year, "end_year": end_year}

def parse_date_string(date_string, charter_num):
    date_string = date_string.lower()
    date_type = "exact"
    date_certainty = "certain"
    date_specificity = "specific"
    date_modifier = ""
    date_string = date_string.replace(";", "]")
    date_string = date_string.replace("v??r", "voor")
    # remove initial single digit, e.g. "1   1299 okt. 12"
    if re.match(r"[0-9gioz] {2,}[0-9gioz]", date_string):
        date_string = re.sub(r"^[0-9gioz] {2,}", "", date_string)
    # remove comma after year, e.g. "1299, "
    if re.search(r"[0-9gioz]{3,4},", date_string):
        date_string = re.sub(r"([0-9gioz]{3,4}),", r"\1", date_string)
    # remove falseness indicator
    if "<" in date_string and ">" in date_string:
        date_string = date_string.replace("<","").replace(">","")
        date_certainty = "unreliable"
    # remove uncertainty indicator
    if "[" in date_string and "]" in date_string:
        date_string = date_string.replace("[","").replace("]","")
        date_certainty = "uncertain"
    # remove uncertainty indicator
    if date_string[0] == "[" and date_string[-1] == "]":
        date_certainty = "uncertain"
        date_string = date_string[1:-1]
    if date_string[0] == "[":
        date_string = date_string[1:]
        date_certainty = "uncertain"
    # remove uncertainty terms, e.g. "vermoedelijk|verm.|waarschijnlijk|ws."
    if re.search(r"\b(vermoedelijk|verm.|waarschijnlijk|ws.) ", date_string):
        date_string = re.sub(r"\b(vermoedelijk|verm.|waarschijnlijk|ws.) ", "", date_string)
        date_certainty = "uncertain"
    if re.search(r" (vermoedelijk|verm.|waarschijnlijk|ws.)\b", date_string):
        date_string = re.sub(r" (vermoedelijk|verm.|waarschijnlijk|ws.)\b", "", date_string)
        date_certainty = "uncertain"
    # remove circa indicators, e.g. "c.|ca.|circa"
    if re.search(r"\b(c.|ca.|circa) ", date_string):
        date_string = re.sub(r"\b(c.|ca.|circa) ", "", date_string)
        date_modifier = "circa"
    # remove earlier/later indicators
    if re.search(r"\b(enige tijd voor|kort voor|uiterlijk|voor) ", date_string):
        date_string = re.sub(r"\b(enige tijd voor|kort voor|uiterlijk|voor) ", "", date_string)
        date_modifier = "earlier"
    if re.search(r"\b(kort na|na) ", date_string):
        date_string = re.sub(r"\b(kort na|na) ", "", date_string)
        date_modifier = "later"
    if re.search(r" of kort daarna", date_string):
        date_string = re.sub(r" of kort daarna", "", date_string)
        date_modifier = "or_later"
    if re.search(r" of kort daarvoor", date_string):
        date_string = re.sub(r" of kort daarvoor", "", date_string)
        date_modifier = "or_earlier"
    # replace or indicator with start-end indicator
    if re.search(r" of ", date_string):
        date_string = re.sub(r" of ", "-", date_string)
        if len(date_modifier) > 0:
            date_modifier += "_or_dates"
        else:
            date_modifier = "or_dates"
    date_string = normalize_months(date_string)
    if is_specific_date(date_string):
        date_tuple = is_specific_date(date_string)
        year_info = make_year_info(date_tuple[0])
        year_info["sort_date"] = make_proper_date(date_tuple)
        year_info["start_date"] = make_proper_date(date_tuple)
        year_info["end_date"] = make_proper_date(date_tuple)
        year_info["date_specificity"] = "specific_date"
    elif has_full_date_range(date_string):
        date_tuple = has_full_date_range(date_string)
        year_info = make_year_info(date_tuple[0], date_tuple[3])
        year_info["sort_date"] = make_proper_date(date_tuple[:3])
        year_info["start_date"] = make_proper_date(date_tuple[:3])
        year_info["end_date"] = make_proper_date(date_tuple[3:])
        year_info["date_specificity"] = "range_date"
    elif has_specific_year_and_date_range(date_string):
        date_tuple = has_specific_year_and_date_range(date_string)
        year_info = make_year_info(date_tuple[0], date_tuple[3])
        year_info["sort_date"] = make_proper_date(date_tuple[:3])
        year_info["start_date"] = make_proper_date(date_tuple[:3])
        year_info["end_date"] = make_proper_date(date_tuple[3:])
        year_info["date_specificity"] = "specific_year_range_month_day"
    elif has_specific_year_month_and_day_range(date_string):
        date_tuple = has_specific_year_month_and_day_range(date_string)
        year_info = make_year_info(date_tuple[0], date_tuple[3])
        year_info["sort_date"] = make_proper_date(date_tuple[:3])
        year_info["start_date"] = make_proper_date(date_tuple[:3])
        year_info["end_date"] = make_proper_date(date_tuple[3:])
        year_info["date_specificity"] = "specifc_year_month_range_day"
    elif has_year_range_and_specific_date(date_string):
        date_tuple = has_year_range_and_specific_date(date_string)
        year_info = make_year_info(date_tuple[0], date_tuple[3])
        year_info["sort_date"] = make_proper_date(date_tuple[:3])
        year_info["start_date"] = make_proper_date(date_tuple[:3])
        year_info["end_date"] = make_proper_date(date_tuple[3:])
        year_info["date_specificity"] = "range_year_specific_month_day"
    # year month - year month day
    elif has_year_month_and_full_date(date_string):
        date_tuple = has_year_month_and_full_date(date_string)
        year_info = make_year_info(date_tuple[0], date_tuple[2])
        year_info["sort_date"] = make_proper_date(date_tuple[:2])
        year_info["start_date"] = make_proper_date(date_tuple[:2])
        year_info["end_date"] = make_proper_date(date_tuple[2:])
        year_info["date_specificity"] = "specific_year_month_range_date"
    # year month - year month
    elif has_year_month_year_month(date_string):
        date_tuple = has_year_month_year_month(date_string)
        year_info = make_year_info(date_tuple[0], date_tuple[2])
        year_info["sort_date"] = make_proper_date(date_tuple[:2])
        year_info["start_date"] = make_proper_date(date_tuple[:2])
        year_info["end_date"] = make_proper_date(date_tuple[2:])
        year_info["date_specificity"] = "range_year_month"
    # year month - month day
    elif has_year_month_month_day(date_string):
        date_tuple = has_year_month_month_day(date_string)
        year_info = make_year_info(date_tuple[0], date_tuple[2])
        year_info["sort_date"] = make_proper_date(date_tuple[:2])
        year_info["start_date"] = make_proper_date(date_tuple[:2])
        year_info["end_date"] = make_proper_date(date_tuple[2:])
        year_info["date_specificity"] = "specific_year_range_month_day"
    elif has_year_month(date_string):
        date_tuple = has_year_month(date_string)
        year_info = make_year_info(date_tuple[0])
        year_info["sort_date"] = make_proper_date(date_tuple[:2])
        year_info["start_date"] = make_proper_date(date_tuple[:2])
        year_info["end_date"] = make_proper_date(date_tuple[:2])
        year_info["date_specificity"] = "specific_year_month"
    elif has_specific_year(date_string):
        year = has_specific_year(date_string)
        year_info = make_year_info(year)
        year_info["sort_date"] = make_proper_date([year])
        year_info["start_date"] = make_proper_date([year])
        year_info["end_date"] = make_proper_date([year])
        year_info["date_specificity"] = "specific_year"
    elif has_year_range(date_string):
        start_year, end_year = has_year_range(date_string)
        year_info = make_year_info(start_year, end_year)
        year_info["sort_date"] = make_proper_date([start_year])
        year_info["start_date"] = make_proper_date([start_year])
        year_info["end_date"] = make_proper_date([end_year])
        year_info["date_specificity"] = "range_year"
    else:
        print("\t", charter_num, "\t", date_string)
        return None
    year_info["date_certainty"] = date_certainty
    if "date_specificity" not in year_info:
        year_info["date_specificity"] = date_specificity
        raise KeyError("Year info has no date_specificity:", year_info, date_string)
    year_info["date_string"] = date_string
    year_info["date_modifier"] = date_modifier
    return year_info

def make_unknown_date_info(date_string):
    return {
        "sort_year": None, "start_year": None, "end_year": None,
        "sort_date": None, "start_date": None, "end_date": None,
        "date_specificity": "unknown", "date_certainty": "unknown", 
        "date_modifier": "unknown", "date_string": date_string,
    }

def get_title_paragraph(charter):
    for paragraph in charter["paragraphs"]:
        if paragraph["type"] == "charter_title":
            return paragraph
    return None

def get_date_string(charter):
    # the date of the charter is in the title paragraph
    title_paragraph = get_title_paragraph(charter)
    if not title_paragraph:
        return ""
    first_line = title_paragraph["lines"][0]
    parts = re.split(" {4,}", first_line.strip())
    date_string = parts[0]
    if len(title_paragraph["lines"]) > 1:
        #print("before: #{d}#".format(d=date_string))
        for line in title_paragraph["lines"][1:]:
            # remove anything that's not floating left
            # e.g. "                          van Saint-Aubert te Kamerijk" in charter 3406
            extra_date_string = re.sub(r" {6,60}.*", "", line)
            if len(extra_date_string) > 0:
                date_string += " " + extra_date_string[:40].strip()
                date_string = date_string.strip()
    return date_string

MINIMUM_YEAR = 719
MAXIMUM_YEAR = 1299

date_found = 0
date_not_found = 0
date_skipped = 0
charter_date = {}
for charter_num in range(1,3537):
    charter = get_charter(charter_num)
    if not charter:
        charter_date[charter_num] = make_unknown_date_info("")
        date_skipped += 1
        continue
    #print("parsing charter", charter_num)
    date_string = get_date_string(charter)
    if charter_num in exceptions:
        charter["date_info"] = exceptions[charter_num]
        title_paragraph = get_title_paragraph(charter)
        charter["date_info"]["date_string"] = "\n".join(title_paragraph["lines"])
        date_found += 1
    elif charter_num in missing_numbers:
        charter["date_info"] = make_unknown_date_info(date_string)
        charter["date_info"]["date_string"] = ""
        date_not_found += 1
        print("MISSING CHARTER:", charter_num, date_string)
    else:
        charter["date_info"] = parse_date_string(copy.copy(date_string), charter_num)
        if charter["date_info"]:
            charter["date_info"]["date_string"] = date_string
            date_found += 1
        else:
            charter["date_info"] = make_unknown_date_info(date_string)
            date_not_found += 1
    es.index(index="ohz", doc_type="ohz-charter", id=charter_num, body=charter)
    charter_date[charter_num] = charter["date_info"]
    if charter_num % 100 == 0:
        print("charter:", charter_num, "\tdates found:", date_found, "\tdates not found:", date_not_found, "\tdates skipped:", date_skipped)

print("charter:", charter_num, "\tdates found:", date_found, "\tdates not found:", date_not_found, "\tdates skipped:", date_skipped)

MISSING CHARTER: 1 
	 7 	 eind 8e eeuw ?-? 817
	 8 	 eind 8e eeuw? ? 817
	 16 	 822-ec. 825
	 17 	 825 ? 842
	 18 	 855 nov. 7 en 10
	 21 	 889 aug. 7 4
	 23 	 ze helft ge eeuw
	 24 	 ze helft ge eeuw
	 25 	 2e helft ge eeuw
	 39 	 967? mrt. 28
	 45 	 965-(977?) 979 okt. 2-6
	 48 	 980 mei-aug.
	 52 	 misschien 981 982 mrt. 5
	 65 	 995-1012 eind 10e e. sept. 30
	 77 	 1037 aug. 5-kort voordien
	 98 	 tiii apr. 13-1123 jan. 29 1113
charter: 100 	dates found: 83 	dates not found: 17 	dates skipped: 0
	 121 	 1143 okt. 7-later
	 122 	 1122 (1130 sept. 7?)- begin 1145
	 125 	 1147 , nov. 30-1148 feb. 17
	 126 	 1140 juni 2-begin 1150
	 128 	 1142-1147 ?. mei 15-1148-1152 feb.?
	 137 	 1156 mrt. 9-begin mei
	 150 	 1161 junig
	 164 	 1169 begin-1172 eind
	 174 	 1176 midden dec.-1177 begin mrt.
	 191 	 1178 apr. 9-kort nadien
	 193 	 ?. 1172-1178
	 195 	 handeling: 1176 juli 29 beoorkonding: 1179 apr.
	 197 	 1162-1179 aug.
charter: 200 	dates found: 170 	dates not found: 30 	dates skipped

	 3183 	 1296 aug. eind-sept. 20
	 3184 	 1296 aug. einde-sept. 20
charter: 3200 	dates found: 2973 	dates not found: 181 	dates skipped: 46
	 3265 	 1297 mrt. $
	 3272 	 1297 jan. eind-apr. 15
	 3282 	 1297 apr. 15-16-mei 20
charter: 3300 	dates found: 3068 	dates not found: 184 	dates skipped: 48
	 3338 	 1297 sept. eind-okt. 18
	 3339 	 1297 sept. eind-okt. 18
	 3350 	 enige tijd 1297 okt. 18
	 3367 	 1297 mogelijk okt.-nov.
	 3368 	 1297 mogelijk okt.-nov.
	 3376 	 1298 mrt. 3-kort voordien
	 3377 	 1w  1298 mrt. 3
charter: 3400 	dates found: 3158 	dates not found: 191 	dates skipped: 51
	 3431 	 1296 juni 28-1299 nov. 10 1298 nov. 11
	 3492 	 1299 juni 16-aug. 1 juli 29
	 3493 	 1299 juni 16-aug. 1 mogelijk juli 29
charter: 3500 	dates found: 3248 	dates not found: 194 	dates skipped: 58
	 3505 	 1296 juni 28-1299 nov. 10 1299 sept. 15
charter: 3536 	dates found: 3280 	dates not found: 195 	dates skipped: 61


In [36]:
# Write the interpreted date information to an Excel file for manual data curation.

import datetime

headers = [
    "sort_year", "start_year", "end_year", 
    "sort_date", "start_date", "end_date", 
    "date_specificity", "date_certainty", "date_modifier", "date_string"
]


from openpyxl import Workbook
wb = Workbook()

# grab the active worksheet
ws = wb.active

# Rows can also be appended
ws.append(["charter_number"] + headers)


for charter_num in charter_date:
    for header in headers:
        if header not in charter_date[charter_num]:
            charter_date[charter_num][header] = None
    row = [charter_num]
    for header in headers:
        value = charter_date[charter_num][header]
        if isinstance(value, datetime.date):
            value = value.strftime("%Y-%m-%d")
        row += [value]
    ws.append(row)
        
    
# Save the file
wb.save("ohz-charter-dates.xlsx")



## Filtering Place Names based on the Charter Indices

The charter indices mention place names with their spelling variants and identify the charter books, page numbers and line numbers on which these places are mentioned in the charters. This allows separating place names mentioned in charter text from place names mentioned in commentary. 

In [208]:
def make_reference_query(reference):
    return {
        "query": {
            "bool": {
                "must": [
                    {"match": {"book_number": reference["book"]}},
                    {"match": {"paragraphs.page_number": reference["page"]}},
                    {"match": {"paragraphs.line_numbers": reference["line"]}},
                ]
            }
        },
        "size": 1000
    }

def get_paragraph_by_reference(reference, charter_data):
    for paragraph in charter_data["paragraphs"]:
        if paragraph["book_number"] != reference["book"]:
            continue
        if paragraph["page_number"] != reference["page"]:
            continue
        if reference["line"] not in paragraph["line_numbers"]:
            continue
        return paragraph
    return False

def get_charter_by_reference(reference):
    ref_query = make_reference_query(reference)
    try:
        response = es.search(index="ohz", doc_type="ohz-charter", body=ref_query)
    except:
        print("ERROR in querying charter by reference:", reference)
        print(ref_query)
        raise
    if response['hits']['total'] == 0:
        # missing charter text
        return []
    return [hit['_source'] for hit in response['hits']['hits'] if get_paragraph_by_reference(reference, hit['_source'])]

def get_neighbourhood(reference, charter):
    paragraph = get_paragraph_by_reference(reference, charter)
    try:
        line_index = paragraph["line_numbers"].index(reference["line"])
    except TypeError:
        print("charter:", charter)
        print("paragraph:", paragraph)
        print(reference)
        raise
    neighbourhood = {
        "reference_line": paragraph["lines"][line_index]
    }
    ref_line = paragraph["lines"][line_index]
    if line_index > 0 and not paragraph["lines"][line_index-1] == '':
        neighbourhood["previous_line"] = paragraph["lines"][line_index-1]
    elif line_index > 1 and paragraph["lines"][line_index-1] == '':
        neighbourhood["previous_line"] = paragraph["lines"][line_index-2]
    else:
        for paragraph in charter["paragraphs"]:
            if paragraph["page_number"] == reference["page"] and reference["line"] - 1 in paragraph["line_numbers"]:
                line_index = paragraph["line_numbers"].index(reference["line"]-1)
                neighbourhood["previous_line"] = paragraph["lines"][line_index]
    if len(paragraph["lines"]) > line_index+1 and not paragraph["lines"][line_index+1] == '':
        neighbourhood["next_line"] = paragraph["lines"][line_index+1]
    elif len(paragraph["lines"]) > line_index+2 and paragraph["lines"][line_index+1] == '':
        neighbourhood["next_line"] = paragraph["lines"][line_index+2]
    else:
        for paragraph in charter["paragraphs"]:
            if paragraph["page_number"] == reference["page"] and reference["line"] + 1 in paragraph["line_numbers"]:
                line_index = paragraph["line_numbers"].index(reference["line"]+1)
                neighbourhood["next_line"] = paragraph["lines"][line_index]
    try:
        if "previous_line" in neighbourhood and neighbourhood["previous_line"][-1] == "-":
            neighbourhood["previous_line_extended"] = neighbourhood["previous_line"][:-1] + neighbourhood["reference_line"].strip()
    except IndexError:
        print(neighbourhood)
        raise
    if "next_line" in neighbourhood and neighbourhood["reference_line"][-1] == "-":
        neighbourhood["reference_line_extended"] = neighbourhood["reference_line"][:-1] + neighbourhood["next_line"].strip()
    return neighbourhood

def map_book_name(book_name):
    if book_name == "I" or book_name == "I 1":
        return 1
    if book_name == "II" or book_name == "II 2":
        return 2
    if book_name == "III" or book_name == "III 3":
        return 3
    if book_name == "IV" or book_name == "IV 4":
        return 4
    if book_name == "V" or book_name == "V 5":
        return 5
    raise ValueError("Invalid book name: {b}".format(b=book_name))

def make_empty_reference():
    return {
        "book": None,
        "page": None,
        "line": None,
    }

def parse_additional_index_text(item_soup, index_string):
    for font_item in item_soup.find_all("font"):
        if len(font_item.attrs) > 0:
            continue
        index_string += font_item.text.strip("\n")
        if index_string.count("(") == index_string.count(")"):
            return index_string
    return index_string
            

def parse_index_name(item_soup, prev_name):
    name_span_soup = item_soup.find("span", class_="name")
    index_string = name_span_soup.text
    name_pref = index_string.strip()
    name_variants = []
    if " (" in index_string:
        if index_string.count("(") > index_string.count(")"):
            index_string = parse_additional_index_text(item_soup, index_string)
        #print("parsing name:")
        match = re.search(r" \((.*)\)", index_string)
        name_pref = re.sub(r" \((.*)\).*", "", index_string)
        try:
            name_variants = match.group(1).split(", ")
            for name_index, name_variant in enumerate(name_variants):
                if "(?)" in name_variant:
                    name_variant = name_variant.replace(" (?)","")
                    name_variants[name_index] = name_variant
                    unsure_index_variant[name_pref] += [name_variant]
        except AttributeError:
            print("index_string:\n\t", index_string)
            raise
        #print(reference)
    elif index_string.startswith("{p}, ".format(p=prev_name)):
        name_pref = prev_name
    return name_pref, name_variants

def parse_reference_span(item_soup, name_pref, unresolvable_references):
    name_references = []
    ref_span = item_soup.find("span", class_="references")
    if not ref_span:
        return name_references
    reference = make_empty_reference()
    for child in ref_span.children:
        if child.name == "b":
            try:
                reference["book"] = map_book_name(child.text.strip())
            except:
                #print("Adding entry_text:")
                #print(item_soup)
                #reference["entry_text"] = item_soup.text
                #invalid_references[name_pref] += [reference]
                continue
        if child.name == "a":
            reference["page"] = int(child.text.strip())
        if child.name == "font" and 'class' in child.attrs and "pageLine" in child.attrs['class']:
            reference["line"] = int(child.text.strip())
            if not reference["book"] or not reference["page"]:
                print("Missing reference data!")
                #print(name_pref)
                #print(item_soup)
                reference["entry_text"] = item_soup.text.replace("\t", " ").replace("\n", " ")
                reference["problem_type"] = "unparseable_index_entry"
                unresolvable_references[name_pref] += [reference]
                continue
            if "entry_text" in reference:
                print(name_pref)
                print(item_soup)
                raise KeyError("This reference should not have an entry_text property")
            name_references += [reference]
            reference = copy.copy(reference)
            if "entry_text" in reference:
                del reference["entry_text"]
    return name_references

def lookup_reference_exact(name_pref, variants, reference, ref_line, charter):
    for name_variant in [name_pref] + variants[name_pref]:
        if name_variant not in ref_line:
            continue
        reference['place_string'] = name_variant
        reference['place_pref'] = name_pref
        reference["match_type"] = "exact_match"
        return True

def lookup_reference_fuzzy(name_pref, variants, reference, ref_line, charter):
    max_score = 0
    best_match = None
    best_score = None
    for name_variant in [name_pref] + variants[name_pref]:
        try:
            term_matches = fuzzy_matcher.find_term_matches(ref_line, name_variant, max_length_variance=0)
            term_matches += fuzzy_matcher.find_term_matches(ref_line, name_variant, max_length_variance=1)
            term_matches += fuzzy_matcher.find_term_matches(ref_line, name_variant, max_length_variance=2)
        except:
            print(ref_line, name_variant)
            raise
        #print("Term matches:", name_variant, term_matches)
        candidate_scores = fuzzy_matcher.filter_candidates(term_matches, name_variant)
        candidate_scores = fuzzy_matcher.rank_candidates(term_matches, name_variant)
        #candidates = fuzzy_matcher.find_candidates(ref_line, name_variant)
        for candidate_score in candidate_scores:
            if candidate_score["total"] > max_score:
                max_score = candidate_score["total"]
                best_match = {"name_variant": name_variant, "text_variant": candidate_score["candidate"], "score": max_score}
                best_score = candidate_score
    #print("BEST SCORE:", best_score)
    return best_match

def lookup_reference(name_pref, variants, reference, charter):
    order = ["reference_line", "reference_line_extended", "previous_line", "previous_line_extended", "next_line"]
    paragraph = get_paragraph_by_reference(reference, charter)
    neighbourhood = get_neighbourhood(reference, charter)
    overall_best_match = None
    overall_best_score = 0
    overall_best_line = None
    for line_type in order:
        if line_type not in neighbourhood:
            continue
        if lookup_reference_exact(name_pref, variants, reference, neighbourhood[line_type], charter):
            reference["match_line"] = line_type
            return True
        best_match = lookup_reference_fuzzy(name_pref, variants, reference, neighbourhood[line_type], charter)
        if best_match and best_match["score"] > overall_best_score:
            overall_best_score = best_match["score"]
            overall_best_match = best_match
            overall_best_line = line_type
    if overall_best_match and overall_best_match["score"] > 5:
        #print("Best match:", overall_best_match)
        reference['place_variant'] = overall_best_match["name_variant"]
        reference['place_string'] = overall_best_match["text_variant"]
        reference['place_pref'] = name_pref
        reference["match_type"] = "best_match"
        reference["match_line"] = overall_best_line
        return True
    return False

def is_unresolvable_reference(name_pref, reference, charters):
    if is_reference_to_missing_page(name_pref, reference):
        return True
    elif len(charters) == 0:
        if is_addenda_reference(name_pref, reference):
            return True
        else:
            is_other_error_reference(name_pref, reference)
            return True
    return False
        
def is_addenda_reference(name_pref, reference):
    if reference["page"] > last_page["OHZ{b}".format(b=reference["book"])]:
        reference["problem_type"] = "addenda_reference"
        return True
    else:
        return False
    
def is_reference_to_missing_page(name_pref, reference):
    if reference["page"] in missing_pages[reference["book"]]:
        reference["problem_type"] = "unavailable_page_reference"
        return True
    else:
        return False

def is_other_error_reference(name_pref, reference):
    reference["problem_type"] = "unresolved_reference"

def line_has_reference(name_pref, variants, reference, charters):
    reference["charter"] = charters[0]["charter_number"]
    if lookup_reference(name_pref, variants, reference, charters[0]):
        references[name_pref] += [reference]
        return True
    else:
        return False

def prev_line_has_reference(name_pref, variants, reference, charters):
    prev_line_reference = copy.copy(reference)
    prev_line_reference["line"] -= 1
    prev_line_charters = get_charter_by_reference(prev_line_reference)
    if len(prev_line_charters) == 0 or prev_line_charters[0] in charters:
        return False
    prev_line_reference["charter"] = prev_line_charters[0]["charter_number"]
    return line_has_reference(name_pref, variants, prev_line_reference, prev_line_charters)
    
def next_line_has_reference(name_pref, variants, reference, charters):
    next_line_reference = copy.copy(reference)
    next_line_reference["line"] += 1
    next_line_charters = get_charter_by_reference(next_line_reference)
    if len(next_line_charters) == 0 or next_line_charters[0] in charters:
        return False
    next_line_reference["charter"] = next_line_charters[0]["charter_number"]
    return line_has_reference(name_pref, variants, next_line_reference, next_line_charters)
    
def parse_index_item(item_soup, variants, prev_name, references, unresolvable_references):
    name_pref, name_variants = parse_index_name(item_soup, prev_name)
    if name_pref not in has_variants:
        return None
    if name_pref not in variants:
        print("Resolving index entries for", name_pref)
    variants[name_pref] += name_variants
    name_references = parse_reference_span(item_soup, name_pref, unresolvable_references)
    for reference in name_references:
        charters = get_charter_by_reference(reference)
        if is_unresolvable_reference(name_pref, reference, charters):
            unresolvable_references[name_pref] += [reference]
            continue
        if line_has_reference(name_pref, variants, reference, charters):
            continue
        if prev_line_has_reference(name_pref, variants, reference, charters):
            continue
        if next_line_has_reference(name_pref, variants, reference, charters):
            continue
        reference["problem_type"] = "unresolved_reference"
        unresolvable_references[name_pref] += [reference]
    return name_pref
    
def char_range(c1, c2):
    """Generates the characters from `c1` to `c2`, inclusive."""
    for c in range(ord(c1), ord(c2)+1):
        yield chr(c)
        
fuzzy_matcher = FuzzyMatcher(char_match_threshold=0.7, ngram_threshold=0.7, levenshtein_threshold=0.7)

found = 0 # for references that have an exact or best match
not_found = 0 # for references that cannot be found in the text (either word is not in OCR or reference is incorrect)
missing_ref_text = 0 # for references to pages that are not in the hOCR output
addenda_ref = 0 # for references to addenda pages at the end of charter books

references = defaultdict(list)
unresolvable_references = defaultdict(list)
unsure_index_variant = defaultdict(list)
variants = defaultdict(list)

for index_char in char_range("A", "Z"):
    prev_name = None
    index_file = "ohz/ohz_index/ohz_index-{i}.html".format(i=index_char)
    with open(index_file, 'rt') as fh:
        index_soup = bsoup(fh, "lxml")

    items_soup = index_soup.findAll("div", class_="item")
    for index, item_soup in enumerate(items_soup):
        prev_name = parse_index_item(item_soup, variants, prev_name, references, unresolvable_references)
        
print("Index entries:", len(references.keys()))
print("Found:", found, "\tNot found:", not_found, "\tMissing refs:", missing_ref_text, "\tAddenda refs:", addenda_ref)


Resolving index entries for A
Resolving index entries for A
Resolving index entries for A
Resolving index entries for A
Resolving index entries for Aagtdorp
Resolving index entries for Aagtdorp
Resolving index entries for Aagtdorp
Resolving index entries for Aalbertsberg
Resolving index entries for Aalburg
Resolving index entries for Aalburg
Resolving index entries for Aalburg
Resolving index entries for Aalburg
Resolving index entries for Aalburg
Resolving index entries for Aalburg
Resolving index entries for Aalburg
Resolving index entries for Aalburg
Resolving index entries for Aalburg
Resolving index entries for Aalsmeer
Resolving index entries for Aalsmeer
Resolving index entries for Aalsmeer
Resolving index entries for Aalsmeer
Resolving index entries for Aalsmeer
Resolving index entries for Aalsmeer
Resolving index entries for Aalsmeer
Resolving index entries for Aalsmeer
Resolving index entries for Aalsmeer
Resolving index entries for Aalsmeer
Resolving index entries for Aalsme

Resolving index entries for Amerongen
Resolving index entries for Amersfoort
Resolving index entries for Amersfoort
Resolving index entries for Amersfoort
Resolving index entries for Amersfoort
Resolving index entries for Amersfoort
Resolving index entries for Amersfoort
Resolving index entries for Amersfoort
Resolving index entries for Amiens
Resolving index entries for Amiens
Resolving index entries for Amiens
Resolving index entries for Amiens
Resolving index entries for Ammers
Resolving index entries for Ammers
Resolving index entries for Ammers
Resolving index entries for Ammers
Resolving index entries for Ammers
Resolving index entries for Ammers
Resolving index entries for Amstel
Resolving index entries for Amstel
Resolving index entries for Amstel
Resolving index entries for Amstel
Resolving index entries for Amstel
Resolving index entries for Amstel
Resolving index entries for Amstel
Resolving index entries for Amstel
Resolving index entries for Amstel
Resolving index entries 

Resolving index entries for Baardegem
Resolving index entries for Baardwijk
Resolving index entries for Baarland
Resolving index entries for Baarland
Resolving index entries for Baarland
Resolving index entries for Baarland
Resolving index entries for Baarland
Resolving index entries for Baarland
Resolving index entries for Baarland
Resolving index entries for Baarland
Resolving index entries for Baarland
Resolving index entries for Baarland
Resolving index entries for Baarland
Resolving index entries for Baarland
Resolving index entries for Baarland
Resolving index entries for Baarland
Resolving index entries for Baarland
Resolving index entries for Baarle
Resolving index entries for Baarle
Resolving index entries for Baarle
Resolving index entries for Baarle
Resolving index entries for Baarle
Resolving index entries for Baarle
Resolving index entries for Baarle
Resolving index entries for Baarlo
Resolving index entries for Baarsdorp
Resolving index entries for Babiloniënbroek
Resolvi

Resolving index entries for Betuwe
Resolving index entries for Betuwe
Resolving index entries for Betuwe
Resolving index entries for Betuwe
Resolving index entries for Betuwe
Resolving index entries for Betuwe
Resolving index entries for Betuwe
Resolving index entries for Betuwe
Resolving index entries for Beukelsdijk
Resolving index entries for Beukelsdijk
Resolving index entries for Beukelsdijk
Resolving index entries for Beuron
Resolving index entries for Beveland
Resolving index entries for Beveland
Resolving index entries for Beveland
Resolving index entries for Beveland
Resolving index entries for Beveland
Resolving index entries for Beveren-Waas
Resolving index entries for Beveren-Waas
Resolving index entries for Beveren-Waas
Resolving index entries for Beveren-Waas
Resolving index entries for Beverley
Resolving index entries for Beverwijk
Resolving index entries for Beverwijk
Resolving index entries for Beverwijk
Resolving index entries for Beverwijk
Resolving index entries for

Resolving index entries for Boshuizen
Resolving index entries for Boshuizen
Resolving index entries for Boskoop
Resolving index entries for Boskoop
Resolving index entries for Boskoop
Resolving index entries for Boskoop
Resolving index entries for Boston
Resolving index entries for Boston
Resolving index entries for Boston
Resolving index entries for Boston
Resolving index entries for Boston
Resolving index entries for Botersloot
Resolving index entries for Botervleete
Resolving index entries for Boterzande
Resolving index entries for Boterzande
Resolving index entries for Boterzande
Resolving index entries for Botmelosa Kine
Resolving index entries for Bottel
Resolving index entries for Boudeliche
Resolving index entries for Boudelo
Resolving index entries for Boudelo
Resolving index entries for Boudelo
Resolving index entries for Boudelo
Resolving index entries for Boudelo
Resolving index entries for Boudewijnskerke
Resolving index entries for Boulichbergh
Resolving index entries for

Resolving index entries for Burg
Resolving index entries for Burgerscede
Resolving index entries for Burgersdijk
Resolving index entries for Burgersdijk
Resolving index entries for Burgersdijk
Resolving index entries for Burgersdijk
Resolving index entries for Burgsteinfurt
Resolving index entries for Burgsteinfurt
Resolving index entries for Burgsteinfurt
Resolving index entries for Burgsteinfurt
Resolving index entries for Bury Saint Edmunds
Resolving index entries for Bury Saint Edmunds
Resolving index entries for Bury Saint Edmunds
Resolving index entries for Butsegem
Resolving index entries for Buttinge
Resolving index entries for Buttinge
Resolving index entries for Buttinge
Resolving index entries for Buttinge
Resolving index entries for Buurmalsen
Resolving index entries for Buurmalsen
Resolving index entries for Buurmalsen
Resolving index entries for Buurmalsen
Resolving index entries for Bijlmerbos
Resolving index entries for Bijlmerpolder
Resolving index entries for Ceneda
R

Resolving index entries for Delft
Resolving index entries for Delft
Resolving index entries for Delft
Resolving index entries for Delft
Resolving index entries for Delft
Resolving index entries for Delft
Resolving index entries for Delft
Resolving index entries for Denain
Resolving index entries for Dendermonde
Resolving index entries for Dendermonde
Resolving index entries for Dendermonde
Resolving index entries for Destelbergen
Resolving index entries for Destelbergen
Resolving index entries for Destelbergen
Resolving index entries for Destelbergen
Resolving index entries for Deûle
Resolving index entries for Deurlechtervenne
Resolving index entries for Deutz
Resolving index entries for Devel
Resolving index entries for Devel
Resolving index entries for Devel
Resolving index entries for Develpolder
Resolving index entries for Deventer
Resolving index entries for Deventer
Resolving index entries for Deventer
Resolving index entries for Deventer
Resolving index entries for Deventer
Res

Resolving index entries for Egmond
Resolving index entries for Egmond
Resolving index entries for Egmond
Resolving index entries for Egmond
Resolving index entries for Egmond
Resolving index entries for Egmond
Resolving index entries for Egmond
Resolving index entries for Egmond
Resolving index entries for Egmond
Resolving index entries for Egmond
Resolving index entries for Egmond
Resolving index entries for Ehrenbreitstein
Resolving index entries for Eichem
Resolving index entries for Eichem
Resolving index entries for Eichstätt
Resolving index entries for Eikenduinen
Resolving index entries for Eikenduinen
Resolving index entries for Eikenduinen
Resolving index entries for Eikenduinen
Resolving index entries for Eikenduinen
Resolving index entries for Eindhoven
Resolving index entries for Eindhoven
Resolving index entries for Eiteren
Resolving index entries for Eiteren
Resolving index entries for Eiteren
Resolving index entries for Eiteren
Resolving index entries for Eiteren
Resolvi

Resolving index entries for Foreest
Resolving index entries for Forfar
Resolving index entries for Forfar
Resolving index entries for Forfar
Resolving index entries for Forismarische
Resolving index entries for Forli
Resolving index entries for Forlimpopoli
Resolving index entries for Forres
Resolving index entries for Forres
Resolving index entries for Forres
Resolving index entries for Fortrapa
Resolving index entries for fossatum Mengeri
Resolving index entries for Foswert
Resolving index entries for Foswert
Resolving index entries for Foswert
Resolving index entries for Foucarmont
Resolving index entries for Fountains
Resolving index entries for Fountains
Resolving index entries for Franeker
Resolving index entries for Franeker
Resolving index entries for Franeker
Resolving index entries for Frankeland
Resolving index entries for Frankendijk
Resolving index entries for Frankendijk
Resolving index entries for Frankendijk
Resolving index entries for Frankendijk
Resolving index entrie

Resolving index entries for Goch
Resolving index entries for Godekineshofstat
Resolving index entries for Godolfhem
Resolving index entries for Goeree
Resolving index entries for Goes
Resolving index entries for Goes
Resolving index entries for Goes
Resolving index entries for Goes
Resolving index entries for Goes
Resolving index entries for Goes
Resolving index entries for Goes
Resolving index entries for Goes
Resolving index entries for Goes
Resolving index entries for Goes
Resolving index entries for Goes
Resolving index entries for Goes
Resolving index entries for Goetlosenhusen
Resolving index entries for Golberdingen
Resolving index entries for Gooi
Resolving index entries for Gooi
Resolving index entries for Gooik
Resolving index entries for Gooik
Resolving index entries for Goor
Resolving index entries for Goor
Resolving index entries for Goor
Resolving index entries for Goor
Resolving index entries for Goor
Resolving index entries for Goor
Resolving index entries for Goslar
Re

Resolving index entries for Haarlemmerhout
Resolving index entries for Haarlemmerhout
Resolving index entries for Haarlemmerhout
Resolving index entries for Haarlemmermeer
Resolving index entries for Haarlemmermeer
Resolving index entries for Haarlemmermeer
Resolving index entries for Haarlemmerwoude
Resolving index entries for Haastrecht
Resolving index entries for Haastrecht
Resolving index entries for Haastrecht
Resolving index entries for Haastrecht
Resolving index entries for Haastrecht
Resolving index entries for Haastrecht
Resolving index entries for Haastrecht
Resolving index entries for Habsburg
Resolving index entries for Haddenhausen
Resolving index entries for Hadinchaart
Resolving index entries for Hage
Resolving index entries for Hage
Resolving index entries for Hage
Resolving index entries for Hage
Resolving index entries for Hagestein
Resolving index entries for Hagestein
Resolving index entries for Hagestein
Resolving index entries for Hagestein
Resolving index entries

Resolving index entries for Herdeby
Resolving index entries for Herderen
Resolving index entries for Hereford
Resolving index entries for Hereford
Resolving index entries for Hereford
Resolving index entries for Hereford
Resolving index entries for Heren Heyen Land
Resolving index entries for Herentals
Resolving index entries for Heriuurde
Resolving index entries for Herkenrode
Resolving index entries for Herlaar
Resolving index entries for Herlaar
Resolving index entries for Herlaar
Resolving index entries for Hernet
Resolving index entries for Heroldeshusen
Resolving index entries for Herpt
Resolving index entries for Herpt
Resolving index entries for Herpt
Resolving index entries for Herpt
Resolving index entries for Herpterweide
Resolving index entries for Herradeskerke
Resolving index entries for Herradeskerke
Resolving index entries for Herradeskerke
Resolving index entries for Hersfeld
Resolving index entries for Herstal
Resolving index entries for Herstal
Resolving index entrie

Resolving index entries for Holland
Resolving index entries for Holland
Resolving index entries for Holland
Resolving index entries for Holland
Resolving index entries for Holland
Resolving index entries for Holland
Resolving index entries for Holland
Resolving index entries for Holland
Resolving index entries for Holland
Resolving index entries for Holland
Resolving index entries for Holland
Resolving index entries for Holland
Resolving index entries for Holland
Resolving index entries for Holland
Resolving index entries for Holland
Resolving index entries for Holland
Resolving index entries for Holland
Resolving index entries for Holland
Resolving index entries for Holland
Resolving index entries for Holland
Resolving index entries for Holland
Resolving index entries for Holland
Resolving index entries for Holland
Resolving index entries for Holland
Resolving index entries for Holland
Resolving index entries for Holland
Resolving index entries for Holland
Resolving index entries for 

Resolving index entries for Jabbeke
Resolving index entries for Jedburgh
Resolving index entries for Jedburgh
Resolving index entries for Jedburgh
Resolving index entries for Jeruzalem
Resolving index entries for Jeruzalem
Resolving index entries for Jeruzalem
Resolving index entries for Jeruzalem
Resolving index entries for Jeruzalem
Resolving index entries for Jeruzalem
Resolving index entries for Jeruzalem
Resolving index entries for Jeruzalem
Resolving index entries for Jeruzalem
Resolving index entries for Jodogne
Resolving index entries for Jonkhere
Resolving index entries for Joux
Resolving index entries for Joux
Resolving index entries for Jutte sloot
Resolving index entries for Kaifenheim
Resolving index entries for Kalden
Resolving index entries for Kalmthout
Resolving index entries for Kamp
Resolving index entries for Kamp
Resolving index entries for Kamp
Resolving index entries for Kamp
Resolving index entries for Kampen
Resolving index entries for Kampen
Resolving index en

Resolving index entries for Krabbendijke
Resolving index entries for Krabbendijke
Resolving index entries for Krabbendijke
Resolving index entries for Krabbendijke
Resolving index entries for Krabbendijke
Resolving index entries for Krabbendijke
Resolving index entries for Krabbendijke
Resolving index entries for Krabbendijke
Resolving index entries for Krabbendijke
Resolving index entries for Krabbendijke
Resolving index entries for Krabbendijke
Resolving index entries for Krabbendijke
Resolving index entries for Kralingen
Resolving index entries for Kralingen
Resolving index entries for Kralingen
Resolving index entries for Kralingen
Resolving index entries for Kralingen
Resolving index entries for Kralingen
Resolving index entries for Kralingen
Resolving index entries for Kralingen
Resolving index entries for Kralingen
Resolving index entries for Kralingen
Resolving index entries for Kranendonk
Resolving index entries for Krimpen
Resolving index entries for Krimpen
Resolving index e

Resolving index entries for Leuven
Resolving index entries for Leuven
Resolving index entries for Leuven
Resolving index entries for Leuven
Resolving index entries for Leuven
Resolving index entries for Leuven
Resolving index entries for Leuven
Resolving index entries for Leuven
Resolving index entries for Leuven
Resolving index entries for Leuven
Resolving index entries for Leuven
Resolving index entries for Leuven
Resolving index entries for Leuven
Resolving index entries for Leuven
Resolving index entries for Leuven
Resolving index entries for Leuven
Resolving index entries for Leuven
Resolving index entries for Leuven
Resolving index entries for Leuven
Resolving index entries for Leuven
Resolving index entries for Leuven
Resolving index entries for Lewa
Resolving index entries for Lianne
Resolving index entries for Lichfield
Resolving index entries for Lichfield
Resolving index entries for Lichfield
Resolving index entries for Lichfield
Resolving index entries for Liede
Resolving i

Resolving index entries for Lyon
Resolving index entries for Lyon
Resolving index entries for Lysekine
Resolving index entries for Lysekine
Resolving index entries for Le Quesnoy
Resolving index entries for La Roche
Resolving index entries for Le Roeulx
Resolving index entries for Le Roeulx
Resolving index entries for Le Roeulx
Resolving index entries for La Villeneuve
Resolving index entries for La Villeneuve
Resolving index entries for Maalstede
Resolving index entries for Maalstede
Resolving index entries for Maalstede
Resolving index entries for Maalstede
Resolving index entries for Maarheze
Resolving index entries for Maarheze
Resolving index entries for Maarheze
Resolving index entries for Maarheze
Resolving index entries for Maarland
Resolving index entries for Maarland
Resolving index entries for Maarland
Resolving index entries for Maarland
Resolving index entries for Maarland
Resolving index entries for Maarn
Resolving index entries for Maarschalkerweerd
Resolving index entri

Resolving index entries for Meliskerke
Resolving index entries for Meliskerke
Resolving index entries for Meliskerke
Resolving index entries for Meliskerke
Resolving index entries for Melton
Resolving index entries for Melton
Resolving index entries for Melun
Resolving index entries for Melun
Resolving index entries for Melun
Resolving index entries for Melun
Resolving index entries for Mempsegouw
Resolving index entries for Menemara
Resolving index entries for Menkenesdreth
Resolving index entries for Merano
Resolving index entries for Merbetta
Resolving index entries for Mere
Resolving index entries for Mere
Resolving index entries for Mereveld
Resolving index entries for Merinidorbe
Resolving index entries for Merkenich
Resolving index entries for Merksem
Resolving index entries for Mersade
Resolving index entries for Merseburg
Resolving index entries for Merum
Resolving index entries for Merum
Resolving index entries for Merum
Resolving index entries for Merwede
Resolving index ent

Resolving index entries for Morlodenisse
Resolving index entries for Morlodenisse
Resolving index entries for Morlodeoord
Resolving index entries for Morlodeoord
Resolving index entries for Morlodeoord
Resolving index entries for Mortagne
Resolving index entries for Mortagne
Resolving index entries for Mortagne
Resolving index entries for Mühlhausen
Resolving index entries for Muiden
Resolving index entries for Muiden
Resolving index entries for Muiden
Resolving index entries for Muiden
Resolving index entries for Muiden
Resolving index entries for Muiden
Resolving index entries for Muiden
Resolving index entries for Muiderberg
Resolving index entries for Muiderberg
Resolving index entries for Muilkerk
Resolving index entries for Muilkerk
Resolving index entries for Muilkerk
Resolving index entries for Muilvliet
Resolving index entries for Murbach
Resolving index entries for Murtssloet
Resolving index entries for Mutford
Resolving index entries for Mutford
Resolving index entries for M

Resolving index entries for Noorddijk
Resolving index entries for Noorddijk
Resolving index entries for Noorddijk
Resolving index entries for Noorddijk
Resolving index entries for Noorddijk
Resolving index entries for Noorddijk
Resolving index entries for Noorddijk
Resolving index entries for Noordeinde
Resolving index entries for Noordeloos
Resolving index entries for Noordeloos
Resolving index entries for Noordeloos
Resolving index entries for Noordholland
Resolving index entries for Noordholland
Resolving index entries for Noordholland
Resolving index entries for Noordholland
Resolving index entries for Noordholland
Resolving index entries for Noordholland
Resolving index entries for Noordholland
Resolving index entries for Noordholland
Resolving index entries for Noordholland
Resolving index entries for Noordholland
Resolving index entries for Noordholland
Resolving index entries for Noordholland
Resolving index entries for Noordholland
Resolving index entries for Noordmonster
Reso

Resolving index entries for Oostkapelle
Resolving index entries for Oostkapelle
Resolving index entries for Oostkapelle
Resolving index entries for Oostkapelle
Resolving index entries for Oostkapelle
Resolving index entries for Oostkapelle
Resolving index entries for Oostkapelle
Resolving index entries for Oostkapelle
Resolving index entries for Oostkerke
Resolving index entries for Oostkerke
Resolving index entries for Oostkerke
Resolving index entries for Oostkerke
Resolving index entries for Oostkerke
Resolving index entries for Oostkerke
Resolving index entries for Oostkerke
Resolving index entries for Ophemert
Resolving index entries for Ophemert
Resolving index entries for Ophemert
Resolving index entries for Ophemert
Resolving index entries for Oppenheim
Resolving index entries for Oppenheim
Resolving index entries for Oppenheim
Resolving index entries for Opperlotharingen
Resolving index entries for Opperlotharingen
Resolving index entries for Opperlotharingen
Resolving index e

Resolving index entries for Passau
Resolving index entries for Patauspolra
Resolving index entries for Pauluspolder
Resolving index entries for Paveien
Resolving index entries for Pavia
Resolving index entries for Pavia
Resolving index entries for Pavia
Resolving index entries for Pavia
Resolving index entries for Pembroke
Resolving index entries for Pendice
Resolving index entries for Pendrecht
Resolving index entries for Pendrecht
Resolving index entries for Pendrecht
Resolving index entries for Pendrecht
Resolving index entries for Pendrecht
Resolving index entries for Pendrecht
Resolving index entries for Pendrecht
Resolving index entries for Pendrecht
Resolving index entries for Pendrecht
Resolving index entries for Pendrecht
Resolving index entries for Pendrecht
Resolving index entries for Pennincx Camp
Resolving index entries for Perbome
Resolving index entries for Perbome
Resolving index entries for Pernis
Resolving index entries for Pernis
Resolving index entries for Pernis
Re

Resolving index entries for Reimerswaal
Resolving index entries for Reimerswaal
Resolving index entries for Reimerswaal
Resolving index entries for Reimerswaal
Resolving index entries for Reimerswaal
Resolving index entries for Reimerswaal
Resolving index entries for Reimerswaal
Resolving index entries for Reims
Resolving index entries for Reims
Resolving index entries for Reims
Resolving index entries for Reims
Resolving index entries for Reken
Resolving index entries for Rekere
Resolving index entries for Rekeredam
Resolving index entries for Rekeredam
Resolving index entries for Rekeredam
Resolving index entries for Renesse
Resolving index entries for Renesse
Resolving index entries for Renesse
Resolving index entries for Renesse
Resolving index entries for Renesse
Resolving index entries for Renesse
Resolving index entries for Renty
Resolving index entries for Ressegem
Resolving index entries for Rethymnon
Resolving index entries for Rettel
Resolving index entries for Reymers Wael 

Resolving index entries for sHeer Boudenspolre
Resolving index entries for sHeer Boudenspolre
Resolving index entries for sHeer Boudenspolre
Resolving index entries for sHeeren weide
Resolving index entries for Saaftinge
Resolving index entries for Saaftinge
Resolving index entries for Saaftinge
Resolving index entries for Saaftinge
Resolving index entries for Saaftinge
Resolving index entries for Saarbrücken
Resolving index entries for Sabbinge
Resolving index entries for Sabbinge
Resolving index entries for Sabbinge
Resolving index entries for Sabbinge
Resolving index entries for Sabina
Resolving index entries for Sailly
Resolving index entries for Saint Albans
Resolving index entries for Saint Albans
Resolving index entries for Saint Albans
Resolving index entries for Saint-Amand-les-Eaux
Resolving index entries for Saint Andrews
Resolving index entries for Saint Andrews
Resolving index entries for Saint Andrews
Resolving index entries for Saint-Dié
Resolving index entries for Saint

Resolving index entries for Schoorl
Resolving index entries for Schoorl
Resolving index entries for Schoorl
Resolving index entries for Schoorl
Resolving index entries for Schoorl
Resolving index entries for Schoorl
Resolving index entries for Schoorl
Resolving index entries for Schore
Resolving index entries for Schore
Resolving index entries for Schore
Resolving index entries for Schore
Resolving index entries for Schore
Resolving index entries for Schore
Resolving index entries for Schore
Resolving index entries for Schore
Resolving index entries for Schorisse
Resolving index entries for Schoten
Resolving index entries for Schoten
Resolving index entries for Schoten
Resolving index entries for Schoten
Resolving index entries for Schoten
Resolving index entries for Schoten
Resolving index entries for Schotland
Resolving index entries for Schotland
Resolving index entries for Schotland
Resolving index entries for Schotland
Resolving index entries for Schotland
Resolving index entries 

Resolving index entries for Soeburg
Resolving index entries for Soeburg
Resolving index entries for Soeburg
Resolving index entries for Soeburg
Resolving index entries for Soeburg
Resolving index entries for Soest
Resolving index entries for Soest
Resolving index entries for Soest
Resolving index entries for Soest
Resolving index entries for Soest
Resolving index entries for Soest
Resolving index entries for Soiron
Resolving index entries for Soissons
Resolving index entries for Sommerschenburg
Resolving index entries for Soumagne
Resolving index entries for Southampton
Resolving index entries for Southampton
Resolving index entries for Southampton
Resolving index entries for Southampton
Resolving index entries for Southampton
Resolving index entries for Spaarne
Resolving index entries for Spaarne
Resolving index entries for Spaarne
Resolving index entries for Spaarne
Resolving index entries for Spaarne
Resolving index entries for Spaarne
Resolving index entries for Spaarne
Resolving i

Resolving index entries for Ten Duinen
Resolving index entries for Ten Duinen
Resolving index entries for Ten Duinen
Resolving index entries for Ten Duinen
Resolving index entries for Ten Hamer
Resolving index entries for Ten Hamer
Resolving index entries for Ten Hamer
Resolving index entries for Ter Horst
Resolving index entries for Ter Horst
Resolving index entries for Ter Horst
Resolving index entries for Ter Horst
Resolving index entries for Ter Horst
Resolving index entries for tCalf
Resolving index entries for Ter Kameren
Resolving index entries for tLanghe land
Resolving index entries for tOestende
Resolving index entries for Tecklenburg
Resolving index entries for Tecklenburg
Resolving index entries for Tecklenburg
Resolving index entries for Tecklenburg
Resolving index entries for Tedingerbroek
Resolving index entries for Tedingerbroek
Resolving index entries for Teilingen
Resolving index entries for Teilingen
Resolving index entries for Teilingen
Resolving index entries for T

Resolving index entries for Urbino
Resolving index entries for Ursem
Resolving index entries for Utrecht
Resolving index entries for Utrecht
Resolving index entries for Utrecht
Resolving index entries for Utrecht
Resolving index entries for Utrecht
Resolving index entries for Utrecht
Resolving index entries for Utrecht
Resolving index entries for Utrecht
Resolving index entries for Utrecht
Resolving index entries for Utrecht
Resolving index entries for UUanbeke
Resolving index entries for UUigle
Resolving index entries for Uundelgers
Resolving index entries for Valduc
Resolving index entries for Valenciennes
Resolving index entries for Valenciennes
Resolving index entries for Valenciennes
Resolving index entries for Valenciennes
Resolving index entries for Valenciennes
Resolving index entries for Valenciennes
Resolving index entries for Valenciennes
Resolving index entries for Valenciennes
Resolving index entries for Valenciennes
Resolving index entries for Valenciennes
Resolving index

Resolving index entries for Vlaanderen
Resolving index entries for Vlaanderen
Resolving index entries for Vlaanderen
Resolving index entries for Vlaanderen
Resolving index entries for Vlaanderen
Resolving index entries for Vlaanderen
Resolving index entries for Vlaanderen
Resolving index entries for Vlaanderen
Resolving index entries for Vlaanderen
Resolving index entries for Vlaanderen
Resolving index entries for Vlaanderen
Resolving index entries for Vlaanderen
Resolving index entries for Vlaanderen
Resolving index entries for Vlaanderen
Resolving index entries for Vlaanderen
Resolving index entries for Vlaanderen
Resolving index entries for Vlaanderen
Resolving index entries for Vlaanderen
Resolving index entries for Vlaanderen
Resolving index entries for Vlaanderen
Resolving index entries for Vlaanderen
Resolving index entries for Vlaanderen
Resolving index entries for Vlaanderen
Resolving index entries for Vlaanderen
Resolving index entries for Vlaanderen
Resolving index entries f

Resolving index entries for Waalwijk
Resolving index entries for Waarde
Resolving index entries for Waarde
Resolving index entries for Waarde
Resolving index entries for Waarde
Resolving index entries for Waarde
Resolving index entries for Waarde
Resolving index entries for Waarde
Resolving index entries for Waarde
Resolving index entries for Waarde
Resolving index entries for Waarde
Resolving index entries for Waarde
Resolving index entries for Waarde
Resolving index entries for Waarde
Resolving index entries for Waarde
Resolving index entries for Waarde
Resolving index entries for Waarde
Resolving index entries for Waarde
Resolving index entries for Waardenburg
Resolving index entries for Waardenburg
Resolving index entries for Waardenburg
Resolving index entries for Waardenburg
Resolving index entries for Waarder
Resolving index entries for Waarder
Resolving index entries for Waarschoot
Resolving index entries for Waarschoot
Resolving index entries for Waasland
Resolving index entri

Resolving index entries for Westergo
Resolving index entries for Westergo
Resolving index entries for Westergo
Resolving index entries for Westerhout
Resolving index entries for Westeringen
Resolving index entries for Westeringen
Resolving index entries for Westerkinloson
Resolving index entries for Westerlike
Resolving index entries for Westerlo
Resolving index entries for Westfalen
Resolving index entries for Westfalen
Resolving index entries for Westfalen
Resolving index entries for Westfalen
Resolving index entries for Westflinge
Resolving index entries for Westfrancië
Resolving index entries for Westfrancië
Resolving index entries for Westfrancië
Resolving index entries for Westfriesland
Resolving index entries for Westfriesland
Resolving index entries for Westfriesland
Resolving index entries for Westfriesland
Resolving index entries for Westfriesland
Resolving index entries for Westfriesland
Resolving index entries for Westfriesland
Resolving index entries for Westfriesland
Reso

Resolving index entries for Word
Resolving index entries for Word
Resolving index entries for Wordt
Resolving index entries for Wormer
Resolving index entries for Worms
Resolving index entries for Worms
Resolving index entries for Worms
Resolving index entries for Worms
Resolving index entries for Worms
Resolving index entries for Worms
Resolving index entries for Worms
Resolving index entries for Worms
Resolving index entries for Woubrugge
Resolving index entries for Woubrugge
Resolving index entries for Woubrugge
Resolving index entries for Woudrichem
Resolving index entries for Woudrichem
Resolving index entries for Woudrichem
Resolving index entries for Woudrichem
Resolving index entries for Woudrichem
Resolving index entries for Woudrichem
Resolving index entries for Woudrichem
Resolving index entries for Woudrichem
Resolving index entries for Woudrichem
Resolving index entries for Woudrichem
Resolving index entries for Woudrichemer Waard
Resolving index entries for Wrange
Resolvi

Resolving index entries for Zeeland bewesten Schelde
Resolving index entries for Zeeland bewesten Schelde
Resolving index entries for Zeeland bewesten Schelde
Resolving index entries for Zeeland bewesten Schelde
Resolving index entries for Zeeland bewesten Schelde
Resolving index entries for Zeeland bewesten Schelde
Resolving index entries for Zeeland bewesten Schelde
Resolving index entries for Zeeland bewesten Schelde
Resolving index entries for Zeeland bewesten Schelde
Resolving index entries for Zeeland bewesten Schelde
Resolving index entries for Zeevang
Resolving index entries for Zeevang
Resolving index entries for Zeevang
Resolving index entries for Zeevang
Resolving index entries for Zeggelis
Resolving index entries for Zegwaard
Resolving index entries for Zegwaard
Resolving index entries for Zegwaard
Resolving index entries for Zegwaard
Resolving index entries for Zegwaard
Resolving index entries for Zeist
Resolving index entries for Zeist
Resolving index entries for Zeist
Re

In [209]:
headers = ["charter", "book", "page", "line", "place_pref", "place_variant", "place_string", "match_type", "match_line"]

from openpyxl import Workbook
wb = Workbook()

# grab the active worksheet
ws = wb.active

# Rows can also be appended
ws.append(headers)

for name_pref in references:
    for reference in references[name_pref]:
        row = []
        for header in reference.keys():
            if header not in headers:
                print("missing header", header)
                print(reference)
        if "place_variant" not in reference:
            reference["place_variant"] = reference["place_string"]
        for header in headers:
            if header not in reference:
                print("missing header", header)
                print(reference)
            if header == "charter":
                row += ["-".join([str(charter) for charter in reference[header]])]
            else:
                row += [reference[header]]
        ws.append(row)
    
# Save the file
wb.save("ohz-placename-resolved-references.xlsx")



In [200]:
headers = ['problem_type', 'name_pref', 'book', 'page', 'line', 'charter', 'entry_text']

from openpyxl import Workbook
wb = Workbook()

# grab the active worksheet
ws = wb.active

ws.append(headers)
for name_pref in unresolvable_references:
    for reference in unresolvable_references[name_pref]:
        reference["name_pref"] = name_pref
        #reference["charter"] = None
        #reference["entry_text"] = None
        row = []
        for header in headers:
            if header not in reference:
                reference[header] = None
            if header == "charter" and isinstance(reference[header], list):
                row += ["-".join([str(charter) for charter in reference[header]])]
            elif header == "entry_text" and isinstance(reference[header], str):
                row += [reference[header].replace("\n", " ")]
            else:
                row += [reference[header]]
        try:
            ws.append(row)
        except:
            print(row)
            print(reference)
            raise

# Save the file
wb.save("ohz-placename-unresolved-references.xlsx")


## Language Detection

The text of the charter books are a mix between original charter text and commentary. Place names mentioned in the commentaries should be ignored. One distinguishing feature between commentary and charter text is that commentary is always in modern Dutch, while charter text can be either middle Dutch, French or Latin. Another is that commentary contains many formulaic phrases, names of editors and numbers.

The following step is an attempt to detect language per paragraph to identify whether a paragraph is commentary or original charter text.

As the previous step of recognizing place name references from the existing index provides good results, we skip the language detection aspect for now. We can consider revisiting this later when we want to identify e.g. references to the [Digitale Charterbank](https://www.huygens.knaw.nl/digitale-charterbank-nederland/).


In [459]:
from langdetect import detect, detect_langs

latin_text = "noluerit, inprimitus $? iram Dei omnipotentis incurrat et sanctorum angelorum, et a [limi]nibus   ecclesiarum *? vel consortio christianorum efficiatur extraneus, et habeat partem cum Tuda qui Dominum tradidit ?? et cum Dathan et Abyron w? quos terra vivos deglutivit 2?, et insuper inferat ? una cum socio fisco auri libras X, argenti pondo <L)> +? coactus exsolvat +, et quod repetit evindicare non valeat. "
modern_dutch_text = "Drukken (alle, indien gedateerd, ad 726, tenzij anders vermeld): a. C. Scribani, Origines Antverpiensium, Antwerpiae 1610, p. 59-62. ? b. Miraeus, Cod. donationum, p. 31-35, nr. 8, naar a. ? c. Miraeus, Notitia, p. 23-24, fragmenten naar b. ? d. W. Bosschaerts, AtotpiBou de primis veteris Frisiae apostolis, Mechliniae 1650, p. 493-498, naar ab. ? e. Vredius, Hist. comitum Flandriae, IT, p. 317-319, naar b. ? f. C. Le Cointe, Annales ecclesiastici Francorum, IV, Parisiis 1670, p. 743-744. ? 8. J. Le Roy, Notitia marchionatus sacri Romani imperii hoc est urbis et agri Antverpiensis, Amstelaedami 1678, p. 67-68, naar B.?h. J. B. Gramaye, Antiquitates Bredanae, p. 6-7, in: Antiquitates illustrissimi ducatus Brabantia, Lovanii-Bruxellis 1708, onvolledig. ? i. Van Heussen, Batavia sacra, p. 40-41. ? j. Miraeus-Foppens, Opera dipl., T, p. rr, codex nr. 8, naar b. ? k. Van Loon, Hollandsche histori, p. 325-326, in noot, naar b. ? L. J. Bertholet, Histoire eccl?stastique et civile du duch? de Luxembourg et comt? de Chiny, IL, Luxembourg 1742, pi?ces justificatives p. 33-34. ? m. Van Goor, Beschrijving Breda, p. 400402, nr. 2. ? n. Dom Calmet, Histoire eccl?siastique et civile de la Lorraine, IL, Nancy 1748, preuves kol. xcii-xciv. ? o. Hontheim, Hist. Trevirensis, T, p. 115116, nr. 41, naar j. ? p. J.C. Diercxsens, Antverpia Christo nascens et crescens, I, Antverpiae 1773, p. 39-40, naar g. ? q. Gallia Christiana, XTII, instrumenta kol. 296-297, instr. eccl. Trevirensis nr. 10. ? r. L. G. O. de Br?quignyen F. J. G. La Porte du Theil, Diplomata, chartae, epistolae et alia documenta ad res Francicas spectantia, pars prima, I, Parisiis 179, p. 451-452. ?   J. M. Pardessus en L. G. O. de Br?quigny, Diplomata, chartae, epistolae, leges aliaque instrumenta ad res Gallo-Francicas spectantia, II, Parisiis 1649, p. 349-350, nr. 540, naar T. ? t. Migne, Patrologia Latina, LXXXIX, kol. 554-556, dipl. ad S. Willibrordum vel ab eo collata nr. 17. ? u. Van den Bergh, OHZ, L, p. 2, nr. 3, naar j en een afschrift naar B. ? v. F.-X.  Wurth-Paquet,  Table analytique des chartes et documents concernant la ville d Echternach et ses ?tablissements, I, Luxembourg "
modern_dutch_text = "De tekstoverlevering van deze oorkonde is niet bijzonder goed. Poncelet (a.w., p. 166) neemt aan, dat de drukken alle direct of indirect zijn afgeleid van het oudst bekende afschrift B. Zeer waarschijnlijk staat ook geen der overige afschriften los van B. Enkele corrupte plaatsen in B zijn reeds door anderen ge?mendeerd. Het ge?mendeerde plaatsen wij tussen rechte haken. Dit staat los van de vraag naar de echtheid van deze oorkonde. Zij ts voor het eerst gesteld door J. Mabillon (Acta Sanctorum ordinis sancti Benedicti, saec. IIL-t, Lut. Paris. 1672, p. 629). Mabillon zelf heeft haar onbeantwoord gelaten. Geen belang heeft thans nog de kritiek op deze oorkonde uitgebracht door P. P. M. Alberdingk Thijm (De H. Willibrordus apostel der Nederlanden, Amsterdam-Leuven 1661, p. 255-257; vgl. de vertaling door L. Tross, Der heilige Willibrord, M?nster[W. 1863, p. 180, noot 2). Later is deze kritiek uitgewerkt door L. van der Essen in: Geschiedkundige Bladen, I-2, Amsterdam 1905, p. 378-382, waar Diederik van Echternach wordt voorgesteld als de vervaardiger circa 11Q0 van deze oorkonde. Kort nadien evenwel heeft Van der Essen deze hypothese weer ingetrokken, daartoe genoopt door de kritiek die hij had ontmoet bij A. Poncelet (a.w., p. 163-174; vgl. W. Levison, in: MGH, SS rer. Merov., VILT, p. gr, noot 9). Poncelet resumeerde zijn oordeel over de oorkonde, het testamentum van Willibrord, aldus:  Tout compte fait, s?il n'est peut-?tre pas sage de se prononcer r?solument pour Dauthenticit? du ?testament de S. Willibrord, il semble infiniment moins prudent encore de le ranger parmi les documents apocryphes.                                  | Er is geen reden om er aan te twijfelen dat Willibrord in het zesde jaar van koning Diederik TV een aantal bij missionering verworven bezittingen met een oorkonde zou hebben geschonken aan het klooster te Echternach, en dat de tekst van deze oorkonde bewaard is gebleven. Dit neemt niet weg, dat deze tekst in de enige ons bewaard gebleven versie bij nadere beschouwing een gewichtige interpolatie bevat, namelijk de passage waaruit moet blijken dat Willibrord, behalve bezittingen afkomstig van een aantal ingenui Franci, aan Echternach ook goederen heeft geschonken, die hij had ontvangen van de familie der Pippiniden. Wat Wampach (a.w., Le, p. go vig. ) hierover ook moge gezegd hebben, het blijft een feit dat deze passage op onhandige wijze een contekst verstoort, die bovendien verderop met de inhoud van deze toch belangrijke inlas weer geen rekening houdt. Tot de onechte passages moet verder het zinsdeel vel ad illam sanctam congre"

print(detect(latin_text))
print(detect_langs(latin_text))

terms = re.split(r"\W+", latin_text.lower())
terms = re.split(r"\W+", modern_dutch_text.lower())

# Modern Dutch text characteristics:
# - many acronyms /[a-zA-Z]\./ (single character followed by dot)
# - many numbers
# - many page references: /p\. \d+/
# - many domain-specific phrases: ["Drukken", "tenzij anders vermeld", "afschrift", "waarschijnlijk", "oorkonde"]
# - many typical names of editors, scholars: ["Poncelet", "Dijkhof", "Wampach"]

# Latin text characteristics:
# - many typical suffixes: /.*(um|ur|us|is)/
# - few numbers

# Middle Dutch text characteristics:
# - many typical ngrams: /(ae|ugh|ck|)
# - few numbers

# French text characteristics

tf = Counter(terms)
tf.most_common()

ca
[ca:0.7157915619408123, pt:0.1428558161578159, ro:0.14134893023544892]


[('de', 19),
 ('van', 15),
 ('deze', 10),
 ('p', 10),
 ('het', 8),
 ('oorkonde', 7),
 ('dat', 6),
 ('der', 6),
 ('door', 6),
 ('een', 6),
 ('w', 5),
 ('in', 5),
 ('willibrord', 5),
 ('is', 4),
 ('a', 4),
 ('aan', 4),
 ('geen', 4),
 ('heeft', 4),
 ('poncelet', 3),
 ('zijn', 3),
 ('b', 3),
 ('ook', 3),
 ('kritiek', 3),
 ('echternach', 3),
 ('die', 3),
 ('bij', 3),
 ('niet', 2),
 ('neemt', 2),
 ('staat', 2),
 ('los', 2),
 ('plaatsen', 2),
 ('ge', 2),
 ('dit', 2),
 ('mabillon', 2),
 ('op', 2),
 ('m', 2),
 ('amsterdam', 2),
 ('vgl', 2),
 ('l', 2),
 ('noot', 2),
 ('2', 2),
 ('essen', 2),
 ('diederik', 2),
 ('weer', 2),
 ('hij', 2),
 ('had', 2),
 ('s', 2),
 ('il', 2),
 ('le', 2),
 ('er', 2),
 ('te', 2),
 ('aantal', 2),
 ('bezittingen', 2),
 ('met', 2),
 ('hebben', 2),
 ('geschonken', 2),
 ('tekst', 2),
 ('bewaard', 2),
 ('gebleven', 2),
 ('passage', 2),
 ('moet', 2),
 ('tekstoverlevering', 1),
 ('bijzonder', 1),
 ('goed', 1),
 ('166', 1),
 ('drukken', 1),
 ('alle', 1),
 ('direct', 1),
 ('of',