##### Experiment, sheet 11:
Tot nu toe zie ik 3 manieren om tekst+tekstcoordinaten te implementeren: 1. op basis van een onveranderlijke lijst van basissegmenten (zoals de lines in Republic), 2. op basis van een initiele segmentatie die wordt bepaald door hoe de te verwerken bron al is opgedeeld (zoals het TEI voorbeeld, met head, section, paragraph en page), waarbij de segmentatie later is te verfijnen, 3. op basis van een initiele segmentatie, zoals in geval 2, waarbij fijnmaziger segmentatie kan worden toegevoegd op basis van relatieve char offset tov het initiele segment in kwestie. 
Ik heb cases 1 en 2 inmiddels geimplementeerd als subclasses van een abstract class SegmentedText. Case 2 werkt (per definitie) al, want identiek aan sheet 10.
In deze sheet pas ik de recentste Republic code uit sheet 7 aan, om gebruik te maken van IndexedSegmentedText, in plaats van van een simpele list. IndexedSegmentedText doet overigens niet veel meer dan een wrapper voor een list zijn.

Mogelijk vervolg:
- zoek uit of ik ook een abstracte Anchor class nodig heb
- instantieer voor beide cases de juiste SegmentedText en gebruik die in de REST services

In [1]:
import sys
sys.path.append('../un-t-ann-gle')

from textservice import segmentedtext

datadir = '../data/'

In [2]:
import json
import glob
import re

# read files

all_textlines=segmentedtext.IndexedSegmentedText()
all_annotations=[]

# We want to load 'text containers' that contain more or less contiguous text and are as long as practically
# possible. Container size is determined by pragmatic reasons, e.g. technical (performance) or user driven
# (corresponding with all scans in a book or volume). This functions returns all component files IN TEXT ORDER.
# Examples: sorted list of files, part of IIIF manifest.

def get_file_sequence_for_container(text_container):
    path = "../data/sessions/meeting-1705*"
    session_file_names = (f for f in glob.glob(path))
    return sorted(session_file_names)

# Many file types contain a hierarchy of ordered text and/or annotation elements of different types. Some form of
# depth-first, post order traversal is necessary. Examples: processing a json hierarchy with dictionaries
# and lists (republic) or parsing TEI XML (DBNL document).

def traverse(node,node_label,text,annotations):
    # find the list that represents the children, each child is a dict, assume first list is the correct one
    children = []
    label_of_children = ''
    for key,val in node.items():
        if (type(val) == list):
            children = val
            label_of_children = key
            break 
    
    if 'coords' in node:
        coords = node['coords']
    else:
        coords = None
    
    begin_index = text.len()
    annotation_info = {'label' : node_label,'image_coords': coords,'begin_anchor' : begin_index}
    if len(children) == 0:        # if no children, do your 'leaf node thing'
        annotation_info['id'] = node['id']
        annotation_info['end_anchor'] = text.len()
        node_text = node['text']
        
        if node_text is None:
            node_text = '\n'

        text.append(node_text)
    else:                         # if non-leaf node, first visit children     
        for child in children:
            traverse(child,label_of_children,text,annotations)
        
        # ONDERSTAANDE IS SMERIG, hangt van onzekere aannames af
        for k in node['metadata'].keys():
            idkey = ''
            if k.endswith('id'):
                idkey = k
                break
        annotation_info['id'] = node['metadata'][idkey]
        
        end_index = text.len()-1
        annotation_info['end_anchor'] = end_index    # after child text segments are added
        
        # if node contains iiif_url, create extra annotation_info for 'scanpage'
        if 'iiif_url' in node['metadata']:
            scan_annot_info = {'label':'scanpage','iiif_url':node['metadata']['iiif_url'],\
                               'begin_anchor':begin_index,'end_anchor':end_index}
            scan_annot_info['scan_num'] = node['metadata']['scan_num']
            annotations.append(scan_annot_info)
        
    annotations.append(annotation_info)
    return

# In case of presence of a hierarchical structure, processing/traversal typically starts from a root element.

def get_root_tree_element(file):
    with open(file, 'r') as myfile:
        session_file=myfile.read() 
        
    session_data = json.loads(session_file)      
    return session_data['_source']

# Rudimentary version of a scanpage_handler

def deduplicate_scanpage_annotations(a_array):
    # use a generator to create a list of only scanpage annotation_info dicts
    scan_page_annots = [ann_info for ann_info in a_array if ann_info['label'] == 'scanpage']
        
    # use groupBy on a list of dicts (zie Python cookbook 1.15)
    from operator import itemgetter
    from itertools import groupby

    # first sort on scan_num
    scan_page_annots.sort(key=itemgetter('scan_num'))

    # iterate in groups
    aggregated_scan_annots = []

    for scan_num, items in groupby(scan_page_annots, key=itemgetter('scan_num')):
        # first, convert the 'items' iterator to a list, to able to use it twice (iterators can be used once)
        itemlist = list(items)
    
        # copy the item with the lowest begin_index
        aggr_scan_annot = min(itemlist, key=itemgetter('begin_anchor')).copy()
    
        # replace 'end_anchor' with the highest end_index in the group
        max_end_index = max(itemlist, key=itemgetter('end_anchor'))['end_anchor']
        aggr_scan_annot['end_anchor'] = max_end_index
        
        # add to result
        aggregated_scan_annots.append(aggr_scan_annot)

#    for scan_ann in aggregated_scan_annots:
#        scan_ann['iiif_url'] = re.sub(r'(\d+),(\d+),(\d+),(\d+)/(full)', r'\5/,\4', scan_ann['iiif_url'])
    
    a_array = [ann for ann in a_array if ann not in scan_page_annots]
    a_array.extend(aggregated_scan_annots)
           
    return

def correct_scanpage_imageurls(a_array):
    scan_page_annots = [ann_info for ann_info in a_array if ann_info['label'] == 'scanpage']
    
    for scan_ann in scan_page_annots:
        scan_ann['iiif_url'] = re.sub(r'(\d+),(\d+),(\d+),(\d+)/(full)', r'\5/,\4', scan_ann['iiif_url'])
        
    return

# Rudimentary version of a page_handler

def add_page_annotations(source_data, ann_array):
    page_data = source_data['page_versions']
        
    # generator
    page_identifiers = (pg['page_id'] for pg in page_data)
    page_annots = [{'label' : 'pages','id' : page_id} for page_id in page_identifiers]
        
    for pa in page_annots:
        scan_num = int(re.search(r'(\d+)-page-', pa['id']).group(1))
        scanpage_for_scan_num = [ai for ai in annotation_array if 'scan_num' in ai.keys() and ai['scan_num'] == \
                                 scan_num]
        pa['begin_anchor'] = scanpage_for_scan_num[0]['begin_anchor']
        pa['end_anchor'] = scanpage_for_scan_num[0]['end_anchor']
        pa['indexesByContainment'] = True
     
    ann_array.extend(page_annots)
    return

# Process per file, properly concatenate results, maintaining proper referencing the baseline text elements
for f_name in get_file_sequence_for_container('volume-1705-1'):
    text_array = segmentedtext.IndexedSegmentedText()
    annotation_array = []
            
    source_data = get_root_tree_element(f_name)

    traverse(source_data,'sessions',text_array,annotation_array)
        
    scanpages = deduplicate_scanpage_annotations(annotation_array) 
    correct_scanpage_imageurls(annotation_array)
    
    add_page_annotations(source_data, annotation_array)
           
    # properly concatenate annotation info taking ongoing line indexes into account
    for ai in annotation_array:
        ai['begin_anchor'] += all_textlines.len()
        ai['end_anchor'] += all_textlines.len()
    
    all_textlines.extend(text_array)       
    all_annotations.extend(annotation_array)

In [3]:
all_annotations[-20:]

[{'label': 'lines',
  'image_coords': {'left': 2591,
   'right': 3313,
   'top': 2226,
   'bottom': 2302,
   'height': 76,
   'width': 722},
  'begin_anchor': 4670,
  'id': 'NL-HaNA_1.01.02_3760_0032-page-63-col-0-tr-3-line-2',
  'end_anchor': 4670},
 {'label': 'lines',
  'image_coords': {'left': 2586,
   'right': 3416,
   'top': 2274,
   'bottom': 2351,
   'height': 77,
   'width': 830},
  'begin_anchor': 4671,
  'id': 'NL-HaNA_1.01.02_3760_0032-page-63-col-0-tr-3-line-3',
  'end_anchor': 4671},
 {'label': 'lines',
  'image_coords': {'left': 2636,
   'right': 3247,
   'top': 2324,
   'bottom': 2396,
   'height': 72,
   'width': 611},
  'begin_anchor': 4672,
  'id': 'NL-HaNA_1.01.02_3760_0032-page-63-col-0-tr-3-line-4',
  'end_anchor': 4672},
 {'label': 'lines',
  'image_coords': {'left': 2590,
   'right': 3417,
   'top': 2371,
   'bottom': 2449,
   'height': 78,
   'width': 827},
  'begin_anchor': 4673,
  'id': 'NL-HaNA_1.01.02_3760_0032-page-63-col-0-tr-3-line-5',
  'end_anchor': 467

Lees resolutie info in uit de resultaten van queries naar Marijn's resolutie-index, voor de betreffende 7 zittingsdagen. Vooralsnog neem ik alleen resolutie id en line-ids van de eerste en laatste regel van de resolutie mee.

In [4]:
resolution_annotations=[]

def get_resolution_files_for_container(text_container):
    path = "../data/resolutions/*-resolutions.json"
    resolution_file_names = (f for f in glob.glob(path))
    return sorted(resolution_file_names)

def res_traverse(node, line_ids):
    # find the list that represents the children, each child is a dict, assume first list is the correct one
    children = []
    label_of_children = ''
    
    # assume, first list in dict are the children
    for key,val in node.items():
        # HACK ALERT! assumption that first list contains children has exceptions
        if type(val) == list and key != 'paragraphs' and key != 'evidence':
            children = val
            label_of_children = key
            break 
    
    if len(children) == 0:        # if no children, do your 'leaf node thing'
        line_ids.append(node['metadata']['id'])
    else:                         # if non-leaf node, first visit children     
        for child in children:
            res_traverse(child,line_ids)
                            
    return

# In case of presence of a hierarchical structure, processing/traversal typically starts from a root element.

def get_res_root_element(file):
    with open(file, 'r') as myfile:
        resolution_file=myfile.read() 
        
    resolution_data = json.loads(resolution_file)      
    return resolution_data['hits']['hits']

for f_name in get_resolution_files_for_container('volume-1705-1'):    
    # get list of resolution 'hits'
    hits = get_res_root_element(f_name)
    print(f_name)
    for hit in hits:
        # each hit corresponds with a resolution
        resolution_line_ids = []
        res_traverse(hit['_source'],resolution_line_ids)
        
        resolution_info = {'label' : 'resolutions','begin_anchor' : resolution_line_ids[0], \
                                      'end_anchor': resolution_line_ids[len(resolution_line_ids)-1], 'id': hit['_id']}
        resolution_annotations.append(resolution_info)

resolution_annotations

../data/resolutions/1705-01-02-resolutions.json
../data/resolutions/1705-01-03-resolutions.json
../data/resolutions/1705-01-06-resolutions.json
../data/resolutions/1705-01-07-resolutions.json
../data/resolutions/1705-01-09-resolutions.json
../data/resolutions/1705-01-10-resolutions.json
../data/resolutions/1705-01-11-resolutions.json


[{'label': 'resolutions',
  'begin_anchor': 'NL-HaNA_1.01.02_3760_0011-page-20-column-0-tr-0-line-26',
  'end_anchor': 'NL-HaNA_1.01.02_3760_0012-page-22-column-0-tr-0-line-7',
  'id': 'meeting-1705-01-02-session-1-resolution-27'},
 {'label': 'resolutions',
  'begin_anchor': 'NL-HaNA_1.01.02_3760_0012-page-22-column-0-tr-0-line-8',
  'end_anchor': 'NL-HaNA_1.01.02_3760_0012-page-22-column-0-tr-0-line-20',
  'id': 'meeting-1705-01-02-session-1-resolution-28'},
 {'label': 'resolutions',
  'begin_anchor': 'NL-HaNA_1.01.02_3760_0012-page-22-column-0-tr-0-line-21',
  'end_anchor': 'NL-HaNA_1.01.02_3760_0012-page-22-column-0-tr-0-line-43',
  'id': 'meeting-1705-01-02-session-1-resolution-29'},
 {'label': 'resolutions',
  'begin_anchor': 'NL-HaNA_1.01.02_3760_0012-page-22-column-0-tr-0-line-44',
  'end_anchor': 'NL-HaNA_1.01.02_3760_0012-page-22-column-1-tr-0-line-2',
  'id': 'meeting-1705-01-02-session-1-resolution-30'},
 {'label': 'resolutions',
  'begin_anchor': 'NL-HaNA_1.01.02_3760_0012-

Maak voor alle line-ids een dict aan van line-id versus text_index

In [5]:
line_ids_vs_indexes = {}
for line in all_annotations:
    if line['label'] == 'lines':
        line_ids_vs_indexes.update({line['id'] : line['begin_anchor']})
        
line_ids_vs_indexes

{'NL-HaNA_1.01.02_3760_0008-page-15-col-0-tr-0-line-8': 0,
 'NL-HaNA_1.01.02_3760_0008-page-15-col-0-tr-0-line-9': 1,
 'NL-HaNA_1.01.02_3760_0008-page-15-col-0-tr-0-line-10': 2,
 'NL-HaNA_1.01.02_3760_0008-page-15-col-0-tr-0-line-11': 3,
 'NL-HaNA_1.01.02_3760_0008-page-15-col-0-tr-0-line-12': 4,
 'NL-HaNA_1.01.02_3760_0008-page-15-col-0-tr-0-line-13': 5,
 'NL-HaNA_1.01.02_3760_0008-page-15-col-0-tr-0-line-14': 6,
 'NL-HaNA_1.01.02_3760_0008-page-15-col-0-tr-0-line-15': 7,
 'NL-HaNA_1.01.02_3760_0008-page-15-col-0-tr-4-line-2': 8,
 'NL-HaNA_1.01.02_3760_0008-page-15-col-0-tr-4-line-3': 9,
 'NL-HaNA_1.01.02_3760_0008-page-15-col-0-tr-1-line-0': 10,
 'NL-HaNA_1.01.02_3760_0008-page-15-col-0-tr-1-line-1': 11,
 'NL-HaNA_1.01.02_3760_0008-page-15-col-0-tr-1-line-2': 12,
 'NL-HaNA_1.01.02_3760_0008-page-15-col-0-tr-1-line-3': 13,
 'NL-HaNA_1.01.02_3760_0008-page-15-col-0-tr-1-line-4': 14,
 'NL-HaNA_1.01.02_3760_0008-page-15-col-0-tr-1-line-5': 15,
 'NL-HaNA_1.01.02_3760_0008-page-15-col-0-tr

Voor alle resoluties, voeg begin_index en end_index toe.
Opmerking: marijn gebruikt bij zittingsdagen 'col' in de ids, en bij resoluties 'column'. Fix mbv regex.

In [6]:
for res in resolution_annotations:
    num_errors = 0
    try:
        res['begin_anchor'] = re.sub(r'-column-', r'-col-', res['begin_anchor'])
        res['end_anchor'] = re.sub(r'-column-', r'-col-', res['end_anchor'])
    
        res['begin_anchor'] = line_ids_vs_indexes[res['begin_anchor']]
        res['end_anchor'] = line_ids_vs_indexes[res['end_anchor']]
    except:
        res['begin_anchor'] = 0
        res['end_anchor'] = 0
        num_errors += 1
    if num_errors > 0:
        print(f"number of lookup errors: {num_errors}")
    
resolution_annotations

number of lookup errors: 1


[{'label': 'resolutions',
  'begin_anchor': 599,
  'end_anchor': 830,
  'id': 'meeting-1705-01-02-session-1-resolution-27'},
 {'label': 'resolutions',
  'begin_anchor': 831,
  'end_anchor': 843,
  'id': 'meeting-1705-01-02-session-1-resolution-28'},
 {'label': 'resolutions',
  'begin_anchor': 844,
  'end_anchor': 866,
  'id': 'meeting-1705-01-02-session-1-resolution-29'},
 {'label': 'resolutions',
  'begin_anchor': 867,
  'end_anchor': 885,
  'id': 'meeting-1705-01-02-session-1-resolution-30'},
 {'label': 'resolutions',
  'begin_anchor': 886,
  'end_anchor': 899,
  'id': 'meeting-1705-01-02-session-1-resolution-31'},
 {'label': 'resolutions',
  'begin_anchor': 901,
  'end_anchor': 945,
  'id': 'meeting-1705-01-02-session-1-resolution-32'},
 {'label': 'resolutions',
  'begin_anchor': 946,
  'end_anchor': 958,
  'id': 'meeting-1705-01-02-session-1-resolution-33'},
 {'label': 'resolutions',
  'begin_anchor': 959,
  'end_anchor': 983,
  'id': 'meeting-1705-01-02-session-1-resolution-34'},


In [7]:
all_annotations.extend(resolution_annotations)

In [8]:
from annotation import asearch
asearch.get_annotations_at_anchor(2675, all_annotations)

[{'label': 'lines',
  'image_coords': {'left': 1390,
   'right': 2278,
   'top': 955,
   'bottom': 1015,
   'height': 60,
   'width': 888},
  'begin_anchor': 2675,
  'id': 'NL-HaNA_1.01.02_3760_0022-page-42-col-1-tr-0-line-12',
  'end_anchor': 2675},
 {'label': 'textregions',
  'image_coords': {'left': 1380,
   'right': 2294,
   'top': 372,
   'bottom': 3290,
   'height': 2918,
   'width': 914},
  'begin_anchor': 2663,
  'id': 'NL-HaNA_1.01.02_3760_0022-page-42-col-1-tr-0',
  'end_anchor': 2724},
 {'label': 'scanpage',
  'iiif_url': 'https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0022.jpg/full/,3118/0/default.jpg',
  'begin_anchor': 2663,
  'end_anchor': 2724,
  'scan_num': 22},
 {'label': 'columns',
  'image_coords': {'left': 1380,
   'right': 2294,
   'top': 372,
   'bottom': 3290,
   'height': 2918,
   'width': 914},
  'begin_anchor': 2663,
  'id': 'NL-HaNA_1.01.02_3760_0022-page-42-col-1',
  'end_anchor': 2724},
 {'label': 'sessions',
  'image_coords': N

Zoek alle annotaties die overlappen met een interval. Dat interval kan ontleent worden aan begin_index en end_index van een resolutie.

In [9]:
for a in asearch.get_annotations_of_type('resolutions', all_annotations):
    print(a)

{'label': 'resolutions', 'begin_anchor': 599, 'end_anchor': 830, 'id': 'meeting-1705-01-02-session-1-resolution-27'}
{'label': 'resolutions', 'begin_anchor': 831, 'end_anchor': 843, 'id': 'meeting-1705-01-02-session-1-resolution-28'}
{'label': 'resolutions', 'begin_anchor': 844, 'end_anchor': 866, 'id': 'meeting-1705-01-02-session-1-resolution-29'}
{'label': 'resolutions', 'begin_anchor': 867, 'end_anchor': 885, 'id': 'meeting-1705-01-02-session-1-resolution-30'}
{'label': 'resolutions', 'begin_anchor': 886, 'end_anchor': 899, 'id': 'meeting-1705-01-02-session-1-resolution-31'}
{'label': 'resolutions', 'begin_anchor': 901, 'end_anchor': 945, 'id': 'meeting-1705-01-02-session-1-resolution-32'}
{'label': 'resolutions', 'begin_anchor': 946, 'end_anchor': 958, 'id': 'meeting-1705-01-02-session-1-resolution-33'}
{'label': 'resolutions', 'begin_anchor': 959, 'end_anchor': 983, 'id': 'meeting-1705-01-02-session-1-resolution-34'}
{'label': 'resolutions', 'begin_anchor': 984, 'end_anchor': 1056

In [10]:
for a in asearch.get_annotations_overlapping_with(2813,2825,all_annotations):
    print(a)

{'label': 'lines', 'image_coords': {'left': 3575, 'right': 4413, 'top': 1561, 'bottom': 1657, 'height': 96, 'width': 838}, 'begin_anchor': 2813, 'id': 'NL-HaNA_1.01.02_3760_0022-page-43-col-1-tr-2-line-15', 'end_anchor': 2813}
{'label': 'lines', 'image_coords': {'left': 3522, 'right': 4413, 'top': 1609, 'bottom': 1702, 'height': 93, 'width': 891}, 'begin_anchor': 2814, 'id': 'NL-HaNA_1.01.02_3760_0022-page-43-col-1-tr-2-line-16', 'end_anchor': 2814}
{'label': 'lines', 'image_coords': {'left': 3517, 'right': 4413, 'top': 1655, 'bottom': 1749, 'height': 94, 'width': 896}, 'begin_anchor': 2815, 'id': 'NL-HaNA_1.01.02_3760_0022-page-43-col-1-tr-2-line-17', 'end_anchor': 2815}
{'label': 'lines', 'image_coords': {'left': 3515, 'right': 4414, 'top': 1703, 'bottom': 1799, 'height': 96, 'width': 899}, 'begin_anchor': 2816, 'id': 'NL-HaNA_1.01.02_3760_0022-page-43-col-1-tr-2-line-18', 'end_anchor': 2816}
{'label': 'lines', 'image_coords': {'left': 3518, 'right': 4406, 'top': 1753, 'bottom': 1848

Combineer beide: alle annotaties van een bepaald type, binnen een interval

In [11]:
for a in asearch.get_annotations_of_type_overlapping('resolutions',2599,2850,all_annotations):
    print(a)

{'label': 'resolutions', 'begin_anchor': 2586, 'end_anchor': 2609, 'id': 'meeting-1705-01-07-session-1-resolution-1'}
{'label': 'resolutions', 'begin_anchor': 2611, 'end_anchor': 2622, 'id': 'meeting-1705-01-07-session-1-resolution-2'}
{'label': 'resolutions', 'begin_anchor': 2624, 'end_anchor': 2647, 'id': 'meeting-1705-01-07-session-1-resolution-3'}
{'label': 'resolutions', 'begin_anchor': 2649, 'end_anchor': 2668, 'id': 'meeting-1705-01-07-session-1-resolution-4'}
{'label': 'resolutions', 'begin_anchor': 2669, 'end_anchor': 2685, 'id': 'meeting-1705-01-07-session-1-resolution-5'}
{'label': 'resolutions', 'begin_anchor': 2686, 'end_anchor': 2695, 'id': 'meeting-1705-01-07-session-1-resolution-6'}
{'label': 'resolutions', 'begin_anchor': 2696, 'end_anchor': 2706, 'id': 'meeting-1705-01-07-session-1-resolution-7'}
{'label': 'resolutions', 'begin_anchor': 2707, 'end_anchor': 2797, 'id': 'meeting-1705-01-07-session-1-resolution-8'}
{'label': 'resolutions', 'begin_anchor': 2798, 'end_anch

In [12]:
def get_textlines_between(begin,end,annotations): 
    textlines = tservice.IndexedSegmentedText()
    for line_annot in get_annotations_of_type_overlapping('lines',begin,end,annotations):
        textlines.append((line_annot['begin_anchor'],all_textlines[line_annot['begin_anchor']]))
    return textlines
        

In [13]:
# get text for (text segment of) resolution
for a in get_textlines_between(2813,2825,all_annotations):
    print(a)

NameError: name 'tservice' is not defined

Vraag nu de tekst op voor de kolom waar deze resolutie deel van uitmaakt

In [15]:
# resolutiegrenzen 2813 en 2825 waren eerder al opgevraagd middels 'get_annotations_of_type'
# vraag eerst columns op, die overlappen met die grenzen
begin = 2813
end = 2825
column_annots = asearch.get_annotations_of_type_overlapping('columns',begin,end,all_annotations)

# voor ieder column, print column_id, print column text
for col_annot in column_annots:
    print(f"column id: {col_annot['id']}")
    print(f"{col_annot['begin_anchor']}   {col_annot['end_anchor']}\n")
    
    for a in get_textlines_between(col_annot['begin_anchor'],col_annot['end_anchor'],all_annotations):
        print(a)

column id: NL-HaNA_1.01.02_3760_0022-page-43-col-1
2785   2850



NameError: name 'tservice' is not defined

Volgende stap: definieren van named entities. Voorlopig alleen door te verwijzen naar de index van de line(s) waar deze entities op voorkomen

In [16]:
entity_segments = [(2814,2814,'per','Arnoldus Boomhouwer'),(2832,2833,'per','van Rabenpeé'),\
                   (2819,2819,'loc','Roermond')]

id_suffix = 1
entity_annotations = []
for es in entity_segments:
    ent_annot = {'label':'entities','begin_anchor':es[0],'end_anchor':es[1],\
                 'id':'NL-HaNA_1.01.02_3760_entity-'+str(id_suffix),\
                 'entity_type': es[2],'entity_text':es[3]}
    entity_annotations.append(ent_annot)
    id_suffix += 1

entity_annotations

[{'label': 'entities',
  'begin_anchor': 2814,
  'end_anchor': 2814,
  'id': 'NL-HaNA_1.01.02_3760_entity-1',
  'entity_type': 'per',
  'entity_text': 'Arnoldus Boomhouwer'},
 {'label': 'entities',
  'begin_anchor': 2832,
  'end_anchor': 2833,
  'id': 'NL-HaNA_1.01.02_3760_entity-2',
  'entity_type': 'per',
  'entity_text': 'van Rabenpeé'},
 {'label': 'entities',
  'begin_anchor': 2819,
  'end_anchor': 2819,
  'id': 'NL-HaNA_1.01.02_3760_entity-3',
  'entity_type': 'loc',
  'entity_text': 'Roermond'}]

In [17]:
all_annotations.extend(entity_annotations)

In [18]:
for a in asearch.get_annotations_of_type('entities', all_annotations):
    print(a)

{'label': 'entities', 'begin_anchor': 2814, 'end_anchor': 2814, 'id': 'NL-HaNA_1.01.02_3760_entity-1', 'entity_type': 'per', 'entity_text': 'Arnoldus Boomhouwer'}
{'label': 'entities', 'begin_anchor': 2832, 'end_anchor': 2833, 'id': 'NL-HaNA_1.01.02_3760_entity-2', 'entity_type': 'per', 'entity_text': 'van Rabenpeé'}
{'label': 'entities', 'begin_anchor': 2819, 'end_anchor': 2819, 'id': 'NL-HaNA_1.01.02_3760_entity-3', 'entity_type': 'loc', 'entity_text': 'Roermond'}


In [19]:
# get text for (text segment of) entity
for a in get_textlines_between(2832,2833,all_annotations):
    print(a)

NameError: name 'tservice' is not defined

In [20]:
# get text for (text segment of) entity
for a in get_textlines_between(2819,2819,all_annotations):
    print(a)

NameError: name 'tservice' is not defined

In [21]:
# get all entities for a specific session

# first get session that contains our example resolution from 2813 to 2825
for a in asearch.get_annotations_of_type_overlapping('sessions',2599,2850,all_annotations):
    print(a)

{'label': 'sessions', 'image_coords': None, 'begin_anchor': 2568, 'id': 'meeting-1705-01-07-session-1', 'end_anchor': 3093}


In [22]:
# then, get overlapping entities
for a in asearch.get_annotations_of_type_overlapping('entities',2568,3093,all_annotations):
    print(a)

{'label': 'entities', 'begin_anchor': 2814, 'end_anchor': 2814, 'id': 'NL-HaNA_1.01.02_3760_entity-1', 'entity_type': 'per', 'entity_text': 'Arnoldus Boomhouwer'}
{'label': 'entities', 'begin_anchor': 2832, 'end_anchor': 2833, 'id': 'NL-HaNA_1.01.02_3760_entity-2', 'entity_type': 'per', 'entity_text': 'van Rabenpeé'}
{'label': 'entities', 'begin_anchor': 2819, 'end_anchor': 2819, 'id': 'NL-HaNA_1.01.02_3760_entity-3', 'entity_type': 'loc', 'entity_text': 'Roermond'}


Experiment: voeg een 'image_range' element toe voor iedere resolutie

In [23]:
resolution_annotations

[{'label': 'resolutions',
  'begin_anchor': 599,
  'end_anchor': 830,
  'id': 'meeting-1705-01-02-session-1-resolution-27'},
 {'label': 'resolutions',
  'begin_anchor': 831,
  'end_anchor': 843,
  'id': 'meeting-1705-01-02-session-1-resolution-28'},
 {'label': 'resolutions',
  'begin_anchor': 844,
  'end_anchor': 866,
  'id': 'meeting-1705-01-02-session-1-resolution-29'},
 {'label': 'resolutions',
  'begin_anchor': 867,
  'end_anchor': 885,
  'id': 'meeting-1705-01-02-session-1-resolution-30'},
 {'label': 'resolutions',
  'begin_anchor': 886,
  'end_anchor': 899,
  'id': 'meeting-1705-01-02-session-1-resolution-31'},
 {'label': 'resolutions',
  'begin_anchor': 901,
  'end_anchor': 945,
  'id': 'meeting-1705-01-02-session-1-resolution-32'},
 {'label': 'resolutions',
  'begin_anchor': 946,
  'end_anchor': 958,
  'id': 'meeting-1705-01-02-session-1-resolution-33'},
 {'label': 'resolutions',
  'begin_anchor': 959,
  'end_anchor': 983,
  'id': 'meeting-1705-01-02-session-1-resolution-34'},


In [24]:
# voor iedere resolutie, bepaal image_range en voeg deze toe

def get_bounding_box_for(annotations): 
    ann_list = list(annotations) # because a generator can only be used once
    
    min_left = min([ann['image_coords']['left'] for ann in ann_list if 'image_coords' in ann])
    max_right = max([ann['image_coords']['right'] for ann in ann_list if 'image_coords' in ann])
    min_top = min([ann['image_coords']['top'] for ann in ann_list if 'image_coords' in ann])
    max_bottom = max([ann['image_coords']['bottom'] for ann in ann_list if 'image_coords' in ann])
    height = max_bottom - min_top
    width = max_right - min_left

    return {'left': min_left, 'right': max_right, 'top': min_top, 'bottom': max_bottom, 'height': height, 'width': width}

def add_image_range(ann):
    ann['image_range'] = []
    
    ann_begin=ann['begin_anchor']
    ann_end=ann['end_anchor']
        
    # loop over scans die overlappen met de annotatie
    for a in asearch.get_annotations_of_type_overlapping('scanpage',ann_begin,ann_end,all_annotations):
        bounding_boxes = []
        image_url = a['iiif_url']
        
        scan_begin=a['begin_anchor']
        scan_end=a['end_anchor']
        
        # loop over alle kolommen op de betreffende scan. Per kolom, bereken bounding box voor 
        # overlappende resolutieregels
        for clm in asearch.get_annotations_of_type_overlapping('columns',scan_begin,scan_end,all_annotations):
            clm_begin=clm['begin_anchor']
            clm_end=clm['end_anchor']
            
            # bepaal overlap_begin en overlap_end indexes voor kolom
            overlap_begin=max(ann_begin, clm_begin)
            overlap_end=min(ann_end, clm_end)
                        
            # bepaal hieruit de bounding box coords voor deze kolom
            if overlap_end-overlap_begin >= 0: # resolution and column are overlapping
                bounding_box=get_bounding_box_for(asearch.get_annotations_of_type_overlapping('lines',\
                                                        overlap_begin,overlap_end,all_annotations))
                bounding_boxes.append(bounding_box)
        
        ann['image_range'].append((image_url, bounding_boxes))
        print(ann['image_range'])
    return
    
for r in resolution_annotations:
    add_image_range(r)                

[('https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0011.jpg/full/,3108/0/default.jpg', [{'left': 408, 'right': 1295, 'top': 1574, 'bottom': 3274, 'height': 1700, 'width': 887}])]
[('https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0011.jpg/full/,3108/0/default.jpg', [{'left': 408, 'right': 1295, 'top': 1574, 'bottom': 3274, 'height': 1700, 'width': 887}]), ('https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0011.jpg/full/,3053/0/default.jpg', [{'left': 1347, 'right': 2270, 'top': 350, 'bottom': 3203, 'height': 2853, 'width': 923}])]
[('https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0011.jpg/full/,3108/0/default.jpg', [{'left': 408, 'right': 1295, 'top': 1574, 'bottom': 3274, 'height': 1700, 'width': 887}]), ('https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0011.jpg/full/,3053/0/default.jpg', [{'left': 1347, 'right': 2270, 'top': 350, 'bottom': 3203, 'heig

[('https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0014.jpg/full/,3127/0/default.jpg', [{'left': 2509, 'right': 3403, 'top': 2300, 'bottom': 3300, 'height': 1000, 'width': 894}]), ('https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0014.jpg/full/,3130/0/default.jpg', [{'left': 3505, 'right': 4409, 'top': 386, 'bottom': 740, 'height': 354, 'width': 904}])]
[('https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0014.jpg/full/,3130/0/default.jpg', [{'left': 3489, 'right': 4407, 'top': 727, 'bottom': 2492, 'height': 1765, 'width': 918}])]
[('https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0014.jpg/full/,3130/0/default.jpg', [{'left': 3483, 'right': 4382, 'top': 2471, 'bottom': 3316, 'height': 845, 'width': 899}])]
[('https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0014.jpg/full/,3130/0/default.jpg', [{'left': 3483, 'right': 4382, 'top': 2471, 'bottom': 3316, 'hei

[('https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0027.jpg/full/,3113/0/default.jpg', [{'left': 2524, 'right': 3409, 'top': 3069, 'bottom': 3230, 'height': 161, 'width': 885}]), ('https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0027.jpg/full/,3206/0/default.jpg', [{'left': 3371, 'right': 4373, 'top': 279, 'bottom': 3285, 'height': 3006, 'width': 1002}])]
[('https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0027.jpg/full/,3113/0/default.jpg', [{'left': 2524, 'right': 3409, 'top': 3069, 'bottom': 3230, 'height': 161, 'width': 885}]), ('https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0027.jpg/full/,3206/0/default.jpg', [{'left': 3371, 'right': 4373, 'top': 279, 'bottom': 3285, 'height': 3006, 'width': 1002}]), ('https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0028.jpg/full/,3112/0/default.jpg', [{'left': 413, 'right': 1305, 'top': 343, 'bottom': 794, 'heigh

Tussenconclusie: het is mogelijk voor een resolutie alle omsluitende rectangles af te leiden, zelfs als de resolutietekst over meerdere kolommen en/of meerdere scans doorloopt. In principe zijn uit deze image_ranges IIIF image urls af te leiden voor de resolutieonderdelen op de scans, of de hele scans zijn op te halen, en de resolutiedelen daarop zijn te omkaderen.

Omdat het leuk is: genereer nog even een list van IIIF image urls per resolutie (en zet ze dan in je browser even naast elkaar.

In [25]:
resolution_annotations

[{'label': 'resolutions',
  'begin_anchor': 599,
  'end_anchor': 830,
  'id': 'meeting-1705-01-02-session-1-resolution-27',
  'image_range': [('https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0011.jpg/full/,3108/0/default.jpg',
    [{'left': 408,
      'right': 1295,
      'top': 1574,
      'bottom': 3274,
      'height': 1700,
      'width': 887}]),
   ('https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0011.jpg/full/,3053/0/default.jpg',
    [{'left': 1347,
      'right': 2270,
      'top': 350,
      'bottom': 3203,
      'height': 2853,
      'width': 923}]),
   ('https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0011.jpg/full/,3112/0/default.jpg',
    [{'left': 2497,
      'right': 3438,
      'top': 341,
      'bottom': 3253,
      'height': 2912,
      'width': 941}]),
   ('https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0011.jpg/full/,3112/0/default.jpg',
    [{'left': 3462,

In [26]:
def add_region_links(ann):
    region_links = []
    try:
        for image_url, regions in ann['image_range']:
            for coords in regions:
                # construct iiif_url from image_url and coords
                coord_str = f"{coords['left']},{coords['top']},{coords['width']},{coords['height']}"
                region_url = re.sub(r'(full)/(,\d+)', rf'{coord_str}/\1', image_url)
                region_links.append(region_url)
    except:
        print('error: annotation without image range')
        
    ann['region_links'] = region_links
    return
    
for res in resolution_annotations:
    add_region_links(res)

In [27]:
resolution_annotations

[{'label': 'resolutions',
  'begin_anchor': 599,
  'end_anchor': 830,
  'id': 'meeting-1705-01-02-session-1-resolution-27',
  'image_range': [('https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0011.jpg/full/,3108/0/default.jpg',
    [{'left': 408,
      'right': 1295,
      'top': 1574,
      'bottom': 3274,
      'height': 1700,
      'width': 887}]),
   ('https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0011.jpg/full/,3053/0/default.jpg',
    [{'left': 1347,
      'right': 2270,
      'top': 350,
      'bottom': 3203,
      'height': 2853,
      'width': 923}]),
   ('https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0011.jpg/full/,3112/0/default.jpg',
    [{'left': 2497,
      'right': 3438,
      'top': 341,
      'bottom': 3253,
      'height': 2912,
      'width': 941}]),
   ('https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0011.jpg/full/,3112/0/default.jpg',
    [{'left': 3462,

In [28]:
for a in asearch.get_annotations_of_type('resolutions', all_annotations):
    print(a['region_links'])

['https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0011.jpg/408,1574,887,1700/full/0/default.jpg', 'https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0011.jpg/1347,350,923,2853/full/0/default.jpg', 'https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0011.jpg/2497,341,941,2912/full/0/default.jpg', 'https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0011.jpg/3462,347,941,2912/full/0/default.jpg', 'https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0012.jpg/422,364,897,397/full/0/default.jpg']
['https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0012.jpg/422,751,892,639/full/0/default.jpg']
['https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0012.jpg/420,1382,894,1120/full/0/default.jpg']
['https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0012.jpg/422,2495,894,783/full/0/default.jpg', 'https://

In [30]:
asearch.get_annotation_by_id('meeting-1705-01-02-session-1-resolution-27', all_annotations)

{'label': 'resolutions',
 'begin_anchor': 599,
 'end_anchor': 830,
 'id': 'meeting-1705-01-02-session-1-resolution-27',
 'image_range': [('https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0011.jpg/full/,3108/0/default.jpg',
   [{'left': 408,
     'right': 1295,
     'top': 1574,
     'bottom': 3274,
     'height': 1700,
     'width': 887}]),
  ('https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0011.jpg/full/,3053/0/default.jpg',
   [{'left': 1347,
     'right': 2270,
     'top': 350,
     'bottom': 3203,
     'height': 2853,
     'width': 923}]),
  ('https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0011.jpg/full/,3112/0/default.jpg',
   [{'left': 2497,
     'right': 3438,
     'top': 341,
     'bottom': 3253,
     'height': 2912,
     'width': 941}]),
  ('https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0011.jpg/full/,3112/0/default.jpg',
   [{'left': 3462,
     'right': 4403,
     '

In [31]:
for a in asearch.get_annotations_of_type_overlapping('resolutions',130,130,all_annotations):
    print(a['region_links'])

['https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0009.jpg/420,2436,897,837/full/0/default.jpg', 'https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0009.jpg/1361,349,903,212/full/0/default.jpg']


In [32]:
for a in get_textlines_between(127,147,all_annotations):
    print(a)

NameError: name 'tservice' is not defined

In [34]:
for a in asearch.get_annotations_of_type('resolutions', all_annotations):
    print(json.dumps(a, sort_keys=False, indent=2))

{
  "label": "resolutions",
  "begin_anchor": 599,
  "end_anchor": 830,
  "id": "meeting-1705-01-02-session-1-resolution-27",
  "image_range": [
    [
      "https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0011.jpg/full/,3108/0/default.jpg",
      [
        {
          "left": 408,
          "right": 1295,
          "top": 1574,
          "bottom": 3274,
          "height": 1700,
          "width": 887
        }
      ]
    ],
    [
      "https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0011.jpg/full/,3053/0/default.jpg",
      [
        {
          "left": 1347,
          "right": 2270,
          "top": 350,
          "bottom": 3203,
          "height": 2853,
          "width": 923
        }
      ]
    ],
    [
      "https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0011.jpg/full/,3112/0/default.jpg",
      [
        {
          "left": 2497,
          "right": 3438,
          "top": 341,
          "botto

In [35]:
all_textlines.slice(20,30)

['Dau Tour.',
 'Wichers, Gockinga.',
 'ai',
 'a',
 'Rinm gE Resolutien eergisteren genomen',
 'jp En geleen en geresumeert , ge-',
 '|',
 'A He steert zyn de Depesches daer uyt re-',
 'a ) sulterende.',
 'Ontfangen een Missive van den Resident Bil-',
 'derbeeck, geschreven tot Keulen den dertigsten']

In [37]:
# write all_textlines to a file
with open(datadir+'all_textlines.json', 'w') as filehandle:
    json.dump(all_textlines, filehandle, cls=segmentedtext.SegmentEncoder)

Add image_range and region_links to the entity_annotations

In [38]:
entity_annotations

[{'label': 'entities',
  'begin_anchor': 2814,
  'end_anchor': 2814,
  'id': 'NL-HaNA_1.01.02_3760_entity-1',
  'entity_type': 'per',
  'entity_text': 'Arnoldus Boomhouwer'},
 {'label': 'entities',
  'begin_anchor': 2832,
  'end_anchor': 2833,
  'id': 'NL-HaNA_1.01.02_3760_entity-2',
  'entity_type': 'per',
  'entity_text': 'van Rabenpeé'},
 {'label': 'entities',
  'begin_anchor': 2819,
  'end_anchor': 2819,
  'id': 'NL-HaNA_1.01.02_3760_entity-3',
  'entity_type': 'loc',
  'entity_text': 'Roermond'}]

In [39]:
for ea in entity_annotations:
    add_image_range(ea)

entity_annotations

[('https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0022.jpg/full/,3244/0/default.jpg', [{'left': 3522, 'right': 4413, 'top': 1609, 'bottom': 1702, 'height': 93, 'width': 891}])]
[('https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0022.jpg/full/,3244/0/default.jpg', [{'left': 3488, 'right': 4379, 'top': 2477, 'bottom': 2621, 'height': 144, 'width': 891}])]
[('https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0022.jpg/full/,3244/0/default.jpg', [{'left': 3515, 'right': 4403, 'top': 1849, 'bottom': 1944, 'height': 95, 'width': 888}])]


[{'label': 'entities',
  'begin_anchor': 2814,
  'end_anchor': 2814,
  'id': 'NL-HaNA_1.01.02_3760_entity-1',
  'entity_type': 'per',
  'entity_text': 'Arnoldus Boomhouwer',
  'image_range': [('https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0022.jpg/full/,3244/0/default.jpg',
    [{'left': 3522,
      'right': 4413,
      'top': 1609,
      'bottom': 1702,
      'height': 93,
      'width': 891}])]},
 {'label': 'entities',
  'begin_anchor': 2832,
  'end_anchor': 2833,
  'id': 'NL-HaNA_1.01.02_3760_entity-2',
  'entity_type': 'per',
  'entity_text': 'van Rabenpeé',
  'image_range': [('https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0022.jpg/full/,3244/0/default.jpg',
    [{'left': 3488,
      'right': 4379,
      'top': 2477,
      'bottom': 2621,
      'height': 144,
      'width': 891}])]},
 {'label': 'entities',
  'begin_anchor': 2819,
  'end_anchor': 2819,
  'id': 'NL-HaNA_1.01.02_3760_entity-3',
  'entity_type': 'loc',
  'entit

In [40]:
for ea in entity_annotations:
    add_region_links(ea)
    
entity_annotations

[{'label': 'entities',
  'begin_anchor': 2814,
  'end_anchor': 2814,
  'id': 'NL-HaNA_1.01.02_3760_entity-1',
  'entity_type': 'per',
  'entity_text': 'Arnoldus Boomhouwer',
  'image_range': [('https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0022.jpg/full/,3244/0/default.jpg',
    [{'left': 3522,
      'right': 4413,
      'top': 1609,
      'bottom': 1702,
      'height': 93,
      'width': 891}])],
  'region_links': ['https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0022.jpg/3522,1609,891,93/full/0/default.jpg']},
 {'label': 'entities',
  'begin_anchor': 2832,
  'end_anchor': 2833,
  'id': 'NL-HaNA_1.01.02_3760_entity-2',
  'entity_type': 'per',
  'entity_text': 'van Rabenpeé',
  'image_range': [('https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0022.jpg/full/,3244/0/default.jpg',
    [{'left': 3488,
      'right': 4379,
      'top': 2477,
      'bottom': 2621,
      'height': 144,
      'width': 891}])],


In [41]:
for a in asearch.get_annotations_of_type('entities', all_annotations):
    print(json.dumps(a, sort_keys=False, indent=2))

{
  "label": "entities",
  "begin_anchor": 2814,
  "end_anchor": 2814,
  "id": "NL-HaNA_1.01.02_3760_entity-1",
  "entity_type": "per",
  "entity_text": "Arnoldus Boomhouwer",
  "image_range": [
    [
      "https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0022.jpg/full/,3244/0/default.jpg",
      [
        {
          "left": 3522,
          "right": 4413,
          "top": 1609,
          "bottom": 1702,
          "height": 93,
          "width": 891
        }
      ]
    ]
  ],
  "region_links": [
    "https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0022.jpg/3522,1609,891,93/full/0/default.jpg"
  ]
}
{
  "label": "entities",
  "begin_anchor": 2832,
  "end_anchor": 2833,
  "id": "NL-HaNA_1.01.02_3760_entity-2",
  "entity_type": "per",
  "entity_text": "van Rabenpe\u00e9",
  "image_range": [
    [
      "https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0022.jpg/full/,3244/0/default.jpg",
      [
        {
  

In [43]:
for t in all_textlines.slice(1148,1164):
    print(t)

Op de Requeste van Michiel Veruly, Verw-
verkooper tot Rotterdam, is na voorgaende de-
liberatie goetgevonden en verstaen, dat ten be-
hoeve van den Suppliant een Pasport sal werden
vedepescheert , om twaelf lasten gerafineerde
Swavel by kleyne partyen by overschepinge na
Brabandt ende Vlaenderen te mogen uytvoeren,
mits betalende ’s Landts gerechtigheyt, mitsga-
ders de Spaensche rechten van het Fort Ste. Maria
zen den Ontfanger Flinck: Ende sal Extract van
dese haer Hoogh Mogende Resolutie gesonden
werden aen het Collegie ter Admiraliteyt in Zee-
landt en het selve aengeschreven soodanige or-
dre te stellen, en die voorsieninge te doen, dat
de voorschreve Swavel onverhindert en sonder
eenioe molestatie op der selver Comptoiren mo-
en passeren.


In [45]:
for a in asearch.get_annotations_of_type('resolutions', all_annotations):
    if len(a['region_links']) > 1:
        print(a['id'])

meeting-1705-01-02-session-1-resolution-27
meeting-1705-01-02-session-1-resolution-30
meeting-1705-01-02-session-1-resolution-35
meeting-1705-01-02-session-1-resolution-2
meeting-1705-01-02-session-1-resolution-8
meeting-1705-01-02-session-1-resolution-11
meeting-1705-01-02-session-1-resolution-14
meeting-1705-01-02-session-1-resolution-17
meeting-1705-01-02-session-1-resolution-21
meeting-1705-01-02-session-1-resolution-22
meeting-1705-01-02-session-1-resolution-24
meeting-1705-01-02-session-1-resolution-26
meeting-1705-01-03-session-1-resolution-6
meeting-1705-01-03-session-1-resolution-9
meeting-1705-01-03-session-1-resolution-15
meeting-1705-01-03-session-1-resolution-17
meeting-1705-01-03-session-1-resolution-19
meeting-1705-01-03-session-1-resolution-21
meeting-1705-01-03-session-1-resolution-23
meeting-1705-01-03-session-1-resolution-24
meeting-1705-01-06-session-1-resolution-4
meeting-1705-01-06-session-1-resolution-7
meeting-1705-01-06-session-1-resolution-10
meeting-1705-01-0

In [47]:
asearch.get_annotation_by_id('meeting-1705-01-02-session-1-resolution-27', all_annotations)

{'label': 'resolutions',
 'begin_anchor': 599,
 'end_anchor': 830,
 'id': 'meeting-1705-01-02-session-1-resolution-27',
 'image_range': [('https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0011.jpg/full/,3108/0/default.jpg',
   [{'left': 408,
     'right': 1295,
     'top': 1574,
     'bottom': 3274,
     'height': 1700,
     'width': 887}]),
  ('https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0011.jpg/full/,3053/0/default.jpg',
   [{'left': 1347,
     'right': 2270,
     'top': 350,
     'bottom': 3203,
     'height': 2853,
     'width': 923}]),
  ('https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0011.jpg/full/,3112/0/default.jpg',
   [{'left': 2497,
     'right': 3438,
     'top': 341,
     'bottom': 3253,
     'height': 2912,
     'width': 941}]),
  ('https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3760/NL-HaNA_1.01.02_3760_0011.jpg/full/,3112/0/default.jpg',
   [{'left': 3462,
     'right': 4403,
     '

In [48]:
# write all_annotations to a file
with open(datadir+'republic_annotations.json', 'w') as filehandle:
    json.dump(all_annotations, filehandle, cls=segmentedtext.AnchorEncoder)