This notebook is exploring the use of goetgevonden APIs. Starts with trivial scenarios (simple query with basic results) and then gradually dives into use cases and research questions as collected by the Goetgevonden team


Step 1: connect to the goetgevonden broccoli service

In [1]:
import requests as rq

broccoli_endpoint = 'https://api.goetgevonden.nl/about'
params = {'accept': 'application/json'}

result = rq.get(broccoli_endpoint, params = params)

result.json()

{'appName': 'Broccoli',
 'version': '0.40.2',
 'startedAt': '2024-12-04T09:20:25.540539035Z',
 'baseURI': 'https://api.goetgevonden.nl',
 'hucLogLevel': 'DEBUG'}

Step 1b: retrieve some information on the projects and, if possible, about the indexName that is to be used in step 2

In [2]:
query = 'https://api.goetgevonden.nl/projects/republic'
result = rq.get(query, params = params)

result.json()

['DateOccurrence',
 'Entity',
 'Page',
 'Paragraph',
 'Resolution',
 'Session',
 'Volume']

In [3]:
query = 'https://api.goetgevonden.nl/brinta/republic/indices'
result = rq.get(query, params = params)

result.json()

{'republic-2024.11.18': {'textType': 'keyword',
  'resolutionType': 'keyword',
  'propositionType': 'keyword',
  'delegateName': 'keyword',
  'personName': 'keyword',
  'roleName': 'keyword',
  'roleCategories': 'keyword',
  'locationName': 'keyword',
  'locationCategories': 'keyword',
  'organisationName': 'keyword',
  'organisationCategories': 'keyword',
  'commissionName': 'keyword',
  'commissionCategories': 'keyword',
  'sessionWeekday': 'keyword',
  'delegateId': 'keyword',
  'personId': 'keyword',
  'roleId': 'keyword',
  'locationId': 'keyword',
  'organisationId': 'keyword',
  'commissionId': 'keyword',
  'bodyType': 'keyword',
  'sessionDate': 'date',
  'sessionDay': 'byte',
  'sessionMonth': 'byte',
  'sessionYear': 'short'},
 'republic-2024.11.30': {'textType': 'keyword',
  'resolutionType': 'keyword',
  'propositionType': 'keyword',
  'delegateName': 'keyword',
  'personName': 'keyword',
  'roleName': 'keyword',
  'roleCategories': 'keyword',
  'locationName': 'keyword',
 

Step 2: perform a basic search query

In [4]:
query = 'https://api.goetgevonden.nl/projects/republic/search?indexName=republic-2024.11.30&fragmentSize=10&from=0&size=2&sortBy=_score&sortOrder=desc'

body = {
    'text': 'maurits',
    'terms': {},
    'date': {
        'name': 'sessionDate',
        'from': '1576-08-04',
        'to': '1796-03-01'
    }
}

result = rq.post(url=query, json=body)
result.json()

{'total': {'value': 2780, 'relation': 'eq'},
 'results': [{'_id': 'urn:republic:inv-4845-date-1644-01-16-session-427-resolution-1',
   '_hits': {'text': ['Graeff <em>Maurits</em>',
     'Graeff <em>Maurits</em>',
     'geschreven in <em>Maurits</em>']},
   'textType': 'handwritten',
   'resolutionType': 'speciaal',
   'propositionType': 'onbekend',
   'sessionWeekday': 'zaterdag',
   'bodyType': 'Resolution',
   'sessionDate': '1644-01-16',
   'sessionDay': 16,
   'sessionMonth': 1,
   'sessionYear': 1644},
  {'_id': 'urn:republic:inv-3195-date-1636-03-24-session-47-resolution-7',
   '_hits': {'text': ['Graeff <em>Maurits</em>',
     '<em>Maurits</em> van',
     'Graeff <em>Maurits</em>']},
   'textType': 'handwritten',
   'resolutionType': 'ordinaris',
   'propositionType': 'onbekend',
   'sessionWeekday': 'maandag',
   'bodyType': 'Resolution',
   'sessionDate': '1636-03-24',
   'sessionDay': 24,
   'sessionMonth': 3,
   'sessionYear': 1636}],
 'aggs': {}}

Step 3: first simple scenario where I do something with search results and/or facet counts

Take one of Marijns first use cases: I want to query the API to retrieve all metadata, entities and texts for resolutions with proposition type ‘request’ and session date > 1700-01-01, in three separate queries (one for metadata, one for entities, one for texts). I want to gather data for 18th century resolutions around petitions and run text analysis on the resolutions, grouping resolutions along various metadata dimensions (one dimension is resolutions per month, another is resolutions by province of president)

First step: query with date > 1700-01-01 and propositionType == rekest, return metadata triplets _id, sessionMonth, president. Find presidents' province somewhere. Result table with 4 columns

In [5]:
body = {
    'terms': {'propositionType':['rekest']},
    'date': {
        'name': 'sessionDate',
        'from': '1700-01-01',
        'to': '1796-03-01'
    },
    'aggs': {
        'propositionType': {
            'order': 'countDesc',
            'size': 10
        }
    }
}

result = rq.post(url=query, json=body)
result.json()

{'total': {'value': 76523, 'relation': 'eq'},
 'results': [{'_id': 'urn:republic:inv-3760-date-1705-05-18-session-131-resolution-29',
   'textType': 'printed',
   'resolutionType': 'ordinaris',
   'propositionType': 'rekest',
   'sessionWeekday': 'maandag',
   'bodyType': 'Resolution',
   'sessionDate': '1705-05-18',
   'sessionDay': 18,
   'sessionMonth': 5,
   'sessionYear': 1705},
  {'_id': 'urn:republic:inv-3760-date-1705-05-18-session-131-resolution-30',
   'textType': 'printed',
   'resolutionType': 'ordinaris',
   'propositionType': 'rekest',
   'sessionWeekday': 'maandag',
   'bodyType': 'Resolution',
   'sessionDate': '1705-05-18',
   'sessionDay': 18,
   'sessionMonth': 5,
   'sessionYear': 1705}],
 'aggs': {'propositionType': {'rekest': 76523,
   'missive': 157761,
   'onbekend': 83064,
   'rapport': 13089,
   'memorie': 9311,
   'voordracht': 6827,
   'conclusie': 1784,
   'declaratie': 1369,
   'advies': 1055,
   'rekening': 771}}}

Now without aggs. Keep the above just for reference.

In [6]:
query = 'https://api.goetgevonden.nl/projects/republic/search?indexName=republic-2024.11.30&fragmentSize=10&from=0&size=2&sortBy=_score&sortOrder=desc'
body = {
    'terms': {'propositionType':['rekest']},
    'date': {
        'name': 'sessionDate',
        'from': '1700-01-01',
        'to': '1796-03-01'
    }
}

result = rq.post(url=query, json=body)
result.json()

{'total': {'value': 76523, 'relation': 'eq'},
 'results': [{'_id': 'urn:republic:inv-3760-date-1705-05-18-session-131-resolution-29',
   'textType': 'printed',
   'resolutionType': 'ordinaris',
   'propositionType': 'rekest',
   'sessionWeekday': 'maandag',
   'bodyType': 'Resolution',
   'sessionDate': '1705-05-18',
   'sessionDay': 18,
   'sessionMonth': 5,
   'sessionYear': 1705},
  {'_id': 'urn:republic:inv-3760-date-1705-05-18-session-131-resolution-30',
   'textType': 'printed',
   'resolutionType': 'ordinaris',
   'propositionType': 'rekest',
   'sessionWeekday': 'maandag',
   'bodyType': 'Resolution',
   'sessionDate': '1705-05-18',
   'sessionDay': 18,
   'sessionMonth': 5,
   'sessionYear': 1705}],
 'aggs': {}}

size=100000 appears to pass a built in limit, 10000 works. Find a strategy to retrieve and process all 76523 results.

Try pagination with, say, pages of 1 thousand results. Build in a check for reaching the last result.

In [7]:
import pandas as pd

end_reached = False

base_url = "https://api.goetgevonden.nl/projects/republic/search"

params = {
    'indexName': 'republic-2024.11.30',
    'fragmentSize': 10,
    'sortBy':'_score',
    'sortOrder': 'desc',
    'from': 0,
    'size': 5
}

body = {
    'terms': {'propositionType':['rekest']},
    'date': {
        'name': 'sessionDate',
        'from': '1700-01-01',
        'to': '1796-03-01'
    }
}

start = 0
size = 100

def get_page_results(starting, size):
    params['from'] = starting
    params['size'] = size

    result = rq.post(url=base_url, params=params, json=body)

    if starting == 0:
        print(result.json())
    
    return result.json()

all_results = []

while not end_reached:
    res = get_page_results(start, size)
    start += size

    if not res or not 'results' in res.keys():
        end_reached = True
    else:
        for r in res['results']:
            record = [r['_id'], r['sessionMonth']]
            all_results.append(record)


{'total': {'value': 76523, 'relation': 'eq'}, 'results': [{'_id': 'urn:republic:inv-3760-date-1705-05-18-session-131-resolution-29', 'textType': 'printed', 'resolutionType': 'ordinaris', 'propositionType': 'rekest', 'sessionWeekday': 'maandag', 'bodyType': 'Resolution', 'sessionDate': '1705-05-18', 'sessionDay': 18, 'sessionMonth': 5, 'sessionYear': 1705}, {'_id': 'urn:republic:inv-3760-date-1705-05-18-session-131-resolution-30', 'textType': 'printed', 'resolutionType': 'ordinaris', 'propositionType': 'rekest', 'sessionWeekday': 'maandag', 'bodyType': 'Resolution', 'sessionDate': '1705-05-18', 'sessionDay': 18, 'sessionMonth': 5, 'sessionYear': 1705}, {'_id': 'urn:republic:inv-3760-date-1705-05-18-session-131-resolution-31', 'textType': 'printed', 'resolutionType': 'ordinaris', 'propositionType': 'rekest', 'sessionWeekday': 'maandag', 'bodyType': 'Resolution', 'sessionDate': '1705-05-18', 'sessionDay': 18, 'sessionMonth': 5, 'sessionYear': 1705}, {'_id': 'urn:republic:inv-3760-date-170

In [8]:
df = pd.DataFrame(all_results)
df

Unnamed: 0,0,1
0,urn:republic:inv-3760-date-1705-05-18-session-...,5
1,urn:republic:inv-3760-date-1705-05-18-session-...,5
2,urn:republic:inv-3760-date-1705-05-18-session-...,5
3,urn:republic:inv-3760-date-1705-05-18-session-...,5
4,urn:republic:inv-3760-date-1705-05-18-session-...,5
...,...,...
9995,urn:republic:inv-3761-date-1706-03-18-session-...,3
9996,urn:republic:inv-3761-date-1706-03-18-session-...,3
9997,urn:republic:inv-3761-date-1706-03-18-session-...,3
9998,urn:republic:inv-3761-date-1706-03-18-session-...,3


Ok, 2 issues: 1 gracefully detect the end of results and 2 the built in maximum number of ES results. See above: for the moment handled by checking if the result contain a 'results' field.

Next issue: overcome the 10k barrier. Potentially in parallel on basis of 10k results: 1. add province of president metadata 2. collect texts for each result 3. collect entities for each result

In [None]:
res_detail_query = "https://api.goetgevonden.nl/projects/republic/urn:republic:inv-3773-date-1718-05-04-session-118-resolution-5?overlapTypes=Resolution,Session,Entity,Page,DateOccurrence&includeResults=anno,iiif,text&views=self&relativeTo=Origin"
result = rq.get(res_detail_query, params = params)

result.json()