In [1]:
import pandas as pd
import transformers
import requests
import time
import json
import requests
import time

from tqdm import tqdm

tqdm.pandas()

## Named entity tagging

Load the wikineural named entity tagging model, which works for 9 languages, including dutch and french 

In [2]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("Babelscape/wikineural-multilingual-ner")
model = AutoModelForTokenClassification.from_pretrained("Babelscape/wikineural-multilingual-ner")

the actual pipeline; `device=0` will run this on the GPU

In [3]:
# changed to "device = cpu" for not having a discrete GPU 
ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="first", device="cpu")

Example; *Diskmuide* is indeed tagged as a location, whereas *Wereldoorlog I* is tagged as miscellaneous

In [4]:
example = "Diksmuide. Wereldoorlog I (1914-1918)"

ner_results = ner(example)
print(ner_results)

[{'entity_group': 'LOC', 'score': 0.9867822, 'word': 'Diksmuide', 'start': 0, 'end': 9}, {'entity_group': 'MISC', 'score': 0.9712353, 'word': 'Wereldoorlog I', 'start': 11, 'end': 25}]


Let's apply this to our dataset

In [5]:
df = pd.read_csv('/Users/dawn/Desktop/hackathon/20230301_Postcards.csv')

In [6]:
df[df['Uniform title'].isnull()]

Unnamed: 0,MMS ID,Uniform title,Main title,Variant title,Place of publication,Publisher,Date,Material type,Colour,General note,Copyright status of physical object,Copyright status of digital object,Terms of use,Author (main entry),Author (added entry),Group title,Language,Country of publication,Resolver URL,Label (Library Call number)
35447,9992739808601488,,Musée Royal d'Anvers. Wouverman. Chasse à Courre,,[lieu de publication inconnu],[éditeur inconnu],ca. 1910,Graphic,Brown,Titelgegevens ontleend aan prentkaart,public domain,available as open data,gebruiksvoorwaarden,,"Wouwerman, Philips 1619-1668 artist",België. Provincie Antwerpen. Prentkaarten ; Be...,fre,xx,http://resolver.libis.be/IE16593333/representa...,KU Leuven Libraries BCOL BRES GP102409


One title is actually missing from the dataset; we'll just add it in, to make sure the model doesn't break

In [7]:
df.loc[35447, 'Uniform title'] = 'Antwerpen. Beelden en objecten. Koninklijk Museum voor Schone Kunsten'

In [8]:
df['Uniform title'].isnull().sum()

0

Apply to *Uniform title* and *Main title*, though we'll just use *Uniform title* for now for geo lookup (Main title often contains data that is more precise, but also is more messy).

Note: if we were to batch these together in a dataset, this could be run more efficient (but doesn't take so long in the end, so won't bother)

In [9]:
df['uniform_ner'] =  df['Uniform title'].progress_apply(ner)

100%|██████████| 35650/35650 [1:24:52<00:00,  7.00it/s]  


In [10]:
df['uniform_ner'][1]

[{'entity_group': 'LOC',
  'score': 0.9939957,
  'word': 'Belœil',
  'start': 0,
  'end': 6},
 {'entity_group': 'LOC',
  'score': 0.99763954,
  'word': 'Gebouwen',
  'start': 8,
  'end': 16},
 {'entity_group': 'LOC',
  'score': 0.9932414,
  'word': 'Kastelen',
  'start': 18,
  'end': 26},
 {'entity_group': 'LOC',
  'score': 0.7462085,
  'word': 'Park',
  'start': 28,
  'end': 32}]

In [24]:
df['main_ner'] =  df['Main title'].progress_apply(ner)

 10%|█         | 3580/35650 [05:12<46:36, 11.47it/s]  


KeyboardInterrupt: 

## Get openstreetmap data

Function that will get openstreetmap data given named entities tagged as location. Note: a number of generic location names are filtered out, and *Belgium* is added to final list of named entities. Request is sent to Streemap's Nominatim API. We'll pause for one second (rate-limiting as per the API rules).

Note: we should probably cache results and re-use for same queries

In [11]:
def get_openstreetmap_data(ner_data):
    stop_elements = ['Gebouwen', 'Kastelen', 'Molens', 'Kapellen', 'Panorama',
                 'Boten', 'Ramp', 'Vertrekken', 'Natuur']
    
    locations = []
    for ne in ner_data:
        if ne['entity_group'] == 'LOC':
            locations.append(ne['word'])
    locations = [loc for loc in locations if not loc in stop_elements]
    locations.append('Belgium')
    payload = {'q': ' '.join(locations), 'format': 'json'}
    r = requests.get('https://nominatim.openstreetmap.org/search', params=payload)
    openstreetmap_data = json.loads(r.text)
    time.sleep(1)
    return openstreetmap_data

In [12]:
df['uniform_ner'][1]

[{'entity_group': 'LOC',
  'score': 0.9939957,
  'word': 'Belœil',
  'start': 0,
  'end': 6},
 {'entity_group': 'LOC',
  'score': 0.99763954,
  'word': 'Gebouwen',
  'start': 8,
  'end': 16},
 {'entity_group': 'LOC',
  'score': 0.9932414,
  'word': 'Kastelen',
  'start': 18,
  'end': 26},
 {'entity_group': 'LOC',
  'score': 0.7462085,
  'word': 'Park',
  'start': 28,
  'end': 32}]

Example result returned for first NE tags

In [13]:
get_openstreetmap_data(df['uniform_ner'][1])

[{'place_id': 144582695,
  'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright',
  'osm_type': 'way',
  'osm_id': 165635705,
  'boundingbox': ['50.5454353', '50.553389', '3.7193868', '3.7327803'],
  'lat': '50.5494354',
  'lon': '3.7264512060049375',
  'display_name': 'Parc & Château de Belœil, Belœil, Ath, Hainaut, Wallonie, 7970, België / Belgique / Belgien',
  'class': 'leisure',
  'type': 'park',
  'importance': 0.26}]

This will query results for the entire dataset. This will take a bit of time to run. Note: we should be probably properly save intermediate results. 

In [14]:
df['openstreetmap_data'] = df['uniform_ner'].progress_apply(get_openstreetmap_data)

100%|██████████| 35650/35650 [11:31:29<00:00,  1.16s/it]  


In [16]:
df['openstreetmap_data']


[{'place_id': 144582695,
  'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright',
  'osm_type': 'way',
  'osm_id': 165635705,
  'boundingbox': ['50.5454353', '50.553389', '3.7193868', '3.7327803'],
  'lat': '50.5494354',
  'lon': '3.7264512060049375',
  'display_name': 'Parc & Château de Belœil, Belœil, Ath, Hainaut, Wallonie, 7970, België / Belgique / Belgien',
  'class': 'leisure',
  'type': 'park',
  'importance': 0.26}]

In [17]:
df.to_csv('withopenstreetmapdata.csv', encoding='utf-8', index=False)

In [19]:
dfnew = pd.read_csv('/Users/dawn/Desktop/hackathon/poststarts/notebooks/withopenstreetmapdata.csv')

In [20]:
dfnew.head(10)

Unnamed: 0,MMS ID,Uniform title,Main title,Variant title,Place of publication,Publisher,Date,Material type,Colour,General note,...,Terms of use,Author (main entry),Author (added entry),Group title,Language,Country of publication,Resolver URL,Label (Library Call number),uniform_ner,openstreetmap_data
0,1,130$a,245$a,246$a,264$a,264$b,264$c,340$a,340$o,500$a,...,542$u,100$a Name 100$d biographical data 100$e relat...,700$a Name 700$d biographical data 700$e relat...,830$a,008$35:3,008$15:2,856$u,856$y,"[{'entity_group': 'MISC', 'score': 0.95299965,...","[{'place_id': 297452507, 'licence': 'Data © Op..."
1,9990136310101488,Belœil. Gebouwen. Kastelen. Park,Belœil. Le parc. Le groupe de Neptune - Het pa...,,[lieu de publication inconnu],[éditeur inconnu],[date de publication inconnue],Graphic,Black-and-white.,Neptunusfontein;Kasteel van Belœil;Titelgegeve...,...,gebruiksvoorwaarden,,,België. Provincie Henegouwen. Prentkaarten ; B...,fre,\\,http://resolver.libis.be/IE2777387/representation,KU Leuven Libraries BIBC BRES GP002180,"[{'entity_group': 'LOC', 'score': 0.9939957, '...","[{'place_id': 144582695, 'licence': 'Data © Op..."
2,9990302540101488,Beringen. Folklore en volkscultuur,Beeringen. Grand'Place. Souvenir des Fêtes de ...,,[lieu de publication inconnu],[éditeur inconnu],[date de publication inconnue],Graphic,Black-and-white.,Titelgegevens ontleend aan prentkaart,...,gebruiksvoorwaarden,,,België. Provincie Limburg. Prentkaarten ; Belg...,fre,\\,http://resolver.libis.be/IE2783050/representation,KU Leuven Libraries BIBC BRES GP003454,"[{'entity_group': 'LOC', 'score': 0.99905795, ...","[{'place_id': 298277599, 'licence': 'Data © Op..."
3,9990544990101488,Bilzen. Panorama,Bilzen. Panorama,,Brussel,Thill,ca. 1948,Graphic,Sepia.,Titelgegevens ontleend aan prentkaart,...,licentievoorwaarden#digi0020,,,België. Provincie Limburg. Prentkaarten ; Belg...,dut,\\,http://resolver.libis.be/IE2785913/representation,KU Leuven Libraries BIBC BRES GP004279,"[{'entity_group': 'LOC', 'score': 0.92356896, ...","[{'place_id': 298038741, 'licence': 'Data © Op..."
4,9990616780101488,Blankenberge. Panorama,"Blankenberge. Là, tout n'est qu'ordre et beaut...",,Bruxelles,Thill,ca. 1954,Graphic,Black-and-white.,Titelgegevens ontleend aan prentkaart,...,licentievoorwaarden#digi0020,,,België. Provincie West-Vlaanderen. Prentkaarte...,fre,\\,http://resolver.libis.be/IE2786767/representation,KU Leuven Libraries BIBC BRES GP004401,"[{'entity_group': 'LOC', 'score': 0.9660541, '...","[{'place_id': 297970958, 'licence': 'Data © Op..."
5,9990731350101488,Blankenberge. Zeedijk,Blankenberge. Grote hotels op zeedijk - Grands...,,Bruxelles,Thill,[datum van uitgave onbekend],Graphic,Black-and-white.,Titelgegevens ontleend aan prentkaart,...,licentievoorwaarden#digi0020,,,België. Provincie West-Vlaanderen. Prentkaarte...,dut,\\,http://resolver.libis.be/IE2788279/representation,KU Leuven Libraries BIBC BRES GP004617,"[{'entity_group': 'LOC', 'score': 0.9758136, '...","[{'place_id': 339204887, 'licence': 'Data © Op..."
6,9990942900101488,Borgerhout. Beelden en objecten,Anvers (Borgerhout). Statue du Général Carnot ...,,[lieu de publication inconnu],[éditeur inconnu],ca. 1908,Graphic,Brown.,Standbeeld van generaal Carnot;Titelgegevens o...,...,gebruiksvoorwaarden,,,België. Provincie Antwerpen. Prentkaarten ; Be...,fre,\\,http://resolver.libis.be/IE2795671/representation,KU Leuven Libraries BIBC BRES GP006388,"[{'entity_group': 'LOC', 'score': 0.99743384, ...","[{'place_id': 298237127, 'licence': 'Data © Op..."
7,9991507520101488,Brussel-Bruxelles. 1905 : 75 jaar Belgische On...,75e Anniversaire de l'Indépendance Belge. Gran...,,Bruxelles,Lagaert,[date de publication inconnue],Graphic,Black-and-white.,75ste verjaardag van de Belgische Onafhankelij...,...,gebruiksvoorwaarden,,,België. Brussels Hoofdstedelijk Gewest. Prentk...,fre,\\,http://resolver.libis.be/IE2824672/representation,KU Leuven Libraries BIBC BRES GP015174,"[{'entity_group': 'LOC', 'score': 0.6504053, '...","[{'place_id': 297569644, 'licence': 'Data © Op..."
8,9991825100101488,Brugge. Gebouwen. Molens,Brugge [01],,Brussel,Thill,[datum van uitgave onbekend],Graphic,Coloured.,Titelgegevens ontleend aan prentkaart,...,licentievoorwaarden#digi0020,,,België. Provincie West-Vlaanderen. Prentkaarte...,dut,\\,http://resolver.libis.be/IE2811127/representation,KU Leuven Libraries BIBC BRES GP010153,"[{'entity_group': 'LOC', 'score': 0.99931943, ...","[{'place_id': 297704144, 'licence': 'Data © Op..."
9,9991874840101488,Brussel-Bruxelles. Gebouwen. Kerken en Kapellen,Bruxelles. Eglise Saint Joseph. Square Frère-O...,,Brussel,Thill,[date de publication inconnue],Graphic,Coloured.,Onze Lieve Vrouw van de Altijddurende Bijstand...,...,licentievoorwaarden#digi0020,,,België. Brussels Hoofdstedelijk Gewest. Prentk...,fre,\\,http://resolver.libis.be/IE2832855/representation,KU Leuven Libraries BIBC BRES GP017055,"[{'entity_group': 'LOC', 'score': 0.81561995, ...","[{'place_id': 297569644, 'licence': 'Data © Op..."
