In [21]:
import pandas as pd
import json
import requests
import time
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
from tqdm import tqdm

tqdm.pandas()

## Named entity tagging

Load the wikineural named entity tagging model, which works for 9 languages, including dutch and french 

In [22]:
tokenizer = AutoTokenizer.from_pretrained("Babelscape/wikineural-multilingual-ner")
model = AutoModelForTokenClassification.from_pretrained("Babelscape/wikineural-multilingual-ner")

the actual pipeline; `device=0` will run this on the GPU

In [23]:
# changed to "device = cpu" for not having a discrete GPU 
ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="first", device="cpu")

Example; *Diskmuide* is indeed tagged as a location, whereas *Wereldoorlog I* is tagged as miscellaneous

Let's apply this to our dataset

In [25]:
df = pd.read_csv('../data/raw/20230301_Postcards.csv')
df.drop(0, inplace=True)

In [26]:
df[df['Uniform title'].isnull()]

Unnamed: 0,MMS ID,Uniform title,Main title,Variant title,Place of publication,Publisher,Date,Material type,Colour,General note,Copyright status of physical object,Copyright status of digital object,Terms of use,Author (main entry),Author (added entry),Group title,Language,Country of publication,Resolver URL,Label (Library Call number)
35447,9992739808601488,,Musée Royal d'Anvers. Wouverman. Chasse à Courre,,[lieu de publication inconnu],[éditeur inconnu],ca. 1910,Graphic,Brown,Titelgegevens ontleend aan prentkaart,public domain,available as open data,gebruiksvoorwaarden,,"Wouwerman, Philips 1619-1668 artist",België. Provincie Antwerpen. Prentkaarten ; Be...,fre,xx,http://resolver.libis.be/IE16593333/representa...,KU Leuven Libraries BCOL BRES GP102409


One title is actually missing from the dataset; we'll just add it in, to make sure the model doesn't break

In [27]:
df.loc[35447, 'Uniform title'] = 'Antwerpen. Beelden en objecten. Koninklijk Museum voor Schone Kunsten'

In [28]:
df['Uniform title'].isnull().sum()

0

Apply to *Uniform title* and *Main title*, though we'll just use *Uniform title* for now for geo lookup (Main title often contains data that is more precise, but also is more messy).

Note: if we were to batch these together in a dataset, this could be run more efficient (but doesn't take so long in the end, so won't bother)

In [None]:
df['uniform_ner'] =  df['Uniform title'].progress_apply(ner)

 32%|███▏      | 11426/35650 [20:49<1:03:54,  6.32it/s]

In [207]:
len(df['uniform_ner'])

35650

In [24]:
df['main_ner'] =  df['Main title'].progress_apply(ner)

 10%|█         | 3580/35650 [05:12<46:36, 11.47it/s]  


KeyboardInterrupt: 

## Get openstreetmap data

Function that will get openstreetmap data given named entities tagged as location. Note: a number of generic location names are filtered out, and *Belgium* is added to final list of named entities. Request is sent to Streemap's Nominatim API. We'll pause for one second (rate-limiting as per the API rules).

Note: we should probably cache results and re-use for same queries

In [14]:
def get_openstreetmap_data(ner_data):
    stop_elements = ['Gebouwen', 'Kastelen', 'Molens', 'Kapellen', 'Panorama',
                 'Boten', 'Ramp', 'Vertrekken', 'Natuur']
    
    locations = []
    for ne in ner_data:
        if ne['entity_group'] == 'LOC':
            locations.append(ne['word'])
    locations = [loc for loc in locations if not loc in stop_elements]
    locations.append('Belgium')
    payload = {'q': ' '.join(locations), 'format': 'json'}
    r = requests.get('https://nominatim.openstreetmap.org/search', params=payload)
    openstreetmap_data = json.loads(r.text)
    time.sleep(1)
    return openstreetmap_data

This will query results for the entire dataset. This will take a bit of time to run. Note: we should be probably properly save intermediate results. 

In [14]:
# Due to some accidental operation I lost the file
#I tried to run this code again but failed due to the halt of internet in my dorm 

df['openstreetmap_data'] = df['uniform_ner'].progress_apply(get_openstreetmap_data)

100%|██████████| 35650/35650 [11:31:29<00:00,  1.16s/it]  


In [219]:
latlist = []
for item in df['openstreetmap_data']:
    if len(item)!=0:
        latlist.append(item[0]['lat'])
    else:
        latlist.append('')

In [220]:
lonlist = []
for item in df['openstreetmap_data']:
    if len(item)!=0:
        lonlist.append(item[0]['lon'])
    else:
        lonlist.append('')
df['Lat'] = latlist
df['Lng'] = lonlist

Unnamed: 0,MMS ID,Uniform title,Main title,Variant title,Place of publication,Publisher,Date,Material type,Colour,General note,...,Author (added entry),Group title,Language,Country of publication,Resolver URL,Label (Library Call number),uniform_ner,openstreetmap_data,lat,lon
0,1,130$a,245$a,246$a,264$a,264$b,264$c,340$a,340$o,500$a,...,700$a Name 700$d biographical data 700$e relat...,830$a,008$35:3,008$15:2,856$u,856$y,"[{'entity_group': 'MISC', 'score': 0.95299965,...","[{'place_id': 297452507, 'licence': 'Data © Op...",50.6402809,4.6667145
1,9990136310101488,Belœil. Gebouwen. Kastelen. Park,Belœil. Le parc. Le groupe de Neptune - Het pa...,,[lieu de publication inconnu],[éditeur inconnu],[date de publication inconnue],Graphic,Black-and-white.,Neptunusfontein;Kasteel van Belœil;Titelgegeve...,...,,België. Provincie Henegouwen. Prentkaarten ; B...,fre,\\,http://resolver.libis.be/IE2777387/representation,KU Leuven Libraries BIBC BRES GP002180,"[{'entity_group': 'LOC', 'score': 0.9939957, '...","[{'place_id': 144582695, 'licence': 'Data © Op...",50.5494354,3.7264512060049375
2,9990302540101488,Beringen. Folklore en volkscultuur,Beeringen. Grand'Place. Souvenir des Fêtes de ...,,[lieu de publication inconnu],[éditeur inconnu],[date de publication inconnue],Graphic,Black-and-white.,Titelgegevens ontleend aan prentkaart,...,,België. Provincie Limburg. Prentkaarten ; Belg...,fre,\\,http://resolver.libis.be/IE2783050/representation,KU Leuven Libraries BIBC BRES GP003454,"[{'entity_group': 'LOC', 'score': 0.99905795, ...","[{'place_id': 298277599, 'licence': 'Data © Op...",51.0502026,5.220809082300834
3,9990544990101488,Bilzen. Panorama,Bilzen. Panorama,,Brussel,Thill,ca. 1948,Graphic,Sepia.,Titelgegevens ontleend aan prentkaart,...,,België. Provincie Limburg. Prentkaarten ; Belg...,dut,\\,http://resolver.libis.be/IE2785913/representation,KU Leuven Libraries BIBC BRES GP004279,"[{'entity_group': 'LOC', 'score': 0.92356896, ...","[{'place_id': 298038741, 'licence': 'Data © Op...",50.8707787,5.5181089
4,9990616780101488,Blankenberge. Panorama,"Blankenberge. Là, tout n'est qu'ordre et beaut...",,Bruxelles,Thill,ca. 1954,Graphic,Black-and-white.,Titelgegevens ontleend aan prentkaart,...,,België. Provincie West-Vlaanderen. Prentkaarte...,fre,\\,http://resolver.libis.be/IE2786767/representation,KU Leuven Libraries BIBC BRES GP004401,"[{'entity_group': 'LOC', 'score': 0.9660541, '...","[{'place_id': 297970958, 'licence': 'Data © Op...",51.31700275,3.133658034483461
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35645,9992683362801488,Antwerpen. Beelden en objecten. Koninklijk Mus...,"Musée Royal d'Anvers. Le Jugement dernier, les...",,Anvers,Hermans,ca. 1913,Graphic,Brown,Titelgegevens ontleend aan prentkaart,...,"Bosch, Hiëronymus approximately 1450-1516 artist",België. Provincie Antwerpen. Prentkaarten ; Be...,fre,xx,http://resolver.libis.be/IE16587223/representa...,KU Leuven Libraries BCOL BRES GP101210,"[{'entity_group': 'LOC', 'score': 0.98758906, ...","[{'place_id': 112710147, 'licence': 'Data © Op...",51.2084689,4.394877186720867
35646,9992688280601488,Antwerpen. Leysstraat,Antwerpen. Ingang der Leysstraat - Anvers. Ent...,,Bruxelles,Thill,ca. 1947,Graphic,Brown,Titelgegevens ontleend aan prentkaart,...,,België. Provincie Antwerpen. Prentkaarten ; Be...,dut,xx,http://resolver.libis.be/IE16350487/representa...,KU Leuven Libraries BCOL BRES GP106321,"[{'entity_group': 'LOC', 'score': 0.9977471, '...","[{'place_id': 103115332, 'licence': 'Data © Op...",51.2183319,4.4133781
35647,9992688285001488,Antwerpen. Gebouwen. Algemeen. Den Botaniek,Anvers. Rue Botanique [01],,Bruxelles,Nels,[date de publication inconnue],Graphic,Black-and-white,Titelgegevens ontleend aan prentkaart,...,,België. Provincie Antwerpen. Prentkaarten ; Be...,fre,xx,http://resolver.libis.be/IE16350424/representa...,KU Leuven Libraries BCOL BRES GP106314,"[{'entity_group': 'LOC', 'score': 0.9902444, '...","[{'place_id': 111932666, 'licence': 'Data © Op...",51.2144284,4.406539483589943
35648,9992704600301488,Antwerpen. Beelden en objecten. Koninklijk Mus...,"Musée Royal d'Anvers. Le Sauveur mort, pleuré ...",,Anvers,Hermans,[date de publication inconnue],Graphic,Brown,Titelgegevens ontleend aan prentkaart,...,"Rubens, Peter Paul 1577-1640 artist",België. Provincie Antwerpen. Prentkaarten ; Be...,fre,xx,http://resolver.libis.be/IE16588609/representa...,KU Leuven Libraries BCOL BRES GP101367,"[{'entity_group': 'LOC', 'score': 0.98758906, ...","[{'place_id': 112710147, 'licence': 'Data © Op...",51.2084689,4.394877186720867


In [11]:
df[['MMS ID', 'Uniform title', 'uniform_ner', 'openstreetmap_data', 'Lat', 'Lng']].to_csv('./data/processed/geolocation.csv', encoding='utf-8', index=False)