In [28]:
import pandas as pd
import transformers
import requests
import time
import json
import requests
import time

from tqdm import tqdm

tqdm.pandas()

## Named entity tagging

Load the wikineural named entity tagging model, which works for 9 languages, including dutch and french 

In [3]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("Babelscape/wikineural-multilingual-ner")
model = AutoModelForTokenClassification.from_pretrained("Babelscape/wikineural-multilingual-ner")

the actual pipeline; `device=0` will run this on the GPU

In [8]:
ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="first", device=0)

Example; *Diskmuide* is indeed tagged as a location, whereas *Wereldoorlog I* is tagged as miscellaneous

In [9]:
example = "Diksmuide. Wereldoorlog I (1914-1918)"

ner_results = ner(example)
print(ner_results)

[{'entity_group': 'LOC', 'score': 0.9867822, 'word': 'Diksmuide', 'start': 0, 'end': 9}, {'entity_group': 'MISC', 'score': 0.9712352, 'word': 'Wereldoorlog I', 'start': 11, 'end': 25}]


Let's apply this to our dataset

In [6]:
df = pd.read_csv('20230301_Postcards.csv')

One title is actually missing from the dataset; we'll just add it in, to make sure the model doesn't break

In [7]:
df.loc[35447, 'Uniform title'] = 'Antwerpen. Beelden en objecten. Koninklijk Museum voor Schone Kunsten'

Apply to *Uniform title* and *Main title*, though we'll just use *Uniform title* for now for geo lookup (Main title often contains data that is more precise, but also is more messy).

Note: if we were to batch these together in a dataset, this could be run more efficient (but doesn't take so long in the end, so won't bother)

In [13]:
df['uniform_ner'] =  df['Uniform title'].progress_apply(ner)

100%|██████████| 35650/35650 [04:05<00:00, 145.00it/s]


In [18]:
df['uniform_ner'][1]

[{'entity_group': 'LOC',
  'score': 0.9939957,
  'word': 'Belœil',
  'start': 0,
  'end': 6},
 {'entity_group': 'LOC',
  'score': 0.99763954,
  'word': 'Gebouwen',
  'start': 8,
  'end': 16},
 {'entity_group': 'LOC',
  'score': 0.9932414,
  'word': 'Kastelen',
  'start': 18,
  'end': 26},
 {'entity_group': 'LOC',
  'score': 0.7462094,
  'word': 'Park',
  'start': 28,
  'end': 32}]

In [19]:
df['main_ner'] =  df['Main title'].progress_apply(ner)

100%|██████████| 35650/35650 [04:26<00:00, 133.58it/s]


## Get openstreetmap data

Function that will get openstreetmap data given named entities tagged as location. Note: a number of generic location names are filtered out, and *Belgium* is added to final list of named entities. Request is sent to Streemap's Nominatim API. We'll pause for one second (rate-limiting as per the API rules).

Note: we should probably cache results and re-use for same queries

In [30]:
def get_openstreetmap_data(ner_data):
    stop_elements = ['Gebouwen', 'Kastelen', 'Molens', 'Kapellen', 'Panorama',
                 'Boten', 'Ramp', 'Vertrekken', 'Natuur']
    
    locations = []
    for ne in ner_data:
        if ne['entity_group'] == 'LOC':
            locations.append(ne['word'])
    locations = [loc for loc in locations if not loc in stop_elements]
    locations.append('Belgium')
    payload = {'q': ' '.join(locations), 'format': 'json'}
    r = requests.get('https://nominatim.openstreetmap.org/search', params=payload)
    openstreetmap_data = json.loads(r.text)
    time.sleep(1)
    return openstreetmap_data

Example result returned for first NE tags

In [40]:
get_openstreetmap_data(df['uniform_ner'][1])

[{'place_id': 144582695,
  'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright',
  'osm_type': 'way',
  'osm_id': 165635705,
  'boundingbox': ['50.5454353', '50.553389', '3.7193868', '3.7327803'],
  'lat': '50.5494354',
  'lon': '3.7264512060049375',
  'display_name': 'Parc & Château de Belœil, Belœil, Ath, Hainaut, Wallonie, 7970, België / Belgique / Belgien',
  'class': 'leisure',
  'type': 'park',
  'importance': 0.26}]

This will query results for the entire dataset. This will take a bit of time to run. Note: we should be probably properly save intermediate results. 

In [None]:
df['openstreetmap_data'] = df['uniform_ner'].progress_apply(get_openstreetmap_data)