In [1]:
import pandas as pd
import transformers
import requests
import time
import json
import requests
import time

from tqdm import tqdm

tqdm.pandas()

## Named entity tagging

Load the wikineural named entity tagging model, which works for 9 languages, including dutch and french 

In [2]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("Babelscape/wikineural-multilingual-ner")
model = AutoModelForTokenClassification.from_pretrained("Babelscape/wikineural-multilingual-ner")

Downloading (…)okenizer_config.json:   0%|          | 0.00/333 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/709M [00:00<?, ?B/s]

the actual pipeline; `device=0` will run this on the GPU

In [4]:
ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="first", device=0)

AssertionError: Torch not compiled with CUDA enabled

Example; *Diskmuide* is indeed tagged as a location, whereas *Wereldoorlog I* is tagged as miscellaneous

In [None]:
example = "Diksmuide. Wereldoorlog I (1914-1918)"

ner_results = ner(example)
print(ner_results)

Let's apply this to our dataset

In [None]:
df = pd.read_csv('20230301_Postcards.csv')

One title is actually missing from the dataset; we'll just add it in, to make sure the model doesn't break

In [None]:
df.loc[35447, 'Uniform title'] = 'Antwerpen. Beelden en objecten. Koninklijk Museum voor Schone Kunsten'

Apply to *Uniform title* and *Main title*, though we'll just use *Uniform title* for now for geo lookup (Main title often contains data that is more precise, but also is more messy).

Note: if we were to batch these together in a dataset, this could be run more efficient (but doesn't take so long in the end, so won't bother)

In [None]:
df['uniform_ner'] =  df['Uniform title'].progress_apply(ner)

In [None]:
df['uniform_ner'][1]

In [None]:
df['main_ner'] =  df['Main title'].progress_apply(ner)

## Get openstreetmap data

Function that will get openstreetmap data given named entities tagged as location. Note: a number of generic location names are filtered out, and *Belgium* is added to final list of named entities. Request is sent to Streemap's Nominatim API. We'll pause for one second (rate-limiting as per the API rules).

Note: we should probably cache results and re-use for same queries

In [None]:
def get_openstreetmap_data(ner_data):
    stop_elements = ['Gebouwen', 'Kastelen', 'Molens', 'Kapellen', 'Panorama',
                 'Boten', 'Ramp', 'Vertrekken', 'Natuur']
    
    locations = []
    for ne in ner_data:
        if ne['entity_group'] == 'LOC':
            locations.append(ne['word'])
    locations = [loc for loc in locations if not loc in stop_elements]
    locations.append('Belgium')
    payload = {'q': ' '.join(locations), 'format': 'json'}
    r = requests.get('https://nominatim.openstreetmap.org/search', params=payload)
    openstreetmap_data = json.loads(r.text)
    time.sleep(1)
    return openstreetmap_data

Example result returned for first NE tags

In [None]:
get_openstreetmap_data(df['uniform_ner'][1])

This will query results for the entire dataset. This will take a bit of time to run. Note: we should be probably properly save intermediate results. 

In [None]:
df['openstreetmap_data'] = df['uniform_ner'].progress_apply(get_openstreetmap_data)