# Extraction de candidats en fonctions des coordonnées de l'_Encyclopédie_

Petit programme montrant comment extraire de Wikidata des lieux proches d'un point donné par ses coordonnées géographiques dans l'_Encyclopédie_.

Auteur : Pierre Nugues

In [1]:
import json
from tqdm import tqdm
import matplotlib.pyplot as plt
from collections import Counter
import numpy as np
import statistics
import regex as re
import plotly.graph_objects as go
import plotly.express as px
import regex as re
import bs4
import requests
from geopy import distance

## Le jeu de données et l'extraction des coordonnées du texte
On utilise des données où on a extrait les coordonnées gégraphiques de wikidata au préalable

In [2]:
with open('diderot_1751_wd_extraction.json', 'r') as f:
    diderot_wd = json.loads(f.read())

Un exemple

In [3]:
diderot_wd[7675]

{'vedette': 'MOLINA',
 'entreeid': 'v10-1676-0',
 'texte': 'MOLINA, (Géog.)\u200b ville d’Espagne, dans la nouvelle Castille, sur le Gallo, à 3 lieues des frontieres de l’Arragon, près de Caracena. Cette ville est dans un pays de pâturage, où l’on nourrit des brebis qui portent une laine précieuse. Elle est située à 10 lieues S. E. de Siguenza, 28 N. E. de Madrid. Long. 15. 55. lat. 40. 50. (D. J.)\u200b',
 'qid': ['Q919050'],
 'wd_statements': [{'desc_en': ['municipality of Spain'],
   'desc_fr': ['commune espagnole'],
   'libellé_en': ['Molina de Aragón'],
   'libellé_fr': ["Molina d'Aragon"],
   'type': ['Q2074737'],
   'geo': ['Point(-1.888611111 40.843888888)']}]}

Les expressions régulières

In [4]:
lat_long = r'(latitude|latit\.?|lat\.?|longitude|longit\.?|long\.?|lon\.?)'
lat_long

'(latitude|latit\\.?|lat\\.?|longitude|longit\\.?|long\\.?|lon\\.?)'

In [5]:
coord = r'((?:\p{N}+\.\s*)+)' # On ne capture que la plus longue chaîne

In [6]:
expr = (lat_long) + r'\s*' + coord
expr

'(latitude|latit\\.?|lat\\.?|longitude|longit\\.?|long\\.?|lon\\.?)\\s*((?:\\p{N}+\\.\\s*)+)'

L'extraction et la normalisation avec le méridien de l'île de Fer

In [7]:
long_ile_de_fer = 17 + 39 / 60 + 46 / 3600
long_ile_de_fer

17.662777777777777

In [8]:
def convertit_coord(coords: list[tuple], long_ile_de_fer=17.662777777777777):
    """
    Convertit une liste de deux éléments du type
    [('long.', '26. 57. '), ('lat.', '47. 55.')]
    en un couple de coordonnées avec Greenwich comme méridien de référence
    (47.916666666666664, 9.287222222222223)
    """
    latitude = 0
    longitude = 0
    for coord in coords:
        coord_dec = 0
        chiffres = coord[1].strip().split()
        nouv_chiffres = []
        for chiffre in chiffres:
            chiffre = chiffre.strip()
            if chiffre[-1] == '.':
                chiffre = chiffre[:-1]
            if chiffre is int:
                chiffre = int(chiffre)
                nouv_chiffres += [chiffre]
            else:
                nouv_chiffres += map(lambda x: int(x.strip()), chiffre.split('.'))

        for i, chiffre in enumerate(nouv_chiffres):
            coord_dec += chiffre/(60**i)
        if re.match('lon', coord[0].lower()):
            longitude = coord_dec - long_ile_de_fer
        elif re.match('lat', coord[0].lower()):
            latitude = coord_dec
        else:
            return False
    return (latitude, longitude)

In [9]:
def extrait_coord_txt(expr, texte):
    m = re.findall(expr, texte.strip().lower())
    if m and len(m) == 2:
        return m
    else:
        return False

In [10]:
def extrait_coord_format_wd(article):
    if 'wd_statements' in article and 'geo' in article['wd_statements'][0]:
            coords_wd = article['wd_statements'][0]['geo'][0]
            [long_wd, lat_wd] = re.findall(r'-?\p{N}+\.-?\p{N}+', coords_wd)
            return (float(lat_wd), float(long_wd))
    else:
          return False

Le calcul de la distance entre les coordonnées Wikidata et les coordonnées de l'_Encyclopédie_

In [11]:
cnt = 0
for i, article in enumerate(diderot_wd[7675:7676]):
    #if 'qid' in article:
    #    continue
    coord_txt = extrait_coord_txt(expr, article['texte'])
    if coord_txt:
        coord_encyclo = convertit_coord(coord_txt)
    else:
        continue
    print('Article :', article)
    print('Coordonnées extraites :', coord_txt)
    print('Coordonnées normalisées :', coord_encyclo)
    coords_wd = extrait_coord_format_wd(article)
    print('Coordonnées wikidata :', coords_wd)
    if coord_encyclo and coords_wd:
        dist = distance.geodesic(coords_wd, coord_encyclo).km
        cnt += 1
        print('Distance :', dist)
cnt

Article : {'vedette': 'MOLINA', 'entreeid': 'v10-1676-0', 'texte': 'MOLINA, (Géog.)\u200b ville d’Espagne, dans la nouvelle Castille, sur le Gallo, à 3 lieues des frontieres de l’Arragon, près de Caracena. Cette ville est dans un pays de pâturage, où l’on nourrit des brebis qui portent une laine précieuse. Elle est située à 10 lieues S. E. de Siguenza, 28 N. E. de Madrid. Long. 15. 55. lat. 40. 50. (D. J.)\u200b', 'qid': ['Q919050'], 'wd_statements': [{'desc_en': ['municipality of Spain'], 'desc_fr': ['commune espagnole'], 'libellé_en': ['Molina de Aragón'], 'libellé_fr': ["Molina d'Aragon"], 'type': ['Q2074737'], 'geo': ['Point(-1.888611111 40.843888888)']}]}
Coordonnées extraites : [('long.', '15. 55. '), ('lat.', '40. 50. ')]
Coordonnées normalisées : (40.833333333333336, -1.7461111111111105)
Coordonnées wikidata : (40.843888888, -1.888611111)
Distance : 12.075488278906363


1

## Construction de requêtes d'extraction
On construit une requête et l'on l'applique. On voit qu'on doit nettoyer les données récupérées.

In [12]:
url = 'https://query.wikidata.org/bigdata/namespace/wdq/sparql'

In [13]:
prefixes = '''PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX ps: <http://www.wikidata.org/prop/statement/>'''

In [14]:
headers = {
    'User-Agent': 'NLP-project/1.0 (pierre.nugues@cs.lth.se)'
}

In [15]:
def applique_requete(qt, prefixes, headers):
    try:
        data = requests.get(
            url, params={
                'query': prefixes + qt, 
                'format': 'json', 
                'headers': headers}).json()
        data = data['results']['bindings']
    except:
        data = ['échec']
    return data

La requête reprise de stackoverflow :
`https://stackoverflow.com/questions/49302399/how-to-get-cities-around-a-location-in-wikidata`

In [16]:
def cree_req_coord(coord):
    query_dist = """SELECT DISTINCT ?distance ?qid ?qidLabel ?coords WHERE {{

   # Use the around service
   SERVICE wikibase:around {{ 
     # Looking for items with coordinate locations(P625)
     ?qid wdt:P625 ?coords . 
     
     # That are in a circle with a centre of with a point
     bd:serviceParam wikibase:center "{0}"^^geo:wktLiteral   . 
     # Where the circle has a radius of 20km
     bd:serviceParam wikibase:radius "20" . 
     bd:serviceParam wikibase:distance ?distance .
   }} .

   ?qid wdt:P31/wdt:P279* wd:Q486972.

   # Use the label service to get the English label
   SERVICE wikibase:label {{
   bd:serviceParam wikibase:language "en" . 
   }}
}}
ORDER BY ?distance""".format(coord)
    return query_dist

In [17]:
req = cree_req_coord('Point(-6.50,39.1)')
req

'SELECT DISTINCT ?distance ?qid ?qidLabel ?coords WHERE {\n\n   # Use the around service\n   SERVICE wikibase:around { \n     # Looking for items with coordinate locations(P625)\n     ?qid wdt:P625 ?coords . \n     \n     # That are in a circle with a centre of with a point\n     bd:serviceParam wikibase:center "Point(-6.50,39.1)"^^geo:wktLiteral   . \n     # Where the circle has a radius of 20km\n     bd:serviceParam wikibase:radius "20" . \n     bd:serviceParam wikibase:distance ?distance .\n   } .\n\n   ?qid wdt:P31/wdt:P279* wd:Q486972.\n\n   # Use the label service to get the English label\n   SERVICE wikibase:label {\n   bd:serviceParam wikibase:language "en" . \n   }\n}\nORDER BY ?distance'

In [18]:
req = cree_req_coord('Point(3.43,50.45)')
req

'SELECT DISTINCT ?distance ?qid ?qidLabel ?coords WHERE {\n\n   # Use the around service\n   SERVICE wikibase:around { \n     # Looking for items with coordinate locations(P625)\n     ?qid wdt:P625 ?coords . \n     \n     # That are in a circle with a centre of with a point\n     bd:serviceParam wikibase:center "Point(3.43,50.45)"^^geo:wktLiteral   . \n     # Where the circle has a radius of 20km\n     bd:serviceParam wikibase:radius "20" . \n     bd:serviceParam wikibase:distance ?distance .\n   } .\n\n   ?qid wdt:P31/wdt:P279* wd:Q486972.\n\n   # Use the label service to get the English label\n   SERVICE wikibase:label {\n   bd:serviceParam wikibase:language "en" . \n   }\n}\nORDER BY ?distance'

In [19]:
applique_requete(req, prefixes, headers)

[{'distance': {'datatype': 'http://www.w3.org/2001/XMLSchema#double',
   'type': 'literal',
   'value': '0.103'},
  'qid': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q2820650'},
  'qidLabel': {'xml:lang': 'en',
   'type': 'literal',
   'value': 'Abbaye de Notre-Dame-de-la-Paix'},
  'coords': {'datatype': 'http://www.opengis.net/ont/geosparql#wktLiteral',
   'type': 'literal',
   'value': 'Point(3.42861111 50.44972222)'}},
 {'distance': {'datatype': 'http://www.w3.org/2001/XMLSchema#double',
   'type': 'literal',
   'value': '0.103'},
  'qid': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q334131'},
  'qidLabel': {'xml:lang': 'en',
   'type': 'literal',
   'value': 'Saint-Amand Abbey'},
  'coords': {'datatype': 'http://www.opengis.net/ont/geosparql#wktLiteral',
   'type': 'literal',
   'value': 'Point(3.42861111 50.44972222)'}},
 {'distance': {'datatype': 'http://www.w3.org/2001/XMLSchema#double',
   'type': 'literal',
   'value': '0.208'},
  'qid': {'type': 'ur

In [20]:
applique_requete(req, prefixes, headers)

[{'distance': {'datatype': 'http://www.w3.org/2001/XMLSchema#double',
   'type': 'literal',
   'value': '0.103'},
  'qid': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q2820650'},
  'qidLabel': {'xml:lang': 'en',
   'type': 'literal',
   'value': 'Abbaye de Notre-Dame-de-la-Paix'},
  'coords': {'datatype': 'http://www.opengis.net/ont/geosparql#wktLiteral',
   'type': 'literal',
   'value': 'Point(3.42861111 50.44972222)'}},
 {'distance': {'datatype': 'http://www.w3.org/2001/XMLSchema#double',
   'type': 'literal',
   'value': '0.103'},
  'qid': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q334131'},
  'qidLabel': {'xml:lang': 'en',
   'type': 'literal',
   'value': 'Saint-Amand Abbey'},
  'coords': {'datatype': 'http://www.opengis.net/ont/geosparql#wktLiteral',
   'type': 'literal',
   'value': 'Point(3.42861111 50.44972222)'}},
 {'distance': {'datatype': 'http://www.w3.org/2001/XMLSchema#double',
   'type': 'literal',
   'value': '0.208'},
  'qid': {'type': 'ur

## Élimination d'en-tête verbeux
Les en-têtes sont verbeux. On les nettoie.

In [21]:
def condense_dict(dico: dict) -> dict:
    nouv_dict = dict()
    for key in dico.keys():
        if type(dico[key]) == dict and 'value' in dico[key]:
            nouv_dict[key] = dico[key]['value']
        else:
            nouv_dict[key] = dico[key]
    return nouv_dict

In [22]:
def deduplique(dicos: list[dict]) -> dict:
    nouv_dict = dict()
    for dico in dicos:
        for key in dico.keys():
            if key in nouv_dict:
                nouv_dict[key] += [dico[key]]
            else:
                nouv_dict[key] = [dico[key]]
    for key in nouv_dict.keys():
        nouv_dict[key] = set(nouv_dict[key])
    for key in nouv_dict.keys():
        vals = []
        for val in nouv_dict[key]:
            if val.startswith('http://www.wikidata.org/entity/'):
                vals += [val[len('http://www.wikidata.org/entity/'):]]
            else:
                vals += [val]
        nouv_dict[key] = vals
    return nouv_dict

In [23]:
def creer_nuplet(dicts_cond: list[dict]):
    nuplets = []
    for dict_cond in dicts_cond:
        if dict_cond['qid'].startswith('http://www.wikidata.org/entity/'):
                dict_cond['qid'] = [dict_cond['qid'][len('http://www.wikidata.org/entity/'):]]
        nuplet = (dict_cond['distance'],
                dict_cond['qid'][0],
                dict_cond['qidLabel'],
                dict_cond['coords'])
        nuplets += [nuplet]
    return nuplets

In [24]:
def fusionne_dicts_wd(dicts_wd: list) -> dict:
    print(dicts_wd)
    dicts_cond = [condense_dict(dict_wd)
                  for dict_wd in dicts_wd]
    nouv_dict = creer_nuplet(dicts_cond)
    return nouv_dict

## Application de la requête
Maintenant on extrait les candidats avec la requête. On prend _Molina_ comme exemple et on limite les candidats par un rayon maximal, ici 20 km fixé en dur dans la requête SPARQL.

In [25]:
def extrait_serveur_wd(qt: str):
    data = requests.get(url, params={
                        'query': prefixes + qt, 'format': 'json', 'headers': headers}).json()
    #print(data)
    cond_data = fusionne_dicts_wd(data['results']['bindings'])
    return cond_data

In [26]:
req = cree_req_coord('Point(-1.7461111111111105,40.833333333333336)')
req

'SELECT DISTINCT ?distance ?qid ?qidLabel ?coords WHERE {\n\n   # Use the around service\n   SERVICE wikibase:around { \n     # Looking for items with coordinate locations(P625)\n     ?qid wdt:P625 ?coords . \n     \n     # That are in a circle with a centre of with a point\n     bd:serviceParam wikibase:center "Point(-1.7461111111111105,40.833333333333336)"^^geo:wktLiteral   . \n     # Where the circle has a radius of 20km\n     bd:serviceParam wikibase:radius "20" . \n     bd:serviceParam wikibase:distance ?distance .\n   } .\n\n   ?qid wdt:P31/wdt:P279* wd:Q486972.\n\n   # Use the label service to get the English label\n   SERVICE wikibase:label {\n   bd:serviceParam wikibase:language "en" . \n   }\n}\nORDER BY ?distance'

In [27]:
candidats = extrait_serveur_wd(req)

[{'distance': {'datatype': 'http://www.w3.org/2001/XMLSchema#double', 'type': 'literal', 'value': '1.959'}, 'qid': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q1656043'}, 'qidLabel': {'xml:lang': 'en', 'type': 'literal', 'value': 'Castellar de la Muela'}, 'coords': {'datatype': 'http://www.opengis.net/ont/geosparql#wktLiteral', 'type': 'literal', 'value': 'Point(-1.759444444 40.818888888)'}}, {'distance': {'datatype': 'http://www.w3.org/2001/XMLSchema#double', 'type': 'literal', 'value': '4.284'}, 'qid': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q6149117'}, 'qidLabel': {'xml:lang': 'en', 'type': 'literal', 'value': 'Tordelpalo'}, 'coords': {'datatype': 'http://www.opengis.net/ont/geosparql#wktLiteral', 'type': 'literal', 'value': 'Point(-1.7963 40.826816666)'}}, {'distance': {'datatype': 'http://www.w3.org/2001/XMLSchema#double', 'type': 'literal', 'value': '4.465'}, 'qid': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q118092717'}, 'qidLabel': {'

In [28]:
for i, candidat in enumerate(candidats):
    if 'Molina' in candidat[2]:
        print('->', i, candidat)
    else:
        print(i, candidat)

0 ('1.959', 'Q1656043', 'Castellar de la Muela', 'Point(-1.759444444 40.818888888)')
1 ('4.284', 'Q6149117', 'Tordelpalo', 'Point(-1.7963 40.826816666)')
2 ('4.465', 'Q118092717', 'Q118092717', 'Point(-1.7857 40.8066)')
3 ('5.07', 'Q5765761', 'Chera, Guadalajara', 'Point(-1.771388888 40.791944444)')
4 ('5.402', 'Q54801224', 'Anchuela del Pedregal', 'Point(-1.808972222 40.843280555)')
5 ('6.169', 'Q1654736', 'Morenilla', 'Point(-1.7076042 40.7861279)')
6 ('6.225', 'Q534562', 'Hombrados', 'Point(-1.6853606 40.8013884)')
7 ('6.266', 'Q30317286', 'Cubillejo de la Sierra', 'Point(-1.774472222 40.885444444)')
8 ('6.677', 'Q24011113', 'Prados Redondos', 'Point(-1.79329 40.78505)')
9 ('6.72', 'Q1656490', 'Prados Redondos', 'Point(-1.7935166 40.7847021)')
10 ('7.603', 'Q24011243', 'Campillo de Dueñas', 'Point(-1.68505 40.88376)')
11 ('7.786', 'Q1655088', 'Campillo de Dueñas', 'Point(-1.683611111 40.885)')
12 ('8.018', 'Q21003995', 'Cubillejo del Sitio', 'Point(-1.803694444 40.890805555)')
13 ('