# GEOPARSING HISTORICAL DOCUMENTS

This notebook is proposed by [L. Moncla](https://ludovicmoncla.github.io/) and [K. McDonough](https://www.turing.ac.uk/people/researchers/katherine-mcdonough) as part of the [Sunoikisis Digital Classics](https://github.com/SunoikisisDC/SunoikisisDC-2021-2022/wiki/SunoikisisDC-Summer-2022-Session-9) Summer course on NLP for Historical maps (Session 9).

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](http://colab.research.google.com/github/ludovicmoncla/SunoikisisDC-Summer2022-Session9/blob/main/GeoparsingEncyclopedie.ipynb)
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/ludovicmoncla/SunoikisisDC-Summer2022-Session9/master?filepath=GeoparsingEncyclopedie.ipynb)


## Overview

In this tutorial, we'll learn about a few different things.

- How to load data from TEI-XML files into a Python dataframe
- Use Python dataframe for simple data analysis
- Test the [Perdido](https://github.com/ludovicmoncla/perdido) for geoparsing (geotagging + geocoding) Encyclopedie articles
- Display custom geotagging results (PERDIDO TEI-XML) with the [displaCy Named Entity Visualizer](https://spacy.io/usage/visualizers)
- Display geocoding results on a map
- Discuss the limits of geoparsing historical French texts

## Introduction

Geoparsing (also known as toponym resolution) refers to the process of extracting place names from text and assigning geographic coordinates to them.
This involves two main tasks: geotagging and geocoding.
Geotagging consists to identify spans of text referring to place names while geocoding consists to find unambiguous geographic coordinates.


### The Perdido Geoparser python library

[Perdido](https://github.com/ludovicmoncla/perdido) is a python text geoparser. It provides NLP and GIS methods for geoparsing French texts.
It has initially been developed as a REST API for extracting and retrieving displacements from French hiking descriptions [PERDIDO](http://erig.univ-pau.fr/PERDIDO/) and [ANR Choucas](http://choucas.ign.fr) projects.

More recently, as part of the [GEODE project](https://geode-project.github.io) we have developed a custom version for historical documents and more specifically for the Encyclopédie.


In this tutorial we'll see how to use the `Perdido` python library for geoparsing French texts. 
We will apply geoparsing on the Encyclopedie corpus version released by the [ARTFL project](https://encyclopedie.uchicago.edu/) and we'll show the limit of geotagging and geocoding historical documents.

### Acknowledgement

Data courtesy the [ARTFL Encyclopédie Project](https://artfl-project.uchicago.edu/), University of Chicago.


## Setting up the environment



### Install the Perdido Geoparser python library

In [None]:
!pip install --upgrade perdido

### Import the libraries

In [1]:
from perdido.geoparser import Geoparser
from perdido.datasets import load_edda_artfl

import lxml.etree as etree
from spacy import displacy


## 1. Getting started

### 1.1 Loading the Encyclopédie ARTFL dataset



In [15]:
d = load_edda_artfl()
dataset = d['data']
dataset.head()

Unnamed: 0,filename,volume,number,head,normClass,author,text
0,volume01-1063.tei,1,1063,AFRIQUE,Géographie,Diderot,"* AFRIQUE, (Géog.) l'une des quatre parties pr..."
1,volume01-1468.tei,1,1468,AISNE,Géographie,Diderot,"* AISNE, (Géog.) riviere de France, qui a sa s..."
2,volume01-1789.tei,1,1789,ALLEMAGNE,Géographie,Diderot,"* ALLEMAGNE, (Geog.) grand pays situé au milie..."
3,volume01-2568.tei,1,2568,ANCUD,Géographie moderne,Diderot,"* ANCUD, (Géog. mod.) l'Archipel d'Ancud ou de..."
4,volume01-4093.tei,1,4093,ARRACIFES,Géographie,Diderot,"* ARRACIFES, (Géog.) une des îles des Larrons,..."


### 1.2 First look at the data

Now we have access to all the attributs and methods of the dataframe object. For instance, we can easily print the number of rows in our dataframe which correspond to the number of articles in our corpus:

In [6]:
n = dataset.shape[0]
print('There are ' + str(n) + ' articles in the input directory')

There are 35 articles in the input directory




Now that the data from the XML-TEI files are loaded into a python dataframe, we can have a look at them.
For instance, we can select articles based on their classification in the Encyclopedie.
If we want all articles in 'geography' we can just do as follows: 

In [None]:
# create the list of class that refers to 'Géographie'
normclassGEO = ['Géographie', 'Géographie moderne',
                 'Géographie ancienne', 'Géographie moderne | Géographie ancienne',
                 'Géographie ancienne | Géographie moderne', 'Géographie sacrée', 'Géographie sainte',
                 'Géographie | Histoire ancienne', 'Géographie historique', 'Géographie | Histoire',
                 'Histoire | Géographie', 'Géographie | Histoire naturelle', 'Géographie | Mythologie',
                 'Géographie ancienne | Mythologie', 'Histoire moderne | Géographie',
                 'Géographie ancienne | Géographie sainte', 'Géographie ancienne | Géographie sacrée',
                 'Géographie sacrée | Géographie ancienne', 'Géographie du moyen âge', 'Géographie des Arabes',
                 'Géographie | Commerce', 'Histoire | Géographie ancienne',
                 'Géographie | Histoire ancienne | Histoire moderne', 'Géographie ancienne | Littérature | Histoire',
                 'Histoire naturelle | Géographie', 'Géographie | Histoire ancienne | Mythologie',
                 'Géographie moderne | Commerce', 'Géographie ancienne | Géographie antique',
                 'Géographie moderne | Histoire', 'Géographie | Histoire monastique',
                 'Géographie ancienne | Géographie moderne | Mythologie', 'Géographie ancienne | Histoire',
                 'Géographie ancienne | Littérature | Mythologie', 'Géographie ancienne | Médailles'
                 ]

# query the dataframe for all articles matching one of the class in our list
df_geo = dataset.loc[dataset['normClass'].isin(normclassGEO)]
df_geo.head(10)

In [None]:
print('There are ' + str(df_geo.shape[0]) + ' geography articles')

Then, we can also make a query based on the value of the data. For instance, we can query all the articles of a specific author:

In [None]:
val = 'Jaucourt'
n = df_geo.loc[df['author'] == val].shape[0]
print(str(n) + ' were written by '+ val)

We can also easily show the number of articles per author

In [None]:
df_geo.groupby(['author'])["filename"].count()

It is possible to show the value of one of the column of our dataframe for a specific row (i.e., article) based on its name. For instance, if we want to know who wrote the article about Lyon or if we want to see its content:

In [None]:
dataset.loc[dataset['head'] == 'LYON'].author.item()

In [None]:
dataset.loc[dataset['head'] == 'LYON'].text.item()

We can also perform a keyword search over the text content of all articles:

In [None]:
val = 'france'
df_2 = dataset[dataset['text'].str.contains(val, case=False)]
print(str(df_2.shape[0]) + ' articles contain the word \''+ val + '\'')

Another example with the expression "ville de" will extract all articles that contain the expression 'ville de':

In [None]:
dataset[dataset['text'].str.contains("ville de", case=False)]

The same with the words 'océan pacifique' and 'mer pacifique'. Which can be used to study the extent of the Encyclopedie on the pacific area:

In [None]:
dataset[dataset['text'].str.contains("océan pacifique|mer pacifique", case=False)]


Then, the same with a more thematic search for instance about 'esclavage': 

In [None]:
ddatasetf[dataset['text'].str.contains("esclavage", case=False)]

## 2. The NLP pipeline: Perdido Geoparser

### 2.1 Preprocessing: tokenization and part-of-speech (POS) tagging 

In Natural Language Processing (NLP), the main first steps before processing text content consist in tokenizing sentences and words and assigning to each word its grammatical category (Part-of-Speech). Then, this allows the construction of more complex rules or queries than a simple keyword search.
This preprocessing step is language dependent and thus we have to choose the right tool according to the language of our documents. This is a major difficulty when dealing with historical or ancient texts. For instance, for French it is difficult to find a POS tagger for old French as all well known taggers are trained on contemporary corpora.

> McDonough, K., Moncla, L., & van de Camp, M. (2019). Named entity recognition goes to old regime France: geographic text analysis for early modern French corpora. International Journal of Geographical Information Science, 33, 2498–2522.


The `Perdido` geoparser uses [Treetagger](https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/) for part-of-speech tagging. 


### 2.2 Geoparsing: geotagging + geocoding

Geoparsing is divided into two main tasks: geotagging (NER) and geocoding.

The geotagging service of the `Perdido` API uses a cascade of finite-state transducers defining specific patterns for NER and identification of geographic information (spatial relations, etc.). 
> Mauro Gaio and Ludovic Moncla (2019). “Geoparsing and geocoding places in a dynamic space context.“ In The Semantics of Dynamic Space in French: Descriptive, experimental and formal studies on motion expression, 66, 353.

For our custom version of the `Perdido` Geoparser, the geocoding task uses a simple gazetteer lookup method. We use the French wikiGazetteer (a gazetteer based on Wikipedia and enriched with Geonames data) generated following this work: https://github.com/alan-turing-institute/lwm_GIR19_resolving_places/tree/master/gazetteer_construction
> Mariona Coll Ardanuy, Katherine McDonough, Amrey Krause, Daniel CS Wilson, Kasra Hosseini, and Daniel van Strien. (2019) “Resolving Places, Past and Present: Toponym Resolution in Historical British Newspapers Using Multiple Resources”. In Proceedings of the 13th Workshop on Geographic Information Retrieval (GIR19).

Geographic text analysis research in the digital humanities has focused on projects analyzing modern English-language corpora. 
In this tutorial we propose to highlight the difficulties of extracting and mapping geographical information from historical French texts.
As we'll see in the following, in addition to the problem of language when it comes to historical documents, the early-modern period lacks temporally appropriate gazetteers.

> McDonough, K., Moncla, L., & van de Camp, M. (2019). Named entity recognition goes to old regime France: geographic text analysis for early modern French corpora. International Journal of Geographical Information Science, 33, 2498–2522.




The PERDIDO Geoparsing service (`http://erig.univ-pau.fr/PERDIDO/api/geoparsing/`) takes 6 new parameters:
1. api_key: API key of the user
2. lang: language of the document (currently only available for French)
3. content: textual content to parse
4. mode: indicates if the query uses exact match on the name (mode: *s*) or if it uses also alternate names (mode: *a*). (default : *s*)
5. records_limit: maximum number of records found in gazetteer for each toponym (default: 1)
6. version: indicates the version of the geoparser (Encyclopedie or Standard). Default: Standard (the standard version has been developped for the analysis of hiking descriptions)

The PERDIDO Geoparser returns XML-TEI. The `<name>` element refers to named entities (proper nouns) and the type attribute indicates its class (place, person, etc.). The `<rs>` element refers to extended named entities (e.g. ville d'Egypte). The `<location>` element indicates that geographic coordinates were found during geocoding.  



As we'll see in the next cell, when we apply the PERDIDO Geoparser to the following example: (Volume 1 article 5236, available online from the [ARTFL project](https://artflsrv03.uchicago.edu/philologic4/encyclopedie1117/navigate/1/5236/))

>AZIRUTH (Géographie.) petite ville d'Egypte, sur la côte occidentale de la mer Rouge ; ce n'est presque plus qu'un village.


Three spatial entities are found during geotagging:
1. Aziruth, 
2. petite ville d'Egypte
3. la côte occidentale de la mer Rouge

while only one entity (*Egypte*) is found during geocoding:

```xml
<name type="place" subtype="edda" id="en.2">
   <w lemma="null" type="NPr" xml:id="w9">Egypte</w>
   <location>
      <geo source="wiki">35.4833 24.1333</geo>
   </location>
</name>
```

In [3]:
content = dataset.loc[dataset['head'] == 'AZIRUTH'].text.item()
content

"* AZIRUTH (Géographie.) petite ville d'Egypte, sur la côte occidentale de la mer Rouge ; ce n'est presque plus qu'un village."

#### 2.2.1 

In [4]:
geoparser = Geoparser(version="Encyclopedie")
doc = geoparser(content)

In [5]:
doc.tei

'<TEI><teiheader></teiheader><text><body><s><phr type="loc" subtype="relationHead"><phr type="relationHead"><rs type="ene" id="en.0"><rs type="unknown" subtype="ene" id="en.1" start="0" end="0" startT="0" endT="2"><term type="unknown" start="0" end="0" startT="0" endT="1"><w lemma="*" type="N" id="w0">*</w></term><rs type="place" subtype="no" id="en.2" start="1" end="0" startT="1" endT="2"><name type="place" subtype="edda" id="en.3" start="1" end="0" startT="1" endT="2"><w lemma="aziruth" type="NPr" id="w1">AZIRUTH</w></name></rs></rs></rs><term type="articleClass" start="1" end="0" startT="2" endT="6"><w lemma="(" type="PUN" id="w2">(</w><w lemma="g&#233;ographie" type="N" id="w3">G&#233;ographie</w><w type="PUN" lemma="" id="w4">.</w><w lemma=")" type="PUN" id="w5">)</w></term><rs type="ene" id="en.4"><rs type="place" subtype="ene" id="en.5" start="1" end="0" startT="6" endT="10"><term type="place" start="1" end="0" startT="6" endT="8"><w lemma="petit" type="A" id="w6">petite</w><w l

In [6]:
doc.geojson

{'type': 'FeatureCollection',
 'features': [{'type': 'Feature',
   'geometry': {'type': 'Point', 'coordinates': [29.267547, 26.254049]},
   'properties': {'id': 'en.7',
    'name': 'Egypte',
    'sourceName': 'مصر',
    'type': 'administrative',
    'country': 'مصر',
    'source': 'nominatim'}},
  {'type': 'Feature',
   'geometry': {'type': 'Point', 'coordinates': [38.534297, 20.296545]},
   'properties': {'id': 'en.11',
    'name': 'mer Rouge',
    'sourceName': 'البحر الأحمر',
    'type': 'sea',
    'country': 'البحر الأحمر',
    'source': 'nominatim'}}]}

In [7]:
df = doc.to_dataframe()
df.head()

Unnamed: 0,name,tag,lat,lng,toponym_candidates
0,AZIRUTH,place,,,
1,Egypte,place,29.267547,26.254049,"[{'name': '', 'lat': '29.267547', 'lng': '26.2..."
2,mer Rouge,place,38.534297,20.296545,"[{'name': '', 'lat': '38.534297', 'lng': '20.2..."


#### 2.2.2 Save the results in files

In [8]:
doc.to_xml('aziruth-perdido.xml')

In [9]:
doc.to_geojson('aziruth-perdido.geojson')

In [10]:
doc.to_csv('aziruth-perdido.csv')

#### 2.2.3 Display named entities



https://spacy.io/usage/visualizers#span

In [7]:
displacy.render(doc.to_spacy_doc(), style="ent", jupyter=True)

In [8]:
displacy.render(doc.to_spacy_doc(), style="span", jupyter=True)

#### 2.2.4 Mapping place names


In [None]:
doc.get_folium_map()


## 3. Comparison to other NER tools
### 3.1 SpaCy

In [None]:
!pip install -U spaCy
!python -m spacy download fr_core_news_sm

In [None]:
import spacy

In [None]:
spacy_parser = spacy.load('fr_core_news_sm')

In [None]:
doc = spacy_parser(content)
for ent in doc.ents:
    print(ent.text, ent.label_)


In [None]:
displacy.render(doc, style="ent", jupyter=True) 

### 3.2 Stanza

In [None]:
!pip install stanza

In [None]:
import stanza

In [None]:
stanza_parser = stanza.Pipeline(lang='fr', processors='tokenize,ner')

In [None]:
doc = stanza_parser(content)
print(*[f'entity: {ent.text}\ttype: {ent.type}' for ent in doc.ents], sep='\n')

## 4. Processing several documents at once


In [16]:
df_sampled = dataset.sample(4)
df_sampled

Unnamed: 0,filename,volume,number,head,normClass,author,text
3,volume01-2568.tei,1,2568,ANCUD,Géographie moderne,Diderot,"* ANCUD, (Géog. mod.) l'Archipel d'Ancud ou de..."
15,volume04-4137.tei,4,4137,DENAT,Géographie moderne,unsigned,"DENAT, (Géog. mod.) petite ville de France au ..."
32,volume17-2675.tei,17,2675,"Zacatula, la",Géographie moderne,unsigned,"Zacatula, la, (Géog. mod.) riviere de l'Amériq..."
10,volume02-1650.tei,2,1650,Benin,Géographie,Diderot,"* Benin, (Géog.) capitale du royaume de même n..."


In [17]:
df_sampled.text

3     * ANCUD, (Géog. mod.) l'Archipel d'Ancud ou de...
15    DENAT, (Géog. mod.) petite ville de France au ...
32    Zacatula, la, (Géog. mod.) riviere de l'Amériq...
10    * Benin, (Géog.) capitale du royaume de même n...
Name: text, dtype: object

In [12]:
docs = geoparser(df_sampled.text)

In [14]:
for doc in docs:
    displacy.render(doc.to_spacy_doc(), style="ent", jupyter=True) 

In [20]:
for e in docs[3].named_entities:
    print(e)

DAS VELAS place
 toponym candidate >  -28.205296 38.679753 nominatim Velas, Velas (São Jorge), Açores, Portugal

Océan place
 toponym candidate >  10.188326 2.859961 nominatim Océan, Sud, Cameroun

mer du Sud place
 toponym candidate >  7.009304 43.548716 nominatim Mer du Sud, 61-63, Rue Georges Clemenceau, Vallon Provençal, Cannes, Grasse, Alpes-Maritimes, Provence-Alpes-Côte d'Azur, France métropolitaine, 06400, France

Guan place
 toponym candidate >  115.539561 36.533939 nominatim 冠县, 聊城市, 山东省, 252500, 中国

Urac place
 toponym candidate >  0.044315 43.246812 nominatim Urac, Tarbes, Hautes-Pyrénées, Occitanie, France métropolitaine, 65000, France

Magellan place
 toponym candidate >  170.14217 -43.57091 nominatim Magellan, Westland District, West Coast, New Zealand / Aotearoa

Michel Lopez de Legaspi unknown

Philippe place
 toponym candidate >  -0.144005 44.554648 nominatim Philippe, Castets-en-Dorthe, Castets et Castillon, Langon, Gironde, Nouvelle-Aquitaine, France métropolitaine,

## 3. Toponym disambiguation using network analysis

In our work, we use this methodoly for constructing a network based on the citation of "géographie" articles between them.
We proposed to use network analysis measures to establish an approximate location, defined by qualitative relations, for each named toponym in EDDA. Throwing a list of decontextualized toponyms at an external resource like Geonames is risky. We therefore hypothesize that defining meaningful links between places can provide essentialinformation to improve disambiguation (and potentially replace resolution as the end goal). We establish connections between places based on the citation of “headword” toponyms (those that appearas headwords of entries) in other EDDA entries.

>Moncla, L., McDonough, K., Vigier, D., Joliveau, T., & Brenon, A. (2019). Toponym disambiguation in historical documents using network analysis of qualitative relationships. Proceedings of the 3rd ACM SIGSPATIAL International Workshop on Geospatial Humanities, 1–4. Chicago, IL, USA.

This method draws on relations in the corpus of EDDA articles, which improves disambiguation at a later stage with an external resource. We suggest the network as an alternative to geospatial representation, a useful proxy when no historical gazetteer exists for the source material's period. Our first experiments have shown that this approach goes beyond a simple text analysis and is able to find relations between toponyms that are not co-occurring in the same documents. Network relations are also usefully compared with disambiguated toponyms to evaluate geographical coverage, and the ways that geographical discourse is expressed, in historical texts.


<table>
  <tr>
    <td> <img src="img/labels_indegree2.png" width ="500px"> </td>
    <td> <img src="img/nodes_betweenness+class2.png" width ="500px" > </td>
  </tr>
  <tr>
    <td>Node and label size indicate in-degree centrality</td>
    <td>Node size indicates betweenness centrality<br/> 
        colors refer to geographic feature types <br/> 
        (city: red, hydronym: blue, country: green, mountain: brown, unclassified: grey)</td>
  </tr>
</table> 



We also made somse preliminary tests by assigning geographic coordinates found in our French wikiGazetteer to each node (headword). We have only 2535 nodes with coordinates over the 13734 nodes. 

Our first experiment is shown below. Colors identify clusters of nodes computed with the [modularity measure](https://en.wikipedia.org/wiki/Modularity_(networks)) implemented on Gephy.

<table><tr>
<td> <img src="img/geocodingEDDA1.png" width ="500"> </td>
<td> <img src="img/geocodingEDDA_network.png" width ="500" > </td>
</tr></table> 