# NLP for historical texts

This notebook is proposed by [L. Moncla](https://ludovicmoncla.github.io/) and [K. McDonough](https://www.turing.ac.uk/people/researchers/katherine-mcdonough) as part of the [Sunoikisis Digital Classics](https://github.com/SunoikisisDC/SunoikisisDC-2021-2022/wiki/SunoikisisDC-Summer-2022-Session-9) Summer course on NLP for historical texts (Session 9).

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](http://colab.research.google.com/github/ludovicmoncla/SunoikisisDC-Summer2022-Session9/blob/main/Tutorial-geoparsing.ipynb)
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/ludovicmoncla/SunoikisisDC-Summer2022-Session9/main?filepath=Tutorial-geoparsing.ipynb)


## 1. Overview


In this tutorial, we'll learn about a few different things:


- Load a dataset from the `Perdido` library as a Python dataframe (articles from Diderot and d'Alembert's *Encyclopédie*)
  - Use Python dataframe for simple data analysis
- Use the `Perdido Geoparser` library for geoparsing French texts
  - Display geotagging results
  - Map geocoding results
- Compare `Perdido` NER results with `spaCy` and `Stanza` (python libraries)
- Reflect on the limits of geoparsing historical French (and multilingual) texts.

## 2. Introduction

Geoparsing (also known as toponym resolution) refers to the process of extracting place names from text and assigning geographic coordinates to them.
This involves two main tasks: geotagging and geocoding.
Geotagging consists to identify spans of text referring to place names while geocoding consists to find unambiguous geographic coordinates.

Geographic text analysis research in the digital humanities has focused on projects analyzing modern English-language corpora. 
In this tutorial we propose to highlight the difficulties of extracting and mapping geographical information from historical French texts.
As we'll see in the following, in addition to the problem of language when it comes to historical documents, the early-modern period lacks temporally appropriate gazetteers.

> McDonough, K., Moncla, L., & van de Camp, M. (2019). Named entity recognition goes to old regime France: geographic text analysis for early modern French corpora. International Journal of Geographical Information Science, 33, 2498–2522.


### Discussion Part 1

1. Input/output: what do you need for geoparsing, and what do you get at the end?
2. How is geoparsing related to other NLP tasks?
3. Why is geoparsing an unsolved task in GIScience?
4. What methods allow us to evaluate the results of automatic geoparsing?
5. What kinds of humanities research questions can we answer using the results of geoparsing?

### 2.1 The Perdido Geoparser python library

[Perdido](https://github.com/ludovicmoncla/perdido) is a python text geoparser. It provides NLP and GIS methods for geoparsing French texts.
It has initially been developed as a REST API for extracting and retrieving displacements from French hiking descriptions, under the framework of the [PERDIDO](http://erig.univ-pau.fr/PERDIDO/) and [ANR Choucas](http://choucas.ign.fr) projects.

More recently, as part of the [GEODE project](https://geode-project.github.io) we have developed a custom version for historical documents and more specifically for the Encyclopédie.


In this tutorial we'll see how to use the `Perdido` python library for geoparsing French texts. 
We will apply geoparsing on volume 7 of Encyclopedie corpus version released by the [ARTFL project](https://encyclopedie.uchicago.edu/) and we'll show the limits of geotagging and geocoding historical documents.

### 2.2 Acknowledgement

Data courtesy the [ARTFL Encyclopédie Project](https://artfl-project.uchicago.edu/), University of Chicago.


## 3. Setting up the environment



### 3.1 Install python packages

* If you configured your conda environment using the `requirements.txt` file, you can skip this step and go to the `Import` section.
* If you configured your conda environment using the `environment.yml` file or if you use a Google Colab environment, you need to install `perdido` using `pip`:

In [None]:
!pip install --upgrade perdido

* Then, if you already configured your conda environment, either with conda or pip (see readme file) you can skip the next cell.
* If you're running this notebook from Google Colab, you need to run the next cell.


In [None]:
!pip install stanza

### 3.2 Import the libraries

First, we will load some specific libraries from `Perdido` that we will use in this notebook. Next, we import some tools that will help us parse and visualize the text.

In [1]:
from perdido.geoparser import Geoparser
from perdido.geocoder import Geocoder
from perdido.datasets import load_edda_artfl, load_edda_perdido

from spacy import displacy


## 4. Getting started

In this notebook, we'll test out some basic queries of the *Encyclopédie* articles from volume 7 (H - Itzehoa, published in 1765). You can learn more about the other volumes [here](https://encyclopedie.uchicago.edu/node/102).


### 4.1 Loading the ARTFL *Encyclopédie* dataset

First, we load the data. You can view this sample dataset at under the 'Data' directory [here](https://github.com/ludovicmoncla/SunoikisisDC-Summer2022-Session9).

The next cell loads the data, defines the data as `dataset`, and shows you the top 5 records (`head`). The data has been saved as a dataframe.

In [2]:
d = load_edda_artfl()
dataset = d['data']
dataset.head()

Unnamed: 0,filename,volume,number,head,normClass,author,text
0,volume07-1.tei,7,1,Title Page,unclassified,unsigned,"ENCYCLOPÉDIE, ou DICTIONNAIRE RAISONNÉ DES SCI..."
1,volume07-10.tei,7,10,FOESNE ou FOUANE,Marine | Pêche,Bellin,"FOESNE ou FOUANE, sub. s. (Marine & Pêche.) c'..."
2,volume07-100.tei,7,100,Fond de la hune,unclassified,Bellin,Fond de la hune ; ce sont les planches qu on p...
3,volume07-1000.tei,7,1000,Fronteau,Bourrelier | Sellier,Diderot,"* Fronteau, terme de Sellier-Bourrelier ; c'es..."
4,volume07-1001.tei,7,1001,FRONTIERE,Géographie,Diderot,"* FRONTIERE, s. f. (Géog.) se dit des limites,..."


### 4.2 Exploring the data

Now we have access to all the attributes and methods of the [dataframe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) object. For instance, we can easily print the number of rows in our dataframe which correspond to the number of articles in our corpus:

In [3]:
n = dataset.shape[0]
print('There are ' + str(n) + ' articles in the dataset.')

There are 3385 articles in the dataset.


#### 4.2.1 Searching by metadata

Now that the data from the XML-TEI files are loaded into a python dataframe, we can select groups of articles based on the article metadata that was originally stored in the TEI header.

For instance, we can select articles based on their classification in the *Encyclopédie*. (There are actually a few different ways that the ARTFL *Encyclopédie* articles have been classified. In this notebook we will be using the `normclass` field, which normalizes classifications given at time of publication that had many spelling variants. In the cell below, we have hand selected all the `normclass` combinations that include Geography as well as Geography on its own.)

If we want all articles classified as 'Geography' we can make the request as follows (the output is stored as a new data frame `df_geo`: 

In [4]:
normclassGEO = ['Géographie', 'Géographie moderne',
                 'Géographie ancienne', 'Géographie moderne | Géographie ancienne',
                 'Géographie ancienne | Géographie moderne', 'Géographie sacrée', 'Géographie sainte',
                 'Géographie | Histoire ancienne', 'Géographie historique', 'Géographie | Histoire',
                 'Histoire | Géographie', 'Géographie | Histoire naturelle', 'Géographie | Mythologie',
                 'Géographie ancienne | Mythologie', 'Histoire moderne | Géographie',
                 'Géographie ancienne | Géographie sainte', 'Géographie ancienne | Géographie sacrée',
                 'Géographie sacrée | Géographie ancienne', 'Géographie du moyen âge', 'Géographie des Arabes',
                 'Géographie | Commerce', 'Histoire | Géographie ancienne',
                 'Géographie | Histoire ancienne | Histoire moderne', 'Géographie ancienne | Littérature | Histoire',
                 'Histoire naturelle | Géographie', 'Géographie | Histoire ancienne | Mythologie',
                 'Géographie moderne | Commerce', 'Géographie ancienne | Géographie antique',
                 'Géographie moderne | Histoire', 'Géographie | Histoire monastique',
                 'Géographie ancienne | Géographie moderne | Mythologie', 'Géographie ancienne | Histoire',
                 'Géographie ancienne | Littérature | Mythologie', 'Géographie ancienne | Médailles'
                 ]

* Query the dataframe for all articles matching one of the class in our list:

In [5]:
df_geo = dataset.loc[dataset['normClass'].isin(normclassGEO)]
df_geo.head(10)

Unnamed: 0,filename,volume,number,head,normClass,author,text
4,volume07-1001.tei,7,1001,FRONTIERE,Géographie,Diderot,"* FRONTIERE, s. f. (Géog.) se dit des limites,..."
5,volume07-1002.tei,7,1002,FRONTIGNAN,Géographie,Jaucourt,"FRONTIGNAN, (Géog.) petite ville de France. au..."
29,volume07-1024.tei,7,1024,"FROWARD, le cap.",Géographie,Jaucourt,"FROWARD, le cap. (Géog.) & par les François le..."
53,volume07-1046.tei,7,1046,FUEGO (Isla del-),Géographie,Jaucourt,"FUEGO (Isla del-), Géog ou en françois, l'île ..."
54,volume07-1047.tei,7,1047,Fuego ou Fogo (Isle de-),Géographie,Jaucourt,"Fuego ou Fogo (Isle de-), Géog. cette seconde ..."
55,volume07-1048.tei,7,1048,FUENCHEU ou FOUENTCHÉOU,Géographie,Jaucourt,"FUENCHEU ou FOUENTCHÉOU, (Géogr.) grande ville..."
56,volume07-1049.tei,7,1049,"FUESSEN, ou FUSSER",Géographie,Jaucourt,"FUESSEN, ou FUSSER, en latin Fucena, & par que..."
61,volume07-1053.tei,7,1053,FULDE,Géographie,unsigned,"FULDE, Fulda, (Géog.) ville & abbaye célebre d..."
82,volume07-1072.tei,7,1072,FUM-CHIM,Géographie,Jaucourt,"FUM-CHIM, (Géog.) petite ville de la province ..."
104,volume07-1092.tei,7,1092,FUNCHAL,Géographie,Jaucourt,"FUNCHAL, (Géog.) ville de l'Océan atlantique, ..."


In [6]:
print('There are ' + str(df_geo.shape[0]) + ' geography articles')

There are 489 geography articles


We can query based on any value in the dataframe (e.g. article metadata). For instance, we can query all the articles written by a specific author:

* Count article for a single named author (Jaucourt)

In [7]:
val = 'Jaucourt'
n = df_geo.loc[dataset['author'] == val].shape[0]
print(str(n) + ' were written by '+ val)

472 were written by Jaucourt


We can also easily show the number of articles per author:

In [8]:
df_geo.groupby(['author'])["filename"].count()

author
Diderot                 1
Jaucourt              472
La Condamine            1
Robert de Vaugondy      2
unsigned               13
Name: filename, dtype: int64

It is possible to show the value of one column in our dataframe for a specific row (i.e., by article) based on its name. For instance, if we want to know who wrote the article about Lyon or if we want to see its content, we make these requests:

In [9]:
dataset.loc[dataset['head'] == 'FRONTIGNAN'].author.item()

'Jaucourt'

#### 4.2.2 Searching by text 

It is also possible to display and search the full text content of the articles stored in the dataframe. 

* Show full text for a specific article:

In [10]:
dataset.loc[dataset['head'] == 'FRONTIGNAN'].text.item()

"FRONTIGNAN, (Géog.) petite ville de France. au Bas-Languedoc, connue par ses excellens vins muscats, & ses raisins de caisse qu'on appelle passerilles. Quelques savans croyent, sans en donner de preuves, que cette ville est le forum Domitii des Romains. Elle est située sur l'étang de Maguelone, à six lieues N. E. d'Agde, & cinq S. O. de Montpellier. Long. 15d. 24'. lat. 43d. 28'. (D. J.)"

We can also perform a **keyword search** over the text content of all articles:

* Select articles that contain 'france':

In [11]:
# search corpus by keyword (val)
val = 'france'
df_2 = dataset[dataset['text'].str.contains(val, case=False)]
print(str(df_2.shape[0]) + ' articles contain the word \''+ val + '\'')

239 articles contain the word 'france'


It is also possible to search by **phrases**. The expression "ville de" is commonly used in the *Encyclopédie* to define the country or region of a place. Searching by this phrase gives us a sense of the broader geographical coverage of the corpus. 

Here we extract all articles that contain the expression 'ville de':

In [12]:
dataset[dataset['text'].str.contains("ville de", case=False)]

Unnamed: 0,filename,volume,number,head,normClass,author,text
5,volume07-1002.tei,7,1002,FRONTIGNAN,Géographie,Jaucourt,"FRONTIGNAN, (Géog.) petite ville de France. au..."
82,volume07-1072.tei,7,1072,FUM-CHIM,Géographie,Jaucourt,"FUM-CHIM, (Géog.) petite ville de la province ..."
104,volume07-1092.tei,7,1092,FUNCHAL,Géographie,Jaucourt,"FUNCHAL, (Géog.) ville de l'Océan atlantique, ..."
114,volume07-1100.tei,7,1100,Funérailles des Romains,unclassified,Jaucourt,Funérailles des Romains. Les Romains ont eté s...
129,volume07-1114.tei,7,1114,FUNG,Géographie,Jaucourt,"FUNG, (Géog.) ville de la Chine, dans la provi..."
...,...,...,...,...,...,...,...
3277,volume07-901.tei,7,901,FRIAS,Géographie,Jaucourt,"FRIAS, (Géog.) petite ville de la Castille vie..."
3279,volume07-903.tei,7,903,Fribourg,Géographie,Jaucourt,"Fribourg, Friburgum, (Géog.) ville de Suisse f..."
3282,volume07-906.tei,7,906,FRICENTI,Géographie,Jaucourt,"FRICENTI, en latin moderne Fricentium, (Géog.)..."
3288,volume07-911.tei,7,911,FRIDERICKSTADT,Géographie,Jaucourt,"FRIDERICKSTADT, (Géog.) petite ville de la pre..."


Next, we can try a thematic search, for instance about 'esclavage' (slavery): 

In [13]:
dataset[dataset['text'].str.contains("esclavage", case=False)]

Unnamed: 0,filename,volume,number,head,normClass,author,text
456,volume07-1409.tei,7,1409,GALÉRIEN,Jurisprudence | Marine,unsigned,"GALÉRIEN, s. m. (Jurisprud. Marine.) criminel ..."
2283,volume07-3054.tei,7,3054,GROTESQUES,Beaux-Arts,Watelet,"GROTESQUES, s. f. pl. (Beaux-Arts.) vient du m..."
2943,volume07-600.tei,7,600,FOURRAGE,Maréchallerie,Bourgelat,"FOURRAGE, s. m. (Maréchall.) nourriture des ch..."
3054,volume07-700.tei,7,700,Franc,unclassified,Boucher d'Argis,"Franc signifie quelquefois une personne libre,..."
3131,volume07-770.tei,7,770,"FRANÇOIS, ou FRANÇAIS",Littérature | Histoire | Morale,Voltaire,"FRANÇOIS, ou FRANÇAIS, s. m. (Hist. Littérat. ..."


## 5. The NLP pipeline: Perdido Geoparser


In Natural Language Processing (NLP), the main first steps before processing text content consist in tokenizing sentences and words and assigning to each word its grammatical category (Part-of-Speech). 

This allows the construction of more complex rules or queries compared to a simple keyword search. E.g. we would know that "city" is a noun, and we can perform a search for all nouns in the corpus.


These preprocessing steps are language dependent, and therefore we have to choose the right tool according to the language, style and period of our documents. This is a major difficulty when dealing with historical or ancient texts. For instance, for French it is difficult to find a POS tagger for pre-20th century French as major well known taggers are trained on contemporary corpora.

> McDonough, K., Moncla, L., & van de Camp, M. (2019). Named entity recognition goes to old regime France: geographic text analysis for early modern French corpora. International Journal of Geographical Information Science, 33, 2498–2522.



### 5.1 Perdido Geoparser


The `Perdido` geoparser uses [Treetagger](https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/) for part-of-speech tagging. 


The geotagging step of `Perdido` is perform using a cascade of finite-state transducers defining specific patterns for NER and identification of geographic information (spatial relations, etc.). 
> Mauro Gaio and Ludovic Moncla (2019). “Geoparsing and geocoding places in a dynamic space context.“ In The Semantics of Dynamic Space in French: Descriptive, experimental and formal studies on motion expression, 66, 353.

Geoparsing in the digital humanities began with projects analyzing modern English-language corpora. But now, many researchers are developing projects for automatically identifying and geolocating places named in texts of many languages. 

Here we highlight the difficulties of extracting and mapping geographical information from historical French texts. In addition to language-related problems which impact the quality of tokenization, POS tagging, and NER, geocoding presents its own challenges. Once place names have been identifed in a text, correctly associating geographical coordinates with that place is a challenge. **Gazetteers** are knowledge bases that help researchers link place names with information about place, including its location. 

For our custom version of the `Perdido` Geoparser, the geocoding task uses a simple gazetteer lookup method. Several gazetteers can be used:
 - Nominatim (ie, OpenStreetMap) by default, 
 - Geonames, 
 - World Historical Gazetteer, 
 - Pleiades

Like using the most appropriate POS tagger, finding the best gazetteer for your corpus can be challenging. Luckily, for the ancient world, there are some excellent options. Here, you will also be able to test the [Pleiades gazetteer](https://pleiades.stoa.org/) and compare the results with the other contemporary gazetteers.


The PERDIDO Geoparser returns XML-TEI. The `<name>` element refers to named entities (proper nouns) and the type attribute indicates its class (place, person, etc.). The `<rs>` element refers to extended named entities (e.g. ville d'Egypte). The `<location>` element indicates that geographic coordinates were found during geocoding.  


#### 5.1.1 Getting started with `Perdido`

* Get the content from the article 'FRONTIGNAN' ([https://artflsrv03.uchicago.edu/philologic4/encyclopedie1117/navigate/7/1002/](https://artflsrv03.uchicago.edu/philologic4/encyclopedie1117/navigate/7/1002/)) from the dataset:

In [14]:
content = dataset.loc[dataset['head'] == 'FRONTIGNAN'].text.item()
content

"FRONTIGNAN, (Géog.) petite ville de France. au Bas-Languedoc, connue par ses excellens vins muscats, & ses raisins de caisse qu'on appelle passerilles. Quelques savans croyent, sans en donner de preuves, que cette ville est le forum Domitii des Romains. Elle est située sur l'étang de Maguelone, à six lieues N. E. d'Agde, & cinq S. O. de Montpellier. Long. 15d. 24'. lat. 43d. 28'. (D. J.)"

* Create a Geoparser object from the `Perdido` library. Specify that we are working with the *Encyclopédie* version.

In [15]:
geoparser = Geoparser(version="Encyclopedie")

* Now you can use this geoparser for geoparsing text content. Let's try with the `content`variable that we declare before:

In [16]:
doc = geoparser(content)

The geoparser return a `Perdido` object. This object has several attributes and methods. We'll now see some of them.

* Accessing the XML-TEI result:

In [17]:
doc.tei

'<TEI><teiheader></teiheader><text><body><s><phr type="relationHead"><rs type="place" subtype="no" id="en.0" start="0" end="0" startT="0" endT="1"><name type="place" subtype="edda" id="en.1" start="0" end="0" startT="0" endT="1"><w pos="N" lemma="frontignan" id="w0">FRONTIGNAN</w><location><geo source="nominatim" rend="Frontignan, Montpellier, H&#233;rault, Occitanie, France m&#233;tropolitaine, 34110, France">3.753064 43.448762</geo></location></name></rs><w pos="PUN" lemma="" id="w1">,</w><term type="articleClass" start="1" end="0" startT="2" endT="6"><w pos="PUN" lemma="(" id="w2">(</w><w pos="NPr" lemma="g&#233;og" id="w3">G&#233;og</w><w pos="PUN" lemma="" id="w4">.</w><w pos="PUN" lemma=")" id="w5">)</w></term><rs type="ene" id="en.2"><rs type="place" subtype="ene" id="en.3" start="1" end="0" startT="6" endT="10"><term type="place" start="1" end="0" startT="6" endT="8"><w pos="A" lemma="petit" id="w6">petite</w><w pos="N" lemma="ville" id="w7">ville</w></term><w pos="PREP" lemma=

* Accessing the geojson results generate during the geocoding phase:

In [18]:
doc.geojson

{'type': 'FeatureCollection',
 'features': [{'type': 'Feature',
   'geometry': {'type': 'Point', 'coordinates': [3.753064, 43.448762]},
   'properties': {'id': 'en.1',
    'name': 'FRONTIGNAN',
    'sourceName': 'Frontignan, Montpellier, Hérault, Occitanie, France métropolitaine, 34110, France',
    'type': 'administrative',
    'country': 'France',
    'source': 'nominatim'}},
  {'type': 'Feature',
   'geometry': {'type': 'Point', 'coordinates': [1.888334, 46.603354]},
   'properties': {'id': 'en.5',
    'name': 'France',
    'sourceName': 'France',
    'type': 'administrative',
    'country': 'France',
    'source': 'nominatim'}},
  {'type': 'Feature',
   'geometry': {'type': 'Point', 'coordinates': [-68.50451, 47.532931]},
   'properties': {'id': 'en.7',
    'name': 'Bas-Languedoc',
    'sourceName': 'Ruisseau Languedoc, Dégelis, Témiscouata, Bas-Saint-Laurent, Québec, G5T 1P8, Canada',
    'type': 'stream',
    'country': 'Canada',
    'source': 'nominatim'}},
  {'type': 'Feature',

* Transform the Perdido object into a dataframe (only some of the attributes are kept):

In [19]:
df = doc.to_dataframe()
df.head()

Unnamed: 0,name,tag,lat,lng,toponym_candidates
0,FRONTIGNAN,place,3.753064,43.448762,"[{'name': '', 'lat': '3.753064', 'lng': '43.44..."
1,France,place,1.888334,46.603354,"[{'name': '', 'lat': '1.888334', 'lng': '46.60..."
2,Bas-Languedoc,place,-68.50451,47.532931,"[{'name': '', 'lat': '-68.50451', 'lng': '47.5..."
3,Domitii des Romains,unknown,,,
4,Maguelone,place,3.883389,43.514004,"[{'name': '', 'lat': '3.883389', 'lng': '43.51..."


#### 5.1.2 Save the results in files

In [None]:
doc.to_xml('FRONTIGNAN-perdido.xml')

In [None]:
doc.to_geojson('FRONTIGNAN-perdido.geojson')

In [None]:
doc.to_csv('FRONTIGNAN-perdido.csv')

#### 5.1.3 Display named entities

Often, it is useful to vizualize the output in sentence form. The `spacy` library provides a useful tool for this: `displacy`.


In [20]:
displacy.render(doc.to_spacy_doc(), style="ent", jupyter=True)

In [21]:
displacy.render(doc.to_spacy_doc(), style="span", jupyter=True)

#### 5.1.4 Map place names


For many projects, it is important to view the results of geoparsing on a map. Here, we can see the results plotted on a map. But remember, these are only the results for which coordinates could be found. Results that could not be matched to records in the gazetteer will not be mapped.

* Here we see the geocoding results for 'Frontignan' mapped:


In [22]:
doc.get_folium_map()

### 5.1.5 Try another example

GESSORIACUM - https://artflsrv03.uchicago.edu/philologic4/encyclopedie1117/navigate/7/2046/



In [23]:
content = dataset.loc[dataset['head'] == 'GESSORIACUM'].text.item()
content

"GESSORIACUM, (Géog. anc.) le Gessoriacum de Suétone & de Ptolomée, ce fameux port des Romains d'où se faisoit le passage des Gaules dans la Grande-Bretagne ; ce port décoré d'un phare magnifique bâti  par Caligula, étoit Boulogne-sur-mer ; on n'en peut pas douter par l'ancienne carte de Peutinger, qui dit Gessoriacum quod nunc Bononia. Ce port étoit dans le pays des Morins ; & depuis Jules-César jusqu'au tems des derniers empereurs, tous ceux que l'Histoire dit avoir passé des Gaules dans la Grande-Bretagne, se sont embarqués à Gessoriacum, c'est-à-dire à Boulogne. Voyez la Martiniere, & les mémoires  de l'acad des Inscrip. tom. IX. (D. J.)"

In [24]:
doc = geoparser(content)
displacy.render(doc.to_spacy_doc(), style="ent", jupyter=True)

In [25]:
doc.get_folium_map()

## 6. Comparison to other NER tools

`Perdido` is a custom geoparsing library for French language documents. How does it compare to other state-of-the-art libraries?

Comparing `Perdido`, `SpaCy`, and `Stanza` outputs, even for just 1-2 articles from our corpus, allows us to see common errors. One library might excel at identifying people, but struggle with complex place names. Another is better at capturing places named within phrases, but mixes up people and places. It is important to test multiple geoparsers for your corpus, and to understand how they can be adapted, in order to get the best results.

### 6.1 SpaCy

[SpaCy](https://spacy.io/) is a commonly-used NLP library that supports documents in many languages. `SpaCy` uses Machine Learning to perform NER (versus being a rule-based system).

* Install the `spaCy` french pre-trained language model:

In [None]:
!python -m spacy download fr_core_news_sm

* Import the `spaCy` library

In [None]:
import spacy

* Load the `spaCy` french pre-trained language model

In [None]:
spacy_parser = spacy.load('fr_core_news_sm')

* Run the NER pipeline

In [None]:
doc = spacy_parser(content)

* Show the named entities

In [None]:
for ent in doc.ents:
    print(ent.text, ent.label_)

* Display the named entities with `displaCy`

In [None]:
displacy.render(doc, style="ent", jupyter=True) 

### 6.2 Stanza

[Stanza](https://stanfordnlp.github.io/stanza/) is another NLP ML library developed by Stanford that is designed to work across many languages.

* Import the `Stanza` library and download the pre-trained french language model:

In [26]:
import stanza
# This can take a while depending on your internet connection (fr model is 572M)
stanza.download('fr')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.2.2.json:   0%|   …

2022-06-23 14:53:14 INFO: Downloading default packages for language: fr (French)...
2022-06-23 14:53:15 INFO: File exists: /Users/lmoncla/stanza_resources/fr/default.zip.
2022-06-23 14:53:18 INFO: Finished downloading models and saved to /Users/lmoncla/stanza_resources.


* Declare the NER pipeline:

In [27]:
stanza_parser = stanza.Pipeline(lang='fr', processors='tokenize,ner')

2022-06-23 14:53:24 INFO: Loading these models for language: fr (French):
| Processor | Package |
-----------------------
| tokenize  | gsd     |
| mwt       | gsd     |
| ner       | wikiner |

2022-06-23 14:53:24 INFO: Use device: cpu
2022-06-23 14:53:24 INFO: Loading: tokenize
2022-06-23 14:53:24 INFO: Loading: mwt
2022-06-23 14:53:24 INFO: Loading: ner
2022-06-23 14:53:25 INFO: Done loading processors!


* Run the NER pipeline:

In [28]:
doc = stanza_parser(content)


  prevK = bestScoresId // numWords


* Show the named entities:

In [29]:
for ent in doc.ents:
    print(ent.text, ent.type)

GESSORIACUM PER
Gessoriacum MISC
Suétone & de Ptolomée PER
Romains LOC
Gaules LOC
Grande-Bretagne LOC
Caligula PER
Boulogne-sur-mer LOC
Peutinger LOC
Gessoriacum quod nunc Bononia MISC
Morins LOC
Jules-César PER
Histoire MISC
Gaules LOC
Grande-Bretagne LOC
Gessoriacum LOC
Boulogne LOC
Voyez la Martiniere MISC
Inscrip. tom. IX MISC


### 6.3 Geocoding


#### 6.3.1 Quick start with the Geocoder class from the `Perdido` library

The `Geocoder()` can take several parameters (all optional) such as:
1. sources: list of gazetteers (possible values are: 'nominatim' (default), 'geonames', 'whg', 'pleiades', 'ign' (only for France))
2. max_rows: maximum number of toponym candidates return by the gazetteer (default = 1)




* The next cell shows how to create a geocoder and geocode 'Lyon':

In [30]:
geocoder = Geocoder()
doc = geocoder('Lyon')

* Show the geojson results:

In [31]:
doc.geojson

{'type': 'FeatureCollection',
 'features': [{'type': 'Feature',
   'geometry': {'type': 'Point', 'coordinates': [4.832011, 45.757814]},
   'properties': {'id': 0,
    'name': 'Lyon',
    'sourceName': 'Lyon, Métropole de Lyon, Circonscription départementale du Rhône, Auvergne-Rhône-Alpes, France métropolitaine, France',
    'type': 'administrative',
    'country': 'France',
    'source': 'nominatim'}}]}

* Map the results:

In [32]:
doc.get_folium_map()

#### 6.3.1 Geocode spaCy results

* Run stanza NER again

In [33]:
doc = stanza_parser(content)

[{
   "text": "GESSORIACUM",
   "type": "PER",
   "start_char": 0,
   "end_char": 11
 },
 {
   "text": "Gessoriacum",
   "type": "MISC",
   "start_char": 29,
   "end_char": 40
 },
 {
   "text": "Suétone & de Ptolomée",
   "type": "PER",
   "start_char": 44,
   "end_char": 65
 },
 {
   "text": "Romains",
   "type": "LOC",
   "start_char": 86,
   "end_char": 93
 },
 {
   "text": "Gaules",
   "type": "LOC",
   "start_char": 125,
   "end_char": 131
 },
 {
   "text": "Grande-Bretagne",
   "type": "LOC",
   "start_char": 140,
   "end_char": 155
 },
 {
   "text": "Caligula",
   "type": "PER",
   "start_char": 205,
   "end_char": 213
 },
 {
   "text": "Boulogne-sur-mer",
   "type": "LOC",
   "start_char": 221,
   "end_char": 237
 },
 {
   "text": "Peutinger",
   "type": "LOC",
   "start_char": 288,
   "end_char": 297
 },
 {
   "text": "Gessoriacum quod nunc Bononia",
   "type": "MISC",
   "start_char": 307,
   "end_char": 336
 },
 {
   "text": "Morins",
   "type": "LOC",
   "start_char": 369,


* Get the list of place entities:

In [38]:
places = [d.text for d in doc.ents if d.type == 'LOC']
places

['Romains',
 'Gaules',
 'Grande-Bretagne',
 'Boulogne-sur-mer',
 'Peutinger',
 'Morins',
 'Gaules',
 'Grande-Bretagne',
 'Gessoriacum',
 'Boulogne']

* Geocode the list of place entities with Periddo Geocoder:

In [40]:
locations = geocoder(places)

* Map the results:

In [41]:
locations.get_folium_map()

The same method can be used for geocoding and mapping stanza NER

## 7. Processing several documents at once

Usually, we want to process a sample of documents, not just one. 

As the process can be time consuming we will first select a small sample from our dataset to show how it works.

* We can use the [sample()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html) method from the pandas library to select randomly a small amount of documents


In [None]:
df_sampled = dataset.sample(3)
df_sampled

* Then, we keep only the text content of those documents:

In [None]:
contents = df_sampled.text

`geoparser` can parse a `string`, a `list` of string or a `pandas.Series`.
When the argument is a `list` or a `pandas.series`, the geoparser returns a `PerdidoCollection` object, while when it is a `string` it returns a `Perdido` object.

In [None]:
docs = geoparser(contents)

In [None]:
for doc in docs:
    print('-----')
    displacy.render(doc.to_spacy_doc(), style="ent", jupyter=True) 

## 8. Explore *Encyclopédie* geoparsed example dataset

Now, let's work with the same sample of articles (all of volume 7), but already processed in Perdido. Below, you can select a subsample from this dataset based on a keyword search, and then make a map of those results.


* Load the dataset from the library using the `load_edda_perdido()` function:

In [None]:
# remove this cell
from perdido.geoparser import Geoparser
from perdido.datasets import load_edda_artfl, load_edda_perdido

from spacy import displacy

### 8.1 Load the data geocoded with Pleiades

Pleiades is a gazetter of the classical world. Let's explore how it performs on this 18th-century text. The results presented here are still preliminary. The geocoding with Pleiades still needs some improvement. At the moment, querying Pleiades database only consists of a strict string match. This will be improved in futher versions of the library.

We use this as an example to highlights the difficulty when dealing with historical data.

In [42]:
d = load_edda_perdido('pleiades')
dataset_pleiades = d['data']

In [43]:
df = dataset_pleiades.to_dataframe()
df.head()

Unnamed: 0,filename,volume,number,head,normClass,author,text,#_places,#_person,#_event,#_date,#_misc,#_locations
0,volume07-1099.tei,7,1099,Funérailles des Grecs,unclassified,unsigned,Funérailles des Grecs. Nous passons aux funéra...,5,14,0,0,1,0
1,volume07-1106.tei,7,1106,Funérailles des Misilimakinaks,unclassified,Jaucourt,Funérailles des Misilimakinaks. Il y a d'autre...,0,0,0,0,0,0
2,volume07-1112.tei,7,1112,FUNESTE,Grammaire,Diderot,"* FUNESTE, adj. (Gramm.) qui porte malheur ; c...",0,0,0,0,1,0
3,volume07-1175.tei,7,1175,Fusée,Manège | Maréchallerie,unsigned,"Fusée, (Manége, Maréchall.) nous appellons de ...",0,0,0,0,0,0
4,volume07-1258.tei,7,1258,GAGE,Jurisprudence,Boucher d'Argis,"GAGE, pignus, s. m. (Jurisprud.) est un effet ...",5,8,0,5,1,0


* Let's do a basic keyword search:

In [44]:
collection = dataset_pleiades.keyword_search(keyword='rome')
collection.to_dataframe()

Unnamed: 0,filename,volume,number,head,normClass,author,text,#_places,#_person,#_event,#_date,#_misc,#_locations
0,volume07-1258.tei,07,1258,GAGE,Jurisprudence,Boucher d'Argis,"GAGE, pignus, s. m. (Jurisprud.) est un effet ...",5,8,0,5,1,0
1,volume07-1808.tei,07,1808,GELÉE,Physique,Ratte,"GELÉE, s. f. (Physique.) froid par lequel l'ea...",13,11,0,0,2,0
2,volume07-1890.tei,07,1890,GENÈVE,Histoire | Politique,d'Alembert,"GENÈVE, (Hist. & Politiq.) Cette ville est sit...",27,21,0,8,5,0
3,volume07-2396.tei,07,2396,GORDIEN (Noeud),Littérature,Jaucourt,"GORDIEN (Noeud), s. m. (Littérat.) noeud du ch...",1,5,0,0,1,0
4,volume07-2802.tei,07,2802,Grecs (philosophie des),unclassified,Diderot,* Grecs (philosophie des). Je tirerai la divis...,53,100,0,0,26,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
67,volume07-1971.tei,07,1971,Géographie physique,unclassified,Desmarest,"Géographie physique, est la description raison...",139,7,0,0,1,0
68,volume07-3141.tei,07,3141,GUAXACA,Géographie,Jaucourt,"GUAXACA, (Géogr.) province de l'Amerique septe...",1,0,0,0,0,0
69,volume07-1298.tei,07,1298,Gageure,Jurisprudence,Boucher d'Argis,"Gageure, (Jurisprud.) est une convention sur u...",4,12,0,3,6,0
70,volume07-2168.tei,07,2168,Glace,Médecine,unsigned,"Glace, (Medecine.) Il y a différentes observat...",2,7,0,1,0,0


Let's take a brief look at these results.

Which articles have places that could be located? 
What surprises you about the results?

Now, let's move from the metadata to look at the text for one of the articles above: 'Funérailles des Grecs' (https://artflsrv03.uchicago.edu/philologic4/encyclopedie1117/navigate/7/1099/). 


In [45]:
# filter by metadata
collection = dataset_pleiades.filter_equal(column='head', value='Funérailles des Grecs')
collection.to_dataframe()

Unnamed: 0,filename,volume,number,head,normClass,author,text,#_places,#_person,#_event,#_date,#_misc,#_locations
0,volume07-1099.tei,7,1099,Funérailles des Grecs,unclassified,unsigned,Funérailles des Grecs. Nous passons aux funéra...,5,14,0,0,1,0


* Get the doc from the collection and display the NER

In [46]:
doc = collection.data[0]
displacy.render(doc.to_spacy_doc(), style="ent", jupyter=True) 

* Filter by metadata for multiple articles:

In [47]:
collection = dataset_pleiades.filter_in(column='head', values=['GENÈVE', 'Funérailles des Grecs'])
collection.to_dataframe()

Unnamed: 0,filename,volume,number,head,normClass,author,text,#_places,#_person,#_event,#_date,#_misc,#_locations
0,volume07-1099.tei,7,1099,Funérailles des Grecs,unclassified,unsigned,Funérailles des Grecs. Nous passons aux funéra...,5,14,0,0,1,0
1,volume07-1890.tei,7,1890,GENÈVE,Histoire | Politique,d'Alembert,"GENÈVE, (Hist. & Politiq.) Cette ville est sit...",27,21,0,8,5,0


* Get only docs that contain place names:

In [51]:
collection = dataset_pleiades.filter_gt(column='#_places', value=0)
collection.to_dataframe()

Unnamed: 0,filename,volume,number,head,normClass,author,text,#_places,#_person,#_event,#_date,#_misc,#_locations
0,volume07-1099.tei,07,1099,Funérailles des Grecs,unclassified,unsigned,Funérailles des Grecs. Nous passons aux funéra...,5,14,0,0,1,0
1,volume07-1258.tei,07,1258,GAGE,Jurisprudence,Boucher d'Argis,"GAGE, pignus, s. m. (Jurisprud.) est un effet ...",5,8,0,5,1,0
2,volume07-1462.tei,07,1462,GAMBESON ou GOBESON,Histoire moderne,Le Blond,"GAMBESON ou GOBESON, s. m. (Hist. mod.) terme ...",1,2,0,0,0,0
3,volume07-1476.tei,07,1476,GANESBOROUGH,Géographie,Jaucourt,"GANESBOROUGH, (Géog.) ville à marche d'Anglete...",5,2,1,3,1,0
4,volume07-1489.tei,07,1489,GANSE,Rubanier,unsigned,"GANSE, s. f. (Rubanier.) espece de petit cordo...",1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1212,volume07-3383.tei,07,3383,Etoile,Imprimerie,unsigned,"Etoile, terme d'Imprimeur, a été oublié. C'est...",1,0,0,0,0,0
1213,volume07-845.tei,07,845,Fréne,unclassified,"Daubenton, Pierre","Fréne, grand arbre qui croit naturellement dan...",6,0,0,6,0,0
1214,volume07-879.tei,07,879,Frere,unclassified,Jaucourt,Frere ; ce nom étoit donné à des empereurs col...,1,3,0,0,0,0
1215,volume07-886.tei,07,886,"FRESANGE, ou FRESSENGE",Jurisprudence,Boucher d'Argis,"FRESANGE, ou FRESSENGE, s. f. (Jurispr.) est u...",1,3,0,0,2,0


* Get only docs that contain locations (i.e. geocoded place names):

In [49]:
collection = dataset_pleiades.filter_gt(column='#_locations', value=0)
collection.to_dataframe()

Unnamed: 0,filename,volume,number,head,normClass,author,text,#_places,#_person,#_event,#_date,#_misc,#_locations
0,volume07-1150.tei,7,1150,Fusain,unclassified,"Daubenton, Pierre","Fusain, arbrisseau qui se trouve communément d...",5,0,0,4,0,2
1,volume07-2529.tei,7,2529,Goutte,Horlogerie,Le Roy,"Goutte, parmi les Horlogers ; c'est une petite...",1,0,0,0,0,1
2,volume07-2771.tei,7,2771,GRAVIER,unclassified,unsigned,"GRAVIER, s. m. Voyez Arene.",1,0,0,0,0,1
3,volume07-1647.tei,7,1647,Gardien,unclassified,Boucher d'Argis,Gardien ; ce titre étoit quelquefois donné au ...,2,0,0,1,0,1
4,volume07-2776.tei,7,2776,GRAVITÉ,Physique | Méchanique,d'Alembert,"GRAVITÉ, s. f. (Phys. & Méchaniq.) on appelle ...",2,18,0,2,4,1
5,volume07-3079.tei,7,3079,Grue,Astronomie,d'Alembert,"Grue, (Astron.) constellation de l'hémisphere ...",2,0,0,0,0,2
6,volume07-2690.tei,7,2690,GRAPHIQUE,Astronomie,d'Alembert,"GRAPHIQUE, adjectif, (Astron.) on appelle en A...",1,0,0,0,0,2
7,volume07-2352.tei,7,2352,GOMME,Physique générale,Jaucourt,"GOMME, s. f. (Phys. génér.) suc végétal concre...",3,0,0,0,0,1
8,volume07-2463.tei,7,2463,GOUFFRE,Physique,unsigned,"GOUFFRE, s. m. (Phys.) les gouffres ne paroiss...",6,2,0,0,0,1
9,volume07-1965.tei,7,1965,GÉOCENTRIQUE,Astronomie,d'Alembert,"GÉOCENTRIQUE, adj. (Astron.) se dit de l'orbit...",8,1,0,0,0,2


* Select the article 'Gabale' and show the NER annotations

Gabale - https://artflsrv03.uchicago.edu/philologic4/encyclopedie1117/navigate/7/1219/

In [52]:
collection = dataset_pleiades.filter_equal(column='head', value='Gabale')
collection.to_dataframe()

Unnamed: 0,filename,volume,number,head,normClass,author,text,#_places,#_person,#_event,#_date,#_misc,#_locations
0,volume07-1219.tei,7,1219,Gabale,Mythologie,Diderot,"* Gabale, s. m. (Myth.) dieu adoré à Emese & à...",2,0,0,0,0,1


In [53]:
doc = collection.data[0]
displacy.render(doc.to_spacy_doc(), style="ent", jupyter=True) 

* Check if locations have been found for place names using Pleiades:

In [54]:
doc.geojson

'{"features": [], "type": "FeatureCollection"}'

* Example of geocoding 'Héliopolis' with OpenStreetMap:

In [55]:
geocoder = Geocoder(max_rows=10)
doc = geocoder('Héliopolis')

In [56]:
m = doc.get_folium_map()
m

## 9. Toponym disambiguation using network analysis


As we have seen, it can be difficult to recognize a place name in a text, and even more difficult to locate that place on the earth. Researchers are actively improving methods for both of these tasks. In the last few years, the #multilingualDH community has helped to advocate for resources and methods that address the specific needs of historical, non-English languages. 

In our own work, we have used network analysis as a means of both aiding with toponym disambiguation and an alternative to mapping. 

We constructed a network based on the citation of "géographie" articles in any other *Encyclopédie* article (also classified as geography). As an alternative to georesolution, it diversifies the toolkit for visualizing place names in text when it's unlikely that many of those names can be geolocated. instead of mapping place names based on a geospatial location, place names in a text can be represented in a graph based on their topological, qualitative relationships. 

>Moncla, L., McDonough, K., Vigier, D., Joliveau, T., & Brenon, A. (2019). Toponym disambiguation in historical documents using network analysis of qualitative relationships. Proceedings of the 3rd ACM SIGSPATIAL International Workshop on Geospatial Humanities, 1–4. Chicago, IL, USA.



<table>
  <tr>
    <td> <img src="img/labels_indegree2.png" width ="500px"> </td>
    <td> <img src="img/nodes_betweenness+class2.png" width ="500px" > </td>
  </tr>
  <tr>
    <td>Node and label size indicate in-degree centrality</td>
    <td>Node size indicates betweenness centrality<br/> 
        colors refer to geographic feature types <br/> 
        (city: red, hydronym: blue, country: green, mountain: brown, unclassified: grey)</td>
  </tr>
</table> 



We also experimented with assigning geographic coordinates found in our French wikiGazetteer to each node (headword). We have only 2535 nodes with coordinates over the 13734 nodes. 

Our first experiment is shown below. Colors identify clusters of nodes computed with the [modularity measure](https://en.wikipedia.org/wiki/Modularity_(networks)) implemented on Gephy.

<table><tr>
<td> <img src="img/geocodingEDDA1.png" width ="500"> </td>
<td> <img src="img/geocodingEDDA_network.png" width ="500" > </td>
</tr></table> 

## 10. Conclusion and discussion

### Discussion Part 2

1. How did testing Perdido on Encyclopédie text allow you to think about the texts in new ways?
2. In what ways would geoparsing post-Classical texts shed light on Classical history/historiography?
3. How would you describe some of the limitations of using geoparsing tools on historical texts?