# NLP for historical texts

This notebook is proposed by [L. Moncla](https://ludovicmoncla.github.io/) and [K. McDonough](https://www.turing.ac.uk/people/researchers/katherine-mcdonough) as part of the [Sunoikisis Digital Classics](https://github.com/SunoikisisDC/SunoikisisDC-2021-2022/wiki/SunoikisisDC-Summer-2022-Session-9) Summer course on NLP for historical texts (Session 9).

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](http://colab.research.google.com/github/ludovicmoncla/SunoikisisDC-Summer2022-Session9/blob/main/Tutorial-geoparsing.ipynb)
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/ludovicmoncla/SunoikisisDC-Summer2022-Session9/main?filepath=Tutorial-geoparsing.ipynb)


## 1. Overview

In this tutorial, we'll learn about a few different things:


- How to load a dataset from the `Perdido` library as a Python dataframe (articles from Diderot and d'Alembert's *Encyclopédie*)
- Use Python dataframe for simple data analysis
- How to use the `Perdido Geoparser` library for geoparsing French texts
- Display geotagging results
- Map geocoding results
- Compare the NER results with `spaCy` and `Stanza` (python libraries)
- Reflect on the limits of geoparsing historical French (and multilingual) texts.

## 2. Introduction

Geoparsing (also known as toponym resolution) refers to the process of extracting place names from text and assigning geographic coordinates to them.
This involves two main tasks: geotagging and geocoding.
Geotagging consists to identify spans of text referring to place names while geocoding consists to find unambiguous geographic coordinates.

Geographic text analysis research in the digital humanities has focused on projects analyzing modern English-language corpora. 
In this tutorial we propose to highlight the difficulties of extracting and mapping geographical information from historical French texts.
As we'll see in the following, in addition to the problem of language when it comes to historical documents, the early-modern period lacks temporally appropriate gazetteers.

> McDonough, K., Moncla, L., & van de Camp, M. (2019). Named entity recognition goes to old regime France: geographic text analysis for early modern French corpora. International Journal of Geographical Information Science, 33, 2498–2522.


### 2.1 The Perdido Geoparser python library

[Perdido](https://github.com/ludovicmoncla/perdido) is a python text geoparser. It provides NLP and GIS methods for geoparsing French texts.
It has initially been developed as a REST API for extracting and retrieving displacements from French hiking descriptions, under the framework of the [PERDIDO](http://erig.univ-pau.fr/PERDIDO/) and [ANR Choucas](http://choucas.ign.fr) projects.

More recently, as part of the [GEODE project](https://geode-project.github.io) we have developed a custom version for historical documents and more specifically for the Encyclopédie.


In this tutorial we'll see how to use the `Perdido` python library for geoparsing French texts. 
We will apply geoparsing on volume 7 of Encyclopedie corpus version released by the [ARTFL project](https://encyclopedie.uchicago.edu/) and we'll show the limits of geotagging and geocoding historical documents.

### 2.2 Acknowledgement

Data courtesy the [ARTFL Encyclopédie Project](https://artfl-project.uchicago.edu/), University of Chicago.


## 3. Setting up the environment



### 3.1 Install python packages

* If you configured your conda environment using the `requirements.txt` file, you can skip this step and go to the `Import` section.
* If you configured your conda environment using the `environment.yml` file or if you use a Google Colab environment, you need to install `perdido` using `pip`:

In [1]:
!pip install --upgrade perdido



* Then, if you already configured your conda environment, either with conda or pip (see readme file) you can skip the next cell.
* If you're running this notebook from Google Colab, you need to run the next cell.


In [None]:
!pip install stanza

### 3.2 Import the libraries

First, we will load some specific libraries from `Perdido` that we will use in this notebook. Next, we import some tools that will help us parse and visualize the text.

In [1]:
from perdido.geoparser import Geoparser
from perdido.datasets import load_edda_artfl, load_edda_perdido

from spacy import displacy


## 4. Getting started

In this notebook, we'll test out some basic queries of the *Encyclopédie* articles from volume 7 (H - Itzehoa, published in 1765). You can learn more about the other volumes [here](https://encyclopedie.uchicago.edu/node/102).


### 4.1 Loading the ARTFL *Encyclopédie* dataset

First, we load the data. You can view this sample dataset at under the 'Data' directory [here](https://github.com/ludovicmoncla/SunoikisisDC-Summer2022-Session9).

The next cell loads the data, defines the data as `dataset`, and shows you the top 5 records (`head`). The data has been saved as a dataframe.

***define dataframe, explain (briefly) how it parses the TEI-XML***

In [None]:
d = load_edda_artfl()
dataset = d['data']
dataset.head()

### 1.2 Exploring the data

Now we have access to all the attributs and methods of the [dataframe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) object. For instance, we can easily print the number of rows in our dataframe which correspond to the number of articles in our corpus:

In [None]:
n = dataset.shape[0]
print('There are ' + str(n) + ' articles in the dataset.')

#### 1.2.1 Searching by metadata

Now that the data from the XML-TEI files are loaded into a python dataframe, we can select groups of articles based on the article metadata that was originally stored in the TEI header.

For instance, we can select articles based on their classification in the *Encyclopédie*. (There are actually a few different ways that the ARTFL *Encyclopédie* articles have been classified. In this notebook we will be using the `normclass` field, which normalizes classifications given at time of publication that had many spelling variants. In the cell below, we have hand selected all the `normclass` combinations that include Geography as well as Geography on its own.)

If we want all articles classified as 'Geography' we can make the request as follows (the output is stored as a new data frame `df_geo`: 

In [None]:
normclassGEO = ['Géographie', 'Géographie moderne',
                 'Géographie ancienne', 'Géographie moderne | Géographie ancienne',
                 'Géographie ancienne | Géographie moderne', 'Géographie sacrée', 'Géographie sainte',
                 'Géographie | Histoire ancienne', 'Géographie historique', 'Géographie | Histoire',
                 'Histoire | Géographie', 'Géographie | Histoire naturelle', 'Géographie | Mythologie',
                 'Géographie ancienne | Mythologie', 'Histoire moderne | Géographie',
                 'Géographie ancienne | Géographie sainte', 'Géographie ancienne | Géographie sacrée',
                 'Géographie sacrée | Géographie ancienne', 'Géographie du moyen âge', 'Géographie des Arabes',
                 'Géographie | Commerce', 'Histoire | Géographie ancienne',
                 'Géographie | Histoire ancienne | Histoire moderne', 'Géographie ancienne | Littérature | Histoire',
                 'Histoire naturelle | Géographie', 'Géographie | Histoire ancienne | Mythologie',
                 'Géographie moderne | Commerce', 'Géographie ancienne | Géographie antique',
                 'Géographie moderne | Histoire', 'Géographie | Histoire monastique',
                 'Géographie ancienne | Géographie moderne | Mythologie', 'Géographie ancienne | Histoire',
                 'Géographie ancienne | Littérature | Mythologie', 'Géographie ancienne | Médailles'
                 ]

* Query the dataframe for all articles matching one of the class in our list:

In [None]:
df_geo = dataset.loc[dataset['normClass'].isin(normclassGEO)]
df_geo.head(10)

In [None]:
print('There are ' + str(df_geo.shape[0]) + ' geography articles')

We can query based on any value in the dataframe (e.g. article metadata). For instance, we can query all the articles written by a specific author:

* Count article for a single named author (Jaucourt)

In [None]:
val = 'Jaucourt'
n = df_geo.loc[dataset['author'] == val].shape[0]
print(str(n) + ' were written by '+ val)

We can also easily show the number of articles per author:

In [None]:
df_geo.groupby(['author'])["filename"].count()

It is possible to show the value of one column in our dataframe for a specific row (i.e., by article) based on its name. For instance, if we want to know who wrote the article about Lyon or if we want to see its content, we make these requests:

In [None]:
dataset.loc[dataset['head'] == 'FRONTIGNAN'].author.item()

#### 1.2.2 Searching by text 

It is also possible to display and search the full text content of the articles stored in the dataframe. 

* Show full text for a specific article:

In [None]:
dataset.loc[dataset['head'] == 'FRONTIGNAN'].text.item()

We can also perform a **keyword search** over the text content of all articles:

* Select articles that contains 'france':

In [None]:
# search corpus by keyword (val)

val = 'france'
df_2 = dataset[dataset['text'].str.contains(val, case=False)]
print(str(df_2.shape[0]) + ' articles contain the word \''+ val + '\'')

It is also possible to search by **phrases**. The expression "ville de" is commonly used in the *Encyclopédie* to define the country or region of a place. Searching by this phrase gives us a sense of the broader geographical coverage of the corpus. 

Here we extract all articles that contain the expression 'ville de':

In [None]:
dataset[dataset['text'].str.contains("ville de", case=False)]

Next, we can try a thematic search, for instance about 'esclavage' (slavery): 

In [None]:
dataset[dataset['text'].str.contains("esclavage", case=False)]

## 2. The NLP pipeline: Perdido Geoparser


In Natural Language Processing (NLP), the main first steps before processing text content consist in tokenizing sentences and words and assigning to each word its grammatical category (Part-of-Speech). 

This allows the construction of more complex rules or queries compared to a simple keyword search. E.g. we would know that "city" is a noun, and we can perform a search for all nouns in the corpus.


These preprocessing steps are language dependent, and therefore we have to choose the right tool according to the language, style and period of our documents. This is a major difficulty when dealing with historical or ancient texts. For instance, for French it is difficult to find a POS tagger for pre-20th century French as major well known taggers are trained on contemporary corpora.

> McDonough, K., Moncla, L., & van de Camp, M. (2019). Named entity recognition goes to old regime France: geographic text analysis for early modern French corpora. International Journal of Geographical Information Science, 33, 2498–2522.



### 2.1 Perdido Geoparser


The `Perdido` geoparser uses [Treetagger](https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/) for part-of-speech tagging. 


The geotagging step of `Perdido` is perform using a cascade of finite-state transducers defining specific patterns for NER and identification of geographic information (spatial relations, etc.). 
> Mauro Gaio and Ludovic Moncla (2019). “Geoparsing and geocoding places in a dynamic space context.“ In The Semantics of Dynamic Space in French: Descriptive, experimental and formal studies on motion expression, 66, 353.

Geoparsing in the digital humanities began with projects analyzing modern English-language corpora. But now, many researchers are developing projects for automatically identifying and geolocating places named in texts of many languages. 

Here we highlight the difficulties of extracting and mapping geographical information from historical French texts. In addition to language-related problems which impact the quality of tokenization, POS tagging, and NER, geocoding presents its own challenges. Once place names have been identifed in a text, correctly associating geographical coordinates with that place is a challenge. **Gazetteers** are knowledge bases that help researchers link place names with information about place, including its location. 

For our custom version of the `Perdido` Geoparser, the geocoding task uses a simple gazetteer lookup method. Several gazetteers can be used:
 - Nominatim (ie, OpenStreetMap) by default, 
 - Geonames, 
 - World Historical Gazetteer, 
 - Pleiades

Like using the most appropriate POS tagger, finding the best gazetteer for your corpus can be challenging. Luckily, for the ancient world, there are some excellent options. Here, you will also be able to test the [Pleiades gazetteer](https://pleiades.stoa.org/) and compare the results with the other contemporary gazetteers.


The PERDIDO Geoparser returns XML-TEI. The `<name>` element refers to named entities (proper nouns) and the type attribute indicates its class (place, person, etc.). The `<rs>` element refers to extended named entities (e.g. ville d'Egypte). The `<location>` element indicates that geographic coordinates were found during geocoding.  


#### 2.1.1 Getting started with `Perdido`

* Get the content from one article from the dataset:

In [None]:
content = dataset.loc[dataset['head'] == 'FRONTIGNAN'].text.item()
content

* Instanciate a Geoparser object from the `Perdido` library. Specify that we are working with the *Encyclopédie* version.

In [None]:
geoparser = Geoparser(version="Encyclopedie")

* Now you can use this geoparser for geoparsing text content. Let's try with the `content`variable that we declare before:

In [None]:
doc = geoparser(content)

The geoparser return a `Perdido`object. This object has several attributes and methods. We'll now see some of them.

* Accessing the XML-TEI result:

In [None]:
doc.tei

* Accessing the geojson results generate during the geocoding phase:

In [None]:
doc.geojson

* Transform the Perdido object into a dataframe (only some of the attributes are kept):

In [None]:
df = doc.to_dataframe()
df.head()

#### 2.1.2 Save the results in files

In [None]:
doc.to_xml('FRONTIGNAN-perdido.xml')

In [None]:
doc.to_geojson('FRONTIGNAN-perdido.geojson')

In [None]:
doc.to_csv('FRONTIGNAN-perdido.csv')

#### 2.1.3 Display named entities

Often, it is useful to vizualize the output in sentence form. The `spacy` library provides a useful tool for this: `displacy`.


In [None]:
displacy.render(doc.to_spacy_doc(), style="ent", jupyter=True)

In [None]:
displacy.render(doc.to_spacy_doc(), style="span", jupyter=True)

#### 2.1.4 Mapping place names


For many projects, it is important to view the results of geoparsing on a map. Here, we can see the results plotted on a map. But remember, these are only the results for which coordinates could be found. Results that could not be matched to records in the gazetteer will not be mapped.


In [None]:
doc.get_folium_map()

### 2.1.5 Another example

https://artflsrv03.uchicago.edu/philologic4/encyclopedie1117/navigate/7/2046/



In [None]:
content = dataset.loc[dataset['head'] == 'GESSORIACUM'].text.item()
content

In [None]:
doc = geoparser(content)
displacy.render(doc.to_spacy_doc(), style="ent", jupyter=True)

## 3. Comparison to other NER tools

`Perdido` is a custom geoparsing library for French language documents. How does it compare to other state-of-the-art libraries?

### 3.1 SpaCy

[SpaCy](https://spacy.io/) is a commonly-used NLP library that supports documents in many languages. `SpaCy` uses Machine Learning to perform NER (versus being a rule-based system).

* Install the `spaCy` french pre-trained language model:

In [None]:
!python -m spacy download fr_core_news_sm

* Import the `spaCy` library

In [None]:
import spacy

* Load the `spaCy` french pre-trained language model

In [None]:
spacy_parser = spacy.load('fr_core_news_sm')

* Run the NER pipeline

In [None]:
doc = spacy_parser(content)

* Show the named entities

In [None]:
for ent in doc.ents:
    print(ent.text, ent.label_)

* Display the named entities with `displaCy`

In [None]:
displacy.render(doc, style="ent", jupyter=True) 

### 3.2 Stanza

[Stanza](https://stanfordnlp.github.io/stanza/) is another NLP ML library developed by Stanford that is designed to work across many languages.

* Import the `Stanza` library and download the pre-trained french language model:

In [None]:
import stanza
# This can take a while depending on your internet connection (fr model is 572M)
stanza.download('fr')

* Declare the NER pipeline:

In [None]:
stanza_parser = stanza.Pipeline(lang='fr', processors='tokenize,ner')

* Run the NER pipeline:

In [None]:
doc = stanza_parser(content)


* Show the named entities:

In [None]:
for ent in doc.ents:
    print(ent.text, ent.type)

Comparing `Perdido`, `SpaCy`, and `Stanza` outputs, even for just 1-2 articles from our corpus, allows us to see common errors. One library might excel at identifying people, but struggle with complex place names. Another is better at capturing places named within phrases, but mixes up people and places. It is important to test multiple geoparsers for your corpus, and to understand how they can be adapted, in order to get the best results.

## 4. Processing several documents at once

Usually, we want to process a sample of documents, not just one. 

As the process can be time consuming we will first select a small sample from our dataset to show how it works.

* We can use the [sample()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html) method from the pandas library to select randomly a small amount of documents


In [None]:
df_sampled = dataset.sample(4)
df_sampled

* Then, we keep only the text content of those documents:

In [None]:
contents = df_sampled.text

`geoparser` can parse a `string`, a `list` of string or a `pandas.Series`.
When the argument is a `list` or a `pandas.series`, the geoparser returns a `PerdidoCollection` object, while when it is a `string` it returns a `Perdido` object.

In [None]:
docs = geoparser(contents)

In [None]:
for doc in docs:
    print('-----')
    displacy.render(doc.to_spacy_doc(), style="ent", jupyter=True) 

## 5. Explore *Encyclopédie* geoparsed example dataset

Now, let's work with the same sample of articles (all of volume 7), but already processed in Perdido. Below, you can select a subsample from this dataset based on a keyword search, and then make a map of those results.


* Load the dataset from the library using the `load_edda_perdido()` function:

In [1]:
# remove this cell
from perdido.geoparser import Geoparser
from perdido.datasets import load_edda_artfl, load_edda_perdido

from spacy import displacy

In [2]:
d = load_edda_perdido('pleiades')
dataset = d['data']

In [None]:
for index, doc in enumerate(dataset):
    print(dataset.metadata[index])

In [3]:
df = dataset.to_dataframe()
df.head()

Unnamed: 0,filename,volume,number,head,normClass,author,text,#_places,#_person,#_event,#_date,#_misc,#_locations
0,volume07-1099.tei,7,1099,Funérailles des Grecs,unclassified,unsigned,Funérailles des Grecs. Nous passons aux funéra...,5,14,0,0,1,0
1,volume07-1106.tei,7,1106,Funérailles des Misilimakinaks,unclassified,Jaucourt,Funérailles des Misilimakinaks. Il y a d'autre...,0,0,0,0,0,0
2,volume07-1112.tei,7,1112,FUNESTE,Grammaire,Diderot,"* FUNESTE, adj. (Gramm.) qui porte malheur ; c...",0,0,0,0,1,0
3,volume07-1175.tei,7,1175,Fusée,Manège | Maréchallerie,unsigned,"Fusée, (Manége, Maréchall.) nous appellons de ...",0,0,0,0,0,0
4,volume07-1258.tei,7,1258,GAGE,Jurisprudence,Boucher d'Argis,"GAGE, pignus, s. m. (Jurisprud.) est un effet ...",5,8,0,5,1,0


In [4]:
# keyword search
collection = dataset.keyword_search(keyword='rome')
collection.to_dataframe()

Unnamed: 0,filename,volume,number,head,normClass,author,text,#_places,#_person,#_event,#_date,#_misc,#_locations
0,volume07-1258.tei,07,1258,GAGE,Jurisprudence,Boucher d'Argis,"GAGE, pignus, s. m. (Jurisprud.) est un effet ...",5,8,0,5,1,0
1,volume07-1808.tei,07,1808,GELÉE,Physique,Ratte,"GELÉE, s. f. (Physique.) froid par lequel l'ea...",13,11,0,0,2,0
2,volume07-1890.tei,07,1890,GENÈVE,Histoire | Politique,d'Alembert,"GENÈVE, (Hist. & Politiq.) Cette ville est sit...",27,21,0,8,5,0
3,volume07-2396.tei,07,2396,GORDIEN (Noeud),Littérature,Jaucourt,"GORDIEN (Noeud), s. m. (Littérat.) noeud du ch...",1,5,0,0,1,0
4,volume07-2802.tei,07,2802,Grecs (philosophie des),unclassified,Diderot,* Grecs (philosophie des). Je tirerai la divis...,53,100,0,0,26,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
67,volume07-1971.tei,07,1971,Géographie physique,unclassified,Desmarest,"Géographie physique, est la description raison...",139,7,0,0,1,0
68,volume07-3141.tei,07,3141,GUAXACA,Géographie,Jaucourt,"GUAXACA, (Géogr.) province de l'Amerique septe...",1,0,0,0,0,0
69,volume07-1298.tei,07,1298,Gageure,Jurisprudence,Boucher d'Argis,"Gageure, (Jurisprud.) est une convention sur u...",4,12,0,3,6,0
70,volume07-2168.tei,07,2168,Glace,Médecine,unsigned,"Glace, (Medecine.) Il y a différentes observat...",2,7,0,1,0,0


In [5]:
# filter by metadata
collection = dataset.filter_equal(column='head', value='GENÈVE')
collection.to_dataframe()

Unnamed: 0,filename,volume,number,head,normClass,author,text,#_places,#_person,#_event,#_date,#_misc
0,volume07-1890.tei,7,1890,GENÈVE,Histoire | Politique,d'Alembert,"GENÈVE, (Hist. & Politiq.) Cette ville est sit...",27,21,0,8,5


In [6]:
# filter by metadata
collection = dataset.filter_in(column='head', values=['GENÈVE', 'GAGE'])
collection.to_dataframe()

Unnamed: 0,filename,volume,number,head,normClass,author,text,#_places,#_person,#_event,#_date,#_misc
0,volume07-1258.tei,7,1258,GAGE,Jurisprudence,Boucher d'Argis,"GAGE, pignus, s. m. (Jurisprud.) est un effet ...",5,8,0,5,1
1,volume07-1890.tei,7,1890,GENÈVE,Histoire | Politique,d'Alembert,"GENÈVE, (Hist. & Politiq.) Cette ville est sit...",27,21,0,8,5


In [None]:
# filter by metadata
collection = dataset.filter_equal(column='#_locations', value=0)
collection.to_dataframe()

In [None]:
# get only docs that contain place names
docs_with_places = dataset.contains('place')

In [None]:
# get only docs that contain locations

In [None]:
# map a collection of documents
m = collection.get_folium_map()
if m is None:
    print('No location found!')
else:
    m

In [None]:
# plot number of entities per documents

# get the document with the maximum places...



## 3. Toponym disambiguation using network analysis

In our work, we use this methodoly for constructing a network based on the citation of "géographie" articles between them.
We proposed to use network analysis measures to establish an approximate location, defined by qualitative relations, for each named toponym in EDDA. Throwing a list of decontextualized toponyms at an external resource like Geonames is risky. We therefore hypothesize that defining meaningful links between places can provide essentialinformation to improve disambiguation (and potentially replace resolution as the end goal). We establish connections between places based on the citation of “headword” toponyms (those that appearas headwords of entries) in other EDDA entries.

>Moncla, L., McDonough, K., Vigier, D., Joliveau, T., & Brenon, A. (2019). Toponym disambiguation in historical documents using network analysis of qualitative relationships. Proceedings of the 3rd ACM SIGSPATIAL International Workshop on Geospatial Humanities, 1–4. Chicago, IL, USA.

This method draws on relations in the corpus of EDDA articles, which improves disambiguation at a later stage with an external resource. We suggest the network as an alternative to geospatial representation, a useful proxy when no historical gazetteer exists for the source material's period. Our first experiments have shown that this approach goes beyond a simple text analysis and is able to find relations between toponyms that are not co-occurring in the same documents. Network relations are also usefully compared with disambiguated toponyms to evaluate geographical coverage, and the ways that geographical discourse is expressed, in historical texts.


<table>
  <tr>
    <td> <img src="img/labels_indegree2.png" width ="500px"> </td>
    <td> <img src="img/nodes_betweenness+class2.png" width ="500px" > </td>
  </tr>
  <tr>
    <td>Node and label size indicate in-degree centrality</td>
    <td>Node size indicates betweenness centrality<br/> 
        colors refer to geographic feature types <br/> 
        (city: red, hydronym: blue, country: green, mountain: brown, unclassified: grey)</td>
  </tr>
</table> 



We also made somse preliminary tests by assigning geographic coordinates found in our French wikiGazetteer to each node (headword). We have only 2535 nodes with coordinates over the 13734 nodes. 

Our first experiment is shown below. Colors identify clusters of nodes computed with the [modularity measure](https://en.wikipedia.org/wiki/Modularity_(networks)) implemented on Gephy.

<table><tr>
<td> <img src="img/geocodingEDDA1.png" width ="500"> </td>
<td> <img src="img/geocodingEDDA_network.png" width ="500" > </td>
</tr></table> 