# Creating Geospatial Data from Historical Texts in French

This notebook is proposed by [L. Moncla](https://ludovicmoncla.github.io/) and [K. McDonough](https://www.turing.ac.uk/people/researchers/katherine-mcdonough) as part of the [GEODE](https://geode-project.github.io) (2020-2024) project.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/GEODE-project/perdido-geoparsing-notebook/blob/master/Tutorial-geoparsing.ipynb) [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/GEODE-project/perdido-geoparsing-notebook/master?filepath=Tutorial-geoparsing.ipynb)


## 1. Overview


In this tutorial, we'll learn about a few different things:


- Load a dataset from the `Perdido` library as a Python dataframe (articles from Diderot and d'Alembert's *Encyclopédie*)
- Load data from TEI-XML files into a Python dataframe
- Use a dataframe for simple data analysis
- Use the `Perdido Geoparser` library for geoparsing French texts (geotagging + geocoding)
  - Display geotagging results
  - Map geocoding results
- Compare `Perdido` NER results with `spaCy` and `Stanza` (python libraries)
- Reflect on the limits of geoparsing historical French (and multilingual) texts.


## 2. Introduction

Geoparsing (also known as toponym resolution) refers to the process of extracting place names from text and assigning geographic coordinates to them.
This involves two main tasks: geotagging and geocoding.
Geotagging consists to identify spans of text referring to place names while geocoding (or toponym resolution) consists to find unambiguous geographic coordinates.

Geographic text analysis research in the digital humanities has focused on projects analyzing modern English-language corpora. 
In this tutorial we propose to highlight the difficulties of extracting and mapping geographical information from historical French texts.
As we'll see in the following, in addition to the problem of language when it comes to historical documents, the early-modern period lacks temporally appropriate gazetteers.

> McDonough, K., Moncla, L., & van de Camp, M. (2019). Named entity recognition goes to old regime France: geographic text analysis for early modern French corpora. International Journal of Geographical Information Science, 33, 2498–2522.



### 2.1 The Perdido Geoparser python library

[Perdido](https://github.com/ludovicmoncla/perdido) is a python text geoparser. It provides NLP and GIS methods for geoparsing French texts.
It has initially been developed as a REST API for extracting and retrieving displacements from French hiking descriptions, during the [PERDIDO](http://erig.univ-pau.fr/PERDIDO/) and [ANR Choucas](http://choucas.ign.fr) projects.

More recently, as part of the [GEODE project](https://geode-project.github.io) we have developed a custom version for historical documents and more specifically for the Encyclopédie.


In this tutorial we'll see how to use the `Perdido` python library for geoparsing French texts. 
We will apply geoparsing on volume 7 of Encyclopedie corpus version released by the [ARTFL project](https://encyclopedie.uchicago.edu/) and we'll show the limits of geotagging and geocoding historical documents.

### 2.2 Acknowledgement

Data courtesy the [ARTFL Encyclopédie Project](https://artfl-project.uchicago.edu/), University of Chicago.


## 3. Setting up the environment



### 3.1 Install python packages

* If you already configured your environment using conda (`environment.yml`) or pip (`requirements.txt`), you can skip this step and go to section [3.2 Import the libraries](3.2-Import-the-libraries).


In [None]:
!pip install perdido==0.1.28
!pip install stanza==1.4.2

### 3.2 Import the libraries

First, we will load some specific libraries from `Perdido` that we will use in this notebook. Next, we import some tools that will help us parse and visualize the text.

In [None]:
import warnings
warnings.filterwarnings('ignore')

from perdido.geoparser import Geoparser
from perdido.geocoder import Geocoder
from perdido.datasets import load_edda_artfl, load_edda_perdido
from spacy import displacy

import os
import lxml.etree as etree
import xml.dom.minidom as xml
import pandas as pd


## 4. Getting started

In this notebook, we'll test out some basic queries of the *Encyclopédie* articles from volume 7 (H - Itzehoa, published in 1765). You can learn more about the other volumes [here](https://encyclopedie.uchicago.edu/node/102).




### 4.1 Loading the ARTFL *Encyclopédie* dataset from the Perdido library

Now we will see how we can load the dataset directly from the Perdido library. 

The next cell loads the data from the Perdido library, defines the data as `dataset`, and shows you the top 5 records (using the [head()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) method from the [Pandas](https://pandas.pydata.org/docs/index.html) library). The data has been saved as a dataframe.

In [None]:
# Use the load_edda_artfl function to get the dataset
d = load_edda_artfl()

# The load_edda_artfl function returns a Python dictionnary, containing the dataset as a dataframe in the data entry
dataset = d['data']

# Display the 10 first row of the dataframe
dataset.head(10)

### 4.2 Exploring the data

Now we have access to all the attributes and methods of the [dataframe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) object. 

For instance, we can print the number of rows in our dataframe which corresponds to the number of articles in our corpus:

In [None]:
n = dataset.shape[0]
print('There are ' + str(n) + ' articles in the dataset.')

It is possible to display and search the full text content of the articles stored in the dataframe. 

* Show full text for a specific article:

In [None]:
dataset.loc[dataset['head'] == 'FRONTIGNAN'].text.item()

More examples of how you can search and filter the dataframe structure are given in Section 9.

## 5. Perdido Geoparser


In Natural Language Processing (NLP), the main first steps before processing text content consist in tokenizing sentences and words and assigning to each word its grammatical category (Part-of-Speech). 

This allows the construction of more complex rules or queries compared to a simple keyword search. E.g. we would know that "city" is a noun, and we can perform a search for all nouns in the corpus.


These preprocessing steps are language dependent, and therefore we have to choose the right tool according to the language, style and period of our documents. This is a major difficulty when dealing with historical or ancient texts. For instance, for French it is difficult to find a POS tagger for pre-20th century French as major well known taggers are trained on contemporary corpora.

> McDonough, K., Moncla, L., & van de Camp, M. (2019). Named entity recognition goes to old regime France: geographic text analysis for early modern French corpora. International Journal of Geographical Information Science, 33, 2498–2522.

For now, the `Perdido` geoparser uses [Treetagger](https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/) for preprocessing (tokenzation, lemmatization and part-of-speech tagging). 


The geotagging step of `Perdido` is perform using a cascade of finite-state transducers defining specific patterns for NER and identification of geographic information (spatial relations, etc.). 
> Mauro Gaio and Ludovic Moncla (2019). “Geoparsing and geocoding places in a dynamic space context.“ In The Semantics of Dynamic Space in French: Descriptive, experimental and formal studies on motion expression, 66, 353.

Geoparsing in the digital humanities began with projects analyzing modern English-language corpora. But now, many researchers are developing projects for automatically identifying and geolocating places named in texts of many languages. 

Here we highlight the difficulties of extracting and mapping geographical information from historical French texts. In addition to language-related problems which impact the quality of tokenization, POS tagging, and NER, geocoding presents its own challenges. Once place names have been identifed in a text, correctly associating geographical coordinates with that place is a challenge. **Gazetteers** are knowledge bases that help researchers link place names with information about place, including its location. 

For our custom version of the `Perdido` Geoparser, the geocoding task uses a simple gazetteer lookup method. Several gazetteers can be used:
 - Nominatim (ie, OpenStreetMap) by default, 
 - Geonames, 
 - World Historical Gazetteer, 
 - IGN (French National Institute of Geographic Information)

Like using the most appropriate POS tagger, finding the best gazetteer for your corpus can be challenging. 

### 5.1 Getting started with `Perdido`

* Get the content from the article 'FRONTIGNAN' ([https://artflsrv03.uchicago.edu/philologic4/encyclopedie1117/navigate/7/1002/](https://artflsrv03.uchicago.edu/philologic4/encyclopedie1117/navigate/7/1002/)) from the dataset:

In [None]:
content = dataset.loc[dataset['head'] == 'FRONTIGNAN'].text.item()
content

* Create a Geoparser object from the `Perdido` library. Specify that we are working with the *Encyclopédie* version.

In [None]:
geoparser = Geoparser(version="Encyclopedie")

* Now you can use this geoparser for geoparsing text content. Let's try with the `content` variable that we declared before:

In [None]:
doc = geoparser(content)

The geoparser return a `Perdido` object. This object has several attributes and methods. We'll now see some of them.

* Accessing the XML-TEI result:

In [None]:
doc.tei

* We can use the `xml` library to get something easier to read:

In [None]:
print(xml.parseString(doc.tei).toprettyxml(indent=' ')) 

* Accessing the geojson results generate during the geocoding phase:

In [None]:
doc.geojson

* Transform the Perdido object into a dataframe (only some of the attributes are kept):

In [None]:
df = doc.to_dataframe()
df.head()

### 5.2 Display named entities

Often, it is useful to vizualize the output in sentence form. The `spacy` library provides a useful tool for this: `displacy`.


In [None]:
displacy.render(doc.to_spacy_doc(), style="ent", jupyter=True)

In [None]:
displacy.render(doc.to_spacy_doc(), style="span", jupyter=True)

### 5.3 Map place names


For many projects, it is important to view the results of geoparsing on a map. Here, we can see the results plotted on a map. But remember, these are only the results for which coordinates could be found. Results that could not be matched to records in the gazetteer will not be mapped.

* Here we see the geocoding results for 'Frontignan' mapped:


In [None]:
doc.get_folium_map()

### 5.4 Save/export the results in external files

In [None]:
doc.to_xml('FRONTIGNAN-perdido.xml')

In [None]:
doc.to_geojson('FRONTIGNAN-perdido.geojson')

In [None]:
doc.to_csv('FRONTIGNAN-perdido.csv')

### 5.5 Try another example

GESSORIACUM - https://artflsrv03.uchicago.edu/philologic4/encyclopedie1117/navigate/7/2046/



In [None]:
content = dataset.loc[dataset['head'] == 'GESSORIACUM'].text.item()
content

In [None]:
doc = geoparser(content)
displacy.render(doc.to_spacy_doc(), style="ent", jupyter=True)

In [None]:
doc.get_folium_map()

## 6. Comparison to other NER tools

`Perdido` is a custom geoparsing library for French language documents. How does it compare to other state-of-the-art libraries?

Comparing `Perdido`, `SpaCy`, and `Stanza` outputs, even for just 1-2 articles from our corpus, allows us to see common errors. One library might excel at identifying people, but struggle with complex place names. Another is better at capturing places named within phrases, but mixes up people and places. It is important to test multiple geoparsers for your corpus, and to understand how they can be adapted, in order to get the best results.

### 6.1 SpaCy

[SpaCy](https://spacy.io/) is a commonly-used NLP library that supports documents in many languages. `SpaCy` uses Machine Learning to perform NER (versus being a rule-based system).

* Install the `spaCy` french pre-trained language model:

In [None]:
!python -m spacy download fr_core_news_sm

* Import the `spaCy` library

In [None]:
import spacy

* Load the `spaCy` french pre-trained language model

In [None]:
spacy_parser = spacy.load('fr_core_news_sm')

* Run the NER pipeline

In [None]:
doc = spacy_parser(content)

* Show the named entities

In [None]:
for ent in doc.ents:
    print(ent.text, ent.label_)

* Display the named entities with `displaCy`

In [None]:
displacy.render(doc, style="ent", jupyter=True) 

### 6.2 Stanza

[Stanza](https://stanfordnlp.github.io/stanza/) is another NLP ML library developed by Stanford that is designed to work across many languages.

* Import the `Stanza` library and download the pre-trained french language model:

In [None]:
import stanza
# This can take a while depending on your internet connection (fr model is 572M)
stanza.download('fr')

* Declare the NER pipeline:

In [None]:
stanza_parser = stanza.Pipeline(lang='fr', processors='tokenize,ner')

* Run the NER pipeline:

In [None]:
doc = stanza_parser(content)


* Show the named entities:

In [None]:
for ent in doc.ents:
    print(ent.text, ent.type)


### 6.3 Rule-based vs. machine learning methods


<img src="img/annotation_schema_guideline.png" alt="annotation schema" width="800px"/>

## 7. Geocoding / Toponym resolution


### 7.1 Quick start with the Geocoder class from the `Perdido` library

The `Geocoder()` can take several parameters (all optional) such as:
1. sources: list of gazetteers (possible values are: 'nominatim' (default), 'geonames', 'whg', 'pleiades', 'ign' (only for France))
2. max_rows: maximum number of toponym candidates return by the gazetteer (default = 1)
3. country_code: filter the results per country
4. bbox: filter the results with a bounding box (http://bboxfinder.com)

* The next cell shows how to create a geocoder and geocode 'Frontignan':

In [None]:
geocoder = Geocoder()
doc = geocoder('Frontignan')

* Show the geojson results:

In [None]:
doc.geojson

* Map the results:

In [None]:
doc.get_folium_map()

* Another example with 'Formose':

In [None]:
content = dataset.loc[dataset['head'] == 'FORMOSE'].text.item()
content

In [None]:
geocoder = Geocoder()
doc = geocoder('Formose')
doc.get_folium_map()

### 7.2 Gazetteer comparison and geocoding parameters




* Geocoding 'Grumentum':

In [None]:
content = dataset.loc[dataset['head'] == 'GRUMENTUM'].text.item()
content

#### 7.2.1 Nominatim (OpenStreetMap) - default gazetteer

In [None]:
geocoder = Geocoder(sources=['nominatim'])
doc = geocoder('Grumentum')
doc.get_folium_map()

In [None]:
doc.geojson

#### 7.2.2 Geonames

http://www.geonames.org



In [None]:
geocoder = Geocoder(sources=['geonames'])
doc = geocoder('Grumentum')
doc.get_folium_map()

In [None]:
doc.geojson

#### 7.2.2 World Historical Gazetteer


https://whgazetteer.org

In [None]:
geocoder = Geocoder(sources=['whg'])
doc = geocoder('Grumentum')
doc.get_folium_map()

In [None]:
doc.geojson

<img src="img/grumento_whg.png" alt="grumento whg" width="400px"/>

#### 7.2.4 Querying all gazetteers at once

In [None]:
geocoder = Geocoder(sources=['nominatim', 'geonames', 'whg'])
doc = geocoder('Grumentum')
doc.get_folium_map()

#### 7.2.5 Getting more toponym candidates from gazetteers

In [None]:
geocoder = Geocoder(sources=['nominatim', 'geonames', 'whg'], max_rows=20)
doc = geocoder('Grumentum')
doc.get_folium_map()

In [None]:
doc.geojson

#### 7.2.6 Geoparsing 'Grumentum' article

In [None]:
geoparser = Geoparser(sources=['whg'])
doc = geoparser(content)

displacy.render(doc.to_spacy_doc(), style="ent", jupyter=True)

doc.get_folium_map()

In [None]:
places = [d.text for d in doc.named_entities if d.tag == 'place' or d.tag == 'unknown']
places

Now, we use directly the `Geocoder` class as we already have done the geotagging and we get the list of named entities.

In [None]:
geocoder = Geocoder(sources=['nominatim'])
doc = geocoder(places)
doc.get_folium_map()

In [None]:
geocoder = Geocoder(sources=['geonames'])
doc = geocoder(places)
doc.get_folium_map()

In [None]:
geocoder = Geocoder(sources=['nominatim', 'geonames', 'whg'], max_rows=10)
doc = geocoder(places)
doc.get_folium_map()

Setting a bounding box to filter results into a specific area

In [None]:
geocoder = Geocoder(sources=['nominatim', 'geonames', 'whg'], max_rows=10, bbox=[13.079224,38.856820,18.753662,41.037931])
doc = geocoder(places)
doc.get_folium_map()

#### 7.2.7 Geoparsing 'Frontignan'

In [None]:
content = dataset.loc[dataset['head'] == 'FRONTIGNAN'].text.item()
content

In [None]:
geoparser = Geoparser(version="Encyclopedie", sources=['nominatim', 'geonames', 'ign'])
doc = geoparser(content)

displacy.render(doc.to_spacy_doc(), style="ent", jupyter=True)
doc.get_folium_map()

One solution for toponym disambiguation is to build cluters based on spatial density and keep only the cluster with the more distinct toponyms.

> Moncla, L., Renteria-Agualimpia, W., Nogueras-Iso, J., & Gaio, M. (2014). Geocoding for texts with fine-grain toponyms: an experiment on a geoparsed hiking descriptions corpus. In Proceedings of the 22nd acm sigspatial international conference on advances in geographic information systems (pp. 183-192).

Perdido provides this solution with the `cluster_disambiguation()` method.

In [None]:
doc.cluster_disambiguation(0.4) 

doc.get_folium_map()

## 8. Wrapping up

## 9. Going further

### 8.1 Extracting metadata and content from XML-TEI

Here we assume that we have access to a directory with the corpus of documents. 
In our case, documents are XML-TEI files.

In [None]:
path = './data/EDdA_vol7/' # path of the directory containing the corpus of documents

# select one document for testing
file = 'volume07-1002.tei' # FRONTIGNAN: https://artflsrv03.uchicago.edu/philologic4/encyclopedie1117/navigate/7/1002/

# get the XML-TEI content of the document
root = etree.parse(path + file, etree.XMLParser(remove_blank_text=True)).getroot()

# print the XML-TEI content
print(xml.parseString(etree.tostring(root)).toprettyxml(indent=' ')) 

In the following cell, we define a function for parsing and extracting metadata and text content from an XML-TEI file.
In this example, we only extract from the metadata the normclass (classification of the article, e.g. 'Géographie'), the head (head word of the article), and the author of the article. Then, we also extract the textual content as raw text.

In [None]:
def getDataFromEDDATEI(file_path, filename):
    file_id = filename[:-4]
    d = []
    try:
        volume = filename[6:8] 
        number = filename[9:-4] 
        head = ''
        normClass = ''
        author = ''
        txtContent = ''
        root = etree.parse(file_path+filename).getroot()
        div1 = root.find('./text/body/div1')
        if len(div1):
            for elt in div1:
                if elt.tag == 'p':
                    txtContent += ''.join(elt.itertext())
                    txtContent = txtContent.replace('\n', ' ').strip()
                elif elt.tag == 'index':
                    if elt.get('type') == 'normclass':
                        normClass = elt.get('value')
                    if elt.get('type') == 'head':
                        head = elt.get('value')
                    if elt.get('type') == 'author':
                        author = elt.get('value')
        d = [filename, volume, number, head, normClass, author, txtContent]
    except etree.XMLSyntaxError as e:
        pass
        #print(filename + ': ' + str(e))
    return d

* Use this function to get the metadata an XML-TEI file:

In [None]:
getDataFromEDDATEI(path, file)

In order to easily analyse and use these data we will now load these information about all the documents in our directory into a [Python dataframe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html):

In [None]:
data = []
for doc in os.listdir(path):
    if doc[-4:] == '.tei':
        data.append(getDataFromEDDATEI(path, doc))
df = pd.DataFrame(data, columns=['filename', 'volume', 'number', 'head', 'normClass', 'author', 'txtContent'])
df = df.dropna()
df = df.sort_values(['volume', 'number']).reset_index(drop = True)

df.head(10) # show the 10 first rows of the dataframe
#df.tail(10) # show the 10 last rows of the dataframe

Now we have access to all the attributs and methods of the dataframe object. For instance, we can easily print the number of rows in our dataframe which correspond to the number of articles in our corpus:

In [None]:
n = df.shape[0]
print('There are ' + str(n) + ' articles in the input directory')

### 8.2 Searching by metadata

We can select articles based on their classification in the *Encyclopédie*. (There are actually a few different ways that the ARTFL *Encyclopédie* articles have been classified. In this notebook we will be using the `normclass` field, which normalizes classifications given at time of publication that had many spelling variants).

If we want all articles classified as 'Geography' we can make the request as follows (the output is stored as a new data frame `df_geo`: 

In [None]:
req = 'Géographie'
df_geo = dataset[dataset['normClass'].str.contains(req, case=False)]

n = df_geo.shape[0]
print('There are ' + str(n) + ' geography articles ('+ req +')')

We can query based on any value in the dataframe (e.g. article metadata). For instance, we can query all the articles written by a specific author:

* Count article for a single named author (Jaucourt)

In [None]:
val = 'Jaucourt'
n = df_geo.loc[dataset['author'] == val].shape[0]
print(str(n) + ' were written by '+ val)

We can also easily show the number of articles per author:

In [None]:
df_geo.groupby(['author'])["filename"].count()

It is possible to show the value of one column in our dataframe for a specific row (i.e., by article) based on its name. For instance, if we want to know who wrote the article about Frontignan or if we want to see its content, we make these requests:

In [None]:
dataset.loc[dataset['head'] == 'FRONTIGNAN'].author.item()

We can also perform a **keyword search** over the text content of all articles:

* Select articles that contain 'france':

In [None]:
# search corpus by keyword (val)
val = 'france'
df_2 = dataset[dataset['text'].str.contains(val, case=False)]
print(str(df_2.shape[0]) + ' articles contain the word \''+ val + '\'')

It is also possible to search by **phrases**. The expression "ville de" is commonly used in the *Encyclopédie* to define the country or region of a place. Searching by this phrase gives us a sense of the broader geographical coverage of the corpus. 

Here we extract all articles that contain the expression 'ville de':

In [None]:
dataset[dataset['text'].str.contains("ville de", case=False)]

### 8.3 Processing several documents at once

Usually, we want to process a sample of documents, not just one. 

As the process can be time consuming we will first select a small sample from our dataset to show how it works.

* We can use the [sample()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html) method from the pandas library to select randomly a small amount of documents

In [None]:
df_sampled = dataset.sample(3)
df_sampled

* Then, we keep only the text content of those documents:

In [None]:
contents = df_sampled.text

`geoparser` can parse a `string`, a `list` of string or a `pandas.Series`.
When the argument is a `list` or a `pandas.series`, the geoparser returns a `PerdidoCollection` object, while when it is a `string` it returns a `Perdido` object.

In [None]:
docs = geoparser(contents)

In [None]:
for doc in docs:
    print('-----')
    displacy.render(doc.to_spacy_doc(), style="ent", jupyter=True) 