# Extracting location names and data from Gibbon

In this note book we will put together and practice a lot of skills we have learned so far this term. Starting with just the raw text files from Gibbon's Decline and Fall we will create a DataFrame containing location names, location counts, and location data.

The code in this notebook may seem complex, but if you read through it carefully, you will likely understand what most of the code is doing.


## Set-up

In [2]:
# install necessary libraries. The "%%capture" stops the notebook from printing
# out all the insall output. Remove if you need to trouble shoot.
!pip install stanza



In [3]:
# install necessary libraries. The "%%capture" stops the notebook from printing
# out all the insall output. Remove if you need to trouble shoot.
!pip install wget

Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25ldone
[?25h  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9656 sha256=470a3f0123b6045e15e3585aa050461c317b37775f043243efb341c6fa54da6c
  Stored in directory: /Users/Max/Library/Caches/pip/wheels/40/b3/0f/a40dbd1c6861731779f62cc4babcb234387e11d697df70ee97
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [4]:
# import necessary libraries
import os
import pandas as pd
import stanza
import json
import wget



## NLP pipeline
Now that all the necessary libraries have been installed and imported into our project, we need to set up our nlp pipeline. We will use [Stanza](https://stanfordnlp.github.io/stanza/).

In [5]:
# load stanza nlp pipeline that tokenizes and performs Named Entity Recognition
nlp_ner= stanza.Pipeline(lang='en', processors='tokenize, ner')

2023-11-03 21:18:26 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json:   0%|   …

2023-11-03 21:18:27 INFO: Loading these models for language: en (English):
| Processor | Package          |
--------------------------------
| tokenize  | combined         |
| ner       | ontonotes_charlm |

2023-11-03 21:18:27 INFO: Using device: cpu
2023-11-03 21:18:27 INFO: Loading: tokenize
2023-11-03 21:18:27 INFO: Loading: ner
2023-11-03 21:18:28 INFO: Done loading processors!


## Load text data
If you are using a **Colab Notebook** you will need to run the cell below to get the text files.

Otherwise, you should have all of the text files for Gibbon's _Decline and Fall of the Roman Empire_ already downloaded from Canvas.

In [None]:
# load text files, Colab only
! git clone https://github.com/jdeen33/Gibbon_text.git

## Extract location infromation from text file(s)

In [6]:
# create function that will take a text string as input and return a dictionary
# with locations and location counts from the text string
def get_locations_from_text(text):
    locations_dict = {}
    doc = nlp_ner(text)
    for sentence in doc.sentences:
        for token in sentence.tokens:
            if token.ner == 'S-GPE':
                if not token.text in locations_dict.keys():
                    locations_dict[token.text] = 1
                else:
                    locations_dict[token.text] += 1
            else:
                continue
    return locations_dict

You will need to choose which chapter you would like to extract locations from. For this example I will use Chapter 16.

For **Colab** it will look something like this:
`/content/Gibbon_text/gibbon_decline_volume1_chap16.txt`

For **Jupyter** it will look something like this:
`../text/gibbon_decline_and_fall/gibbon_decline_volume1_chap16.txt`

In [15]:
# identify the path to the text file you want to use
path_to_file = text/gibbon_decline_and_fall/gibbon_decline_volume1_chap16.txt

NameError: name 'text' is not defined

In [None]:
# read text from text file
with open(path_to_file, encoding='utf-8', mode='r') as f:
       text  = f.read()

In [None]:
# apply function to get locations and location counts
# this will take a few minutes
locations = get_locations_from_text(text)

In [None]:
# sanity check
locations

In [None]:
# you may want to save the locations dictionary
path = './' # <-- Path of your choosing
file_name = 'locations_data.json'
with open(file_name, encoding='utf-8', mode='w') as f:
    json.dump(locations, f)

In [None]:
# convert dictionary to dataframe for easier processing
location_count_df = pd.DataFrame.from_dict(locations, orient='index').reset_index().rename(columns={'index':'place_name', 0:'count'})


In [None]:
# preview DataFrame
location_count_df.head()

## Load data from Pleiades

In [None]:
# data from Pleiades, thanks to Peter Nadel!
if not os.path.isfile('places.csv'):  # checkin to see if we have this file or not
    wget.download('https://raw.githubusercontent.com/pnadelofficial/FallDHCourseMaterials/main/places.csv')
if not os.path.isfile('names.csv'):
    wget.download('https://raw.githubusercontent.com/pnadelofficial/FallDHCourseMaterials/main/names.csv')

In [None]:
# load and preview places DataFrame
places_df = pd.read_csv('places.csv')
places_df.head()

In [None]:
# load and preview names DataFrame
names_df = pd.read_csv('names.csv')
names_df.head()

In [None]:
# quick example: find 'Roma' in places DataFrame
places_df.loc[places_df['title'] == 'Roma']

In [None]:
# quick example: find 'Rome' in names DataFrame
names_df.loc[names_df['romanized_form_1'] == 'Rome']

## Extract data from Pleiades data
For each location in we identified from the text, we will extract extract the longitude, latitude, and a description. First we need to find each location in the Pleiades data.

In [None]:
def get_pleiades_id(location):
    """
    Iterates through all of the possible names in the names.csv file
    Returns None if no matched names
    """
    name_row = names_df.loc[names_df['attested_form'] == location]
    if len(name_row) == 1:
        return int(name_row.place_id.iloc[0])
    else:
        name_row = names_df.loc[names_df['romanized_form_1'] == location]
        if len(name_row) == 1:
            return int(name_row.place_id.iloc[0])
        else:
            name_row = names_df.loc[names_df['romanized_form_2'] == location]
            if len(name_row) == 1:
                return int(name_row.place_id.iloc[0])
            else:
                name_row = names_df.loc[names_df['romanized_form_3'] == location]
                if len(name_row) == 1:
                    return int(name_row.place_id.iloc[0])
                else:
                    return None

In [None]:
# apply the above founction to each row in our location count DataFrame and then
# add a new colum with the Pleiades id
location_count_df['pleiades_id'] = location_count_df['place_name'].apply(get_pleiades_id)

In [None]:
# preview new location count DataFrame.
# the NaN means we were unable to find the location in the Pleiades data.
location_count_df.head()

In [None]:
# we can drop the rows with NaN values
location_count_df = location_count_df.dropna().reset_index(drop=True)

In [None]:
# preview updated location count DataFrame
location_count_df.head()

Now that we have a `pleiades_id` for each location from names.csv, we can use that information to get more data from the places.csv. It would be possible to combine the functions below into one, but I have seperated them out for clarity.

In [None]:
def get_description(pleiades_id):
    """return description from a pleiades id"""
    places_row = places_df.loc[places_df['id'] == pleiades_id]
    if len(places_row) == 1:
        return places_row.description.iloc[0]

In [None]:
def get_uri(pleiades_id):
    """return uri from a pleiades id"""
    places_row = places_df.loc[places_df['id'] == pleiades_id]
    if len(places_row) == 1:
        return places_row.uri.iloc[0]

In [None]:
def get_latitude(pleiades_id):
    """return latitude from a pleiades id"""
    places_row = places_df.loc[places_df['id'] == pleiades_id]
    if len(places_row) == 1:
        return places_row.representative_latitude.iloc[0]

In [None]:
# Challenge: Can you write a function to get the longitude data?





In [None]:
# add new column for description
location_count_df['description'] = location_count_df['pleiades_id'].apply(get_description)

In [None]:
# Challenge: can you write the code to add a colmn for the uri?


In [None]:
# add new column for latitude
location_count_df['latitude'] = location_count_df['pleiades_id'].apply(get_latitude)

In [None]:
# Challenge: can you write the code to add a colmn for the longitude?


Now that we have all the data we need, I am going to make a few little changes to the DataFrame.

In [None]:
# now that we have a uri we don't need the pleiades_id
location_count_df = location_count_df.drop(columns=['pleiades_id'])

In [None]:
# for our purposes we don't really need an index, so I will make the place_name column the index
location_count_df.set_index('place_name', inplace=True)

In [None]:
# final sanity check
location_count_df

## Save location data for further use

In [None]:
# create path and file name variables
path = # <-- set path variable (not necessary for Colab)
file_name = # <-- set file_name variable

In [None]:
# save DataFrame to a .csv file
location_count_df.to_csv(file_name) # <-- For Jupyter you may want to add path

In [None]:
# Colab only
files.download(file_name)