# Their Eyes Were Watching God (Map)

## Introduction

This project maps all locations mentioned in *Their Eyes Were Watching God* using Machine Learning tools. The script processes the text, extracts toponyms (place names) using a geoparser library, and assigns coordinates to each location.

The locations are visualized on a map, where the size of each circle corresponds to the frequency of mentions. Clicking a circle reveals the underlying text.

## 1 Data Acquisition

### Step-by-Step Guide

In [1]:
import pandas as pd
import nltk
import re

### Import text

In [3]:
file_path = "data/hurston_their_eyes_were_watching_god.txt"

with open(file_path, "r", encoding="utf-8") as file:
    text = file.read()

# Create a DataFrame with the entire text in one cell
df = pd.DataFrame([[text]], columns=["Text"])

---


## 2 Clean DataFrame for Analysis

To prepare our data for analysis we will: 

- Split it into sentences
- Clean the individual sentences
- Drop unnecessary data

For the code below you will have to replace `df_dickens` with the name of your dataframe.

#### Step 1: Tokenize Text into Sentences

In [5]:
# Explodes the DataFrame so that each row corresponds to a single sentence
df_sentences = df.assign(
    sentences=df['Text'].apply(nltk.sent_tokenize)
).explode('sentences')

LookupError: 
**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt/english.pickle[0m

  Searched in:
    - 'C:\\Users\\burgerjx/nltk_data'
    - 'C:\\Users\\burgerjx\\AppData\\Local\\anaconda3\\nltk_data'
    - 'C:\\Users\\burgerjx\\AppData\\Local\\anaconda3\\share\\nltk_data'
    - 'C:\\Users\\burgerjx\\AppData\\Local\\anaconda3\\lib\\nltk_data'
    - 'C:\\Users\\burgerjx\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - ''
**********************************************************************


#### Step 2: Drop remaining column

In [8]:
df_sentences = df_sentences.drop(columns='Text')

### Step 3: Define a Cleaning Function for Sentences

The following function goes through and cleans all the sentences. This is a common procedure in this type of analysis. All of the text needs to be in the some format in order for it to work.

In [9]:
def clean_sentence(sentence):
    # 1. Remove text inside square brackets
    sentence = re.sub(r'\[.*?\]', '', sentence)
    # 2. Remove unwanted punctuation but retain sentence-ending punctuation
    sentence = re.sub(r'[^\w\s,.!?\'"‘’“”`]', '', sentence)
    # 3. Remove newline and carriage return characters, and underscores
    sentence = sentence.replace('\n', ' ').replace('\r', ' ').replace('_', '')
    # 4. Return an empty string for all-uppercase sentences (likely headers or TOC entries)
    return '' if sentence.isupper() else sentence

### Step 4: Apply Cleaning and Remove Empty Sentences

In [10]:
# Apply the cleaning function, then filter out any sentences that are empty strings
df_sentences['cleaned_sentences'] = df_sentences['sentences'].apply(clean_sentence)
df_sentences = df_sentences[df_sentences['sentences'] != '']


### Step 5: Reset Index for the Cleaned DataFrame

This is a necessary step for Pandas to keep track of the row numbers

In [11]:
df_sentences = df_sentences.reset_index(drop=True)
df_sentences = df_sentences.drop(columns='sentences')

## 2 Geoparsing 

### Overview

Geoparsing identifies place names in text and assigns them geographic coordinates. Some limitations:
1. It can only detect explicitly named locations.
2. It makes an educated guess when multiple locations share the same name.

In [12]:
from geoparser import Geoparser
from tqdm.autonotebook import tqdm

Because there are some compatibility issues with the `geoparser` package, there are pesky warnings that pop-up. These do not affect the output, but they are annoying. The line below filters these out of the console.

In [13]:
import warnings

# Suppress all FutureWarnings
warnings.simplefilter(action='ignore', category=FutureWarning)

### Load Geoparser

To use Geoparser, instantiate an object of the Geoparser class with optional specifications for the spaCy model, transformer model, and gazetteer. By default, the library uses an accuracy-optimised configuration:

In [13]:
geo = Geoparser(spacy_model='en_core_web_trf', transformer_model='dguzh/geo-all-distilroberta-v1', gazetteer='geonames')

NameError: name 'Geoparser' is not defined

Load in the `geoparse_column` function to simplify the toponym recognition process.

In [None]:
def geoparse_column(df):
    sentences = df['cleaned_sentences'].tolist()  # Convert column to list
    docs = geo.parse(sentences, feature_filter=['A', 'P'])  # Run geo.parse on the entire list

    # Initialize lists to store the extracted fields
    places, latitudes, longitudes, feature_names = [], [], [], []

    # Iterate through the results and extract toponyms and their locations
    for doc in docs:
        doc_places = []
        doc_latitudes = []
        doc_longitudes = []
        doc_feature_names = []

        for toponym in doc.toponyms:
            if toponym.location:
                doc_places.append(toponym.location.get('name'))
                doc_latitudes.append(toponym.location.get('latitude'))
                doc_longitudes.append(toponym.location.get('longitude'))
                doc_feature_names.append(toponym.location.get('feature_name'))
            else:
                doc_places.append(None)
                doc_latitudes.append(None)
                doc_longitudes.append(None)
                doc_feature_names.append(None)

        # Append the extracted data for the document
        places.append(doc_places)
        latitudes.append(doc_latitudes)
        longitudes.append(doc_longitudes)
        feature_names.append(doc_feature_names)

    # Assign the extracted data to the DataFrame as new columns
    df['place'] = places
    df['latitude'] = latitudes
    df['longitude'] = longitudes
    df['feature_name'] = feature_names

    return df


In [None]:
geoparse_column(df_sentences)

### Export Pickle 2

As the geoparsing process takes a long time, you should store it right after the result. 

In [None]:
df_sentences.to_pickle('df_sentences_places.pickle')

In [None]:
df_sentences= pd.read_pickle('basic_geoparser/df_sentences_places.pickle')

### Clean up the resulting dataframe

As with the previous instance of toponym resolution, there will be some rows that do not contain relevant information. This will slow down the sentiment analysis. 
1. Eliminate empty results

In [None]:
df_places = df_sentences[df_sentences['place'].apply(lambda x: isinstance(x, list) and x != [None] and len(x) > 0)].copy()


#### Unnest Places

Since sentences can contain multiple locations, occassionally there will be multiple locations per row. These have to be `unnested` that is turned into individual rows.

In [None]:
df_places = df_places.explode(['place', 'latitude', 'longitude', 'feature_name']).reset_index(drop=True)

### Aggregate place

Once the places are unnested they can be aggretated and each instance of a place can be counted.

In [None]:
df_consolidated = df_places.groupby('place').agg(
    sentence_count=('cleaned_sentences', 'count'),
    sentences=('cleaned_sentences', list),
    latitude=('latitude', 'first'),
    longitude=('longitude', 'first'),
    feature_name=('feature_name', 'first')
).reset_index()

Generally, the data will be very left skewed. You might want to filter out some of the lower values.

In [None]:
df_dickens = df_dickens[df_dickens.location_count>50]

### Bucket Data

When mapping quantitative data you want to break it down into buckets so that it is easier to differentiate small from large.

In [None]:
import mapclassify as mc #you may get an error. If so install mapclassify with pip install mapclassify

jenks_breaks = mc.NaturalBreaks(y=df_consolidated['sentence_count'], k=5)
df_consolidated.loc[:,'location_count_bucket'] = jenks_breaks.find_bin(df_consolidated['sentence_count'])+1

### Export Pickle 3

Exporting the file to save it for later.


In [None]:
df_consolidated.to_pickle('basic_geoparser/df_consolidated_buckets.pickle')

---

## Map your Data

### Overview

With all this data we can create a custom map.

In [None]:
# Convert list of sentences into a single string for hover text

import dash
from dash import dcc, html
from dash.dependencies import Input, Output
import plotly.express as px
import pandas as pd
jenks_labels = {
    1: f"{min(df_consolidated['sentence_count'])}",
    2: f"{int(jenks_breaks.bins[1])}",
    3: f"{int(jenks_breaks.bins[1])} - {int(jenks_breaks.bins[2])}",
    4: f"{int(jenks_breaks.bins[2])} - {int(jenks_breaks.bins[3])}",
    5: f"{int(jenks_breaks.bins[3])} - {max(df_consolidated['sentence_count'])}"
}

# Map the bucket numbers to labels
df_consolidated["bucket_label"] = df_consolidated["location_count_bucket"].map(jenks_labels)

# Create the scatter map
fig = px.scatter_mapbox(
    df_plot,
    lat="latitude",
    lon="longitude",
    size="location_count_bucket",  # Bubble sizes scale correctly
    color="bucket_label",  # Legend shows Jenks bucket ranges
    
    hover_name="place",
    size_max=20,
    category_orders={"bucket_label": list(jenks_labels.values())},  # Keep correct legend order
    center={"lat": 27, "lon": -81},
    zoom=3
)

# Ensure legend bubbles are properly sized
fig.update_traces(marker=dict(sizemode="area"), selector=dict(mode="markers"))

# Update layout to improve legend readability
fig.update_layout(
    mapbox_style="carto-positron",
    margin={"r": 0, "t": 0, "l": 0, "b": 0},
    legend_title_text="Location Count"
)

# Create Dash App
app = dash.Dash(__name__)

app.layout = html.Div([
    dcc.Graph(id="map", figure=fig),  # Display the map
    html.Div(id="output-container", style={'padding': '20px', 'font-size': '16px'})  # Placeholder for dialog box
])

# Callback to update dialog on click
@app.callback(
    Output("output-container", "children"),
    Input("map", "clickData")
)
def display_dialog(clickData):
    if clickData:
        clicked_place = clickData["points"][0]["hovertext"]  # Get the place name
        sentences = df_consolidated[df_consolidated["place"] == clicked_place]["sentences"].values
        if len(sentences) > 0:
            sentences_list = "<br>".join(sentences[0])  # Format sentences for display
            return html.Div([
                html.H3(f"Sentences about {clicked_place}:"),
                html.P(sentences_list, style={'white-space': 'pre-line'})
            ])
    return "Click a location to see related sentences."

# Run the app
if __name__ == "__main__":
    app.run_server(debug=True)

Happy mapping!