# Project Template

The following template repeats the essential steps for scraping, tokenizing, and mapping sentiment data. Each major step is further broken down into smaller steps. You can step through most of the code. 
- You will  have to create your own dataframe names
- Customize the filters
- Customize the maps

#### Note

When your individual input is required the following symbol will appear:</br> 
![](edit-code.png)

You should save your results to a `pickle` file after each major step. These points are indicated by the save icon:</br>
![](save.png)
</br>
They are also titled **Export Pickle**
You will need to import these results into the `project_presentation_template`

---


## 2 Data Input and Wrangling

### Introduction

In this step, you will download a CSV file from Gutenberg, modify it, and save it locally for further analysis.

#### Process Steps:
- Load a CSV file from a remote source
- Save the file locally

### Step-by-Step Guide

#### Step 1: Import the Required Libraries

The primary library for handling CSV data in Python is `pandas`. Make sure it’s imported at the beginning of your script.

```python
import pandas as pd
```

In [2]:
import pandas as pd

#### Step 2: Import Pickle File 
In lesson 2_1 we created a clean pickle file of the catalog, we can simply start there rather than running through all those steps.

In [4]:
pg_catalog_clean = pd.read_pickle('pg_catalog_clean.pickle')

---


![](edit-code.png)

#### Step 3: Create a custom dataframe

##### Overview

In lesson 2_2 we learned how to filter dataframes. For this part of the project, you are going to create your own pg_catalog dataframe that includes the corpus of works you want to analyze. Your corpus can be as large as you want, but keep in mind that having a lot of text can significantly increase your processing time. 

##### Requirements
Please follow the following parameters when creating our custom dataframe:

1. Unique descriptive dataframe name (i.e. not `df_virginia_history`)

2. Corpus should have a logical coherence.

3. You should aim for at least 25 texts

4. Save the resulting dataframe as a pickle file

##### Note
Remember to make a deep copy of your dataframe by using the method `.copy()`

###### *Example*
```python
df_virginia_history = df_pg_catalog[
    (df_pg_catalog.language == 'en') & 
    (df_pg_catalog.type == 'Text') 
].copy()
```


Save the dataframe to a pickle file: 

```python
df_virginia_history.to_pickle('custom_file.pickle')
```

---


## 3 Scraping Gutenberg

### Overview

In the previous section you were meant to create a custom dataframe. Here you are going to scrape that data frame. The scraping function assumes that your first column is called `text_id`. If that is not the case then something went wrong with the filtering process. 

#### Load `gutenberg_scraper`

Since you only need to perform one function `fetch_text_data()` all of the other logic has been tucked away in a `.py` file. You can load the python file and function in like any other import as long as the file is present in your root directory. 

In [3]:
from gutenberg_scraper import fetch_text_data

In [5]:
import pickle
with open('pg_catalog_clean.pickle', 'rb') as file:

    data = pickle.load(file)
df_new_york_history = data[
    (data.language == 'en') & 
    (data.type == 'Text') &
    (data.subjects.str.contains('new york', case=False,)) &
    ~ (data.subjects.str.contains('fiction|speech*', case=False,)) &
    (data.birth > 1600) &
    ~ (data.subjects.str.contains('poetry|short story|travel|art|essay|biography|cooking', case=False,)) &
    (data.subjects.str.contains('history', case=False,))
] .copy ()

df_new_york_history.to_pickle('pg_catalog_clean.pickle')

In [7]:
fetch_text_data(df_new_york_history)

 93%|███████████████████████████████████████▉   | 26/28 [07:01<00:34, 17.33s/it]ERROR:gutenberg_scraper:Error fetching https://www.gutenberg.org/ebooks/6/8/2/3/68232/68232-8.txt: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
100%|███████████████████████████████████████████| 28/28 [08:09<00:00, 17.50s/it]


Unnamed: 0,text_id,type,issued,title,language,subjects,locc,bookshelves,second_author,last_name,first_name,birth,death,text_data
2096,2128,Text,2000-04-01,"Narratives of New Netherland, 1609-1664",en,New York (State) -- History -- Colonial period...,F106,United States; Browsing: History - American; B...,,Jameson,J. Franklin (John Franklin),1859,1937,***
3120,3161,Text,2002-04-01,"Narratives of New Netherland, 1609-1664",en,New York (State) -- History -- Colonial period...,F106,Browsing: History - American; Browsing: Histor...,,Jameson,J. Franklin (John Franklin),1859,1937,"NARRATIVES OF NEW NETHERLAND, 1609-1664 ***\r\..."
6814,6856,Text,2004-11-01,"The Great Riots of New York, 1712 to 1873",en,Riots -- New York (State) -- New York; Draft R...,F106,Browsing: History - American; Browsing: Histor...,,Headley,Joel Tyler,1813,1897,"THE GREAT RIOTS OF NEW YORK, 1712 TO 1873 ***\..."
7193,7235,Text,2005-01-01,The Bride of Fort Edward: Founded on an Incide...,en,Fort Edward (N.Y.) -- History -- Drama; New Yo...,PS,Browsing: History - American; Browsing: Fiction,,Bacon,Delia Salter,1811,1859,THE BRIDE OF FORT EDWARD: FOUNDED ON AN INCIDE...
12971,13042,Text,2004-07-29,"Knickerbocker's History of New York, Complete",en,New York (State) -- History -- Colonial period...,F106,Browsing: History - American,,Irving,Washington,1783,1859,"KNICKERBOCKER'S HISTORY OF NEW YORK, COMPLETE ..."
13740,13811,Text,2004-10-20,"Peter Stuyvesant, the Last Dutch Governor of N...",en,New York (State) -- History -- Colonial period...,F106,Browsing: History - American; Browsing: Histor...,,Abbott,John S. C. (John Stevens Cabot),1805,1877,"PETER STUYVESANT, THE LAST DUTCH GOVERNOR OF N..."
21919,21990,Text,2007-07-03,The Campaign of 1776 around New York and Brook...,en,"Long Island, Battle of, New York, N.Y., 1776; ...",E201,Browsing: History - American; Browsing: Histor...,,Johnston,Henry Phelps,1842,1923,THE CAMPAIGN OF 1776 AROUND NEW YORK AND BROOK...
24641,24712,Text,2008-02-28,The Negro at Work in New York City: A Study in...,en,African Americans -- History; African American...,E151; H,United States; Browsing: Culture/Civilization/...,,Haynes,George Edmund,1880,1960,THE NEGRO AT WORK IN NEW YORK CITY: A STUDY IN...
31901,31974,Text,2010-04-13,Last Days of the Rebellion The Second New Yor...,en,"United States -- History -- Civil War, 1861-18...",E456,Browsing: History - American; Browsing: Histor...,,Randol,Alanson M.,1837,1887,***
31940,32013,Text,2010-04-16,The Last Campaign of the Twenty-Second Regimen...,en,United States. Army. New York Infantry Regimen...,E456,US Civil War; Browsing: History - American; Br...,,Wingate,George Wood,1840,1928,THE LAST CAMPAIGN OF THE TWENTY-SECOND REGIMEN...


### Export Pickle 1

![](save.png)

Export the file as a `pickle` file for presentation.

```python
YOUR_DATAFRAME.to_pickle('YOUR_DATAFRAME_TEXTS.pickle')
```


In [12]:
df_new_york_history.to_pickle('df_new_york_history.pickle')

---

![](edit-code.png)

## 4 Clean DataFrame for Analysis

To prepare our data for analysis we will: 

- Split it into sentences
- Clean the individual sentences
- Drop unnecessary data

For the code below you will have to replace `YOUR_DATAFRAME` with the name of your dataframe.

### Import `NLTK`

We can use NLTK for some basic preprocessing.

```python
import nltk
import re
```

In [9]:
import nltk
import re

### Step 1: Tokenize Text into Sentences

In [10]:
df_new_york_history = df_new_york_history.assign(
    sentences = df_new_york_history['text_data'].apply(nltk.sent_tokenize)
).explode('sentences')

### Step 2: Remove the 'text_data' column 

In [12]:
df_new_york_history = df_new_york_history.drop(columns='text_data')

### Step 3: Define a Cleaning Function for Sentences

In [15]:
def clean_sentence(sentence):
    # 1. Remove text inside square brackets
    sentence = re.sub(r'\[.*?\]', '', sentence)
    # 2. Remove unwanted punctuation but retain sentence-ending punctuation
    sentence = re.sub(r'[^\w\s,.!?\'"‘’“”`]', '', sentence)
    # 3. Remove newline and carriage return characters, and underscores
    sentence = sentence.replace('\n', ' ').replace('\r', ' ').replace('_', '')
    # 4. Return an empty string for all-uppercase sentences (likely headers or TOC entries)
    return '' if sentence.isupper() else sentence

### Step 4: Apply Cleaning and Remove Empty Sentences

In [17]:
df_new_york_history['cleaned_sentences'] = df_new_york_history['sentences'].apply(clean_sentence)
df_new_york_history = df_new_york_history[df_new_york_history['sentences'] != '']


### Step 5: Reset Index for the Cleaned DataFrame

In [19]:
df_new_york_history = df_new_york_history.reset_index(drop=True)


### Step 6: (OPTIONAL) Save deep copy of dataframe and pickle

In [21]:
df_new_york_history_DEEP = df_new_york_history.copy()
df_new_york_history_DEEP.to_pickle('df_new_york_history_DEEP.pickle')

---

## 5 Perform Initial Tokenization

### Overview

Since the geoparsing process is quite intense, we can actually reduce our processing overhead a bit by eliminating those sentences that likely don't have toponyms. We can do so by first running a pass with the lightweight `en_core_web_sm` `spacy` library

### Load Spacy
We are going to load spacy and the small library at the same time.



In [23]:
import spacy
from tqdm import tqdm
tqdm.pandas()
nlp = spacy.load('en_core_web_sm')

#### Load Functions into memory

In [24]:
# Function to extract GPE (Geopolitical Entities) from a batch of docs
def extract_gpe_from_docs(docs):
    return [[ent.text for ent in doc.ents if ent.label_ == 'GPE'] or None for doc in docs]

# Use nlp.pipe() for faster batch processing with multiple cores
def process_sentences_in_batches(sentences, batch_size=50, n_process=-1):
    # Process sentences using nlp.pipe with batch processing and multi-processing
    gpe_results = []
    for doc in tqdm(nlp.pipe(sentences, batch_size=batch_size, n_process=n_process), total=len(sentences)):
        gpes = [ent.text for ent in doc.ents if ent.label_ == 'GPE']
        gpe_results.append(gpes if gpes else None)
    return gpe_results

#### Process your DataFrame

In [27]:
df_new_york_history['toponyms'] = process_sentences_in_batches(df_new_york_history['cleaned_sentences'])
df_new_york_history['cleaned_sentences']

100%|████████████████████████████████████| 61981/61981 [04:00<00:00, 257.77it/s]


0                                                         
1        NARRATIVES OF NEW NETHERLAND, 16091664        ...
2                              Michaelius, Reverend Jonas.
3            "Letter of Reverend Jonas  Michaelius, 1628."
4        In J. Franklin Jameson, ed., Narratives  of Ne...
                               ...                        
61976    Sometimes is found the  touching Gedachtenis, ...
61977    More impressive still, from  its calm repetiti...
61978    Not only in memory of those deadandgone coloni...
61979    The lichened  lettering of those unfamiliar wo...
61980                                                     
Name: cleaned_sentences, Length: 61981, dtype: object

### Clean up the result

As we saw in the most intense part of the extraction process in lesson_5 we want to reduce the number of sentences being processed to lower the computation time. We can do two things at this stage. 
 1. Eliminate unnecessary columns
 2. Eliminate all sentences for which there is no result
 3. Eliminate all sentences with very few results. Your group can decide on the threshold, but suffice to say that all toponyms with a count of 1 won't be relevant. You can adjust this number as you fine-tune your model.

In [34]:
df_new_york_history.sample(5)

Unnamed: 0,text_id,type,issued,title,language,subjects,locc,bookshelves,second_author,last_name,first_name,birth,death,sentences,cleaned_sentences,toponyms
16796,21990,Text,2007-07-03,The Campaign of 1776 around New York and Brook...,en,"Long Island, Battle of, New York, N.Y., 1776; ...",E201,Browsing: History - American; Browsing: Histor...,,Johnston,Henry Phelps,1842,1923,But in all Washington's army\r\nthere was not ...,But in all Washington's army there was not a ...,"[Washington, Long Island, Kings]"
977,6856,Text,2004-11-01,"The Great Riots of New York, 1712 to 1873",en,Riots -- New York (State) -- New York; Draft R...,F106,Browsing: History - American; Browsing: Histor...,,Headley,Joel Tyler,1813,1897,"He wisely forebore to give the order, for if\r...","He wisely forebore to give the order, for if ...",
71,3161,Text,2002-04-01,"Narratives of New Netherland, 1609-1664",en,New York (State) -- History -- Colonial period...,F106,Browsing: History - American; Browsing: Histor...,,Jameson,J. Franklin (John Franklin),1859,1937,In 1636-1637 he made arrangements\r\nwith Blom...,In 16361637 he made arrangements with Blommae...,[New Sweden]
34701,37421,Text,2011-09-14,"The Life and Times of Kateri Tekakwitha, the L...",en,New York (State) -- History -- Colonial period...,E011,Browsing: History - American; Browsing: Histor...,,Walworth,Ellen H. (Ellen Hardin),1858,1932,The Onondagas that very year sent presents to ...,The Onondagas that very year sent presents to ...,[Quebec]
57583,70129,Text,2023-02-24,General Washington's spies on Long Island and ...,en,"New York (N.Y.) -- History -- Revolution, 1775...",E201,Browsing: History - American; Browsing: Histor...,,Pennypacker,Morton,1872,1956,Meanwhile Benedict Arnold was uneasy.,Meanwhile Benedict Arnold was uneasy.,


#### Eliminate Unncessary Columns

In [29]:
df_new_york_history = df_new_york_history.drop(columns=['language', 'issued', 'type', 'locc', 'bookshelves', 'second_author']).copy()

#### Eliminate `None`

In [31]:
df_new_york_history = df_new_york_history[df_new_york_history.toponyms.notna()]

In [33]:
df_new_york_history

Unnamed: 0,text_id,title,subjects,last_name,first_name,birth,death,sentences,cleaned_sentences,toponyms
4,3161,"Narratives of New Netherland, 1609-1664",New York (State) -- History -- Colonial period...,Jameson,J. Franklin (John Franklin),1859,1937,"In J. Franklin Jameson, ed., Narratives\r\nof ...","In J. Franklin Jameson, ed., Narratives of Ne...","[Narratives, New Netherland]"
6,3161,"Narratives of New Netherland, 1609-1664",New York (State) -- History -- Colonial period...,Jameson,J. Franklin (John Franklin),1859,1937,INTRODUCTION\r\n\r\nTHE established church in ...,INTRODUCTION THE established church in the ...,[the United Netherlands]
7,3161,"Narratives of New Netherland, 1609-1664",New York (State) -- History -- Colonial period...,Jameson,J. Franklin (John Franklin),1859,1937,Its polity was that of Geneva or of\r\nPresbyt...,Its polity was that of Geneva or of Presbyter...,[Geneva]
12,3161,"Narratives of New Netherland, 1609-1664",New York (State) -- History -- Colonial period...,Jameson,J. Franklin (John Franklin),1859,1937,In 1624 the synod of North\r\nHolland decreed ...,In 1624 the synod of North Holland decreed th...,"[North , Holland]"
15,3161,"Narratives of New Netherland, 1609-1664",New York (State) -- History -- Colonial period...,Jameson,J. Franklin (John Franklin),1859,1937,Many extracts\r\nfrom the minutes of that clas...,Many extracts from the minutes of that classi...,"[New York, Albany]"
...,...,...,...,...,...,...,...,...,...,...
61946,72327,Colonial days in old New York,New York (State) -- History -- Colonial period...,Earle,Alice Morse,1851,1911,“Burial-cakes” were advertised by\r\na baker i...,“Burialcakes” were advertised by a baker in 1...,"[Burialcakes, Philadelphia]"
61954,72327,Colonial days in old New York,New York (State) -- History -- Colonial period...,Earle,Alice Morse,1851,1911,"Perhaps with\r\ngifts of gloves, spoons, bottl...","Perhaps with gifts of gloves, spoons, bottles...",[scarfs]
61960,72327,Colonial days in old New York,New York (State) -- History -- Colonial period...,Earle,Alice Morse,1851,1911,"In the “New York Gazette” of December 24, 1750...","In the “New York Gazette” of December 24, 1750...","[New York Gazette, Esq]"
61966,72327,Colonial days in old New York,New York (State) -- History -- Colonial period...,Earle,Alice Morse,1851,1911,As this is intended as a small Tribute to the ...,As this is intended as a small Tribute to the ...,"[Morrisania, Esq, the Province of New Jersey]"


To eliminate the some of the complicated processing the function below adds a count column to the dataframe.

In [35]:
def add_toponym_count(df, toponym_col='toponyms', sentence_col='cleaned_sentences'):
    """
    Processes the DataFrame to count toponyms and aggregate back to sentences, keeping all original columns.
    
    Args:
        df (pd.DataFrame): The DataFrame containing toponyms and sentences.
        toponym_col (str): Column containing the toponyms as lists.
        sentence_col (str): Column containing the cleaned sentences.
    
    Returns:
        pd.DataFrame: A DataFrame grouped by sentences with a list of toponyms, their counts, and all original columns.
    """
    
    # Step 1: Explode the 'toponyms' column
    exploded_df = df.explode(toponym_col)
    
    # Step 2: Group by 'toponyms' to count occurrences and add 'nltk_toponym_count' column
    toponym_counts = exploded_df.groupby(toponym_col).size().reset_index(name='nltk_toponym_count')
    
    # Step 3: Merge the counts back to the exploded DataFrame
    exploded_df = exploded_df.merge(toponym_counts, on=toponym_col, how='left')
    
    # Step 4: Group by 'cleaned_sentences' and aggregate all columns
    # Use 'first' to retain the first non-null value for each original column, and 'list' for the toponym_col
    aggregation_dict = {col: 'first' for col in df.columns if col not in [sentence_col, toponym_col]}
    aggregation_dict[toponym_col] = lambda x: list(x)  # Aggregate toponyms into lists
    aggregation_dict['nltk_toponym_count'] = 'first'   # Take the first count (all counts are the same within groups)
    
    result_df = exploded_df.groupby(sentence_col).agg(aggregation_dict).reset_index()
    
    return result_df

In [44]:
df_new_york_history.sample(5)

Unnamed: 0,text_id,title,subjects,last_name,first_name,birth,death,sentences,cleaned_sentences,toponyms
10141,13042,"Knickerbocker's History of New York, Complete",New York (State) -- History -- Colonial period...,Irving,Washington,1783,1859,"For\r\nonce in his life, and only for once, di...","For once in his life, and only for once, did ...",[New Netherlands]
37561,37421,"The Life and Times of Kateri Tekakwitha, the L...",New York (State) -- History -- Colonial period...,Walworth,Ellen H. (Ellen Hardin),1858,1932,"p. 9, will\r\n show you an Iroquois villag...","p. 9, will show you an Iroquois village ...",[Iroquois]
21592,21990,The Campaign of 1776 around New York and Brook...,"Long Island, Battle of, New York, N.Y., 1776; ...",Johnston,Henry Phelps,1842,1923,When we had proceeded to within a mile and a h...,When we had proceeded to within a mile and a h...,[Princeton]
13938,13811,"Peter Stuyvesant, the Last Dutch Governor of N...",New York (State) -- History -- Colonial period...,Abbott,John S. C. (John Stevens Cabot),1805,1877,While New Netherland was thus fearfully menace...,While New Netherland was thus fearfully menace...,[England]
14169,13811,"Peter Stuyvesant, the Last Dutch Governor of N...",New York (State) -- History -- Colonial period...,Abbott,John S. C. (John Stevens Cabot),1805,1877,"We regret to say, but\r\nhistory will bear us ...","We regret to say, but history will bear us ou...",[Great Britain]


In [37]:
df_new_york_history = add_toponym_count(df_new_york_history)

In [39]:
df_new_york_history

Unnamed: 0,cleaned_sentences,text_id,title,subjects,last_name,first_name,birth,death,sentences,toponyms,nltk_toponym_count
0,...,24712,The Negro at Work in New York City: A Study in...,African Americans -- History; African American...,Haynes,George Edmund,1880,1960,| |\r\n ...,"[Cincinnati, New ...",7
1,...,24712,The Negro at Work in New York City: A Study in...,African Americans -- History; African American...,Haynes,George Edmund,1880,1960,|\r\n | ...,[NEW YORK],53
2,...,24712,The Negro at Work in New York City: A Study in...,African Americans -- History; African American...,Haynes,George Edmund,1880,1960,|\r\n | ...,[Virginia],224
3,...,70129,General Washington's spies on Long Island and ...,"New York (N.Y.) -- History -- Revolution, 1775...",Pennypacker,Morton,1872,1956,* * * * *\r\n\r\n ...,[FAIRFIELD],9
4,...,24712,The Negro at Work in New York City: A Study in...,African Americans -- History; African American...,Haynes,George Edmund,1880,1960,| | ...,[America],212
...,...,...,...,...,...,...,...,...,...,...,...
12209,“Whenever I sit down I always feel and know my...,70129,General Washington's spies on Long Island and ...,"New York (N.Y.) -- History -- Revolution, 1775...",Pennypacker,Morton,1872,1956,“Whenever I sit down I always feel and know my...,[Letter],7
12210,“Whereas several of the inhabitants on the fe...,56078,"Notes Geographical and Historical, Relating to...","Brooklyn (New York, N.Y.) -- History",Furman,Gabriel,1800,1854,“Whereas several of the inhabitants\r\non the ...,[Breuckland],3
12211,“Whether there are any Works upon the Island o...,70129,General Washington's spies on Long Island and ...,"New York (N.Y.) -- History -- Revolution, 1775...",Pennypacker,Morton,1872,1956,“Whether there are any Works upon the Island o...,[Washington],753
12212,“Ye fyremasters” were also ordered to see that...,72327,Colonial days in old New York,New York (State) -- History -- Colonial period...,Earle,Alice Morse,1851,1911,“Y^e fyre-masters” were also ordered to see th...,[Flatbush],11


![](edit-code.png)

#### Filter out low toponym counts

Your dataframe now has the new variable `nltk_toponym_count`. You can filter out low count results to get fewer sentences. You can get a data frame for all cleaned sentences where the nltk_toponym_count is **greater** than 1.

In [96]:
#nltk_toponym_count = df_new_york_history
#if nltk_toponym_count > 1:
    

![](save.png)

#### (Optional) Save pickle of tokenization

In [50]:
df_new_york_history.to_pickle('df_new_york_history.pickle')

---

## 6 Geoparsing (Deep Scan)

### Overview

Since the deep scan for toponyms will likely reduce the size of the dataframe again, we can backload the sentiment analysis as the last step to ensure we don't process data unnecessarily.

In [49]:
from geoparser import Geoparser
from tqdm.notebook import tqdm


Because there are some compatibility issues with the `geoparser` package, there are pesky warnings that pop-up. These do not affect the output, but they are annoying. The line below filters these out of the console.

In [50]:
import warnings

# Suppress all FutureWarnings
warnings.simplefilter(action='ignore', category=FutureWarning)

### Load Geoparser

To use Geoparser, instantiate an object of the Geoparser class with optional specifications for the spaCy model, transformer model, and gazetteer. By default, the library uses an accuracy-optimised configuration:

In [51]:
geo = Geoparser(spacy_model='en_core_web_trf', transformer_model='dguzh/geo-all-distilroberta-v1', gazetteer='geonames')

Load in the `geoparse_column` function to simplify the toponym recognition process.

In [54]:
def geoparse_column(df):
    sentences = df['cleaned_sentences'].tolist()  # Convert column to list
    docs = geo.parse(sentences, feature_filter=['A', 'P'])  # Run geo.parse on the entire list

    # Initialize lists to store the extracted fields
    places, latitudes, longitudes, feature_names = [], [], [], []

    # Iterate through the results and extract toponyms and their locations
    for doc in docs:
        doc_places = []
        doc_latitudes = []
        doc_longitudes = []
        doc_feature_names = []

        for toponym in doc.toponyms:
            if toponym.location:
                doc_places.append(toponym.location.get('name'))
                doc_latitudes.append(toponym.location.get('latitude'))
                doc_longitudes.append(toponym.location.get('longitude'))
                doc_feature_names.append(toponym.location.get('feature_name'))
            else:
                doc_places.append(None)
                doc_latitudes.append(None)
                doc_longitudes.append(None)
                doc_feature_names.append(None)

        # Append the extracted data for the document
        places.append(doc_places)
        latitudes.append(doc_latitudes)
        longitudes.append(doc_longitudes)
        feature_names.append(doc_feature_names)

    # Assign the extracted data to the DataFrame as new columns
    df['place'] = places
    df['latitude'] = latitudes
    df['longitude'] = longitudes
    df['feature_name'] = feature_names

    return df


In [55]:
geoparse_column(df_new_york_history)

Toponym Recognition...


Batches:   0%|          | 0/12214 [00:00<?, ?it/s]

Toponym Resolution...


Batches:   0%|          | 0/3906 [00:00<?, ?it/s]

Batches:   0%|          | 0/2477 [00:00<?, ?it/s]

Unnamed: 0,cleaned_sentences,text_id,title,subjects,last_name,first_name,birth,death,sentences,toponyms,nltk_toponym_count,place,latitude,longitude,feature_name
0,...,24712,The Negro at Work in New York City: A Study in...,African Americans -- History; African American...,Haynes,George Edmund,1880,1960,| |\r\n ...,"[Cincinnati, New ...",7,"[Cincinnati, New York City]","[39.12711, 40.71427]","[-84.51439, -74.00597]",[seat of a second-order administrative divisio...
1,...,24712,The Negro at Work in New York City: A Study in...,African Americans -- History; African American...,Haynes,George Edmund,1880,1960,|\r\n | ...,[NEW YORK],53,[New York City],[40.71427],[-74.00597],[populated place]
2,...,24712,The Negro at Work in New York City: A Study in...,African Americans -- History; African American...,Haynes,George Edmund,1880,1960,|\r\n | ...,[Virginia],224,[Virginia],[37.54812],[-77.44675],[first-order administrative division]
3,...,70129,General Washington's spies on Long Island and ...,"New York (N.Y.) -- History -- Revolution, 1775...",Pennypacker,Morton,1872,1956,* * * * *\r\n\r\n ...,[FAIRFIELD],9,[Pleasant Hill],[39.44338],[-90.87235],[populated place]
4,...,24712,The Negro at Work in New York City: A Study in...,African Americans -- History; African American...,Haynes,George Edmund,1880,1960,| | ...,[America],212,"[United States, New York]","[39.76, 43.00035]","[-98.5, -75.4999]","[independent political entity, first-order adm..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12209,“Whenever I sit down I always feel and know my...,70129,General Washington's spies on Long Island and ...,"New York (N.Y.) -- History -- Revolution, 1775...",Pennypacker,Morton,1872,1956,“Whenever I sit down I always feel and know my...,[Letter],7,[],[],[],[]
12210,“Whereas several of the inhabitants on the fe...,56078,"Notes Geographical and Historical, Relating to...","Brooklyn (New York, N.Y.) -- History",Furman,Gabriel,1800,1854,“Whereas several of the inhabitants\r\non the ...,[Breuckland],3,"[None, None, None]","[None, None, None]","[None, None, None]","[None, None, None]"
12211,“Whether there are any Works upon the Island o...,70129,General Washington's spies on Long Island and ...,"New York (N.Y.) -- History -- Revolution, 1775...",Pennypacker,Morton,1872,1956,“Whether there are any Works upon the Island o...,[Washington],753,"[None, City, None, Washington County]","[None, 18.24408, None, 43.3137]","[None, -77.4972, None, -73.43076]","[None, populated place, None, second-order adm..."
12212,“Ye fyremasters” were also ordered to see that...,72327,Colonial days in old New York,New York (State) -- History -- Colonial period...,Earle,Alice Morse,1851,1911,“Y^e fyre-masters” were also ordered to see th...,[Flatbush],11,[Flatbush],[40.65205],[-73.95903],[populated place]


![](save.png)

### Export Pickle 2

As the geoparsing process takes a long time, you should store it right after the result. You will also import these results for your `project_presentation_template`

```python
YOUR_DATAFRAME.to_pickle('YOUR_DATAFRAME_PLACES.pickle')
```


In [None]:
df_new_york_history.to_pickle('df_new_york_history.pickle')

In [None]:
df_new_york_history.sample(5)

### Clean up the resulting dataframe

As with the previous instance of toponym resolution, there will be some rows that do not contain relevant information. This will slow down the sentiment analysis. 
1. Eliminate empty results

In [None]:
df_new_york_history = df_new_york_history[df_new_york_history['place'].str.len() != 0].copy()

---

## 7 Run Sentiment Analysis

### Overview

We will now implement the sentiment analysis on the remaining sentences.

Read step through and read all the prerequisites into memory.

In [None]:
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
from tqdm import tqdm
from scipy.special import softmax
from typing import Dict, Any

In [None]:
df_new_york_history

In [None]:
# Initialize RoBERTa. There will probably be a warning. You can ignore this.
MODEL = "cardiffnlp/twitter-roberta-base-sentiment"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)

In [None]:
# Function to calculate RoBERTa sentiment scores
def polarity_scores_roberta(text: str) -> Dict[str, float]:
    """
    Calculate RoBERTa sentiment scores for a given text.
    
    Args:
    - text: The text to analyze
    
    Returns:
    - A dictionary with sentiment scores for negative, neutral, and positive sentiment
    """
    # Tokenize and truncate to max length (512 tokens)
    encoded_text = tokenizer.encode_plus(
        text, 
        max_length=512, 
        truncation=True, 
        return_tensors='pt'
    )
    
    # Get model output and convert to probabilities
    output = model(**encoded_text)
    scores = output[0][0].detach().numpy()
    scores = softmax(scores)
    
    return {
        'roberta_neg': scores[0],
        'roberta_neu': scores[1],
        'roberta_pos': scores[2]
    }


In [None]:
# Function to attach sentiment analysis to a specific column in the dataframe
def add_sentiment_to_column(
    df: pd.DataFrame, column_name: str, num_rows: int = None
) -> pd.DataFrame:
    """
    Adds RoBERTa sentiment analysis to a specified column in a dataframe.
    
    Args:
    - df: The dataframe to process
    - column_name: The name of the column containing the text to analyze
    - num_rows: The number of rows to process (default: 500)
    
    Returns:
    - df: A dataframe with added sentiment analysis columns
    """
        # If num_rows is specified, limit the dataframe, otherwise process all rows
    if num_rows:
        df_subset = df.head(num_rows).reset_index(drop=True)
    else:
        df_subset = df.reset_index(drop=True)  # Process all rows and reset the index
    
    # Function to process each row and add sentiment analysis
    def process_row(text: str) -> Dict[str, Any]:
        try:
            return polarity_scores_roberta(text)
        except Exception as e:
            print(f"Error processing text: {text}. Error: {e}")
            return {'roberta_neg': None, 'roberta_neu': None, 'roberta_pos': None}
    
    # Apply the RoBERTa sentiment analysis to each row
    tqdm.pandas(desc="Processing Sentiment Analysis")
    sentiment_scores = df_subset[column_name].progress_apply(process_row)
    
    # Convert the resulting list of dictionaries into a DataFrame and concatenate it with the original subset
    sentiment_df = pd.DataFrame(sentiment_scores.tolist())
    df_subset = pd.concat([df_subset, sentiment_df], axis=1)
    
    return df_subset

In [None]:
df_new_york_history_sentiment = add_sentiment_to_column(df_new_york_history, 'cleaned_sentences')


### Create an aggregate score

Since the roberta score is positive, negative, and neutral, we will have to consolidate it into one easier to understand score. We will take the difference between positive and negative, and multiply it by the percentage of neutral. This way if a score is very neutral it will even out the difference between positive and negative.

In [None]:
# Calculate the compound score and add it as a new column 'roberta_compound'
df_new_york_history_sentiment['roberta_compound'] = (
    df_new_york_history_sentiment['roberta_pos'] - df_new_york_history_sentiment['roberta_neg']
) * (1 - df_new_york_history_sentiment['roberta_neu'])


### Explode, filter, and aggregate

At the moment, there are places and sentiments, but since some of the sentences contain multiple places these need to be unnested.

In [None]:
df_new_york_history_sentiment = df_new_york_history_sentiment.explode(['place', 'latitude', 'longitude', 'feature_name'])

Remove empty values


In [None]:
df_new_york_history_sentiment = df_new_york_history_sentiment[df_new_york_history_sentiment.place.notna()]

Aggregate the data

In [None]:
df_new_york_history_sentiment = df_new_york_history_sentiment.groupby('place').agg(
    location_count=('place', 'size'),  # Count occurrences of each location
    latitude=('latitude', 'first'),    # Take the first latitude (you can also use 'mean')
    longitude=('longitude', 'first'),  # Take the first longitude (or 'mean')
    location=('feature_name','first'),
    avg_roberta_pos=('roberta_pos', 'mean'),  # Average of roberta_pos
    avg_roberta_neu=('roberta_neu', 'mean'),  # Average of roberta_neu
    avg_roberta_neg=('roberta_neg', 'mean'), # Average of roberta_neg
    avg_roberta_compound =('roberta_compound','mean')
).reset_index()

#### Create Histogram of Count Values (Optional)

To get a sense of how the data is distributed and to decide which data to include, you can create a histogram of the `location_count` column fairly easily.

In [64]:
import matplotlib.pyplot as plt
# You might need to install matplotlib with 
pip install matplotlib

df_new_york_history_sentiment.location_count.plot.hist(bins=10, alpha=0.7)

SyntaxError: invalid syntax (725985143.py, line 3)

Generally, the data will be very left skewed. You might want to filter out some of the lower values.

In [None]:
df_new_york_history_sentiment.sample(5)



### Filter out low counts

As very low counts will not show up on the map anyway, filter them out here. No code has been provided, but the procedure is essentially the same as before.

In [None]:
df_new_york_history_sentiment_filtered = df_new_york_history_sentiment[(df_new_york_history_sentiment.location_count > 20)]

### Bucket Data

In [None]:
df_new_york_history_sentiment_filtered.sample(5)


As we saw in lesson_5, the distribution of the data is tricky. We can solve this by bucketing it along the lines of Jenks Natural Breaks.

In [None]:
import mapclassify as mc #you may get an error. If so install mapclassify with pip install mapclassify

jenks_breaks = mc.NaturalBreaks(y=df_new_york_history_sentiment_filtered_SENTIMENTS['location_count'], k=5)
df_new_york_history_sentiment_filtered_SENTIMENTS.loc[:,'location_count_bucket'] = jenks_breaks.find_bin(df_new_york_history_sentiment_filtered['location_count'])+1

### Export Pickle 3

This is the final export of the file for the `project_presentation_template`

```python
YOUR_DATAFRAME.to_pickle('YOUR_DATAFRAME_SENTIMENTS.pickle')
```


In [None]:
df_new_york_history_sentiment_filtered.to_pickle('df_new_york_history_sentiment_filtered_SENTIMENTS.pickle')

---

## Map your Data

### Overview

This is the core of the project. Use the stub below to map your data and then customize the map. I have deliberately set some of the values very poorly to encourage you to work on your own map!

In [None]:
import plotly.express as px

fig = px.scatter_mapbox(
    df_new_york_history_sentiment_filtered,  #put your dataframe here
    lat="latitude",               # Latitude column
    lon="longitude",              # Longitude column
    size="location_count_bucket",        # Bubble size based on location count
    color="avg_roberta_compound",      # Color based on sentiment score
    color_continuous_scale=px.colors.cyclical.Twilight[::-1],  # Use Twilight scale (blue to red)
    size_max=30,                  # Maximum size of the bubbles
    center={"lat": 48, "lon": 2},
    zoom=6                       # Adjust zoom level for better visibility
)

# Update the layout to use the default map style (which doesn't need a token)
fig.update_layout(
    mapbox_style="open-street-map",  # No token needed for this style
    margin={"r":0,"t":0,"l":0,"b":0}  # Remove margins for a cleaner view
)



fig.show()

Happy mapping!