# Lesson 4: Sentiment Analysis on Toponym Sentences

## Overview

This lesson will cover two sentiment analysis methods:
- Using the **NLTK** library's VADER sentiment analysis tool.
- Using **Hugging Face's RoBERTa** model for sentiment analysis.

We will compare how these two tools perform on sentences containing toponyms extracted from the `virginia_toponyms_pickle` file, and we will store the results in a **Pandas DataFrame** for further analysis. The key goal is to understand how different tools analyze sentiment, identify their limitations, and explore why their outputs might differ.

---

## 1. Loading the Dataset

We will begin by loading the data containing the sentences with toponyms into a dataframe.

In [3]:
import pandas as pd

In [4]:
df_virginia_toponyms = pd.read_pickle('df_virginia_toponyms.pickle')

### 1.1 More clean up...yes that's most of the work!

In the previous lesson, we discovered that toponyms for UPPER CASE sentences were mostly garbage. Let's drop those to reduce processing overhead.

In [6]:
df_virginia_toponyms = df_virginia_toponyms[~df_virginia_toponyms.cleaned_sentences.str.isupper()]

#### 1.1.1 Reset the index

Since we extracted all of the sentences, the index (left most columns) got messed up. This wasn't an issue for the toponyms, but will cause problems for the sentiment analysis.

In [8]:
df_virginia_toponyms =df_virginia_toponyms.reset_index(drop=True)

#### 1.1.2 Drop Unnecessary Columns

As the dataframe keeps getting wider and wider, we'll want to drop some unnecessary columns just so the view is manageable.

In [10]:
df_virginia_toponyms_compact = df_virginia_toponyms.drop(columns=['language', 'issued', 'type', 'locc', 'bookshelves', 'second_author']).copy()
df_virginia_toponyms_compact.sample(3)

Unnamed: 0,text_id,title,subjects,last_name,first_name,birth,death,cleaned_sentences,toponyms
29097,44229,"The Birth of the Nation, Jamestown, 1607","Virginia -- History -- Colonial period, ca. 16...",Pryor,Sara Agnes Rice,1830.0,1912.0,Smith was sent overland to invite the Emperor...,[Smith]
14106,32507,The Planters of Colonial Virginia,"Virginia -- History -- Colonial period, ca. 16...",Wertenbaker,Thomas Jefferson,1879.0,1966.0,"Why, it was asked, should Englishmen be forc...",[America]
4584,27117,"Tobacco in Colonial Virginia ""The Sovereign Re...",Tobacco -- Virginia -- History; Virginia -- Hi...,Herndon,G. Melvin,,,In 1771 there were rumors that at least one h...,[Virginia]


#### 1.1.3 Set Panda width to max

Some of the sentences are quite long and to see them all on the screen we will need to change the width of the columns to max.

```python
pd.set_option('display.max_colwidth', None)
```
When we are done we can set this back to a more reasonable number by replacing `None` with an integer.

```python
pd.set_option('display.max_colwidth', 100)
```


In [12]:
pd.set_option('display.max_colwidth', None)

## 2. Sentiment Analysis with NLTK (VADER)

### 2.1 Overview
VADER (Valence Aware Dictionary and sEntiment Reasoner) is a rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. It was largely trained on twitter, and really only looks at sentiment-per-word. This makes it relatively speedy, but there are some issues with this.



### 2.2 Loading VADER

In [15]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

You will only need to download the lexicon once.

In [17]:
#Download the 'vader_lexicon'
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\joost\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [18]:
# Initialize the VADER sentiment analyzer
sia = SentimentIntensityAnalyzer()

### 2.3 Using `sia.polarity_scores()`

The sentiment analyzer works by applying the VADER model to any text passed into the function `sia.polarity_scores()`. It will then generate a list of scores for that particular phrase.

#### 2.3.1 Good Vibes!

In [21]:
sia.polarity_scores('JMU is the best university!')

{'neg': 0.0, 'neu': 0.471, 'pos': 0.529, 'compound': 0.6696}

#### 2.3.2 Bad Vibes!

In [23]:
sia.polarity_scores('UVA is not the best university!')

{'neg': 0.423, 'neu': 0.577, 'pos': 0.0, 'compound': -0.5661}

### 2.4 Critical Thinking Challenge

For the next activity, you are going to try to push the limits of the tokenizer. For each challenge, think of a sentence that will get the scores you want, even if those scores don't make sense.


#### 2.4.1 Most Goodest Vibes

Try to create a sentence with a compound polarity score of 1.0.

In [26]:
sia.polarity_scores('')

{'neg': 0.0, 'neu': 0.0, 'pos': 0.0, 'compound': 0.0}

#### 2.4.2 Most Baddest Vibes

Try to create a sentence with a compound polarity score of 1.0, but keep it pg-13!

In [28]:
sia.polarity_scores('')

{'neg': 0.0, 'neu': 0.0, 'pos': 0.0, 'compound': 0.0}

#### 2.4.3 Most Strangest Vibes

Try to create a sentence with either a positive or negative compound score, but that means the exact opposite of what it says.

In [30]:
sia.polarity_scores('')

{'neg': 0.0, 'neu': 0.0, 'pos': 0.0, 'compound': 0.0}

### 2.5 Run VADER on all sentences.

In [32]:
# Perform sentiment analysis on each sentence and store the compound score in the DataFrame
df_virginia_toponyms_compact['nltk_sentiment'] = df_virginia_toponyms_compact['cleaned_sentences'].apply(lambda x: sia.polarity_scores(x)['compound'])


See result

In [34]:

df_virginia_toponyms_compact[['cleaned_sentences','nltk_sentiment']].sample(5, random_state=50)


Unnamed: 0,cleaned_sentences,nltk_sentiment
39161,"Enemies accused him of profiting by the maladministration of his officials, and he himself confessed in a rather cynical letter to Lord Arlington that, while advancing years had taken away his ambition, they had left him covetous.",-0.6597
1233,"Oade. A thing of so great vent and vse amongst English Diers, which cannot bee yeelded sufficiently in our owne countrey for spare of ground may bee planted in Virginia, there being ground enough.",0.7384
45299,"Mr. Rubsamen told me that lead ore is found on New River and the Greenbrier, copper on the Roanoke Dan, and iron everywhere about, particularly in Buckingham County.",0.0
26192,"All seemed in a reverie, dreaming a long sweet dream of the past, and entering into the grief of the sisters, who lived afterward for many years in a pleasant home on a pleasant street in Richmond, with warm friends to serve them, yet their tears never ceased to flow at the mention of Mount Erin.",0.8885
32279,"From the Ohio River to the sea, from North Carolina to the Pennsylvania line, the people of the commonwealth were stirred by the fervor of the campaign and the magnitude of the issues upon which they were called to pass.",0.0


#### 2.6 Evaluate the result

The compound score ranges from -1 to 1. When a passage is very negative it gets a -1 and when it is possitive it gets a 1. Read through the passages above and try to figure out why these passages received the sentiments they did.

### 2.7 Critical Question

How effective is the VADER tokenizer in dealing with sentiments in historical manuscripts?

## 3. Sentiment Analysis with Hugging Face (RoBERTa)

RoBERTa (Robustly Optimized BERT Pretraining Approach) is a transformer-based model that has been fine-tuned for sentiment analysis tasks. We will use Hugging Face's `transformers` library to analyze the sentiment of the toponym-containing sentences. This model is available on a site called [Hugging Face](https://huggingface.co/). Check out the sentiment models [here](https://huggingface.co/models?sort=trending&search=sentiment).

### 3.1 Prepping your system

You will need to insall yet more libraries. 

Open up a new terminal window in Juypter and type the following commands:

- `pip install transformers`
- `pip install torch`
- `pip install scipy`
  

### 3.1.1 Possible Warning

When running the import:

```python
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
```

You might getting a warning message telling you to upgrade Juypter Lab and Ipywidgets. If that is the case use the command:

- `conda update jupyterlab`
- `conda install -c conda-forge ipywidgets`

## 3.2 Load Functions into memory

Getting Roberta to code the sentiments is a fairly common procedure. There is a great in-depth video [here](https://www.youtube.com/watch?v=QpzMWQvxXWk). I have adapted and updated the code for newer versions of Python. The only thing you need to do is to load the functions into memory.

Step through the code blocks below. 

In [44]:
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
from tqdm import tqdm
from scipy.special import softmax
from typing import Dict, Any

In [45]:
# Initialize RoBERTa. There will probably be a warning. You can ignore this.
MODEL = "cardiffnlp/twitter-roberta-base-sentiment"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)



In [46]:
# Function to calculate RoBERTa sentiment scores

def polarity_scores_roberta(text: str) -> Dict[str, float]:
    """
    Calculate RoBERTa sentiment scores for a given text.
    
    Args:
    - text: The text to analyze
    
    Returns:
    - A dictionary with sentiment scores for negative, neutral, and positive sentiment
    """
    # Tokenize and truncate to max length (512 tokens)
    encoded_text = tokenizer.encode_plus(
        text, 
        max_length=512, 
        truncation=True, 
        return_tensors='pt'
    )
    
    # Get model output and convert to probabilities
    output = model(**encoded_text)
    scores = output[0][0].detach().numpy()
    scores = softmax(scores)
    
    return {
        'roberta_neg': scores[0],
        'roberta_neu': scores[1],
        'roberta_pos': scores[2]
    }


In [47]:
# Function to attach sentiment analysis to a specific column in the dataframe
def add_sentiment_to_column(
    df: pd.DataFrame, column_name: str, num_rows: int = None
) -> pd.DataFrame:
    """
    Adds RoBERTa sentiment analysis to a specified column in a dataframe.
    
    Args:
    - df: The dataframe to process
    - column_name: The name of the column containing the text to analyze
    - num_rows: The number of rows to process (default: 500)
    
    Returns:
    - df: A dataframe with added sentiment analysis columns
    """
        # If num_rows is specified, limit the dataframe, otherwise process all rows
    if num_rows:
        df_subset = df.head(num_rows).reset_index(drop=True)
    else:
        df_subset = df.reset_index(drop=True)  # Process all rows and reset the index
    
    # Function to process each row and add sentiment analysis
    def process_row(text: str) -> Dict[str, Any]:
        try:
            return polarity_scores_roberta(text)
        except Exception as e:
            print(f"Error processing text: {text}. Error: {e}")
            return {'roberta_neg': None, 'roberta_neu': None, 'roberta_pos': None}
    
    # Apply the RoBERTa sentiment analysis to each row
    tqdm.pandas(desc="Processing Sentiment Analysis")
    sentiment_scores = df_subset[column_name].progress_apply(process_row)
    
    # Convert the resulting list of dictionaries into a DataFrame and concatenate it with the original subset
    sentiment_df = pd.DataFrame(sentiment_scores.tolist())
    df_subset = pd.concat([df_subset, sentiment_df], axis=1)
    
    return df_subset

### 3.2 A Very Simple Explanation

The code blocks above are quite complex, but they essentially do one thing: add columns with sentiment scores to a dataframe that contains sentences. The function is fairly straightforward and has three possible parameters: 

- `dataframe` - The dataframe where you want to perform the function. In our case, `df_virginia_toponyms_compact`
- `column` - The column name where the sentences are stored
- `num_rows=` - (Optional) Integer value of the number of rows you want to process. Since this is very processor intensive. It makes sense to be able to just grab a sample. Leaving this blank will process every row.

```python
df_virginia_toponym_sentiment_sample = add_sentiment_to_column(df_virginia_toponyms_compact, 'cleaned_sentences', num_rows=1000)
```
With that explanation in mind, what does the above line of code do?


### 3.3 Get a Sample Sentiment Column

In [50]:
df_virginia_toponym_sentiment_sample = add_sentiment_to_column(df_virginia_toponyms_compact, 'cleaned_sentences', num_rows=1000)

Processing Sentiment Analysis: 100%|███████████████████████████████████████████████| 1000/1000 [00:56<00:00, 17.73it/s]


In [51]:
df_virginia_toponym_sentiment_sample.to_pickle('df_virginia_toponym_sentiment_sample.pickle')

#### Evaluate the Sample

If you could not get the tokenizer to work, you can get the result by running this line of code:

```python
df_virginia_toponym_sentiment_sample = pd.read_pickle('df_virginia_toponym_sentiment_sample')
```

In [53]:
# Display the results for the first few rows
df_virginia_toponym_sentiment_sample[['cleaned_sentences', 'roberta_neg', 'roberta_neu', 'roberta_pos']].sample(5, random_state=47)

Unnamed: 0,cleaned_sentences,roberta_neg,roberta_neu,roberta_pos
530,"Captain Nathaniel Butler, who had once been Governor of the Somers Islands and had now returned to England by way of Virginia, published in London ""The Unmasked Face of Our Colony in Virginia"", containing a savage attack upon every item of Virginian administration.",0.478677,0.49852,0.022803
926,"Blair sailed back to Virginia with the charter of the college, some money, a plan for the main building drawn by Christopher Wren, and for himself the office of President.",0.022498,0.928171,0.049331
586,"Baltimore was a reflective man, a dreamer in the good sense of the term, and religiously minded.",0.019395,0.408428,0.572177
25,But Rembrandt was not born in Massachusetts people hardly ever do know where to be born until it is too late.,0.428928,0.527906,0.043166
332,"Incontinently Smith was seized, dragged to a great stone lying before Powhatan, forced down and bound.",0.500344,0.486323,0.013333


#### 3.3.1 Critical Question

1. How did the tokenizer do?
2. Where would you dispute the sentiment?

#### 3.3.2 Critical Activity

1. Cycle through the samples by changing `random_state=` to a different integer.
2. Look through the sentences
3. Identify a sentence where the language model does particularly well or poorly.
4. If you were not able to run the tokenizer. Load in the sample pickle file below.


In [56]:
df_virginia_toponym_sentiment_sample = pd.read_pickle('df_virginia_toponym_sentiment_sample')

## 4. Creating the entire dataset

This process will take a very long time. I will create this data set for you, but if you ever want to do it on your own. The line of code is below. Simply remove the hashtag to uncomment it.

In [58]:

#df_virginia_toponym_sentiment_full = add_sentiment_to_column(df_virginia_toponyms_compact, 'cleaned_sentences')

In [59]:
#df_virginia_toponym_sentiment_full.to_pickle('df_virginia_toponym_sentiment_full.pickle')

In [60]:
df_virginia_toponym_sentiment_full = pd.read_pickle('df_virginia_toponym_sentiment_full.pickle')

In [61]:
df_virginia_toponym_sentiment_full.sample(5, random_state = 23)

Unnamed: 0,text_id,title,subjects,last_name,first_name,birth,death,cleaned_sentences,toponyms,nltk_sentiment,roberta_neg,roberta_neu,roberta_pos
14360,32507,The Planters of Colonial Virginia,"Virginia -- History -- Colonial period, ca. 1600-1775; Slavery -- Virginia; Virginia -- Economic conditions",Wertenbaker,Thomas Jefferson,1879,1966,"If confined to England alone, only a fraction of the output could be consumed and disaster was certain.",[England],-0.6124,0.692587,0.286966,0.020447
32351,46026,Virginia's Attitude Toward Slavery and Secession,United States -- Politics and government -- 1861-1865; Slavery -- Virginia; Virginia -- Politics and government -- 1861-1865,Munford,Beverley B. (Beverley Bland),1856,1910,""" The Governors of Kentucky, Missouri, Arkansas, Tennessee and North Carolina returned like answers to the requisitions of the Federal authorities for troops.","[Missouri, Arkansas, Tennessee, North Carolina]",0.3612,0.075799,0.871962,0.052239
7834,29055,The Present State of Virginia,Indians of North America -- Virginia -- Early works to 1800; Slavery -- Virginia -- Early works to 1800; African Americans -- Virginia -- Early works to 1800; Virginia -- Description and travel -- Early works to 1800,Jones,Hugh,1669,1760,"Near this is a large Octogon Tower, which is the Magazine or Repository of Arms and Ammunition, landing far from any House except James Town CourtHouse for the Town is half in James Town County, and half in York County.",[York County],0.0,0.073226,0.88394,0.042834
23821,39148,"How Justice Grew: Virginia Counties, An Abstract of Their Formation","Virginia -- History; Counties -- Virginia -- History; Virginia -- History, Local",Hiden,Martha W. (Martha Woodroof),1883,1959,Wood and Harrison are also West Virginia counties.,[West Virginia],0.0,0.069118,0.890772,0.04011
45413,63221,Travels in Virginia in Revolutionary Times,Virginia -- Description and travel,Morrison,Alfred J. (Alfred James),1876,1923,"Taking a road, however, as nearly as I could guess, in a direct line from the river up the country, at the end of an hour I came upon a narrow road, which led to a large old brick house, somewhat similar to those I had met with on the Maryland shore.",[Maryland],0.0,0.023312,0.814797,0.161891


### 4.1 Check peformance

We can check the performance of both tokenizers by looking up "edge cases" where one gives a negative evaluation and the other positive. 

#### 4.1.2
How would we go about this? I have stubbed out some of the code below.

In [63]:
df_virginia_toponym_sentiment_full[(df_virginia_toponym_sentiment_full.nltk_sentiment )&
                                    (df_virginia_toponym_sentiment_full.roberta_pos )]

TypeError: unsupported operand type(s) for &: 'float' and 'float'

## 5. Analyzing the Differences

### Differences Between NLTK and RoBERTa:
1. **NLTK (VADER)**:
    - Uses a lexicon-based approach.
    - Performs well on short social media-style texts, but may not capture the full context in longer, more complex sentences.

2. **RoBERTa**:
    - Uses a transformer-based deep learning model, which can better understand context.
    - However, RoBERTa can sometimes be biased towards its training data (in this case, Twitter-based sentiments).

### Why Might These Differences Occur?
- **Context Understanding**: RoBERTa uses a much more advanced neural network model, allowing it to grasp nuances better than NLTK.
- **Lexicon Limitations**: NLTK relies on predefined dictionaries of words, which means it may miss certain contextual clues or interpret complex sentences inaccurately.

### Limitations:
- **RoBERTa**: While more accurate in many cases, it might be overfitted to specific domains (e.g., Twitter), which could skew its results on historical or formal texts.
- **NLTK**: Fast and simple, but its lexicon-based approach might not always provide a detailed or accurate sentiment analysis in nuanced contexts.


## 6. Conclusion

By comparing NLTK’s VADER and Hugging Face's RoBERTa, we can see that different sentiment analysis tools offer different strengths. NLTK’s rule-based system is fast and straightforward but can miss complex sentiment cues. RoBERTa, being a transformer model, performs better on context-heavy sentences but can sometimes be biased by its training data. 
