# Lesson 4: Sentiment Analysis on Toponym Sentences

## Overview

This lesson will cover two sentiment analysis methods:
- Using the **NLTK** library's VADER sentiment analysis tool.
- Using **Hugging Face's RoBERTa** model for sentiment analysis.

We will compare how these two tools perform on sentences containing toponyms extracted from the `virginia_toponyms_pickle` file, and we will store the results in a **Pandas DataFrame** for further analysis. The key goal is to understand how different tools analyze sentiment, identify their limitations, and explore why their outputs might differ.

---

## 1. Loading the Dataset

We will begin by loading the data containing the sentences with toponyms into a dataframe.

In [4]:
import pandas as pd

In [5]:
df_virginia_toponyms = pd.read_pickle('df_virginia_toponyms.pickle')

### 1.1 More clean up...yes that's most of the work!

In the previous lesson, we discovered that toponyms for UPPER CASE sentences were mostly garbage. Let's drop those to reduce processing overhead.

In [7]:
df_virginia_toponyms = df_virginia_toponyms[~df_virginia_toponyms.cleaned_sentences.str.isupper()]

#### 1.1.1 Reset the index

Since we extracted all of the sentences, the index (left most columns) got messed up. This wasn't an issue for the toponyms, but will cause problems for the sentiment analysis.

In [9]:
df_virginia_toponyms =df_virginia_toponyms.reset_index(drop=True)

#### 1.1.2 Drop Unnecessary Columns

As the dataframe keeps getting wider and wider, we'll want to drop some unnecessary columns just so the view is manageable.

In [11]:
df_virginia_toponyms_compact = df_virginia_toponyms.drop(columns=['language', 'issued', 'type', 'locc', 'bookshelves', 'second_author']).copy()
df_virginia_toponyms_compact.sample(3)

Unnamed: 0,text_id,title,subjects,last_name,first_name,birth,death,cleaned_sentences,toponyms
6418,28555,"The Virginia Company Of London, 1606-1624","Virginia -- History -- Colonial period, ca. 16...",Craven,Wesley Frank,1905.0,1981.0,It is not easy for the modern American to read...,[Virginia]
24253,40044,Journal and Letters of Philip Vickers Fithian:...,Virginia -- Social life and customs -- To 1775...,Fithian,Philip Vickers,1747.0,1776.0,The sounds very much resemble the human voice...,[Organ]
30844,45233,History of the Twelfth West Virginia Volunteer...,"United States -- History -- Civil War, 1861-18...",Hewitt,William,,,But it was here that the Twelfth won its eagle...,[Twelfth]


#### 1.1.3 Set Panda width to max

Some of the sentences are quite long and to see them all on the screen we will need to change the width of the columns to max.

```python
pd.set_option('display.max_colwidth', None)
```
When we are done we can set this back to a more reasonable number by replacing `None` with an integer.

```python
pd.set_option('display.max_colwidth', 100)
```


In [13]:
pd.set_option('display.max_colwidth', None)

## 2. Sentiment Analysis with NLTK (VADER)

### 2.1 Overview
VADER (Valence Aware Dictionary and sEntiment Reasoner) is a rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. It was largely trained on twitter, and really only looks at sentiment-per-word. This makes it relatively speedy, but there are some issues with this.



### 2.2 Loading VADER

In [16]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

You will only need to download the lexicon once.

In [18]:
#Download the 'vader_lexicon'
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\joost\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [19]:
# Initialize the VADER sentiment analyzer
sia = SentimentIntensityAnalyzer()

### 2.3 Using `sia.polarity_scores()`

The sentiment analyzer works by applying the VADER model to any text passed into the function `sia.polarity_scores()`. It will then generate a list of scores for that particular phrase.

#### 2.3.1 Good Vibes!

In [22]:
sia.polarity_scores('JMU is the best university!')

{'neg': 0.0, 'neu': 0.471, 'pos': 0.529, 'compound': 0.6696}

#### 2.3.2 Bad Vibes!

In [24]:
sia.polarity_scores('UVA is not the best university!')

{'neg': 0.423, 'neu': 0.577, 'pos': 0.0, 'compound': -0.5661}

### 2.4 Critical Thinking Challenge

For the next activity, you are going to try to push the limits of the tokenizer. For each challenge, think of a sentence that will get the scores you want, even if those scores don't make sense.


#### 2.4.1 Most Goodest Vibes

Try to create a sentence with a compound polarity score of 1.0.

In [27]:
sia.polarity_scores('')

{'neg': 0.0, 'neu': 0.0, 'pos': 0.0, 'compound': 0.0}

#### 2.4.2 Most Baddest Vibes

Try to create a sentence with a compound polarity score of 1.0, but keep it pg-13!

In [29]:
sia.polarity_scores('')

{'neg': 0.0, 'neu': 0.0, 'pos': 0.0, 'compound': 0.0}

#### 2.4.3 Most Strangest Vibes

Try to create a sentence with either a positive or negative compound score, but that means the exact opposite of what it says.

In [31]:
sia.polarity_scores('')

{'neg': 0.0, 'neu': 0.0, 'pos': 0.0, 'compound': 0.0}

### 2.5 Run VADER on all sentences.

In [33]:
# Perform sentiment analysis on each sentence and store the compound score in the DataFrame
df_virginia_toponyms_compact['nltk_sentiment'] = df_virginia_toponyms_compact['cleaned_sentences'].apply(lambda x: sia.polarity_scores(x)['compound'])


See result

In [35]:

df_virginia_toponyms_compact[['cleaned_sentences','nltk_sentiment']].sample(5, random_state=50)


Unnamed: 0,cleaned_sentences,nltk_sentiment
39161,"Enemies accused him of profiting by the maladministration of his officials, and he himself confessed in a rather cynical letter to Lord Arlington that, while advancing years had taken away his ambition, they had left him covetous.",-0.6597
1233,"Oade. A thing of so great vent and vse amongst English Diers, which cannot bee yeelded sufficiently in our owne countrey for spare of ground may bee planted in Virginia, there being ground enough.",0.7384
45299,"Mr. Rubsamen told me that lead ore is found on New River and the Greenbrier, copper on the Roanoke Dan, and iron everywhere about, particularly in Buckingham County.",0.0
26192,"All seemed in a reverie, dreaming a long sweet dream of the past, and entering into the grief of the sisters, who lived afterward for many years in a pleasant home on a pleasant street in Richmond, with warm friends to serve them, yet their tears never ceased to flow at the mention of Mount Erin.",0.8885
32279,"From the Ohio River to the sea, from North Carolina to the Pennsylvania line, the people of the commonwealth were stirred by the fervor of the campaign and the magnitude of the issues upon which they were called to pass.",0.0


#### 2.6 Evaluate the result

The compound score ranges from -1 to 1. When a passage is very negative it gets a -1 and when it is possitive it gets a 1. Read through the passages above and try to figure out why these passages received the sentiments they did.

### 2.7 Critical Question

How effective is the VADER tokenizer in dealing with sentiments in historical manuscripts?

## 3. Sentiment Analysis with Hugging Face (RoBERTa)

RoBERTa (Robustly Optimized BERT Pretraining Approach) is a transformer-based model that has been fine-tuned for sentiment analysis tasks. We will use Hugging Face's `transformers` library to analyze the sentiment of the toponym-containing sentences. This model is available on a site called [Hugging Face](https://huggingface.co/). Check out the sentiment models [here](https://huggingface.co/models?sort=trending&search=sentiment).

### 3.1 Prepping your system

You will need to insall yet more libraries. 

Open up a new terminal window in Juypter and type the following commands:

- `pip install transformers`
- `pip install torch`
- `pip install scipy`
  

### 3.1.1 Possible Warning

When running the import:

```python
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
```

You might getting a warning message telling you to upgrade Juypter Lab and Ipywidgets. If that is the case use the command:

- `conda update jupyterlab`
- `conda install -c conda-forge ipywidgets`

## 3.2 Load Functions into memory

Getting Roberta to code the sentiments is a fairly common procedure. There is a great in-depth video [here](https://www.youtube.com/watch?v=QpzMWQvxXWk). I have adapted and updated the code for newer versions of Python. The only thing you need to do is to load the functions into memory.

Step through the code blocks below. 

In [45]:
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
from tqdm import tqdm
from scipy.special import softmax
from typing import Dict, Any

In [46]:
# Initialize RoBERTa. There will probably be a warning. You can ignore this.
MODEL = "cardiffnlp/twitter-roberta-base-sentiment"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)



In [90]:
# Function to calculate RoBERTa sentiment scores

def polarity_scores_roberta(text: str) -> Dict[str, float]:
    """
    Calculate RoBERTa sentiment scores for a given text.
    
    Args:
    - text: The text to analyze
    
    Returns:
    - A dictionary with sentiment scores for negative, neutral, and positive sentiment
    """
    # Tokenize and truncate to max length (512 tokens)
    encoded_text = tokenizer.encode_plus(
        text, 
        max_length=512, 
        truncation=True, 
        return_tensors='pt'
    )
    
    # Get model output and convert to probabilities
    output = model(**encoded_text)
    scores = output[0][0].detach().numpy()
    scores = softmax(scores)
    
    return {
        'roberta_neg': scores[0],
        'roberta_neu': scores[1],
        'roberta_pos': scores[2]
    }


In [92]:
# Function to attach sentiment analysis to a specific column in the dataframe
def add_sentiment_to_column(
    df: pd.DataFrame, column_name: str, num_rows: int = None
) -> pd.DataFrame:
    """
    Adds RoBERTa sentiment analysis to a specified column in a dataframe.
    
    Args:
    - df: The dataframe to process
    - column_name: The name of the column containing the text to analyze
    - num_rows: The number of rows to process (default: 500)
    
    Returns:
    - df: A dataframe with added sentiment analysis columns
    """
        # If num_rows is specified, limit the dataframe, otherwise process all rows
    if num_rows:
        df_subset = df.head(num_rows).reset_index(drop=True)
    else:
        df_subset = df.reset_index(drop=True)  # Process all rows and reset the index
    
    # Function to process each row and add sentiment analysis
    def process_row(text: str) -> Dict[str, Any]:
        try:
            return polarity_scores_roberta(text)
        except Exception as e:
            print(f"Error processing text: {text}. Error: {e}")
            return {'roberta_neg': None, 'roberta_neu': None, 'roberta_pos': None}
    
    # Apply the RoBERTa sentiment analysis to each row
    tqdm.pandas(desc="Processing Sentiment Analysis")
    sentiment_scores = df_subset[column_name].progress_apply(process_row)
    
    # Convert the resulting list of dictionaries into a DataFrame and concatenate it with the original subset
    sentiment_df = pd.DataFrame(sentiment_scores.tolist())
    df_subset = pd.concat([df_subset, sentiment_df], axis=1)
    
    return df_subset

### 3.2 A Very Simple Explanation

The code blocks above are quite complex, but they essentially do one thing: add columns with sentiment scores to a dataframe that contains sentences. The function is fairly straightforward and has three possible parameters: 

- `dataframe` - The dataframe where you want to perform the function. In our case, `df_virginia_toponyms_compact`
- `column` - The column name where the sentences are stored
- `num_rows=` - (Optional) Integer value of the number of rows you want to process. Since this is very processor intensive. It makes sense to be able to just grab a sample. Leaving this blank will process every row.

```python
df_virginia_toponym_sentiment_sample = add_sentiment_to_column(df_virginia_toponyms_compact, 'cleaned_sentences', num_rows=1000)
```
With that explanation in mind, what does the above line of code do?


### 3.3 Get a Sample Sentiment Column

In [51]:
df_virginia_toponym_sentiment_sample = add_sentiment_to_column(df_virginia_toponyms_compact, 'cleaned_sentences', num_rows=1000)

Processing Sentiment Analysis: 100%|███████████████████████████████████████████████| 1000/1000 [00:58<00:00, 17.16it/s]


In [52]:
df_virginia_toponym_sentiment_sample.to_pickle('df_virginia_toponym_sentiment_sample')

#### Evaluate the Sample

If you could not get the tokenizer to work, you can get the result by running this line of code:

```python
df_virginia_toponym_sentiment_sample = pd.read_pickle('df_virginia_toponym_sentiment_sample')
```

In [54]:
# Display the results for the first few rows
df_virginia_toponym_sentiment_sample[['cleaned_sentences', 'roberta_neg', 'roberta_neu', 'roberta_pos']].sample(5, random_state=47)

Unnamed: 0,cleaned_sentences,roberta_neg,roberta_neu,roberta_pos
530,"Captain Nathaniel Butler, who had once been Governor of the Somers Islands and had now returned to England by way of Virginia, published in London ""The Unmasked Face of Our Colony in Virginia"", containing a savage attack upon every item of Virginian administration.",0.478677,0.49852,0.022803
926,"Blair sailed back to Virginia with the charter of the college, some money, a plan for the main building drawn by Christopher Wren, and for himself the office of President.",0.022498,0.928171,0.049331
586,"Baltimore was a reflective man, a dreamer in the good sense of the term, and religiously minded.",0.019395,0.408428,0.572177
25,But Rembrandt was not born in Massachusetts people hardly ever do know where to be born until it is too late.,0.428928,0.527906,0.043166
332,"Incontinently Smith was seized, dragged to a great stone lying before Powhatan, forced down and bound.",0.500344,0.486323,0.013333


#### 3.3.1 Critical Question

1. How did the tokenizer do?
2. Where would you dispute the sentiment?

#### 3.3.2 Critical Activity

1. Cycle through the samples by changing `random_state=` to a different integer.
2. Look through the sentences
3. Identify a sentence where the language model does particularly well or poorly.
4. If you were not able to run the tokenizer. Load in the sample pickle file below.


In [57]:
df_virginia_toponym_sentiment_sample = pd.read_pickle('df_virginia_toponym_sentiment_sample')

## 4. Creating the entire dataset

This process will take a very long time. I will create this data set for you, but if you ever want to do it on your own. The line of code is below. Simply remove the hashtag to uncomment it.

In [94]:

df_virginia_toponym_sentiment_full = add_sentiment_to_column(df_virginia_toponyms_compact, 'cleaned_sentences')

Processing Sentiment Analysis: 100%|█████████████████████████████████████████████| 45972/45972 [52:29<00:00, 14.59it/s]


In [96]:
df_virginia_toponym_sentiment_full.to_pickle('df_virginia_toponym_sentiment_full.pickle')

In [98]:
df_virginia_toponym_sentiment_full

Unnamed: 0,text_id,title,subjects,last_name,first_name,birth,death,cleaned_sentences,toponyms,nltk_sentiment,roberta_neg,roberta_neu,roberta_pos
0,2674,The Complete Writings of Charles Dudley Warner — Volume 4,"Autobiographies; Virginia -- Description and travel; North Carolina -- Description and travel; Tennessee -- Description and travel; Mexico -- Description and travel; Boys -- Biography; Warner, Charles Dudley, 1829-1900 -- Travel -- Southern States; Appalachian Region -- Description and travel; Warner, Charles Dudley, 1829-1900 -- Travel -- Appalachian Region; Warner, Charles Dudley, 1829-1900 -- Travel -- Mexico",Warner,Charles Dudley,1829,1900,"Title The Complete Writings of Charles Dudley Warner Volume 4 Author Charles Dudley Warner June, 2001 Project Gutenberg The Complete Writings of Charles Dudley Warner This file should be named 2674.txt or 2674.zip This etext was prepared by David Widger, widgercecomet.net Project Gutenberg Etexts are usually created from multiple editions, all of which are in the Public Domain in the United States, unless a copyright notice is included.","[the Public Domain, the United States]",0.6908,0.106633,0.790353,0.103014
1,2674,The Complete Writings of Charles Dudley Warner — Volume 4,"Autobiographies; Virginia -- Description and travel; North Carolina -- Description and travel; Tennessee -- Description and travel; Mexico -- Description and travel; Boys -- Biography; Warner, Charles Dudley, 1829-1900 -- Travel -- Southern States; Appalachian Region -- Description and travel; Warner, Charles Dudley, 1829-1900 -- Travel -- Appalachian Region; Warner, Charles Dudley, 1829-1900 -- Travel -- Mexico",Warner,Charles Dudley,1829,1900,"The Goal of Project Gutenberg is to Give Away One Trillion Etext Files by December 31, 2001.",[Files],0.0000,0.137262,0.606347,0.256391
2,2674,The Complete Writings of Charles Dudley Warner — Volume 4,"Autobiographies; Virginia -- Description and travel; North Carolina -- Description and travel; Tennessee -- Description and travel; Mexico -- Description and travel; Boys -- Biography; Warner, Charles Dudley, 1829-1900 -- Travel -- Southern States; Appalachian Region -- Description and travel; Warner, Charles Dudley, 1829-1900 -- Travel -- Appalachian Region; Warner, Charles Dudley, 1829-1900 -- Travel -- Mexico",Warner,Charles Dudley,1829,1900,"Among other things, this means that no one owns a United States copyright on or for this work, so the Project and you!",[a United States],0.2244,0.676226,0.294574,0.029200
3,2674,The Complete Writings of Charles Dudley Warner — Volume 4,"Autobiographies; Virginia -- Description and travel; North Carolina -- Description and travel; Tennessee -- Description and travel; Mexico -- Description and travel; Boys -- Biography; Warner, Charles Dudley, 1829-1900 -- Travel -- Southern States; Appalachian Region -- Description and travel; Warner, Charles Dudley, 1829-1900 -- Travel -- Appalachian Region; Warner, Charles Dudley, 1829-1900 -- Travel -- Mexico",Warner,Charles Dudley,1829,1900,can copy and distribute it in the United States without permission and without paying copyright royalties.,[the United States],0.4215,0.269529,0.665300,0.065171
4,2674,The Complete Writings of Charles Dudley Warner — Volume 4,"Autobiographies; Virginia -- Description and travel; North Carolina -- Description and travel; Tennessee -- Description and travel; Mexico -- Description and travel; Boys -- Biography; Warner, Charles Dudley, 1829-1900 -- Travel -- Southern States; Appalachian Region -- Description and travel; Warner, Charles Dudley, 1829-1900 -- Travel -- Appalachian Region; Warner, Charles Dudley, 1829-1900 -- Travel -- Mexico",Warner,Charles Dudley,1829,1900,"If I were a boy, I am not sure but I would rather drive the oxen than have a birthday.",[oxen],-0.1232,0.393779,0.554246,0.051975
...,...,...,...,...,...,...,...,...,...,...,...,...,...
45967,70331,Educational laws of Virginia,"African Americans -- Education -- Virginia; Douglass, Margaret Crittenden, 1822-",Douglass,Margaret Crittenden,,,"It is copied from the code of Virginia, passed by the General Assembly of the Commonwealth of Virginia, in the month of August, 1849, and will be found on page 747, chapter 198.",[Virginia],0.0000,0.037465,0.931740,0.030795
45968,70331,Educational laws of Virginia,"African Americans -- Education -- Virginia; Douglass, Margaret Crittenden, 1822-",Douglass,Margaret Crittenden,,,"If a white person assemble with negroes for the purpose of instructing them to read or write, or if he associate with them in an unlawful assembly, he shall be confined in jail not exceeding six months, and fined not exceeding one hundred dollars and any Justice may require him to enter into a recognizance, with sufficient security, to appear before the Circuit, County, or Corporation Court, where the offence was committed, at its next term, to answer therefor and in the meantime, to keep the peace and be of good behavior.” It will be seen from this, that in the enlightened State of Virginia, it is a crime for one portion of human beings to worship their Maker!","[County, Virginia]",0.9230,0.382407,0.571231,0.046361
45969,70331,Educational laws of Virginia,"African Americans -- Education -- Virginia; Douglass, Margaret Crittenden, 1822-",Douglass,Margaret Crittenden,,,"Since my trial and conviction, I have been advised by one of the most eminent counsel in Virginia, that the Norfolk Court exceeded its powers, and violated the law by not construing the act literally in my case.",[Virginia],-0.5267,0.642206,0.341482,0.016312
45970,70331,Educational laws of Virginia,"African Americans -- Education -- Virginia; Douglass, Margaret Crittenden, 1822-",Douglass,Margaret Crittenden,,,"It is the one great evil hanging over the Southern slave States, destroying domestic happiness and the peace of thousands.",[States],0.4939,0.913753,0.082423,0.003824


In [106]:
df_virginia_toponym_sentiment_full[(df_virginia_toponym_sentiment_full.nltk_sentiment <-.6)&
                                    (df_virginia_toponym_sentiment_full.roberta_pos >.6)]

Unnamed: 0,text_id,title,subjects,last_name,first_name,birth,death,cleaned_sentences,toponyms,nltk_sentiment,roberta_neg,roberta_neu,roberta_pos
376,2898,Pioneers of the Old South: A Chronicle of English Colonial Beginnings,"Southern States -- History -- Colonial period, ca. 1600-1775; United States -- History -- Colonial period, ca. 1600-1775; Frontier and pioneer life -- Southern States; British Americans -- Southern States; Maryland -- History; Virginia -- History",Johnston,Mary,1870.0,1936.0,On St. James's day there rose and broke a fearsome storm.,[St. James's],-0.6705,0.005326,0.341994,0.65268
3395,22067,The Story of a Cannoneer Under Stonewall Jackson In Which is Told the Part Taken by the Rockbridge Artillery in the Army of Northern Virginia,"United States -- History -- Civil War, 1861-1865 -- Campaigns; United States -- History -- Civil War, 1861-1865 -- Personal narratives, Confederate; Moore, Edward Alexander, 1842-; Jackson, Stonewall, 1824-1863; Confederate States of America. Army. Virginia Artillery. Rockbridge Battery, 1st; Soldiers -- Virginia -- Biography; Virginia -- History -- Civil War, 1861-1865 -- Campaigns",Moore,Edward Alexander,,,"The beautiful character of Randolph Fairfax, a descendant of Lord Fairfax, who was killed on December 13, 1862, on that fatal hill near Fredericksburg, has been worthily portrayed in a memoir by the Rev.","[Fairfax, Fredericksburg]",-0.6249,0.03978,0.334726,0.625493
12811,30747,Seaport in Virginia George Washington's Alexandria,"Historic buildings -- Virginia -- Alexandria; Alexandria (Va.) -- History; Alexandria (Va.) -- Buildings, structures, etc.",Moore,Gay Montague,,,"The 200 block of Prince Street is probably the finest left in Old Alexandria, in that it has suffered less change.",[Old],-0.7269,0.00893,0.161089,0.829981
13078,30747,Seaport in Virginia George Washington's Alexandria,"Historic buildings -- Virginia -- Alexandria; Alexandria (Va.) -- History; Alexandria (Va.) -- Buildings, structures, etc.",Moore,Gay Montague,,,"Of the many quaint, historical figures whose memories haunt the old streets and houses of Alexandria, none is more interesting than Dr. Craik.",[Alexandria],-0.634,0.023045,0.167167,0.809787
22708,38130,Legends of Loudoun An account of the history and homes of a border county of Virginia's Northern Neck,Loudoun County (Va.) -- History; Historic buildings -- Virginia -- Loudoun County,Williams,Harrison,1873.0,1946.0,"In spite of all his tribulations and the very real dangers he incurred in his American sojourn, he records that ""Virginia is the very finest country I ever was in""no small concession.",[Virginia],-0.7832,0.059485,0.332529,0.607986
35523,52395,"Journal of my journey over the mountains while surveying for Lord Thomas Fairfax, baron of Cameron, in the northern neck of Virginia, beyond the Blue Ridge, in 1747-8.","Virginia -- Description and travel -- Early works to 1800; Washington, George, 1732-1799 -- Travel -- Shenandoah River Valley (Va. and W. Va.); Washington, George, 1732-1799 -- Diaries; Shenandoah River Valley (Va. and W. Va.) -- Description and travel -- Early works to 1800",Washington,George,1732.0,1799.0,to a White Oak on a Mountain side thence No 40 Et 38 po to 3 Red Oaks on a Mountain side near a Spring Branch this Lot very good Lot ye 16th and 17th Widow Wolfs and Henry Sheplars a Black Smith by trade Begins at a Black Walnut on ye Fork Runs So 17 W 76 po to a Red Oak Hickory 90 po Crossing ye Road about 20 po above ye house 226 po to 2 W O thence No 41 Wt 96 po to 2 White Oaks in ye Mannor line to ye River the line of ye 16th Lot from ye 2 W O S 41 Et Lot 18th Jeremiah Osborne's Begins at a Sycamore on ye Fork extending No 80 Et 215 po.,[Mountain],-0.981,0.004906,0.37087,0.624225


## 5. Analyzing the Differences

### Differences Between NLTK and RoBERTa:
1. **NLTK (VADER)**:
    - Uses a lexicon-based approach.
    - Performs well on short social media-style texts, but may not capture the full context in longer, more complex sentences.

2. **RoBERTa**:
    - Uses a transformer-based deep learning model, which can better understand context.
    - However, RoBERTa can sometimes be biased towards its training data (in this case, Twitter-based sentiments).

### Why Might These Differences Occur?
- **Context Understanding**: RoBERTa uses a much more advanced neural network model, allowing it to grasp nuances better than NLTK.
- **Lexicon Limitations**: NLTK relies on predefined dictionaries of words, which means it may miss certain contextual clues or interpret complex sentences inaccurately.

### Limitations:
- **RoBERTa**: While more accurate in many cases, it might be overfitted to specific domains (e.g., Twitter), which could skew its results on historical or formal texts.
- **NLTK**: Fast and simple, but its lexicon-based approach might not always provide a detailed or accurate sentiment analysis in nuanced contexts.


## 6. Conclusion

By comparing NLTK’s VADER and Hugging Face's RoBERTa, we can see that different sentiment analysis tools offer different strengths. NLTK’s rule-based system is fast and straightforward but can miss complex sentiment cues. RoBERTa, being a transformer model, performs better on context-heavy sentences but can sometimes be biased by its training data. 
