# Sentiment Analysis with NLTK

## Python setup


We need the following modules:

- NLTK
- Pandas
- Altair

In [1]:
import nltk

# we suppress some unimportant warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

## Data

### Data import

In [2]:
import pandas as pd

# Import some prepared twitter data from cnn breaking news
df = pd.read_csv("https://raw.githubusercontent.com/kirenz/datasets/master/sentiment-cnn.csv")

df.head(3)

Unnamed: 0,text,created_at,text_token,text_token_s,text_si,text_sil
0,the body of missing princeton university stude...,2022-10-20 19:58:17+00:00,"['the', 'body', 'of', 'missing', 'princeton', ...","['body', 'missing', 'princeton', 'university',...",body missing princeton university student misr...,body missing princeton university student misr...
1,uk prime minister liz truss quits after a disa...,2022-10-20 12:37:10+00:00,"['uk', 'prime', 'minister', 'liz', 'truss', 'q...","['uk', 'prime', 'minister', 'liz', 'truss', 'q...",prime minister liz truss quits disastrous six ...,prime minister liz truss quits disastrous six ...
2,trump weighs letting federal agents return to ...,2022-10-19 23:03:37+00:00,"['trump', 'weighs', 'letting', 'federal', 'age...","['trump', 'weighs', 'letting', 'federal', 'age...",trump weighs letting federal agents return mar...,trump weighs letting federal agents return mar...


### Data corrections

In [3]:
df['created_at'] = pd.to_datetime(df['created_at'])

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22 entries, 0 to 21
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype              
---  ------        --------------  -----              
 0   text          22 non-null     object             
 1   created_at    22 non-null     datetime64[ns, UTC]
 2   text_token    22 non-null     object             
 3   text_token_s  22 non-null     object             
 4   text_si       22 non-null     object             
 5   text_sil      22 non-null     object             
dtypes: datetime64[ns, UTC](1), object(5)
memory usage: 1.2+ KB


## Analysis

### VADER lexicon

- NLTK provides a simple rule-based model for general sentiment analysis called VADER, which stands for "Valence Aware Dictionary and Sentiment Reasoner" (Hutto & Gilbert, 2014).

In [4]:
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/jankirenz/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

### Sentiment Intensity Analyzer

- Initialize an object of `SentimentIntensityAnalyzer` with name "analyzer":

In [5]:
from nltk.sentiment import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()

### Polarity scores

- Use the `polarity_scores` method:

In [6]:
df['polarity'] = df['text_sil'].apply(lambda x: analyzer.polarity_scores(x))

In [7]:
df.head(3)

Unnamed: 0,text,created_at,text_token,text_token_s,text_si,text_sil,polarity
0,the body of missing princeton university stude...,2022-10-20 19:58:17+00:00,"['the', 'body', 'of', 'missing', 'princeton', ...","['body', 'missing', 'princeton', 'university',...",body missing princeton university student misr...,body missing princeton university student misr...,"{'neg': 0.216, 'neu': 0.784, 'pos': 0.0, 'comp..."
1,uk prime minister liz truss quits after a disa...,2022-10-20 12:37:10+00:00,"['uk', 'prime', 'minister', 'liz', 'truss', 'q...","['uk', 'prime', 'minister', 'liz', 'truss', 'q...",prime minister liz truss quits disastrous six ...,prime minister liz truss quits disastrous six ...,"{'neg': 0.206, 'neu': 0.794, 'pos': 0.0, 'comp..."
2,trump weighs letting federal agents return to ...,2022-10-19 23:03:37+00:00,"['trump', 'weighs', 'letting', 'federal', 'age...","['trump', 'weighs', 'letting', 'federal', 'age...",trump weighs letting federal agents return mar...,trump weighs letting federal agents return mar...,"{'neg': 0.0, 'neu': 0.86, 'pos': 0.14, 'compou..."


### Transform data

In [8]:
# Change data structure (we unnest the column polarity and add new columns)
df = pd.concat([df.drop(['polarity'], axis=1), df['polarity'].apply(pd.Series)], axis=1)

In [9]:
df.head()

Unnamed: 0,text,created_at,text_token,text_token_s,text_si,text_sil,neg,neu,pos,compound
0,the body of missing princeton university stude...,2022-10-20 19:58:17+00:00,"['the', 'body', 'of', 'missing', 'princeton', ...","['body', 'missing', 'princeton', 'university',...",body missing princeton university student misr...,body missing princeton university student misr...,0.216,0.784,0.0,-0.296
1,uk prime minister liz truss quits after a disa...,2022-10-20 12:37:10+00:00,"['uk', 'prime', 'minister', 'liz', 'truss', 'q...","['uk', 'prime', 'minister', 'liz', 'truss', 'q...",prime minister liz truss quits disastrous six ...,prime minister liz truss quits disastrous six ...,0.206,0.794,0.0,-0.5994
2,trump weighs letting federal agents return to ...,2022-10-19 23:03:37+00:00,"['trump', 'weighs', 'letting', 'federal', 'age...","['trump', 'weighs', 'letting', 'federal', 'age...",trump weighs letting federal agents return mar...,trump weighs letting federal agents return mar...,0.0,0.86,0.14,0.3818
3,britain's home secretary suella braverman quit...,2022-10-19 16:17:29+00:00,"['britain', 's', 'home', 'secretary', 'suella'...","['britain', 'home', 'secretary', 'suella', 'br...",britain home secretary suella braverman quits ...,britain home secretary suella braverman quits ...,0.152,0.848,0.0,-0.3612
4,russian president vladimir putin signs a decre...,2022-10-19 12:02:15+00:00,"['russian', 'president', 'vladimir', 'putin', ...","['russian', 'president', 'vladimir', 'putin', ...",russian president vladimir putin signs decree ...,russian president vladimir putin signs decree ...,0.0,1.0,0.0,0.0


Create new variable called sentiment which contains the entries "neutral," "positive" or "negative" (depending on the compound score).


Hint:


---

```python
df['___'] = df['___'].___(___ x: '___' if ___ >___ else '___' if ___ else '___')
```

---

- Name the new variable `sentiment`
- Use variable `compound` as basis
- apply a lambda function to each row.
- The lambda function should write a name in a cell:
  - 'positive' `if x>0`
  - 'neutral' `if x==0`
  - 'negative' for all other cases (`else`)


In [10]:
### BEGIN SOLUTION
df['sentiment'] = df['compound'].apply(lambda x: 'positive' if x >0 else 'neutral' if x==0 else 'negative')
### END SOLUTION

In [11]:
# check your code
assert df.iloc[0, 10] == 'negative'

In [12]:
df.head()

Unnamed: 0,text,created_at,text_token,text_token_s,text_si,text_sil,neg,neu,pos,compound,sentiment
0,the body of missing princeton university stude...,2022-10-20 19:58:17+00:00,"['the', 'body', 'of', 'missing', 'princeton', ...","['body', 'missing', 'princeton', 'university',...",body missing princeton university student misr...,body missing princeton university student misr...,0.216,0.784,0.0,-0.296,negative
1,uk prime minister liz truss quits after a disa...,2022-10-20 12:37:10+00:00,"['uk', 'prime', 'minister', 'liz', 'truss', 'q...","['uk', 'prime', 'minister', 'liz', 'truss', 'q...",prime minister liz truss quits disastrous six ...,prime minister liz truss quits disastrous six ...,0.206,0.794,0.0,-0.5994,negative
2,trump weighs letting federal agents return to ...,2022-10-19 23:03:37+00:00,"['trump', 'weighs', 'letting', 'federal', 'age...","['trump', 'weighs', 'letting', 'federal', 'age...",trump weighs letting federal agents return mar...,trump weighs letting federal agents return mar...,0.0,0.86,0.14,0.3818,positive
3,britain's home secretary suella braverman quit...,2022-10-19 16:17:29+00:00,"['britain', 's', 'home', 'secretary', 'suella'...","['britain', 'home', 'secretary', 'suella', 'br...",britain home secretary suella braverman quits ...,britain home secretary suella braverman quits ...,0.152,0.848,0.0,-0.3612,negative
4,russian president vladimir putin signs a decre...,2022-10-19 12:02:15+00:00,"['russian', 'president', 'vladimir', 'putin', ...","['russian', 'president', 'vladimir', 'putin', ...",russian president vladimir putin signs decree ...,russian president vladimir putin signs decree ...,0.0,1.0,0.0,0.0,neutral


### Max and min sentiment

In [13]:
# Tweet with highest positive sentiment
df[['text', 'compound', 'neg', 'neu', 'pos', 'sentiment']].loc[df['compound'].idxmax()]

text         bruce sutter, a cy young award-winning relief ...
compound                                                0.8834
neg                                                      0.126
neu                                                      0.386
pos                                                      0.488
sentiment                                             positive
Name: 16, dtype: object

In [14]:
# Tweet with highest negative sentiment 
# ...seems to be a case of wrong classification because of the word "deficit"
df[['text', 'compound', 'neg', 'neu', 'pos', 'sentiment']].loc[df['compound'].idxmin()]

text         jury finds paul flores guilty of first-degree ...
compound                                               -0.9531
neg                                                       0.58
neu                                                      0.337
pos                                                      0.083
sentiment                                             negative
Name: 6, dtype: object

### Visualize data

In [15]:
import altair as alt

# create data to change colors in Altair plot
domain = ['neutral', 'positive', 'negative']
range_=['#b2d8d8',"#008080", '#db3d13']


alt.Chart(df).mark_bar().encode(
    x=alt.X('count()', title=None),
    y=alt.Y('sentiment', sort="-x"),
    color= alt.Color('sentiment', legend=None, scale=alt.Scale(domain=domain, range=range_))
).properties(
    title="Sentiment analysis",
    width=400,
    height=150,
)

In [16]:
# Function to add date variables to DataFrame.
def add_date_info(df):
  df['created_at'] = pd.to_datetime(df['created_at'], unit='ns')
  df['Year'] = pd.DatetimeIndex(df['created_at']).year
  df['Month'] = pd.DatetimeIndex(df['created_at']).month
  df['Day'] = pd.DatetimeIndex(df['created_at']).day
  df['DOY'] = pd.DatetimeIndex(df['created_at']).dayofyear
  df['Date'] = pd.DatetimeIndex(df['created_at']).date
  return df

In [17]:
add_date_info(df)

Unnamed: 0,text,created_at,text_token,text_token_s,text_si,text_sil,neg,neu,pos,compound,sentiment,Year,Month,Day,DOY,Date
0,the body of missing princeton university stude...,2022-10-20 19:58:17+00:00,"['the', 'body', 'of', 'missing', 'princeton', ...","['body', 'missing', 'princeton', 'university',...",body missing princeton university student misr...,body missing princeton university student misr...,0.216,0.784,0.0,-0.296,negative,2022,10,20,293,2022-10-20
1,uk prime minister liz truss quits after a disa...,2022-10-20 12:37:10+00:00,"['uk', 'prime', 'minister', 'liz', 'truss', 'q...","['uk', 'prime', 'minister', 'liz', 'truss', 'q...",prime minister liz truss quits disastrous six ...,prime minister liz truss quits disastrous six ...,0.206,0.794,0.0,-0.5994,negative,2022,10,20,293,2022-10-20
2,trump weighs letting federal agents return to ...,2022-10-19 23:03:37+00:00,"['trump', 'weighs', 'letting', 'federal', 'age...","['trump', 'weighs', 'letting', 'federal', 'age...",trump weighs letting federal agents return mar...,trump weighs letting federal agents return mar...,0.0,0.86,0.14,0.3818,positive,2022,10,19,292,2022-10-19
3,britain's home secretary suella braverman quit...,2022-10-19 16:17:29+00:00,"['britain', 's', 'home', 'secretary', 'suella'...","['britain', 'home', 'secretary', 'suella', 'br...",britain home secretary suella braverman quits ...,britain home secretary suella braverman quits ...,0.152,0.848,0.0,-0.3612,negative,2022,10,19,292,2022-10-19
4,russian president vladimir putin signs a decre...,2022-10-19 12:02:15+00:00,"['russian', 'president', 'vladimir', 'putin', ...","['russian', 'president', 'vladimir', 'putin', ...",russian president vladimir putin signs decree ...,russian president vladimir putin signs decree ...,0.0,1.0,0.0,0.0,neutral,2022,10,19,292,2022-10-19
5,rising food and housing costs drove uk inflati...,2022-10-19 09:04:45+00:00,"['rising', 'food', 'and', 'housing', 'costs', ...","['rising', 'food', 'housing', 'costs', 'drove'...",rising food housing costs drove inflation back...,rising food housing costs drove inflation back...,0.0,1.0,0.0,0.0,neutral,2022,10,19,292,2022-10-19
6,jury finds paul flores guilty of first-degree ...,2022-10-18 21:35:05+00:00,"['jury', 'finds', 'paul', 'flores', 'guilty', ...","['jury', 'finds', 'paul', 'flores', 'guilty', ...",jury finds paul flores guilty first degree mur...,jury finds paul flores guilty first degree mur...,0.58,0.337,0.083,-0.9531,negative,2022,10,18,291,2022-10-18
7,russian expat who was the main source for infa...,2022-10-18 20:29:03+00:00,"['russian', 'expat', 'who', 'was', 'the', 'mai...","['russian', 'expat', 'main', 'source', 'infamo...",russian expat main source infamous trump dossi...,russian expat main source infamous trump dossi...,0.364,0.445,0.19,-0.6486,negative,2022,10,18,291,2022-10-18
8,a pilot and passenger are dead after a plane c...,2022-10-18 16:00:21+00:00,"['a', 'pilot', 'and', 'passenger', 'are', 'dea...","['pilot', 'passenger', 'dead', 'plane', 'crash...",pilot passenger dead plane crashed car dealers...,pilot passenger dead plane crashed car dealers...,0.318,0.682,0.0,-0.7906,negative,2022,10,18,291,2022-10-18
9,ukraine's president zelensky says 30% of the c...,2022-10-18 10:33:29+00:00,"['ukraine', 's', 'president', 'zelensky', 'say...","['ukraine', 'president', 'zelensky', 'says', '...",ukraine president zelensky says country power ...,ukraine president zelensky says country power ...,0.357,0.643,0.0,-0.7269,negative,2022,10,18,291,2022-10-18


In [20]:
# change format
df['Date'] = pd.to_datetime(df['Date'])

In [24]:
alt.Chart(df).mark_area().encode(
   x=alt.X('Date', axis=alt.Axis(format='%e.%-m.')),
   y=alt.Y('count(sentiment)'),
   color=alt.Color('sentiment', scale=alt.Scale(domain=domain, range=range_))
)

In [22]:
alt.Chart(df).mark_boxplot().encode(
    x=alt.X('sentiment'),
    y=alt.Y('compound'),
    color=alt.Color('sentiment', scale=alt.Scale(domain=domain, range=range_))
).properties(
    width=200,
    height=200
)

Literature:

[Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for
Sentiment Analysis of Social Media Text. Eighth International Conference on
Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.](https://ojs.aaai.org/index.php/ICWSM/article/view/14550)