# Sentiment Analysis with NLTK

## Python setup


We need the following modules:

- NLTK
- Pandas
- Altair

In [1]:
import nltk

# we suppress some unimportant warnings
import warnings

warnings.simplefilter(action='ignore', category=FutureWarning)

## Data

### Data import

In [27]:
import pandas as pd

# Import some prepared twitter data from cnn breaking news
df = pd.read_csv("https://raw.githubusercontent.com/kirenz/datasets/master/sentiment-cnn.csv")

df.head(3)

Unnamed: 0,text,created_at,text_token,text_token_s,text_si,text_sil
0,the body of missing princeton university stude...,2022-10-20 19:58:17+00:00,"['the', 'body', 'of', 'missing', 'princeton', ...","['body', 'missing', 'princeton', 'university',...",body missing princeton university student misr...,body missing princeton university student misr...
1,uk prime minister liz truss quits after a disa...,2022-10-20 12:37:10+00:00,"['uk', 'prime', 'minister', 'liz', 'truss', 'q...","['uk', 'prime', 'minister', 'liz', 'truss', 'q...",prime minister liz truss quits disastrous six ...,prime minister liz truss quits disastrous six ...
2,trump weighs letting federal agents return to ...,2022-10-19 23:03:37+00:00,"['trump', 'weighs', 'letting', 'federal', 'age...","['trump', 'weighs', 'letting', 'federal', 'age...",trump weighs letting federal agents return mar...,trump weighs letting federal agents return mar...


### Data corrections

In [28]:
df['created_at'] = pd.to_datetime(df['created_at'])

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22 entries, 0 to 21
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype              
---  ------        --------------  -----              
 0   text          22 non-null     object             
 1   created_at    22 non-null     datetime64[ns, UTC]
 2   text_token    22 non-null     object             
 3   text_token_s  22 non-null     object             
 4   text_si       22 non-null     object             
 5   text_sil      22 non-null     object             
dtypes: datetime64[ns, UTC](1), object(5)
memory usage: 1.2+ KB


## Analysis

### VADER lexicon

- NLTK provides a simple rule-based model for general sentiment analysis called VADER, which stands for "Valence Aware Dictionary and Sentiment Reasoner" (Hutto & Gilbert, 2014).

In [29]:
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/jankirenz/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

### Sentiment Intensity Analyzer

- Initialize an object of `SentimentIntensityAnalyzer` with name "analyzer":

In [30]:
from nltk.sentiment import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()

### Polarity scores

- Use the `polarity_scores` method:

In [31]:
df['polarity'] = df['text_token'].apply(lambda x: analyzer.polarity_scores(x))

In [32]:
df.head(3)

Unnamed: 0,text,created_at,text_token,text_token_s,text_si,text_sil,polarity
0,the body of missing princeton university stude...,2022-10-20 19:58:17+00:00,"['the', 'body', 'of', 'missing', 'princeton', ...","['body', 'missing', 'princeton', 'university',...",body missing princeton university student misr...,body missing princeton university student misr...,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound..."
1,uk prime minister liz truss quits after a disa...,2022-10-20 12:37:10+00:00,"['uk', 'prime', 'minister', 'liz', 'truss', 'q...","['uk', 'prime', 'minister', 'liz', 'truss', 'q...",prime minister liz truss quits disastrous six ...,prime minister liz truss quits disastrous six ...,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound..."
2,trump weighs letting federal agents return to ...,2022-10-19 23:03:37+00:00,"['trump', 'weighs', 'letting', 'federal', 'age...","['trump', 'weighs', 'letting', 'federal', 'age...",trump weighs letting federal agents return mar...,trump weighs letting federal agents return mar...,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound..."


### Transform data

In [33]:
# Change data structure (we unnest the column polarity and add new columns)

df = pd.concat([df.drop(['polarity'], axis=1), df['polarity'].apply(pd.Series)], axis=1)

In [34]:
df.head()

Unnamed: 0,text,created_at,text_token,text_token_s,text_si,text_sil,neg,neu,pos,compound
0,the body of missing princeton university stude...,2022-10-20 19:58:17+00:00,"['the', 'body', 'of', 'missing', 'princeton', ...","['body', 'missing', 'princeton', 'university',...",body missing princeton university student misr...,body missing princeton university student misr...,0.0,1.0,0.0,0.0
1,uk prime minister liz truss quits after a disa...,2022-10-20 12:37:10+00:00,"['uk', 'prime', 'minister', 'liz', 'truss', 'q...","['uk', 'prime', 'minister', 'liz', 'truss', 'q...",prime minister liz truss quits disastrous six ...,prime minister liz truss quits disastrous six ...,0.0,1.0,0.0,0.0
2,trump weighs letting federal agents return to ...,2022-10-19 23:03:37+00:00,"['trump', 'weighs', 'letting', 'federal', 'age...","['trump', 'weighs', 'letting', 'federal', 'age...",trump weighs letting federal agents return mar...,trump weighs letting federal agents return mar...,0.0,1.0,0.0,0.0
3,britain's home secretary suella braverman quit...,2022-10-19 16:17:29+00:00,"['britain', 's', 'home', 'secretary', 'suella'...","['britain', 'home', 'secretary', 'suella', 'br...",britain home secretary suella braverman quits ...,britain home secretary suella braverman quits ...,0.0,1.0,0.0,0.0
4,russian president vladimir putin signs a decre...,2022-10-19 12:02:15+00:00,"['russian', 'president', 'vladimir', 'putin', ...","['russian', 'president', 'vladimir', 'putin', ...",russian president vladimir putin signs decree ...,russian president vladimir putin signs decree ...,0.0,1.0,0.0,0.0


Create new variable called sentiment which contains the entries "neutral," "positive" or "negative" (depending on the compound score).


Hint:


---

```python
df['___'] = df['___'].___(___ x: '___' if ___ >___ else '___' if ___ else '___')
```

---

- Name the new variable `sentiment`
- Use variable `compound` as basis
- apply a lambda function to each row.
- The lambda function should write a name in a cell:
  - 'positive' `if x>0`
  - 'neutral' `if x==0`
  - 'negative' for all other cases (`else`)


In [35]:
### BEGIN SOLUTION
df['sentiment'] = df['compound'].apply(lambda x: 'positive' if x >0 else 'neutral' if x==0 else 'negative')
### END SOLUTION

In [39]:
# check your code
assert df.iloc[0, 10] == 'neutral'

In [38]:
df.head()

Unnamed: 0,text,created_at,text_token,text_token_s,text_si,text_sil,neg,neu,pos,compound,sentiment
0,the body of missing princeton university stude...,2022-10-20 19:58:17+00:00,"['the', 'body', 'of', 'missing', 'princeton', ...","['body', 'missing', 'princeton', 'university',...",body missing princeton university student misr...,body missing princeton university student misr...,0.0,1.0,0.0,0.0,neutral
1,uk prime minister liz truss quits after a disa...,2022-10-20 12:37:10+00:00,"['uk', 'prime', 'minister', 'liz', 'truss', 'q...","['uk', 'prime', 'minister', 'liz', 'truss', 'q...",prime minister liz truss quits disastrous six ...,prime minister liz truss quits disastrous six ...,0.0,1.0,0.0,0.0,neutral
2,trump weighs letting federal agents return to ...,2022-10-19 23:03:37+00:00,"['trump', 'weighs', 'letting', 'federal', 'age...","['trump', 'weighs', 'letting', 'federal', 'age...",trump weighs letting federal agents return mar...,trump weighs letting federal agents return mar...,0.0,1.0,0.0,0.0,neutral
3,britain's home secretary suella braverman quit...,2022-10-19 16:17:29+00:00,"['britain', 's', 'home', 'secretary', 'suella'...","['britain', 'home', 'secretary', 'suella', 'br...",britain home secretary suella braverman quits ...,britain home secretary suella braverman quits ...,0.0,1.0,0.0,0.0,neutral
4,russian president vladimir putin signs a decre...,2022-10-19 12:02:15+00:00,"['russian', 'president', 'vladimir', 'putin', ...","['russian', 'president', 'vladimir', 'putin', ...",russian president vladimir putin signs decree ...,russian president vladimir putin signs decree ...,0.0,1.0,0.0,0.0,neutral


### Max and min sentiment

In [40]:
# Tweet with highest positive sentiment
df.loc[df['compound'].idxmax()].values

array(['the body of missing princeton university student misrach ewunetie has been found. https://t.co/66wv0od5ut',
       Timestamp('2022-10-20 19:58:17+0000', tz='UTC'),
       "['the', 'body', 'of', 'missing', 'princeton', 'university', 'student', 'misrach', 'ewunetie', 'has', 'been', 'found', 'https', 't', 'co', '66wv0od5ut']",
       "['body', 'missing', 'princeton', 'university', 'student', 'misrach', 'ewunetie', 'found', '66wv0od5ut']",
       'body missing princeton university student misrach ewunetie found 66wv0od5ut',
       'body missing princeton university student misrach ewunetie found 66wv0od5ut',
       0.0, 1.0, 0.0, 0.0, 'neutral'], dtype=object)

In [41]:
# Tweet with highest negative sentiment 
# ...seems to be a case of wrong classification because of the word "deficit"
df.loc[df['compound'].idxmin()].values

array(['the body of missing princeton university student misrach ewunetie has been found. https://t.co/66wv0od5ut',
       Timestamp('2022-10-20 19:58:17+0000', tz='UTC'),
       "['the', 'body', 'of', 'missing', 'princeton', 'university', 'student', 'misrach', 'ewunetie', 'has', 'been', 'found', 'https', 't', 'co', '66wv0od5ut']",
       "['body', 'missing', 'princeton', 'university', 'student', 'misrach', 'ewunetie', 'found', '66wv0od5ut']",
       'body missing princeton university student misrach ewunetie found 66wv0od5ut',
       'body missing princeton university student misrach ewunetie found 66wv0od5ut',
       0.0, 1.0, 0.0, 0.0, 'neutral'], dtype=object)

### Visualize data

In [42]:
import altair as alt

# create data to change colors in Altair plot
domain = ['neutral', 'positive', 'negative']
range_=['#b2d8d8',"#008080", '#db3d13']


alt.Chart(df).mark_bar().encode(
    x=alt.X('count()', title=None),
    y=alt.Y('sentiment', sort="-x"),
    color= alt.Color('sentiment', legend=None, scale=alt.Scale(domain=domain, range=range_))
).properties(
    title="Sentiment analysis",
    width=400,
    height=150,
)

In [43]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22 entries, 0 to 21
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype              
---  ------        --------------  -----              
 0   text          22 non-null     object             
 1   created_at    22 non-null     datetime64[ns, UTC]
 2   text_token    22 non-null     object             
 3   text_token_s  22 non-null     object             
 4   text_si       22 non-null     object             
 5   text_sil      22 non-null     object             
 6   neg           22 non-null     float64            
 7   neu           22 non-null     float64            
 8   pos           22 non-null     float64            
 9   compound      22 non-null     float64            
 10  sentiment     22 non-null     object             
dtypes: datetime64[ns, UTC](1), float64(4), object(6)
memory usage: 2.0+ KB


In [44]:
alt.Chart(df).mark_line().encode(
   x=alt.X('created_at:T'),
   y=alt.Y('compound'),
   color=alt.Color('sentiment', scale=alt.Scale(domain=domain, range=range_))
)

In [45]:
alt.Chart(df).mark_boxplot().encode(
    x=alt.X('sentiment'),
    y=alt.Y('compound'),
    color=alt.Color('sentiment', scale=alt.Scale(domain=domain, range=range_))
).properties(
    width=200,
    height=200
)

Literature:

[Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for
Sentiment Analysis of Social Media Text. Eighth International Conference on
Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.](https://ojs.aaai.org/index.php/ICWSM/article/view/14550)