# Sentiment Analysis with NLTK

## Python setup


We need the following modules:

- NLTK
- Pandas
- Altair

In [1]:
import nltk

# we suppress some unimportant warnings
import warnings

warnings.simplefilter(action='ignore', category=FutureWarning)

## Data

### Data import

In [2]:
import pandas as pd

# Import some prepared Tweets from Elon Musk 
df = pd.read_csv("https://raw.githubusercontent.com/kirenz/datasets/master/twitter-sentiment.csv")

df.head(3)

Unnamed: 0,created_at,id,author_id,text,text_token,text_token_s,text_si,text_sil
0,2021-12-10 07:20:45+00:00,1469205428227784711,44196397,@albi_sidearms maybe i will …,"['albi_sidearms', 'maybe']","['albi_sidearms', 'maybe']",albi_sidearms maybe,albi_sidearms maybe
1,2021-12-10 07:19:05+00:00,1469205011687223298,44196397,@jack https://t.co/ueyr6nawap,"['jack', 'co', 'ueyr6nawap']","['jack', 'co', 'ueyr6nawap']",jack ueyr6nawap,jack ueyr6nawap
2,2021-12-10 06:44:19+00:00,1469196261953884160,44196397,@sawyermerritt 🤣🤣,['sawyermerritt'],['sawyermerritt'],sawyermerritt,sawyermerritt


### Data corrections

In [3]:
df['created_at'] = pd.to_datetime(df['created_at'])

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72 entries, 0 to 71
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype              
---  ------        --------------  -----              
 0   created_at    72 non-null     datetime64[ns, UTC]
 1   id            72 non-null     int64              
 2   author_id     72 non-null     int64              
 3   text          72 non-null     object             
 4   text_token    72 non-null     object             
 5   text_token_s  72 non-null     object             
 6   text_si       72 non-null     object             
 7   text_sil      72 non-null     object             
dtypes: datetime64[ns, UTC](1), int64(2), object(5)
memory usage: 4.6+ KB


## Sentiment analysis

- NLTK provides a simple rule-based model for general sentiment analysis called VADER, which stands for "Valence Aware Dictionary and Sentiment Reasoner" (Hutto & Gilbert, 2014).

In [4]:
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/jankirenz/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

### Sentiment Intensity Analyzer

- Initialize an object of `SentimentIntensityAnalyzer` with name "analyzer":

In [5]:
from nltk.sentiment import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()

### Polarity scores

- Use the `polarity_scores` method:

In [6]:
df['polarity'] = df['text_token'].apply(lambda x: analyzer.polarity_scores(x))

In [7]:
df.head(3)

Unnamed: 0,created_at,id,author_id,text,text_token,text_token_s,text_si,text_sil,polarity
0,2021-12-10 07:20:45+00:00,1469205428227784711,44196397,@albi_sidearms maybe i will …,"['albi_sidearms', 'maybe']","['albi_sidearms', 'maybe']",albi_sidearms maybe,albi_sidearms maybe,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound..."
1,2021-12-10 07:19:05+00:00,1469205011687223298,44196397,@jack https://t.co/ueyr6nawap,"['jack', 'co', 'ueyr6nawap']","['jack', 'co', 'ueyr6nawap']",jack ueyr6nawap,jack ueyr6nawap,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound..."
2,2021-12-10 06:44:19+00:00,1469196261953884160,44196397,@sawyermerritt 🤣🤣,['sawyermerritt'],['sawyermerritt'],sawyermerritt,sawyermerritt,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound..."


### Transform data

In [8]:
# Change data structure (we unnest the column polarity and add new columns)

df = pd.concat([df.drop(['polarity'], axis=1), df['polarity'].apply(pd.Series)], axis=1)

In [16]:
df.head()

Unnamed: 0,created_at,id,author_id,text,text_token,text_token_s,text_si,text_sil,neg,neu,pos,compound,sentiment
0,2021-12-10 07:20:45+00:00,1469205428227784711,44196397,@albi_sidearms maybe i will …,"['albi_sidearms', 'maybe']","['albi_sidearms', 'maybe']",albi_sidearms maybe,albi_sidearms maybe,0.0,1.0,0.0,0.0,neutral
1,2021-12-10 07:19:05+00:00,1469205011687223298,44196397,@jack https://t.co/ueyr6nawap,"['jack', 'co', 'ueyr6nawap']","['jack', 'co', 'ueyr6nawap']",jack ueyr6nawap,jack ueyr6nawap,0.0,1.0,0.0,0.0,neutral
2,2021-12-10 06:44:19+00:00,1469196261953884160,44196397,@sawyermerritt 🤣🤣,['sawyermerritt'],['sawyermerritt'],sawyermerritt,sawyermerritt,0.0,1.0,0.0,0.0,neutral
3,2021-12-10 04:42:00+00:00,1469165476911755264,44196397,@sawyermerritt tesla china has done amazing work,"['sawyermerritt', 'tesla', 'china', 'done', 'a...","['sawyermerritt', 'tesla', 'china', 'done', 'a...",sawyermerritt tesla china done amazing work,sawyermerritt tesla china done amazing work,0.0,1.0,0.0,0.0,neutral
4,2021-12-10 04:21:25+00:00,1469160298158383109,44196397,@mrbeast 🙏,['mrbeast'],['mrbeast'],mrbeast,mrbeast,0.0,1.0,0.0,0.0,neutral


Create new variable with sentiment "neutral," "positive" and "negative".


Hint:


---

```python
df['___'] = df['___'].___(___ x: '___' if ___ >___ else '___' if ___ else '___')
```

---

- Name the new variable `sentiment`
- Use variable `compound` as basis
- apply a lambda function to each row.
- The lambda function should write a name in a cell:
  - 'positive' `if x>0`
  - 'neutral' `if x==0`
  - 'negative' for all other cases (`else`)


In [None]:
### BEGIN SOLUTION
df['sentiment'] = df['compound'].apply(lambda x: 'positive' if x >0 else 'neutral' if x==0 else 'negative')
### END SOLUTION

In [15]:
df.head()

Unnamed: 0,created_at,id,author_id,text,text_token,text_token_s,text_si,text_sil,neg,neu,pos,compound,sentiment
0,2021-12-10 07:20:45+00:00,1469205428227784711,44196397,@albi_sidearms maybe i will …,"['albi_sidearms', 'maybe']","['albi_sidearms', 'maybe']",albi_sidearms maybe,albi_sidearms maybe,0.0,1.0,0.0,0.0,neutral
1,2021-12-10 07:19:05+00:00,1469205011687223298,44196397,@jack https://t.co/ueyr6nawap,"['jack', 'co', 'ueyr6nawap']","['jack', 'co', 'ueyr6nawap']",jack ueyr6nawap,jack ueyr6nawap,0.0,1.0,0.0,0.0,neutral
2,2021-12-10 06:44:19+00:00,1469196261953884160,44196397,@sawyermerritt 🤣🤣,['sawyermerritt'],['sawyermerritt'],sawyermerritt,sawyermerritt,0.0,1.0,0.0,0.0,neutral
3,2021-12-10 04:42:00+00:00,1469165476911755264,44196397,@sawyermerritt tesla china has done amazing work,"['sawyermerritt', 'tesla', 'china', 'done', 'a...","['sawyermerritt', 'tesla', 'china', 'done', 'a...",sawyermerritt tesla china done amazing work,sawyermerritt tesla china done amazing work,0.0,1.0,0.0,0.0,neutral


### Analyze data

In [None]:
# Tweet with highest positive sentiment
df.loc[df['compound'].idxmax()].values

In [None]:
# Tweet with highest negative sentiment 
# ...seems to be a case of wrong classification because of the word "deficit"
df.loc[df['compound'].idxmin()].values

### Visualize data

In [None]:
# create data to change colors in Altair plot
domain = ['neutral', 'positive', 'negative']
range_=['#b2d8d8',"#008080", '#db3d13']


alt.Chart(df).mark_bar().encode(
    x=alt.X('count()', title=None),
    y=alt.Y('sentiment', sort="-x"),
    color= alt.Color('sentiment', legend=None, scale=alt.Scale(domain=domain, range=range_))
).properties(
    title="Sentiment analysis",
    width=400,
    height=150,
)

In [None]:
df.info()

In [None]:
alt.Chart(df).mark_line().encode(
   x=alt.X('created_at:T'),
   y=alt.Y('compound'),
   color=alt.Color('sentiment', scale=alt.Scale(domain=domain, range=range_))
)

In [None]:
alt.Chart(df).mark_boxplot().encode(
    x=alt.X('sentiment'),
    y=alt.Y('compound'),
    color=alt.Color('sentiment', scale=alt.Scale(domain=domain, range=range_))
).properties(
    width=200,
    height=200
)

Literature:

[Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for
Sentiment Analysis of Social Media Text. Eighth International Conference on
Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.](https://ojs.aaai.org/index.php/ICWSM/article/view/14550)