# Sentiment Analysis with NLTK

## Python setup


We need the following modules:

- NLTK
- Pandas
- Altair

In [None]:
import nltk

# we suppress some unimportant warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

## Data

### Data import

In [None]:
import pandas as pd

# Import some prepared twitter data from cnn breaking news
df = pd.read_csv("https://raw.githubusercontent.com/kirenz/datasets/master/sentiment-cnn.csv")

df.head(3)

### Data corrections

In [None]:
df['created_at'] = pd.to_datetime(df['created_at'])

df.info()

## Analysis

### VADER lexicon

- NLTK provides a simple rule-based model for general sentiment analysis called VADER, which stands for "Valence Aware Dictionary and Sentiment Reasoner" (Hutto & Gilbert, 2014).

In [None]:
nltk.download('vader_lexicon')

### Sentiment Intensity Analyzer

- Initialize an object of `SentimentIntensityAnalyzer` with name "analyzer":

In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()

### Polarity scores

- Use the `polarity_scores` method:

In [None]:
df['polarity'] = df['text_sil'].apply(lambda x: analyzer.polarity_scores(x))

In [None]:
df.head(3)

### Transform data

In [None]:
# Change data structure (we unnest the column polarity and add new columns)
df = pd.concat([df.drop(['polarity'], axis=1), df['polarity'].apply(pd.Series)], axis=1)

In [None]:
df.head()

Create new variable called sentiment which contains the entries "neutral," "positive" or "negative" (depending on the compound score).


Hint:


---

```python
df['___'] = df['___'].___(___ x: '___' if ___ >___ else '___' if ___ else '___')
```

---

- Name the new variable `sentiment`
- Use variable `compound` as basis
- apply a lambda function to each row.
- The lambda function should write a name in a cell:
  - 'positive' `if x>0`
  - 'neutral' `if x==0`
  - 'negative' for all other cases (`else`)


In [None]:
### BEGIN SOLUTION
df['sentiment'] = df['compound'].apply(lambda x: 'positive' if x >0 else 'neutral' if x==0 else 'negative')
### END SOLUTION

In [None]:
# check your code
assert df.iloc[0, 10] == 'negative'

In [None]:
df.head()

### Max and min sentiment

In [None]:
# Tweet with highest positive sentiment
df[['text', 'compound', 'neg', 'neu', 'pos', 'sentiment']].loc[df['compound'].idxmax()]

In [None]:
# Tweet with highest negative sentiment 
# ...seems to be a case of wrong classification because of the word "deficit"
df[['text', 'compound', 'neg', 'neu', 'pos', 'sentiment']].loc[df['compound'].idxmin()]

### Visualize data

In [None]:
import altair as alt

# create data to change colors in Altair plot
domain = ['neutral', 'positive', 'negative']
range_=['#b2d8d8',"#008080", '#db3d13']


alt.Chart(df).mark_bar().encode(
    x=alt.X('count()', title=None),
    y=alt.Y('sentiment', sort="-x"),
    color= alt.Color('sentiment', legend=None, scale=alt.Scale(domain=domain, range=range_))
).properties(
    title="Sentiment analysis",
    width=400,
    height=150,
)

In [None]:
# Function to add date variables to DataFrame.
def add_date_info(df):
  df['created_at'] = pd.to_datetime(df['created_at'], unit='ns')
  df['Year'] = pd.DatetimeIndex(df['created_at']).year
  df['Month'] = pd.DatetimeIndex(df['created_at']).month
  df['Day'] = pd.DatetimeIndex(df['created_at']).day
  df['DOY'] = pd.DatetimeIndex(df['created_at']).dayofyear
  df['Date'] = pd.DatetimeIndex(df['created_at']).date
  return df

In [None]:
add_date_info(df)

In [None]:
# change format
df['Date'] = pd.to_datetime(df['Date'])

In [None]:
alt.Chart(df).mark_area().encode(
   x=alt.X('Date', axis=alt.Axis(format='%e.%-m.')),
   y=alt.Y('count(sentiment)'),
   color=alt.Color('sentiment', scale=alt.Scale(domain=domain, range=range_))
)

In [None]:
alt.Chart(df).mark_boxplot().encode(
    x=alt.X('sentiment'),
    y=alt.Y('compound'),
    color=alt.Color('sentiment', scale=alt.Scale(domain=domain, range=range_))
).properties(
    width=200,
    height=200
)

Literature:

[Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for
Sentiment Analysis of Social Media Text. Eighth International Conference on
Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.](https://ojs.aaai.org/index.php/ICWSM/article/view/14550)