<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

Created by [Nathan Kelber](http://nkelber.com) and Ted Lawless for [JSTOR Labs](https://labs.jstor.org/) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email nathan.kelber@ithaka.org.<br />
___

# Sentiment Analysis with VADER

**Description:** This notebook describes Sentiment Analysis and demonstrates basic applications using:
* VADER (Valence Aware Dictionary for sEntiment Reasoning), a rule-based algorithm
* sci-kit learn's Multinomial Naive Bayes, a machine learning classifier

**Use Case:** For Learners (Detailed explanation, not ideal for researchers)

**Difficulty:** Beginner

**Completion Time:** 60 minutes

**Knowledge Required:** 
* Python Basics Series ([Start Python Basics I](./python-basics-1.ipynb))

**Knowledge Recommended:** None

**Data Format:** None

**Libraries Used:** vaderSentiment, sklearn

**Research Pipeline:** None
___

## Methods for Sentiment Analysis

Sentiment analysis can help an analyst discover whether feedback is positive, negative, or mixed. For example, a large company like Amazon or Walmart could use sentiment analysis on user reviews to determine whether a featured product should be promoted or discontinued. Sentiment analysis generally falls into two categories:

* Rule-based algorithms
* Machine Learning models 

### Rule-Based Algorithms

Rule-based algorithms assign sentiment scores to particular words or multi-word constructions. Simple algorithms may simply assess each word individually in a feedback document and add up an overall score. More complex algorithms may assess multi-word (or n-gram) constructions and have special rules for addressing issues such as negation, emojis, and emoticons. They can detect the difference between "bad", "not bad", and "bad ass". Some algorithms also support emojis and emoticons, such as "=)" and "😁".

### Machine Learning Models

Machine learning models rely on feedback data that has already been assessed by humans to have a particular sentiment. Each piece of feedback is **labeled** by a human reader who may place the feedback into a particular category. The categories could be as simple as positive, negative, or neutral. As long as there exists **labeled** data, a machine learning model can often identify complex concepts. For example, a car manufacturer may desire to classify the sentiment of feedback from past buyers as: "budget-conscious", "eco-conscious", "tech-enthusiastic", "luxury-driven", "performance-driven", etc. Assuming there is an adequately labeled **training data** for each of these categories, a machine learning model could assign a score for each category. This could help analysts understand the brand better, answering questions about what consumers do or do not like about a particular vehicle.

In the humanities, sentiment analysis could be used to track emerging trends on social media. For example, we might ask: "How are Twitter or Reddit users responding to a particular government policy or public event?" We could look at a hashtag like "#blm" and get a sense of national sentiment on the Black Lives Matter movement. The project [On the Books: Jim Crow and Algorithms of Resistance](https://onthebooks.lib.unc.edu/) is using machine classification to detect racist laws based on the pioneering work of [Pauli Murray](https://en.wikipedia.org/wiki/Pauli_Murray) and [Safiya Noble](https://en.wikipedia.org/wiki/Safiya_Noble)'s concept of "algorithmic oppression". 

## VADER

This notebook uses a rule-based algorithm named VADER (Valence Aware Dictionary and sEntiment Reasoner). VADER is a rule-based algorithm that is "specifically attuned to sentiments expressed in social media." It relies on a specialized **lexicon** of words, phrases, and emojis. Each token in the lexicon is assigned a "mean-sentiment rating" between -4 (extremely negative) to 4 (extremely positive). Here are a few examples:

|Token|Mean-Sentiment Rating|
|---|---|
|(:|2.2|
|/:|-1.3|
|):<|-1.9|
|rotflmao|2.8|
|aghast|-1.9|
|awesome|3.1|
|awful|-2.0|

There are over 7500 tokens listed in VADER lexicon. (You can also add your own if you like.) VADER also considers grammatical and syntactical rules to measure intensity based on word order and sensitive relationships between terms. For example, it increases or decreases a sentiment based on degree modifers such as: "The product is good" versus "the product is very good" versus "the product is marginally good." To read more about VADER, including how it works and to see its code, [visit the github page](https://github.com/cjhutto/vaderSentiment).

## Applying the VADER Algorithm
First, we need to import the SentimentIntensityAnalyzer. Here we assign the VADER lexicon object to a variable `sa`.

In [None]:
# Import the SentimentIntensityAnalyzer
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# Creat the variable sa to hold the VADER lexicon object 
sa = SentimentIntensityAnalyzer()

We can preview the contents of the lexicon by using `sa.lexicon`. This will return a dictionary, where each key is a token and each value is a sentiment rating.

In [None]:
# Preview the lexicon contents
# There are over 7500 tokens in the lexicon
sa.lexicon

In [None]:
# Check if a word is in the lexicon
test_word = 'sweet' # The word to check for

# Get the word's score or print a message for missing words
sa.lexicon.get(test_word, 'No score for that word') 

In order to do our analysis, we will use a very small sample of 8 user reviews. Each review is a simple text string inside a list variable called `product_reviews`.

In [None]:
# Define a list of product reviews

product_reviews = [
    'I love this product. It helps me get so much work done. I tell everyone about what a great thing it is.',
    'This product is defective. I feel like it is broken because it does not do what it promises. Do not buy this.',
    'Do yourself a favor and buy this product as soon as possible. I recommend it to everyone I know. It has saved me so much time!',
    'This product is overpriced and useless. It was a waste of money and it made all my hair fall out.',
    'Works like a dream and it is a bargain! It solves my problems with ease. I bought two!',
    'Do not buy! This product is a ripoff. I wish it was better, but it fails constantly. What a mistake!',
    'This thing is garbage. Do yourself a favor and save the money. Mine is a dumpster fire and fell apart.',
    'I adore this product. =) It makes my life so much easier. And it is a deal!'
]

Now we will analyze each product and assign it a "normalized, weighted composite score" based on summing the valence scores of each word in the lexicon (with some adjustments based on word order and other rules). VADER measures the proportion of text that falls into positive, negative, and neutral sentiment. The result is a sentiment score that falls between -1 (the most negative) and +1 (the most positive). (This is different from the lexicon scores that fall between -4 to +4!)

In [None]:
# For each review in our `product_reviews` list
# Store a polarity score in `scores`
# Then print the score followed by the review
for review in product_reviews:
    scores = sa.polarity_scores(review)
    print(scores['compound'], review)

Our simple analysis does a fairly good job of assessing positive and negative sentiment. Notice that our second to last review was not very accurate though:
> 0.5423 This thing is garbage. Do yourself a favor and save the money. Mine started on fire and fell apart.

The VADER lexicon contains the following entries:

|Token|Mean-Sentiment Rating|
|---|---|
|favor|1.7|
|fire|-1.4|

VADER assigns a value of -1.4 for "fire" but "fire" can also have a positive connotation, such as "straight fire." However, words like "garbage" and "dumpster," as in "dumpster fire," are less ambiguous. If a specific token is not found in the VADER lexicon, it is considered to be neutral. Like any other statistical approach, the process benefits from having more data. In this case, the sentences are very short and several significant words do not happen to exist in our lexicon. 

## Adding Tokens to the VADER Lexicon

The `sa.lexicon` is a simple dictionary, so we can add words that we want included. There are some guidelines for best scoring practices included in the academic paper linked on [VADER's github repository](https://github.com/cjhutto/vaderSentiment). (Remember that lexicon tokens are scored from -4 to +4.)

In [None]:
# Adding the dictionary of `new_words`
# to sa.lexicon

new_words = {
    'garbage': -2.0,
    'dumpster': -3.1,
}

sa.lexicon.update(new_words)

Let's try our analysis again with the new lexicon.

In [None]:
# For each review in our `product_reviews` list
# Store a polarity score in `scores`
# Then print the score followed by the review

for review in product_reviews:
    scores = sa.polarity_scores(review)
    print(scores['compound'], review)

## Sentiment analysis with machine learning

The primary advantage of using a machine learning classifier for sentiment analysis is there is no need to maintain a lexicon, assign sentiment scores to particular words, develop linguistic rules based on grammatical structures (negation, intensifiers), or keep track of novel expressions (slang, emoticons, etc.). 

In order to deploy a machine learning classifier, however, we need accurate, human-labeled data. If there is no existing labeled data, then collecting and labeling the data can be a laborious task. Without labeled data, there is no way for us to know if a machine learning classifier is accurate.

There are a variety of classifiers, such as Naive Bayes, Decision Trees, Logistic Regression, K-Nearest Neighbor, Deep Learning, Support Vector Machines. We will use a [Naive Bayes](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) approach with the Python library [scikit-learn](https://scikit-learn.org/stable/).

### Train/Validation/Test Split

While we could throw all our data into training our machine, we would have no way to know if the machine's predictions were accurate. For this reason, it is common practice to split the labeled data into two or three groups. 

A simple train/test split is usually 80/20 or 70/30. About 75% of the data is used to train the machine while 25% of the data is held out. This "held out" test data is then used to discover whether the machine's predictions generalize to data it has not yet seen.

Another, more rigorous, approach is to split the data into three groups: Train, Validation, Test. The mix is usually about 60/20/20. In practice, there should be little difference between the Validation and Test sets. However, usually the model is trained on Train set, the hyperparameters are tuned on the Validation set, and the Test set confirms the model's accuracy on novel data.

## Classifying film review sentiment snippets

The fundamental methods for employing machine learning are fairly straightforward. The hard part, by far, is getting access to good data. VADER contains several labeled datasets for testing and demonstration including tweets, NY Times editorials, movie reviews, and Amazon reviews. We could certainly use this data to improve the VADER model through modifying our lexicon and implementing new linguistic rules, but we can also use them to train a machine learning classifer.

First, we need to download and extract the data from the VADER GitHub repository.

In [None]:
# Download sample datasets from VADER and decompress them
import urllib.request
import tarfile
import os

# Move to the data directory
os.chdir('./data')

# Retrieve the file
file_name = 'sentiment_data.tar.gz'

url = 'https://github.com/cjhutto/vaderSentiment/raw/master/additional_resources/hutto_ICWSM_2014.tar.gz'
urllib.request.urlretrieve(url, file_name)
print('Sample datasets retrieved.')

# Uncompress the file
tar = tarfile.open(file_name, 'r:gz')
tar.extractall()
tar.close
print('Datasets uncompressed.')

Now we create a pandas dataframe to store our film review snippets.

In [None]:
# Import data into Pandas
import pandas as pd
movies = pd.read_csv('./hutto_ICWSM_2014/movieReviewSnippets_GroundTruth.txt', sep="\t", header=None)
movies.columns = ["id", "sentiment", "text"]
movies = movies.set_index('id')

Let's expand the column width in Pandas so we can read the full snippets.

In [None]:
# Expand Pandas display
pd.set_option('max_colwidth', 400)

In [None]:
# Preview the data
movies.head(10)

If we look at the `'sentiment'` column, we can see the total number of snippets and the min/max for the sentiment scoring.

In [None]:
movies['sentiment'].describe().round(1)

We create a bag of words corpus from the text of the `movies` dataframe using a tokenizer from the Natural Language Toolkit called `casual_tokenize` which specializes in informal language, particularly Twitter.

In [None]:
# Tokenize our dataset, create a bag of words corpus using Counter() objects
from nltk.tokenize import casual_tokenize
bow_corpus = []
from collections import Counter
for text in movies.text:
    bow_corpus.append(Counter(casual_tokenize(text)))

In [None]:
# Preview our Python list of bags of words.
# Each snippet is converted into a Counter() object
# which counts the number of occurrences of its tokens
bow_corpus

In [None]:
# Convert our dataframe into a sparse matrix 
# where each row represents a snippet
# and each row is a particular token
df_bows = pd.DataFrame.from_records(bow_corpus)

In [None]:
# Preview our dataframe 
df_bows.head()

In [None]:
df_bows.shape

Our data has a lot of NaN entries which will throw an error when we start our training, so we fill these in with 0s.

In [None]:
# Change NaNs to 0
df_bows = df_bows.fillna(0).astype(int)
df_bows.head()

In [None]:
# Preview the data frame with columnns from a given snippet
columns = list(bow_corpus[3].keys())

df_bows.head()[columns]

Now we actually train the model. We import sklearn and fit the model.

In [None]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

# Convert the output variable (sentiment float) to a discrete label
nb = nb.fit(df_bows, movies.sentiment > 0)

In [None]:
# Convert our binary classification (0 or 1) to -4 or 4
# This will help us compare it to the "ground truth" sentiment
movies['predicted_sentiment'] = nb.predict(df_bows) * 8 - 4

In [None]:
# We calculate the average absolute value of the prediction error
# This is the Mean Absolute Error (MAE)
movies['error'] = (movies.predicted_sentiment - movies.sentiment).abs()
movies.error.mean()

In [None]:
movies['sentiment_ispositive'] = (movies.sentiment > 0).astype(int)

In [None]:
movies['predicted_ispositive'] = (movies.predicted_sentiment > 0).astype(int)

In [None]:
movies['sentiment predicted_sentiment sentiment_ispositive predicted_ispositive'.split()].head(15)

In [None]:
# Calculate the accuracy of our predictions (positive/negative)
(movies.predicted_ispositive == movies.sentiment_ispositive).sum() / len(movies)