Sentiment analysis can be done with or without building a machine learning model. This article will go over the Python implementation of VADERfor non-model sentiment analysis. 

After reading the article, you will learn
* What is VADER?
* How to use Python library VADER for sentiment analysis?


Let's get started!

# Step 1: Install And Import Python Libraries

The first step is to install and import Python libraries.
We need to install the `vaderSentiment` package for VADER and the`flair` package for flair.

In [8]:
# Install vaderSentiment package for VADER
!pip install vaderSentiment

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


After installing the packages, let's import the Python libraries. We need to import `pandas` and `numpy` for data processing. 

For the sentiment analysis, we need to import `TextBlob`, `SentimentIntensityAnalyzer` from `vaderSentiment`, and `TextClassifier` from `flair`. We also need to load the English sentiment data from `TextClassifier` and import `Sentence` for text processing for the flair pre-trained model.

To check the sentiment prediction accuracy, we need to import `accuracy_score` from `sklearn`.

Last but not least, we set the `pandas` dataframe column width to be 1000, which will allow us to see more content from the review.

In [7]:
# Data processing
import pandas as pd
import numpy as np


# Import VADER sentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# Import accuracy_score to check performance
from sklearn.metrics import accuracy_score

# Set a wider colwith
pd.set_option('display.max_colwidth', 1000)

# Step 2: Download And Read In Data

The second step is to download and read in the dataset. 




Those who are using Google Colab for this analysis need to mount Google Drive to read the dataset. You can ignore the code below if you are not using Google Colab.

In [4]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Change directory
import os
os.chdir("./drive/My Drive/3SentimentAnalysis")

# Print out the current directory
!pwd

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/My Drive/3SentimentAnalysis


Now let's read the data into a `pandas` dataframe and see what the dataset looks like.

In [9]:
# Read in data
amz_review = pd.read_csv('a1_dump.tsv', delimiter ='\t', quoting = 3)

# Take a look at the data
amz_review.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday off Rick Steve recommendation and loved it.,1
4,The selection on the menu was great and so were the prices.,1


`.info` helps us to get the information about the dataset. 

From the output, we can see that this data set has two columns, 1000 records, and no missing data. The 'review' column is `object` type, and the 'label'column is `int64` type.

In [10]:
# Get the dataset information
amz_review.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 900 entries, 0 to 899
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Review  900 non-null    object
 1   Liked   900 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 14.2+ KB


In [11]:
amz_review.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday off Rick Steve recommendation and loved it.,1
4,The selection on the menu was great and so were the prices.,1


Next, let's check the distribution of the label. There are 500 positive and 500 negative reviews in the dataset, so we have a balanced dataset. For a balanced dataset, we can use accuracy as the performance metric.



In [13]:
# Check the label distribution
amz_review['Liked'].value_counts()

1    496
0    404
Name: Liked, dtype: int64

# Step 4: What is VADER?

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a Python library focusing on social media sentiments. It has a built-in algorithm to change sentiment intensity based on punctuations, slang, emojis, and acronyms.

The output of VADER includes four scores: compound score, negative score, neutral score, and positive score.

* The pos, neu, and neg represent the percentage of tokens that fall into each category, so they add up together to be 100%.
* The compound score is a single score to measure the sentiment of the text. Similar to TextBlob, it ranges from -1 (extremely negative) to 1 (extremely positive). The scores near 0 represent the neural sentiment score.
* The compound score is not a simple aggregation of the pos, neu, and neg scores. Instead, it incorporates rule-based enhancements such as punctuation amplifiers.

In [14]:
# Example text
text = 'GrabNGoInfo.com is a great machine learning tutorial website.'

# VADER Sentiment
vader = SentimentIntensityAnalyzer()
vader_sentiment = vader.polarity_scores(text)
vader_sentiment

{'neg': 0.0, 'neu': 0.631, 'pos': 0.369, 'compound': 0.6249}

VADER gave the sample text 'GrabNGoInfo.com is a great machine learning tutorial website.' compound score of 0.6249. There is no negative word in the sentence, so the neg score value is 0. There are 63.1% of neutral words and 36.9% of positive words in the sentence. 

The output of VADER is saved as a dictionary. We can extract the compound sentiment score by the key 'compound'.

In [15]:
# Extract sentiment score
vader_sentiment['compound']

0.6249

Cek beberapa kalimat menggunakan vader sentiment analysis

In [16]:
text2 = 'The food is bad'
vader_sentiment2 = vader.polarity_scores(text2)
vader_sentiment2

{'neg': 0.538, 'neu': 0.462, 'pos': 0.0, 'compound': -0.5423}

In [17]:
vader_sentiment2['compound']

-0.5423

# Step 7: How To Use VADER For Sentiment Analysis

In step 7, we will apply VADER to the Amazon review dataset and see how it performs.

We first get the sentiment compound score for each review and save the values into a column called 'scores_VADER'. Then check if the compound score is positive. If the score is greater than or equal to zero, the predicted sentiment for the review is positive (labeled as 1). Otherwise, the predicted sentiment for the review is negative (labeled as 0).

In [18]:
amz_review.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday off Rick Steve recommendation and loved it.,1
4,The selection on the menu was great and so were the prices.,1


In [21]:
# Get sentiment score for each review
vader_sentiment = SentimentIntensityAnalyzer()
amz_review['scores_VADER'] = amz_review['Review'].apply(lambda s: vader_sentiment.polarity_scores(s)['compound'])

# Predict sentiment label for each review
amz_review['pred_VADER'] = amz_review['scores_VADER'].apply(lambda x: 1 if x >=0 else 0)
amz_review.head()

Unnamed: 0,Review,Liked,scores_VADER,pred_VADER
0,Wow... Loved this place.,1,0.8271,1
1,Crust is not good.,0,-0.3412,0
2,Not tasty and the texture was just nasty.,0,-0.5574,0
3,Stopped by during the late May bank holiday off Rick Steve recommendation and loved it.,1,0.6908,1
4,The selection on the menu was great and so were the prices.,1,0.6249,1


After getting predictions from VADER, let's check the prediction accuracy. 

In [23]:
# Compare Actual and Predicted
accuracy_score(amz_review['Liked'],amz_review['pred_VADER'])

0.7544444444444445

Comparing the actual label with the VADER prediction, we get an accuracy score of 0.768, which means that VADER predicted the review sentiment 76.8% of the time. 

TextBlob has a prediction accuracy of 68.8% for the same dataset, so VADER has an 8% improvement over the TextBlob prediction.