# Vader Sentiment Analysis

**Vader** is an excellent library for getting rapid sentiment analysis results, particularly for the *social media* text. It has some great **advantages** which could be counted as the following:

* No labeling process is required!
* Fast and deployable,
* Not bad accuracy even without Text Preprocessing.

However, there are some main **disadvantages** as well, and the primary one is the fact that it is a rule-based approach, it utilizes the predefined polarity scores of each words (and emojis!) by summing them up to get the final score of the sentence or paragraph, depending on the context that we would like to extract the sentiment. 

Another disadvantage that I have discored thus far, in connection with the first one, is that we cannot go beyond a certain accuracy (compared to NLP approaches), usually I prefer training an NLP model (such as BERT etc.) for attaining higher success rates. In a future notebook, I intent to compare the result with BERT Model.

* Rule-Based sentiment analysis & no learning.

In [1]:
!pip install vaderSentiment

Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.0/126.0 kB[0m [31m20.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2
[0m

In [2]:
import numpy as np 
import pandas as pd 
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import time
import os

We will be using the "Tweet Sentiment Extraction" data from Kaggle, in particular, the "text" and the "sentiment" features.

In [3]:
data = pd.read_csv('amazon/test.csv').head(32000)

In [4]:
data.shape

(32000, 2)

In [5]:
data.head()

Unnamed: 0,text,label
0,Great CD: My lovely Pat has one of the GREAT v...,1
1,One of the best game music soundtracks - for a...,1
2,Batteries died within a year ...: I bought thi...,0
3,"works fine, but Maha Energy is better: Check o...",1
4,Great for the non-audiophile: Reviewed quite a...,1


In [6]:
data.tail()

Unnamed: 0,text,label
31995,Not his best: This is one of the worst of Merr...,0
31996,Great little unit: Bought this unit to try out...,1
31997,Cute: This book is cute for kids. My preschool...,1
31998,What was I thinking?: I read all of Terry Broo...,0
31999,Hail Terry Brooks: I have enjoyed a lot of thi...,1


In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32000 entries, 0 to 31999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    32000 non-null  object
 1   label   32000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 500.1+ KB


In [8]:
data.isnull().sum()

text     0
label    0
dtype: int64

In [9]:
data.dropna(inplace=True)

In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32000 entries, 0 to 31999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    32000 non-null  object
 1   label   32000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 500.1+ KB


Initialize the sentiment analyzer, and calculating the sentiment scores of each sentences in the "text" feature:

In [11]:
analyzer = SentimentIntensityAnalyzer()

In [12]:
def calculate_sentiment_scores(sentence):
    sntmnt = analyzer.polarity_scores(sentence)['compound']
    return(sntmnt)

In [13]:
start = time.time()

eng_snt_score =  []

for comment in data.text.to_list():
    snts_score = calculate_sentiment_scores(comment)
    eng_snt_score.append(snts_score)
    
end = time.time()

# total time taken
print(f"Runtime of the program is {(end - start)/60} minutes or {(end - start)} seconds")

Runtime of the program is 0.235244083404541 minutes or 14.114645004272461 seconds


In [14]:
data['sentiment_score'] = np.array(eng_snt_score)
data.head()

Unnamed: 0,text,label,sentiment_score
0,Great CD: My lovely Pat has one of the GREAT v...,1,0.9642
1,One of the best game music soundtracks - for a...,1,0.9082
2,Batteries died within a year ...: I bought thi...,0,0.7297
3,"works fine, but Maha Energy is better: Check o...",1,0.8945
4,Great for the non-audiophile: Reviewed quite a...,1,0.8388


In [15]:
thresh=0.5
data["pred"]=data.sentiment_score>thresh
acc=(data.pred==data.label).mean()

In [16]:
data.to_csv("VADER.csv",index=False)


In [17]:
from sklearn.metrics import roc_curve
import seaborn as sns

fpr, tpr, thresholds = roc_curve(data.label, data.sentiment_score)
ax=sns.lineplot(fpr,tpr)
ax.set(title="VADER ROC", xlabel="fpr", ylabel="tpr")


TypeError: lineplot() takes from 0 to 1 positional arguments but 2 were given