<a href="https://colab.research.google.com/github/j4ck132/SMPNet/blob/main/Social_Media_Sentiment_Analysis_In_Python_With%C2%A0VADER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Social Media Sentiment Analysis In Python With VADER
(https://zoumanakeita.medium.com/)

This notebook contains all the code from my article on medium [here](https://zoumanakeita.medium.com/)

# **Prerequisites for VADER**


In [None]:
# Install and import nltk
!pip install nltk
import nltk

# Download the lexicon
nltk.download("vader_lexicon")

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


True

In [None]:
# Import the lexicon
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Create an instance of SentimentIntensityAnalyzer
sent_analyzer = SentimentIntensityAnalyzer()

# Example
sentence = "VADER is pretty good at identifying the underlying sentiment of a text!"
print(sent_analyzer.polarity_scores(sentence))

{'neg': 0.0, 'neu': 0.585, 'pos': 0.415, 'compound': 0.75}


- a positive sentiment, compound ≥ 0.05.
- a negative sentiment, compound ≤ -0.05.
- a neutral sentiment, the compound is between ]-0.05, 0.05[

The previous result shows that the sentence does not have any negative information (neg=0). It has some neutral and positive tones (neu=0.585 and pos=0.415). However, the overall sentiment is positive, because compound > 0.05

In [None]:
# What about this sentence with repeated exclamations and capitalization?
sentence_ = "VADER is a REALLY AMAZING library!!!!"
print(sent_analyzer.polarity_scores(sentence_))

{'neg': 0.0, 'neu': 0.373, 'pos': 0.627, 'compound': 0.8284}


As you can see from this example, the compound jumped to 0.82, which makes the sentence more positive than the one before, as per the value of the compound.

In [None]:
# A last example with negative sentiment
negative_sent = "I do HATE those fake news on internet!!😡"
print(sent_analyzer.polarity_scores(negative_sent))

{'neg': 0.619, 'neu': 0.381, 'pos': 0.0, 'compound': -0.8449}


From this last sentence, we can see that the sentence does not have any positive information (pos=0). It has some neutral and positive tones (neu=0.424 and neg=0.576). However, the overall sentiment is negative, because compound < -0.05.
My guess here is that removing the exclamations will make the sentiment less negative. Why don't you give it a try :)

#*VADER on Large Dataset*  
We are going to use this license-free tweets dataset available on the Sentiment140 website, in order to know how well VADER does.
Before that, we are going to use this helper function which will immediately return the polarity (pos, neg, or neu) instead of the dictionary output.

### Load and preprocess the dataset

In [None]:
import pandas as pd

# Read the data set
data_url = "https://raw.githubusercontent.com/keitazoumana/VADER_sentiment-Analysis/main/data/testdata.manual.2009.06.14.csv"
sentiment_data = pd.read_csv(data_url)

sentiment_data.head(3)

Unnamed: 0,4,3,Mon May 11 03:17:40 UTC 2009,kindle2,tpryan,"@stellargirl I loooooooovvvvvveee my Kindle2. Not that the DX is cool, but the 2 is fantastic in its own right."
0,4,4,Mon May 11 03:18:03 UTC 2009,kindle2,vcu451,Reading my kindle2... Love it... Lee childs i...
1,4,5,Mon May 11 03:18:54 UTC 2009,kindle2,chadfu,"Ok, first assesment of the #kindle2 ...it fuck..."
2,4,6,Mon May 11 03:19:04 UTC 2009,kindle2,SIX15,@kenburbary You'll love your Kindle2. I've had...


We are only interested in two columns.
- **'4'** corresponding to the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive).
- **'@stellargi..right'** corresponding to the actual tweet.  
Let format the dataset for better clarification.  


In [None]:
def format_data(data):

  last_col = str(data.columns[-1])
  first_col = str(data.columns[0])

  data.rename(columns = {last_col: 'tweet_text', first_col: 'polarity'}, inplace=True)

  # Change 0, 2, 4 to negative, neutral and positive
  labels = {0: 'negative', 2: 'neutral', 4: 'positive'}
  data['polarity'] = data['polarity'].map(labels)

  # Get only the two columns
  return data[['tweet_text', 'polarity']]

In [None]:
data = format_data(sentiment_data)
data.head(3)

Unnamed: 0,tweet_text,polarity
0,Reading my kindle2... Love it... Lee childs i...,positive
1,"Ok, first assesment of the #kindle2 ...it fuck...",positive
2,@kenburbary You'll love your Kindle2. I've had...,positive


### Implement VADER Sentiment Analysis

In [None]:
def format_output(output_dict):

  polarity = "neutral"

  if(output_dict['compound']>= 0.05):
    polarity = "positive"

  elif(output_dict['compound']<= -0.05):
    polarity = "negative"

  return polarity

def predict_sentiment(text):

  output_dict =  sent_analyzer.polarity_scores(text)
  return format_output(output_dict)

In [None]:
data["vader_prediction"] = data["tweet_text"].apply(predict_sentiment)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [None]:
data.sample(5)

Unnamed: 0,tweet_text,polarity,vader_prediction
308,New nike muppet commercials are pretty cute. W...,positive,positive
180,Jake's going to safeway!,neutral,neutral
65,is scrapbooking with Nic =D,positive,positive
407,@kirstiealley I hate going to the dentist.. !!!,negative,negative
275,SOOO DISSAPOiNTED THEY SENT DANNY GOKEY HOME.....,positive,positive


### VADER Performance on the Dataset.  
From the original polarity column and VADER's prediction we can generate the confusion matrix and its overall performance (precision, recall, and f1 score).

In [None]:
from sklearn.metrics import accuracy_score, classification_report

accuracy = accuracy_score(data['polarity'], data['vader_prediction'])

print("Accuracy: {}\n".format(accuracy))

# Show the classification report
print(classification_report(data['polarity'], data['vader_prediction']))

Accuracy: 0.716297786720322

              precision    recall  f1-score   support

    negative       0.84      0.64      0.72       177
     neutral       0.66      0.70      0.68       139
    positive       0.68      0.81      0.74       181

    accuracy                           0.72       497
   macro avg       0.73      0.71      0.71       497
weighted avg       0.73      0.72      0.72       497



The model seems to be doing a good job because it is much better than a random guess (accuracy = 0.5)! The same observation can be made from the f1-scores of each polarity.
Before diving into building machine learning models, it might be better to take VADER as your baseline model for such a task.