<a href="https://colab.research.google.com/github/mithunkumarsr/NLPNov21/blob/main/SentAnalysis_VADER_IMDB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://towardsdatascience.com/sentiment-analysis-in-10-minutes-with-rule-based-vader-and-nltk-72067970fb71


In [None]:
import tensorflow as tf

(train_data_raw, train_labels), (test_data_raw, test_labels) = tf.keras.datasets.imdb.load_data(index_from=3)
words2idx = tf.keras.datasets.imdb.get_word_index()
idx2words = {idx:word for word, idx in words2idx.items()}

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json


In [None]:
# Let's see an example
train_ex = [idx2words[x-3] for x in train_data_raw[0][1:]] # We use x-3 because when we load the data above, we used index_form=3
train_ex = ' '.join(train_ex)
print(train_ex)

this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert redford's is an amazing actor and now the same being director norman's father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for retail and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also congratulations to the two little boy's that played the part's of norman and paul they were just brilliant children are often left out of the praising list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be p

In [None]:
imdb_reviews = []
for review, label in zip(train_data_raw, train_labels):
  try:
    tokens = [idx2words[x-3] for x in review[1:]]
    text = ' '.join(tokens)
    imdb_reviews.append([text, label])
  except: # There is a distorted observation. For that, we need to handle the error
    print('Small index number')
    pass

Small index number


In [None]:
import pandas as pd

imdb_df = pd.DataFrame(imdb_reviews,columns=['Text', 'Label'])
print(imdb_df.info())
print(imdb_df.head(10))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24999 entries, 0 to 24998
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Text    24999 non-null  object
 1   Label   24999 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 390.7+ KB
None
                                                Text  Label
0  this film was just brilliant casting location ...      1
1  big hair big boobs bad music and a giant safet...      0
2  this has to be one of the worst films of the 1...      0
3  the scots excel at storytelling the traditiona...      1
4  worst mistake of my life br br i picked this m...      0
5  begins better than it ends funny that the russ...      0
6  lavish production values and solid performance...      1
7  the hamiltons tells the story of the four hami...      0
8  just got out and cannot believe what a brillia...      1
9  this movie has many problem associated with it...      0


In [None]:
# Loading VADER Sentiment Intensity Analyzer
import nltk
nltk.download('vader_lexicon')

from nltk.sentiment.vader import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...




In [None]:

sentences = ['Hello, world. I am terrible']
for sentence in sentences:
  print(sentence)
  ss = sia.polarity_scores(sentence)
  for k in sorted(ss):
    print('{0}: {1}, '.format(k, ss[k]), end='')

Hello, world. I am terrible
compound: -0.4767, neg: 0.508, neu: 0.492, pos: 0.0, 

In [None]:
# Shuffle data, Not really necessary, just for healthy practice
imdb_slice = imdb_df.sample(frac=1.0).reset_index(drop=True)

In [None]:

# Create Prediction column based on Polarity Score
imdb_slice['Prediction'] = imdb_slice['Text'].apply(lambda x: 1 if sia.polarity_scores(x)['compound'] >= 0 else -1)

In [None]:
# Edit Label column 1 for Positive, -1 for Negative
imdb_slice['Label'] = imdb_slice['Label'].apply(lambda x: -1 if x == 0 else 1)

# Check if the Label column and Prediction column match for accuracy calculation
imdb_slice['Accuracy'] = imdb_slice.apply(lambda x: 1 if x[1] == x[2] else 0, axis=1)

In [None]:
def conf_matrix(x):
  if x[1] == 1 and x[2] == 1:
    return 'TP'
  elif x[1] == 1 and x[2] == -1:
    return 'FN'
  elif x[1] == -1 and x[2] == 1:
    return 'FP'
  elif x[1] == -1 and x[2] == -1:
    return 'TN'
  else:
    return 0
    
imdb_slice['Conf_Matrix'] = imdb_slice.apply(lambda x: conf_matrix(x), axis=1)

In [None]:
imdb_slice.tail(10)

Unnamed: 0,Text,Label,Prediction,Accuracy,Conf_Matrix
24989,one would think that a film about a young pers...,-1,1,0,FP
24990,kazan's early film noir won an oscar some of t...,1,1,1,TP
24991,i remember hitch hiking to spain at 25 getting...,1,-1,0,FN
24992,i really do not know what people have against ...,1,1,1,TP
24993,this time around blackadder is no longer royal...,1,-1,0,FN
24994,henri verneuil's film may be not so famous as ...,1,-1,0,FN
24995,brilliant actor as he is al pacino completely ...,-1,-1,1,TN
24996,turn your backs away or you're gonna get in bi...,-1,-1,1,TN
24997,this is an awesome amicus horror anthology wit...,1,1,1,TP
24998,i may not be the one to review this movie beca...,-1,-1,1,TN


In [None]:
conf_vals = imdb_slice.Conf_Matrix.value_counts().to_dict()
print(conf_vals)

accuracy = (conf_vals['TP'] + conf_vals['TN']) / (conf_vals['TP'] + conf_vals['TN'] + conf_vals['FP'] + conf_vals['FN'])
precision = conf_vals['TP'] / (conf_vals['TP'] + conf_vals['FP'])
recall = conf_vals['TP'] / (conf_vals['TP'] + conf_vals['FN'])
f1_score = 2*precision*recall / (precision + recall)
print('Accuracy: ', round(100 * accuracy, 2),'%',
      '\nPrecision: ', round(100 * precision, 2),'%',
      '\nRecall: ', round(100 * recall, 2),'%',
      '\nF1 Score: ', round(100 * f1_score, 2),'%')

{'TP': 10638, 'TN': 6741, 'FP': 5758, 'FN': 1862}
Accuracy:  69.52 % 
Precision:  64.88 % 
Recall:  85.1 % 
F1 Score:  73.63 %
