In [None]:
import nltk
nltk.download('vader_lexicon')
import pandas as pd
from nltk.sentiment.vader import SentimentIntensityAnalyzer

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


## Data loading

Use of IMDB dataset

In [None]:
# Mounting the google drive to google colab in order to load the data files directly from it
from google.colab import drive
drive.mount('/content/drive')
test_df = pd.read_csv("/content/drive/MyDrive/EPITA_NLP/Course2/IMDB Dataset.csv")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# These data from IMDB correspond to movie review, with sentiment (postive/negative) labels
test_df.head

<bound method NDFrame.head of                                                   review sentiment
0      One of the other reviewers has mentioned that ...  positive
1      A wonderful little production. <br /><br />The...  positive
2      I thought this was a wonderful way to spend ti...  positive
3      Basically there's a family where a little boy ...  negative
4      Petter Mattei's "Love in the Time of Money" is...  positive
...                                                  ...       ...
49995  I thought this movie did a down right good job...  positive
49996  Bad plot, bad dialogue, bad acting, idiotic di...  negative
49997  I am a Catholic taught in parochial elementary...  negative
49998  I'm going to have to disagree with the previou...  negative
49999  No one expects the Star Trek movies to be high...  negative

[50000 rows x 2 columns]>

## Sentiment Analysis with Vader library

Creation of a SentimentIntensityAnalyzer object

In [None]:
sid_obj = SentimentIntensityAnalyzer()

This library is specifically train to analyse social media text. To illustrate this, we can test on classic smiley texts.

In [None]:
happy_smiley = ":)"
print(f"We can try on the happy smiley: {happy_smiley}")
print(sid_obj.polarity_scores(happy_smiley))
sad_smiley = ":("
print(f"We can try on the sad smiley: {sad_smiley}")
print(sid_obj.polarity_scores(sad_smiley))

We can try on the happy smiley: :)
{'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.4588}
We can try on the sad smiley: :(
{'neg': 1.0, 'neu': 0.0, 'pos': 0.0, 'compound': -0.4404}


If compound is above 0 that means the sentiment is positive, unless it is negative. 

We test directly on IMDB dataset, and we compare the predicted values with the ground truth given directly into the original dataset

In [None]:
list_vader_test = []
for sentence in test_df.itertuples():
    sentiment_dict = sid_obj.polarity_scores(sentence.review)
    observed_sentiment = sentence.sentiment
    predicted_sentiment = sentiment_dict['compound']
    if (predicted_sentiment>=0 and observed_sentiment=="positive") or (predicted_sentiment<0 and observed_sentiment=="negative"):
      prediction = True
    else:
      prediction = False
    list_vader_test.append(prediction)

In [None]:
list_vader_test[0:10]

[False, True, True, True, True, True, True, False, False, True]

We can check the rate of good prediction rate with this pre-trained Vader model

In [None]:
rate_success = list_vader_test.count(True)/len(list_vader_test) * 100
print(f"Good prediction rate: {int(rate_success)}%")

Good prediction rate: 69%
