<a href="https://colab.research.google.com/github/mkane968/Text-Mining-Experiments/blob/main/NLTK/Tutorial%208%3A%20Sentiment%20Analysis%20with%20nltk.sentiment.SentimentAnalyzer%20and%20VADER%20tools.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Tutorial 8: Sentiment Analysis with `nltk.sentiment.SentimentAnalyzer` and VADER tools

### ***Step 1: Exploring the `subjectivity` corpus***

The Subjectivity Dataset contains 5000 subjective and 5000 objective processed sentences. Learn more about the subjectivity corpus [here](https://www.nltk.org/howto/corpus.html).

Import subjectivity corpus and get the file ids.

In [None]:
import nltk
nltk.download('subjectivity')
from nltk.corpus import subjectivity

subjectivity.fileids()

[nltk_data] Downloading package subjectivity to /root/nltk_data...
[nltk_data]   Package subjectivity is already up-to-date!


['plot.tok.gt9.5000', 'quote.tok.gt9.5000']

Get tokens in plot.tok file

In [None]:
subjectivity.sents('plot.tok.gt9.5000')

[['the', 'movie', 'begins', 'in', 'the', 'past', 'where', 'a', 'young', 'boy', 'named', 'sam', 'attempts', 'to', 'save', 'celebi', 'from', 'a', 'hunter', '.'], ['emerging', 'from', 'the', 'human', 'psyche', 'and', 'showing', 'characteristics', 'of', 'abstract', 'expressionism', ',', 'minimalism', 'and', 'russian', 'constructivism', ',', 'graffiti', 'removal', 'has', 'secured', 'its', 'place', 'in', 'the', 'history', 'of', 'modern', 'art', 'while', 'being', 'created', 'by', 'artists', 'who', 'are', 'unconscious', 'of', 'their', 'artistic', 'achievements', '.'], ...]

Get tokens in quote.tok file

In [None]:
subjectivity.sents('quote.tok.gt9.5000')

[['smart', 'and', 'alert', ',', 'thirteen', 'conversations', 'about', 'one', 'thing', 'is', 'a', 'small', 'gem', '.'], ['color', ',', 'musical', 'bounce', 'and', 'warm', 'seas', 'lapping', 'on', 'island', 'shores', '.', 'and', 'just', 'enough', 'science', 'to', 'send', 'you', 'home', 'thinking', '.'], ...]

Retrieve the categories in subjectivity corpus (objective and subjective sentences).

In [None]:
subjectivity.categories() # The mapping between documents and categories does not depend on the file structure.

['obj', 'subj']

Get tokens in subjectivity that are categorized as "objective"

In [None]:
subjectivity.sents(categories='obj')

[['the', 'movie', 'begins', 'in', 'the', 'past', 'where', 'a', 'young', 'boy', 'named', 'sam', 'attempts', 'to', 'save', 'celebi', 'from', 'a', 'hunter', '.'], ['emerging', 'from', 'the', 'human', 'psyche', 'and', 'showing', 'characteristics', 'of', 'abstract', 'expressionism', ',', 'minimalism', 'and', 'russian', 'constructivism', ',', 'graffiti', 'removal', 'has', 'secured', 'its', 'place', 'in', 'the', 'history', 'of', 'modern', 'art', 'while', 'being', 'created', 'by', 'artists', 'who', 'are', 'unconscious', 'of', 'their', 'artistic', 'achievements', '.'], ...]

Get tokens in subjectivity that are categorized as "subjective"

In [None]:
subjectivity.sents(categories='subj')

[['smart', 'and', 'alert', ',', 'thirteen', 'conversations', 'about', 'one', 'thing', 'is', 'a', 'small', 'gem', '.'], ['color', ',', 'musical', 'bounce', 'and', 'warm', 'seas', 'lapping', 'on', 'island', 'shores', '.', 'and', 'just', 'enough', 'science', 'to', 'send', 'you', 'home', 'thinking', '.'], ...]

###***Step 2: Building and testing a classifier with `SentimentAnalyzer`***

Import necessary classifiers and modules. 

In [None]:
from nltk.classify import NaiveBayesClassifier
from nltk.sentiment import SentimentAnalyzer # SentimentAnalyzer is a tool to implement and facilitate Sentiment Analysis.
from nltk.sentiment.util import (mark_negation, extract_unigram_feats) # mark_negation(): Append _NEG suffix to words that appear in the scope between a negation and a punctuation mark. extract_unigram_feats(): Populate a dictionary of unigram features, reflecting the presence/absence in the document of each of the tokens in unigrams.


Set number of instances at 100; then create two new lists for objective and subjective docs and put sentences up to number of n_instancse (100) in each list. Each document is represented by a tuple (sentence, label). The sentence is tokenized, so it is represented by a list of strings.

Print length of each list to check they both contain 100 sentences.

In [None]:
n_instances = 100
obj_docs = [(sent, 'obj') for sent in subjectivity.sents(categories='obj')[:n_instances]]
subj_docs = [(sent, 'subj') for sent in subjectivity.sents(categories='subj')[:n_instances]]
len(obj_docs), len(subj_docs)

(100, 100)

Print a sentence in obj_docs list to check:

In [None]:
obj_docs[0]

(['the',
  'movie',
  'begins',
  'in',
  'the',
  'past',
  'where',
  'a',
  'young',
  'boy',
  'named',
  'sam',
  'attempts',
  'to',
  'save',
  'celebi',
  'from',
  'a',
  'hunter',
  '.'],
 'obj')

Divde sentences into training and testing groups; first 80 sentences of each are for training, last 20 for testing. Split evenly for objective and subjective docs, then combine into two larger groups (all training and all testing).

In [None]:
train_obj_docs = obj_docs[:80]
test_obj_docs = obj_docs[80:100]
train_subj_docs = subj_docs[:80]
test_subj_docs = subj_docs[80:100]

training_docs = train_obj_docs + train_subj_docs
testing_docs = test_obj_docs + test_subj_docs

Define sentiment analyzer as `SentimentAnalyzer()` and use it to append _NEG suffix to words that appear between a sensed negation and a punctuation mark.

In [None]:
sentim_analyzer = SentimentAnalyzer()
all_words_neg = sentim_analyzer.all_words([mark_negation(doc) for doc in training_docs])
#all_words_neg

Return the list of most common 1-word features in all_words_neg, with a minimum frequency of 4 appearances.

In [None]:
unigram_feats = sentim_analyzer.unigram_word_feats(all_words_neg, min_freq=4)
len(unigram_feats)

83

Add unigram_features to list of features that the sentiment analyzer will extract from the data.

In [None]:
sentim_analyzer.add_feat_extractor(extract_unigram_feats, unigrams=unigram_feats)

Redefine training and test set to include whether or not sents include the `unigram_feats`

In [None]:
training_set = sentim_analyzer.apply_features(training_docs)
test_set = sentim_analyzer.apply_features(testing_docs)
#training_set[0]

We can now train our classifier on the training set, and subsequently output the evaluation results. 

In [None]:
trainer = NaiveBayesClassifier.train
classifier = sentim_analyzer.train(trainer, training_set)


Training classifier


Interpretation of results from [Python NLTK Cookbook:](https://streamhacker.com/2010/05/17/text-classification-sentiment-analysis-precision-recall/)

*  **Accuracy** measures the number of elements correctly identified in a data set.
*  **F-measure** is the weighted harmonic mean of precision and recall. 
*  **Precision** measures the exactness of a classifier. A higher precision means less false positives, while a lower precision means more false positives.
*   **Recall** measures the completeness, or sensitivity, of a classifier. Higher recall means less false negatives, while lower recall means more false negatives. Often improves inverse of precision.


In [None]:
for key,value in sorted(sentim_analyzer.evaluate(test_set).items()):
    print('{0}: {1}'.format(key, value))

Evaluating NaiveBayesClassifier results...
Accuracy: 0.8
F-measure [obj]: 0.8
F-measure [subj]: 0.8
Precision [obj]: 0.8
Precision [subj]: 0.8
Recall [obj]: 0.8
Recall [subj]: 0.8


### ***Step 3: Building and testing a classifier with `nltk.sentiment.vader.SentimentIntensityAnalyzer`***

Import `SentimentIntensityAnalyzer `from [Vader](http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf). This will assign an "intensity score" to each sentence based on its identified sentiment.

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

Add list of sentences for analysis.

In [None]:
sentences = [
    "You are a jerk, and I will step on you.",
    "THIS SUX!!!",
    "This kinda sux...",
    "You're good, man",
    "HAHAHA YOU ARE THE BEST!!!!! VERY FUNNY!!!"
            ]

Use SentimentIntesnityAnalyzer (defined as sid) to get "intensity" of each sentence in list

In [None]:
import nltk
nltk.download('vader_lexicon')
sid = SentimentIntensityAnalyzer()

for sentence in sentences:
    print('\n' + sentence)
    ss = sid.polarity_scores(sentence)
    for k in sorted(ss):
        print('{0}: {1}, '.format(k, ss[k]), end='')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...

You are a jerk, and I will step on you.
compound: -0.34, neg: 0.255, neu: 0.745, pos: 0.0, 
THIS SUX!!!
compound: -0.5229, neg: 0.771, neu: 0.229, pos: 0.0, 
This kinda sux...
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
You're good, man
compound: 0.4404, neg: 0.0, neu: 0.408, pos: 0.592, 
HAHAHA YOU ARE THE BEST!!!!! VERY FUNNY!!!
compound: 0.8386, neg: 0.0, neu: 0.386, pos: 0.614, 