# Sentiment Analysis with `nltk.sentiment.SentimentAnalyzer` and VADER

## 1. Exploring the `subjectivity` corpus

The `subjectivity` dataset contains 5000 objective and 5000 subjective sentences.

In [1]:
from nltk.corpus import subjectivity

subjectivity.categories()

['obj', 'subj']

In [2]:
subjectivity.sents(categories='obj')[0]

['the',
 'movie',
 'begins',
 'in',
 'the',
 'past',
 'where',
 'a',
 'young',
 'boy',
 'named',
 'sam',
 'attempts',
 'to',
 'save',
 'celebi',
 'from',
 'a',
 'hunter',
 '.']

In [3]:
subjectivity.sents(categories='subj')[0]

['smart',
 'and',
 'alert',
 ',',
 'thirteen',
 'conversations',
 'about',
 'one',
 'thing',
 'is',
 'a',
 'small',
 'gem',
 '.']

## 2. Building and testing a classifier with `SentimentAnalyzer`

We will build a classifier that classifies text as being either objective or subjective.

We will start by importing `NaiveBayesClassifier`, `SentimentAnalyzer`, and two useful functions from `nltk.sentiment.util`: `mark_negation()` appends a `_NEG` suffix to words that appear in the scope between a negation and a punctuation mark, and `extract_unigram_feats()` populates a dictionary of word unigram features. Then, for this quick demo, we will create a small dataset of 100 objective and 100 subjective sentences from the corpus.

In [4]:
from nltk.classify import NaiveBayesClassifier
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.util import (mark_negation, extract_unigram_feats)

N_INSTANCES = 100
obj_docs = [(sent, 'obj') for sent in subjectivity.sents(categories='obj')[:N_INSTANCES]]
subj_docs = [(sent, 'subj') for sent in subjectivity.sents(categories='subj')[:N_INSTANCES]]
len(obj_docs), len(subj_docs)

(100, 100)

In [5]:
obj_docs[0]

(['the',
  'movie',
  'begins',
  'in',
  'the',
  'past',
  'where',
  'a',
  'young',
  'boy',
  'named',
  'sam',
  'attempts',
  'to',
  'save',
  'celebi',
  'from',
  'a',
  'hunter',
  '.'],
 'obj')

In [6]:
TRAIN_TEST_SPLIT = .8
cutoff = int(len(obj_docs) * TRAIN_TEST_SPLIT)

train_obj_docs = obj_docs[:cutoff]
test_obj_docs = obj_docs[cutoff:]
train_subj_docs = subj_docs[:cutoff]
test_subj_docs = subj_docs[cutoff:]

training_docs = train_obj_docs + train_subj_docs
testing_docs = test_obj_docs + test_subj_docs

sentiment_analyzer = SentimentAnalyzer()
all_words_with_negation = sentiment_analyzer.all_words([mark_negation(doc) for doc in training_docs])
all_words_with_negation[:10]

['the', 'movie', 'begins', 'in', 'the', 'past', 'where', 'a', 'young', 'boy']

In [7]:
unigram_feats = sentiment_analyzer.unigram_word_feats(all_words_with_negation, min_freq=4)
len(unigram_feats)

83

In [8]:
sentiment_analyzer.add_feat_extractor(extract_unigram_feats, unigrams=unigram_feats)

training_set = sentiment_analyzer.apply_features(training_docs)
test_set = sentiment_analyzer.apply_features(testing_docs)

print('SOME EXAMPLE UNIGRAM FEATURES FOR THE FIRST OBJECTIVE SENTENCE')
print('contains(.):', training_set[0][0]['contains(.)'])
print('contains(it):', training_set[0][0]['contains(it)'])
print('contains(the):', training_set[0][0]['contains(the)'])
print('contains(love):', training_set[0][0]['contains(love)'])
print('contains(but_NEG):', training_set[0][0]['contains(but_NEG)'])

SOME EXAMPLE UNIGRAM FEATURES FOR THE FIRST OBJECTIVE SENTENCE
contains(.): True
contains(it): False
contains(the): True
contains(love): False
contains(but_NEG): False


In [9]:
trainer = NaiveBayesClassifier.train
classifier = sentiment_analyzer.train(trainer, training_set)

Training classifier


In [10]:
for key, value in sorted(sentiment_analyzer.evaluate(test_set).items()):
    print('{0}: {1}'.format(key, value))

Evaluating NaiveBayesClassifier results...
Accuracy: 0.8
F-measure [obj]: 0.8
F-measure [subj]: 0.8
Precision [obj]: 0.8
Precision [subj]: 0.8
Recall [obj]: 0.8
Recall [subj]: 0.8


## 3. Sentiment analysis with `nltk.sentiment.vader.SentimentIntensityAnalyzer`

**VADER** is a parsimonious rule-based model for the sentiment analysis of social media text. It was first developed as a [separate tool](https://github.com/cjhutto/vaderSentiment) and later integrated into NLTK.

VADER is based on a **sentiment lexicon** with sentiment ratings from 10 independent human raters for over 7,500 tokens on a scale from $ [–4] $ (extremely negative) to $ [4] $ (extremely positive). E.g., `okay` has a positive valence of $ 0.9 $, `good` is $ 1.9 $, and `great` is $ 3.1 $, whereas `sucks`/`sux` is $ –1.5 $, `:(` is $ –2.2 $, and `horrible` is $ –2.5 $.

On top of the sentiment lexicon, VADER employs several **rule-based enhancements** like word-order sensitivity, degree modifiers, word-shape amplifiers, punctuation amplifiers, negation polarity switches, and contrastive conjunction sensitivity.

When tested on a benchmark set of tweets, VADER’s sentiment assessment outperformed individual human raters.

In [11]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sentences = [
    "You are a piece of shit, and I will step on you.",
    "THIS SUCKS!",
    "This kinda sux...",
    "You're good, man!",
    "DAMN, YOU ARE THE BEST! VERY FUNNY!!!"
            ]

sid = SentimentIntensityAnalyzer()

for sentence in sentences:
    print(sentence)
    sentiment_strength = sid.polarity_scores(sentence)
    for k in sorted(sentiment_strength):
        print('{0}: {1}, '.format(k, sentiment_strength[k]), end='')
    print('\n')

You are a piece of shit, and I will step on you.
compound: -0.5574, neg: 0.286, neu: 0.714, pos: 0.0, 

THIS SUCKS!
compound: -0.4199, neg: 0.736, neu: 0.264, pos: 0.0, 

This kinda sux...
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 

You're good, man!
compound: 0.4926, neg: 0.0, neu: 0.385, pos: 0.615, 

DAMN, YOU ARE THE BEST! VERY FUNNY!!!
compound: 0.7821, neg: 0.177, neu: 0.262, pos: 0.561, 



Above, `compound` represents the aggregated, final score. It is computed by summing each word’s valence score and then normalizing to the range $ [-1, 1] $. The individual `neg`, `neu`, and `pos` scores are ratios for proportions of text that fall under each category (so they should add up to $ 1 $) and are useful for digging deeper.