## 2. Sentiment Analysis
In this exercise, we will classify the sentiment of text documents. Complete the code with TODO tag.

References and Further Readings:
+ http://www.nltk.org/howto/sentiment.html
+ https://www.nltk.org/api/nltk.sentiment.html
+ http://datameetsmedia.com/vader-sentiment-analysis-explained/
+ https://github.com/cjhutto/vaderSentiment
+ https://marcobonzanini.com/2015/05/17/mining-twitter-data-with-python-part-6-sentiment-analysis-basics/
+ https://github.com/marrrcin/ml-twitter-sentiment-analysis


### 2.1. Classification approach

Classification approach looks at previously labeled data in order to determine the sentiment of never-before-seen sentences. It involves training a model using previously seen text to predict/classify the sentiment of some new input text. The nice thing is that, with a greater volume of data, we generally get better prediction or classification results. However, unlike the lexical approach, we need previously labeled data.

In [58]:
import nltk
from nltk.classify import NaiveBayesClassifier
nltk.download('vader_lexicon')
nltk.download('movie_reviews')
from nltk.corpus import subjectivity
from nltk.corpus import movie_reviews
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.util import *

# n_instances = 100
# subj_docs = [(sent, 'subj') for sent in subjectivity.sents(categories='subj')[:n_instances]]
# obj_docs = [(sent, 'obj') for sent in subjectivity.sents(categories='obj')[:n_instances]]
# len(subj_docs), len(obj_docs)

n_instances = None
if n_instances is not None:
    n_instances = int(n_instances/2)

pos_docs = [(list(movie_reviews.words(pos_id)), 'pos') for pos_id in movie_reviews.fileids('pos')[:n_instances]]
neg_docs = [(list(movie_reviews.words(neg_id)), 'neg') for neg_id in movie_reviews.fileids('neg')[:n_instances]]
len(pos_docs), len(neg_docs)

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\Pube\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\Pube\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


(1000, 1000)

Each document is represented by a tuple (sentence, label). The sentence is tokenized, so it is represented by a list of strings:

In [59]:
pos_docs[0]

(['films',
  'adapted',
  'from',
  'comic',
  'books',
  'have',
  'had',
  'plenty',
  'of',
  'success',
  ',',
  'whether',
  'they',
  "'",
  're',
  'about',
  'superheroes',
  '(',
  'batman',
  ',',
  'superman',
  ',',
  'spawn',
  ')',
  ',',
  'or',
  'geared',
  'toward',
  'kids',
  '(',
  'casper',
  ')',
  'or',
  'the',
  'arthouse',
  'crowd',
  '(',
  'ghost',
  'world',
  ')',
  ',',
  'but',
  'there',
  "'",
  's',
  'never',
  'really',
  'been',
  'a',
  'comic',
  'book',
  'like',
  'from',
  'hell',
  'before',
  '.',
  'for',
  'starters',
  ',',
  'it',
  'was',
  'created',
  'by',
  'alan',
  'moore',
  '(',
  'and',
  'eddie',
  'campbell',
  ')',
  ',',
  'who',
  'brought',
  'the',
  'medium',
  'to',
  'a',
  'whole',
  'new',
  'level',
  'in',
  'the',
  'mid',
  "'",
  '80s',
  'with',
  'a',
  '12',
  '-',
  'part',
  'series',
  'called',
  'the',
  'watchmen',
  '.',
  'to',
  'say',
  'moore',
  'and',
  'campbell',
  'thoroughly',
  'researche

We separately split subjective and objective instances to keep a balanced uniform class distribution in both train and test sets.

In [60]:
# TODO: split training and testing data as 80/20
def getSplitData(data):
    total_len = len(data)
    split_index = total_len/10 * 2;
    test_set = []
    train_set = []
    for i in range(0,len(data)):
        piece = data[i]
        if(i < split_index):
            test_set.append(piece)
        else:
            train_set.append(piece)
    print("Original Data Length: "+str(total_len))
    print("Training Set Length: " +str(len(train_set)))
    print("Testing Set Length: " +str(len(test_set)))
    print("\n")
    return train_set, test_set
print("> Pos Data <")
train_pos_docs, test_pos_docs = getSplitData(pos_docs)
print("> Neg Data <")
train_neg_docs, test_neg_docs = getSplitData(neg_docs)


training_docs = train_pos_docs+train_neg_docs
testing_docs = test_pos_docs+test_neg_docs
sentim_analyzer = SentimentAnalyzer()
all_words_neg = sentim_analyzer.all_words([mark_negation(doc) for doc in training_docs])

all_words_neg[:5]

> Pos Data <
Original Data Length: 1000
Training Set Length: 800
Testing Set Length: 200


> Neg Data <
Original Data Length: 1000
Training Set Length: 800
Testing Set Length: 200




['a', 'common', 'complaint', 'amongst', 'film']

We use simple unigram word features, handling negation:

In [61]:
unigram_feats = sentim_analyzer.unigram_word_feats(all_words_neg, min_freq=4)
print(len(unigram_feats))
sentim_analyzer.add_feat_extractor(extract_unigram_feats, unigrams=unigram_feats)

14730


We apply features to obtain a feature-value representation of our datasets:

In [62]:
training_set = sentim_analyzer.apply_features(training_docs)
test_set = sentim_analyzer.apply_features(testing_docs)
print(training_set[0])



We can now train our classifier on the training set, and subsequently output the evaluation results:

In [14]:
# TODO: Use Naive Bayes to train the sentiment classifier

trainer = NaiveBayesClassifier.train
classifier = sentim_analyzer.train(trainer, training_set)
for key,value in sorted(sentim_analyzer.evaluate(test_set).items()):
    print('{0}: {1}'.format(key, value))


Training classifier
Evaluating NaiveBayesClassifier results...
Accuracy: 0.7875
F-measure [neg]: 0.8045977011494253
F-measure [pos]: 0.7671232876712328
Precision [neg]: 0.7446808510638298
Precision [pos]: 0.8484848484848485
Recall [neg]: 0.875
Recall [pos]: 0.7


### 2.2. Lexical approach

Lexical approaches aim to map words to sentiment by building a lexicon or a 'dictionary of sentiment'. We can use this dictionary to assess the sentiment of phrases and sentences, without the need of looking at anything else. Sentiment can be categorical – such as {negative, neutral, positive} – or it can be numerical – like a range of intensities or scores. Lexical approaches look at the sentiment category or score of each word in the sentence and decide what the sentiment category or score of the whole sentence is. The power of lexical approaches lies in the fact that we do not need to train a model using labeled data, since we have everything we need to assess the sentiment of sentences in the dictionary of emotions. VADER is an example of a lexical method.

In [63]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

Run the lexical approach

In [64]:
sid = SentimentIntensityAnalyzer()
for doc in testing_docs:
    doc = " ".join(doc[0])
    print(doc[:100] + "...")
    ss = sid.polarity_scores(doc)
    for k in sorted(ss):
        print('{0}: {1}, '.format(k, ss[k]), end='')
    print()

films adapted from comic books have had plenty of success , whether they ' re about superheroes ( ba...
compound: -0.9933, neg: 0.159, neu: 0.724, pos: 0.117, 
every now and then a movie comes along from a suspect studio , with every indication that it will be...
compound: 0.9776, neg: 0.097, neu: 0.767, pos: 0.136, 
you ' ve got mail works alot better than it deserves to . in order to make the film a success , all ...
compound: 0.9965, neg: 0.085, neu: 0.702, pos: 0.213, 
" jaws " is a rare film that grabs your attention before it shows you a single image on screen . the...
compound: 0.977, neg: 0.084, neu: 0.809, pos: 0.106, 
moviemaking is a lot like being the general manager of an nfl team in the post - salary cap era -- y...
compound: -0.9895, neg: 0.137, neu: 0.759, pos: 0.104, 
on june 30 , 1960 , a self - taught , idealistic , yet pragmatic , young man became , at age 36 , th...
compound: 0.9964, neg: 0.068, neu: 0.798, pos: 0.133, 
apparently , director tony kaye had a major b

compound: 0.9982, neg: 0.081, neu: 0.689, pos: 0.23, 
screen story by kevin yagher and andrew kevin walker . screenplay by andrew kevin walker . inspired ...
compound: 0.9412, neg: 0.098, neu: 0.774, pos: 0.129, 
mpaa : not rated ( though i feel it would likely be pg , for martial - arts violence . ) with three ...
compound: -0.3236, neg: 0.124, neu: 0.745, pos: 0.13, 
well , i know that stallone is 50 years old now , but in daylight he doesn ' t look it ! daylight is...
compound: -0.7447, neg: 0.158, neu: 0.709, pos: 0.133, 
logical time travel movies are a near - impossibility . considering that the skeptic ' s best argume...
compound: 0.999, neg: 0.085, neu: 0.737, pos: 0.178, 
blade is the movie that shows that wesley snipes really can live up to his potential as one of holly...
compound: 0.9956, neg: 0.062, neu: 0.79, pos: 0.148, 
the truman show ( paramount pictures ) running time : 1 hour 42 minutes starring jim carrey and ed h...
compound: 0.997, neg: 0.059, neu: 0.77, pos: 0.1

compound: 0.996, neg: 0.131, neu: 0.691, pos: 0.178, 
as the film opens up , expectant unwed mother sally ( played by drew barrymore ) encounters her baby...
compound: 0.9865, neg: 0.032, neu: 0.828, pos: 0.139, 
the second jackal - based film to come out in 1997 ( the other starring bruce willis was simply enti...
compound: 0.9968, neg: 0.078, neu: 0.729, pos: 0.193, 
one of the last entries in the long - running carry on series , carry on behind is very similar to c...
compound: 0.7531, neg: 0.072, neu: 0.846, pos: 0.082, 
the cryptic teaser trailer has been unspooling in moviehouses for quite sometime now : " it mu5t be ...
compound: 0.9951, neg: 0.087, neu: 0.775, pos: 0.138, 
" very bad things , " is the most delightfully morbid film of the year , a movie that goes so far ov...
compound: 0.7441, neg: 0.168, neu: 0.651, pos: 0.181, 
it is hard to imagine that a movie which includes abortion and incest as prominent plot devices coul...
compound: 0.9944, neg: 0.049, neu: 0.82, pos: 0

compound: 0.9978, neg: 0.048, neu: 0.791, pos: 0.161, 
damn those trailers . had it not been for the advertising of this film , which reveals far too much ...
compound: 0.9682, neg: 0.092, neu: 0.756, pos: 0.152, 
bob the happy bastard ' s quickie review : rush hour so what ' s the problem with 48 hours clones th...
compound: 0.9843, neg: 0.103, neu: 0.715, pos: 0.182, 
this sunday afternoon i had the priviledge of attending a private screening at the sony astor cinema...
compound: 0.9986, neg: 0.089, neu: 0.722, pos: 0.189, 
plot : two teen couples go to a church party , drink and then drive . they get into an accident . on...
compound: 0.988, neg: 0.103, neu: 0.754, pos: 0.144, 
the happy bastard ' s quick movie review damn that y2k bug . it ' s got a head start in this movie s...
compound: 0.8534, neg: 0.054, neu: 0.854, pos: 0.092, 
it is movies like these that make a jaded movie viewer thankful for the invention of the timex indig...
compound: 0.9669, neg: 0.088, neu: 0.782, pos: 

compound: 0.9955, neg: 0.109, neu: 0.731, pos: 0.16, 
has it really been two decades since walter matthau coached the bad news bears ? nineteen years and ...
compound: 0.9755, neg: 0.097, neu: 0.765, pos: 0.138, 
funny how your expectations can be defeated , and not in good ways . the ghost and the darkness prom...
compound: 0.5217, neg: 0.079, neu: 0.837, pos: 0.084, 
unfortunately it doesn ' t get much more formulaic than one tough cop . there ' s the renegade cop w...
compound: 0.1358, neg: 0.119, neu: 0.76, pos: 0.12, 
supposedly based on a true story in which the british drive to build a rail bridge deep in africa gr...
compound: -0.8568, neg: 0.128, neu: 0.783, pos: 0.09, 
of course i knew this going in . why is it that whenever a tv - star makes a movie it ' s always a r...
compound: 0.9675, neg: 0.077, neu: 0.758, pos: 0.165, 
louie is a trumpeter swan with no voice . in order to woo his lady love serina , louie makes friends...
compound: 0.9927, neg: 0.057, neu: 0.767, pos: 0.

compound: 0.9772, neg: 0.109, neu: 0.753, pos: 0.138, 
" showgirls " is the first big - budget , big - studio film to receive an nc - 17 rating . and its r...
compound: -0.9232, neg: 0.156, neu: 0.713, pos: 0.131, 
here ' s a concept -- jean - claude van damme gets killed within the first ten minutes of the movie ...
compound: 0.8606, neg: 0.065, neu: 0.848, pos: 0.087, 
what makes reindeer games even more disappointing than just a predictable , lifeless action flick is...
compound: -0.9898, neg: 0.16, neu: 0.706, pos: 0.134, 
( dreamworks skg ) running time : 2 hours starring robert duvall , tea leoni and elijah wood directe...
compound: 0.9943, neg: 0.077, neu: 0.756, pos: 0.167, 
seen may 2 , 1998 at 3 : 40 p . m . at the crossgates cinema 18 , theater # 13 , with chris wessell ...
compound: -0.9955, neg: 0.16, neu: 0.74, pos: 0.1, 
the only thing worse than watching a bad movie is realizing that the film had a lot of potential and...
compound: -0.952, neg: 0.157, neu: 0.709, pos: 0

compound: 0.9913, neg: 0.112, neu: 0.743, pos: 0.145, 
an experience like baby geniuses can have certain effects on an average moviegoer . you may be scarr...
compound: -0.8833, neg: 0.117, neu: 0.761, pos: 0.122, 
" houston . we have a serious problem . " after making " mission : impossible " , brian de palma has...
compound: -0.4116, neg: 0.128, neu: 0.749, pos: 0.123, 
gun wielding arnold schwarzenegger has a change of heart by the film ' s end and becomes a believer ...
compound: -0.9977, neg: 0.167, neu: 0.751, pos: 0.082, 
after enduring mariah carey ' s film debut , glitter , i ' m reminded of a bit from chris rock ' s b...
compound: -0.9005, neg: 0.124, neu: 0.763, pos: 0.113, 
poster boy for co - dependency needs patching patch adams a film review by michael redman copyright ...
compound: 0.9964, neg: 0.094, neu: 0.748, pos: 0.158, 
young einstein is embarrassingly lame , but that didn ' t stop it from becoming a phenomenon in aust...
compound: 0.9507, neg: 0.108, neu: 0.75, p

### 2.3 Comparing two approaches

First we can transform the sentiment score by the lexical approach into label by the following rules:

+ positive sentiment: compound score > 0
+ negative sentiment: compound score <= 0

In [65]:
def lexical_sentiment(doc, sid=None):
    """TODO: return the label 'pos' or 'neg' for a document"""
    if sid is None: sid = SentimentIntensityAnalyzer()
    label = "pos"
    ss = sid.polarity_scores(doc)
    if(ss['compound'] <= 0):
        label = "neg"
    print()
    return label

for doc in testing_docs:
    doc = " ".join(doc[0])
    label = lexical_sentiment(doc, sid)
    print(doc[:100] + "...", label)


films adapted from comic books have had plenty of success , whether they ' re about superheroes ( ba... neg

every now and then a movie comes along from a suspect studio , with every indication that it will be... pos

you ' ve got mail works alot better than it deserves to . in order to make the film a success , all ... pos

" jaws " is a rare film that grabs your attention before it shows you a single image on screen . the... pos

moviemaking is a lot like being the general manager of an nfl team in the post - salary cap era -- y... neg

on june 30 , 1960 , a self - taught , idealistic , yet pragmatic , young man became , at age 36 , th... pos

apparently , director tony kaye had a major battle with new line regarding his new film , american h... neg

one of my colleagues was surprised when i told her i was willing to see betsy ' s wedding . and she ... pos

after bloody clashes and independence won , lumumba refused to pander to the belgians , who continue... pos

the american actio


a big surprise to me . the good trailer had hinted that they pulled the impossible off , but making ... pos

after having heard so many critics describe " return to me " as an old - fashioned hollywood romance... pos

wild things is a suspenseful thriller starring matt dillon , denise richards , and neve campbell tha... pos

* * * * * * minor plot spoilers in review * * * * * * * * * * * * no major spoilers are in review * ... pos

are you tired of all the hot new releases being gone by the time you get to the video store ? waffle... neg

many people dislike french films for their lack of closure . while possibly shallow , i ' ve often h... pos

the keen wisdom of an elderly bank robber , the naive ambitions of a sexy hospital nurse , and a par... pos

robert benton has assembled a stellar , mature cast for his latest feature , twilight , a film noir ... pos


accepting his oscar as producer of this year ' s best picture winner , saul zaentz remarked that his... pos

a bleak look at h


a standoff . a man holds a woman , a diplomat ' s daughter , hostage in his embrace , a gun pressed ... neg

with stars like sigourney weaver ( " alien " trilogy ) and academy award winner holly hunter ( the p... neg

no , it is not a bad film , in fact it is so good in achieving its purpose , i actually wished for t... pos

capsule : earthy , experimental , difficult , shockingly frank ( even for 1997 ! ) , and ultimately ... pos

a fully loaded entertainment review - website coming in july ! i didn ' t really expect very much wh... pos

" oh my god , i sounded just like a mother ! " mrs . pascal , played with devilish wickedness by gen... pos

i don ' t know how many other people have had the idea cross their mind that their life could be an ... pos

no filmmaker deconstructs a story as well as atom egoyan . i ' m referring , specifically , to the n... pos

i know what i would do with $ 4 . 4 million if i found it in a previously undiscovered airplane cras... pos

i saw simon birch 


talk about a movie that seemed dated before it even hit the theaters ! spice world is the feature fi... pos

this feature is like a double header , two sets of clich ? s for the price of one . not only do we g... pos

one of the responses those that enjoy " detroit rock city " ( probably kiss fans , mostly ) might ha... pos

has it really been two decades since walter matthau coached the bad news bears ? nineteen years and ... pos

funny how your expectations can be defeated , and not in good ways . the ghost and the darkness prom... pos

unfortunately it doesn ' t get much more formulaic than one tough cop . there ' s the renegade cop w... pos

supposedly based on a true story in which the british drive to build a rail bridge deep in africa gr... neg

of course i knew this going in . why is it that whenever a tv - star makes a movie it ' s always a r... pos

louie is a trumpeter swan with no voice . in order to woo his lady love serina , louie makes friends... pos

one of the most re


some talented actresses are blessed with a demonstrated wide acting range while others , almost as g... pos

susan granger ' s review of " ghosts of mars " ( sony pictures entertainment ) horror auteur john ca... neg

plot : based on the wildly popular " jerry springer " tv show , this movie follows the lives of two ... pos

i never understood what the clich ? " hell on earth " truly meant until very recently . i ' ve just ... neg

about an hour or so into " the jackal , " a character wandered around as people were being shot at i... pos

" showgirls " is the first big - budget , big - studio film to receive an nc - 17 rating . and its r... neg

here ' s a concept -- jean - claude van damme gets killed within the first ten minutes of the movie ... pos

what makes reindeer games even more disappointing than just a predictable , lifeless action flick is... neg

( dreamworks skg ) running time : 2 hours starring robert duvall , tea leoni and elijah wood directe... pos

seen may 2 , 1998 

Now we evaluate the lexical approach by computing accuracy metrics

In [71]:
from collections import defaultdict
from nltk.metrics import (accuracy as eval_accuracy, precision as eval_precision,
        recall as eval_recall, f_measure as eval_f_measure)

gold_results = defaultdict(set)
test_results = defaultdict(set)
acc_gold_results = []
acc_test_results = []
labels = set()
num = 0
for i, (text, label) in enumerate(testing_docs):
    print(label)
    labels.add(label)
    gold_results[label].add(i)
    acc_gold_results.append(label)
    observed = lexical_sentiment(" ".join(text), sid)
    num += 1
    acc_test_results.append(observed)
    test_results[observed].add(i)
metrics_results = {}

# TODO: compute the accuracy metrics
for label in labels:
    metrics_results[label] = eval_accuracy(test_results, gold_results)

for result in sorted(metrics_results):
        print('{0}: {1}'.format(result, metrics_results[result]))

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

pos

