## 2. Sentiment Analysis
In this exercise, we will classify the sentiment of text documents. Complete the code with TODO tag.

References and Further Readings:
+ http://www.nltk.org/howto/sentiment.html
+ https://www.nltk.org/api/nltk.sentiment.html
+ http://datameetsmedia.com/vader-sentiment-analysis-explained/
+ https://github.com/cjhutto/vaderSentiment
+ https://marcobonzanini.com/2015/05/17/mining-twitter-data-with-python-part-6-sentiment-analysis-basics/
+ https://github.com/marrrcin/ml-twitter-sentiment-analysis


### 2.1. Classification approach

Classification approach looks at previously labeled data in order to determine the sentiment of never-before-seen sentences. It involves training a model using previously seen text to predict/classify the sentiment of some new input text. The nice thing is that, with a greater volume of data, we generally get better prediction or classification results. However, unlike the lexical approach, we need previously labeled data.

In [2]:
import nltk
from nltk.classify import NaiveBayesClassifier
nltk.download('vader_lexicon')
nltk.download('movie_reviews')
from nltk.corpus import subjectivity
from nltk.corpus import movie_reviews
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.util import *

# n_instances = 100
# subj_docs = [(sent, 'subj') for sent in subjectivity.sents(categories='subj')[:n_instances]]
# obj_docs = [(sent, 'obj') for sent in subjectivity.sents(categories='obj')[:n_instances]]
# len(subj_docs), len(obj_docs)

n_instances = None
if n_instances is not None:
    n_instances = int(n_instances/2)

pos_docs = [(list(movie_reviews.words(pos_id)), 'pos') for pos_id in movie_reviews.fileids('pos')[:n_instances]]
neg_docs = [(list(movie_reviews.words(neg_id)), 'neg') for neg_id in movie_reviews.fileids('neg')[:n_instances]]
len(pos_docs), len(neg_docs)

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\lucas\AppData\Roaming\nltk_data...
[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\lucas\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\movie_reviews.zip.


(1000, 1000)

Each document is represented by a tuple (sentence, label). The sentence is tokenized, so it is represented by a list of strings:

In [3]:
pos_docs[0]

(['films',
  'adapted',
  'from',
  'comic',
  'books',
  'have',
  'had',
  'plenty',
  'of',
  'success',
  ',',
  'whether',
  'they',
  "'",
  're',
  'about',
  'superheroes',
  '(',
  'batman',
  ',',
  'superman',
  ',',
  'spawn',
  ')',
  ',',
  'or',
  'geared',
  'toward',
  'kids',
  '(',
  'casper',
  ')',
  'or',
  'the',
  'arthouse',
  'crowd',
  '(',
  'ghost',
  'world',
  ')',
  ',',
  'but',
  'there',
  "'",
  's',
  'never',
  'really',
  'been',
  'a',
  'comic',
  'book',
  'like',
  'from',
  'hell',
  'before',
  '.',
  'for',
  'starters',
  ',',
  'it',
  'was',
  'created',
  'by',
  'alan',
  'moore',
  '(',
  'and',
  'eddie',
  'campbell',
  ')',
  ',',
  'who',
  'brought',
  'the',
  'medium',
  'to',
  'a',
  'whole',
  'new',
  'level',
  'in',
  'the',
  'mid',
  "'",
  '80s',
  'with',
  'a',
  '12',
  '-',
  'part',
  'series',
  'called',
  'the',
  'watchmen',
  '.',
  'to',
  'say',
  'moore',
  'and',
  'campbell',
  'thoroughly',
  'researche

We separately split subjective and objective instances to keep a balanced uniform class distribution in both train and test sets.

In [25]:
lst = [1, 2, 3, 4, 5, 6]

print(lst[0:3])

[1, 2, 3]


In [30]:
# TODO: split training and testing data as 80/20
train_pos_docs = pos_docs[0:800]
test_pos_docs = pos_docs[800:1000]
train_neg_docs = neg_docs[0:800]
test_neg_docs = neg_docs[800:1000]

training_docs = train_pos_docs+train_neg_docs
testing_docs = test_pos_docs+test_neg_docs
sentim_analyzer = SentimentAnalyzer()
all_words_neg = sentim_analyzer.all_words([mark_negation(doc) for doc in training_docs])

all_words_neg[:5]

['films', 'adapted', 'from', 'comic', 'books']

We use simple unigram word features, handling negation:

In [31]:
unigram_feats = sentim_analyzer.unigram_word_feats(all_words_neg, min_freq=4)
print(len(unigram_feats))
sentim_analyzer.add_feat_extractor(extract_unigram_feats, unigrams=unigram_feats)

14667


We apply features to obtain a feature-value representation of our datasets:

In [32]:
training_set = sentim_analyzer.apply_features(training_docs)
test_set = sentim_analyzer.apply_features(testing_docs)
print(training_set[0])



We can now train our classifier on the training set, and subsequently output the evaluation results:

In [None]:
# TODO: Use Naive Bayes to train the sentiment classifier
naive_bayes = NaiveBayesClassifier.train
sentim_analyzer...
sentim_analyzer...

### 2.2. Lexical approach

Lexical approaches aim to map words to sentiment by building a lexicon or a 'dictionary of sentiment'. We can use this dictionary to assess the sentiment of phrases and sentences, without the need of looking at anything else. Sentiment can be categorical – such as {negative, neutral, positive} – or it can be numerical – like a range of intensities or scores. Lexical approaches look at the sentiment category or score of each word in the sentence and decide what the sentiment category or score of the whole sentence is. The power of lexical approaches lies in the fact that we do not need to train a model using labeled data, since we have everything we need to assess the sentiment of sentences in the dictionary of emotions. VADER is an example of a lexical method.

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

Run the lexical approach

In [None]:
sid = SentimentIntensityAnalyzer()
for doc in testing_docs:
    doc = " ".join(doc[0])
    print(doc[:100] + "...")
    ss = sid.polarity_scores(doc)
    for k in sorted(ss):
        print('{0}: {1}, '.format(k, ss[k]), end='')
    print()

### 2.3 Comparing two approaches

First we can transform the sentiment score by the lexical approach into label by the following rules:

+ positive sentiment: compound score > 0
+ negative sentiment: compound score <= 0

In [None]:
def lexical_sentiment(doc, sid=None):
    """TODO: return the label 'pos' or 'neg' for a document"""
    if sid is None: sid = SentimentIntensityAnalyzer()
    ...
    return label

for doc in testing_docs:
    doc = " ".join(doc[0])
    label = lexical_sentiment(doc, sid)
    print(doc[:100] + "...", label)

Now we evaluate the lexical approach by computing accuracy metrics

In [None]:
from collections import defaultdict
from nltk.metrics import (accuracy as eval_accuracy, precision as eval_precision,
        recall as eval_recall, f_measure as eval_f_measure)

gold_results = defaultdict(set)
test_results = defaultdict(set)
acc_gold_results = []
acc_test_results = []
labels = set()
num = 0
for i, (text, label) in enumerate(testing_docs):
    labels.add(label)
    gold_results[label].add(i)
    acc_gold_results.append(label)
    observed = lexical_sentiment(" ".join(text), sid)
    num += 1
    acc_test_results.append(observed)
    test_results[observed].add(i)
metrics_results = {}

# TODO: compute the accuracy metrics
for label in labels:
    ...

for result in sorted(metrics_results):
        print('{0}: {1}'.format(result, metrics_results[result]))