# Sentiment Analysis

Based on *Exercise B: Sentiment Analysis* in [Natural Language Processing with Python/NLTK](https://github.com/luchux/ipython-notebook-nltk/blob/master/NLP%20-%20MelbDjango.ipynb) by Luciano M. Guasco.

## 1. Exploring the `movie_reviews` corpus

In [1]:
from nltk.corpus import movie_reviews

print(movie_reviews.readme()[:64])

Sentiment Polarity Dataset Version 2.0
Bo Pang and Lillian Lee




In [2]:
movie_reviews.categories()

['neg', 'pos']

In [3]:
movie_reviews.fileids()[:5]

['neg/cv000_29416.txt',
 'neg/cv001_19502.txt',
 'neg/cv002_17424.txt',
 'neg/cv003_12683.txt',
 'neg/cv004_12641.txt']

In [4]:
print(movie_reviews.raw("neg/cv000_29416.txt")[:260])

plot : two teen couples go to a church party , drink and then drive . 
they get into an accident . 
one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . 
what's the deal ? 
watch the movie and " sorta " find out . .


## 2. Building and testing the classifier

In [5]:
from nltk.corpus import stopwords

stop_words = stopwords.words('english')
stop_words.extend('.,[,],(,),;,/,-,\',?,",:,<,>,n\'t,|,#,\'s,\",\'re,\'ve,\'ll,\'d,\'re'.split(','))
stop_words.extend(',')
stop_words[:5]

['a', 'about', 'above', 'after', 'again']

In [6]:
from nltk.classify import NaiveBayesClassifier

def extract_word_presence_features(words):
    return {word: True for word in words if word not in stop_words and word.isalpha()}

pos_ids = movie_reviews.fileids('pos')
neg_ids = movie_reviews.fileids('neg')

len(pos_ids) + len(neg_ids)

2000

In [7]:
pos_feats = [(extract_word_presence_features(movie_reviews.words(fileids=[f])), 'pos') for f in pos_ids]
neg_feats = [(extract_word_presence_features(movie_reviews.words(fileids=[f])), 'neg') for f in neg_ids]

print('SOME EXAMPLE WORD PRESENCE FEATURES FOR THE FIRST POSITIVE REVIEW')
print('adapted:', pos_feats[0][0]['adapted'])
print('success:', pos_feats[0][0]['success'])
print('like:', pos_feats[0][0]['like'])
print('plenty:', pos_feats[0][0]['plenty'])
print('never:', pos_feats[0][0]['never'])

SOME EXAMPLE WORD PRESENCE FEATURES FOR THE FIRST POSITIVE REVIEW
adapted: True
success: True
like: True
plenty: True
never: True


In [8]:
TRAIN_TEST_SPLIT = 3 / 4

pos_len_train = int(len(pos_feats) * TRAIN_TEST_SPLIT)
neg_len_train = int(len(neg_feats) * TRAIN_TEST_SPLIT)

pos_len_train

750

In [9]:
import nltk.classify.util

train_feats = neg_feats[:neg_len_train] + pos_feats[:pos_len_train]
test_feats = neg_feats[neg_len_train:] + pos_feats[pos_len_train:]

classifier = NaiveBayesClassifier.train(train_feats)

print('Accuracy:', round(nltk.classify.util.accuracy(classifier, test_feats), 2))

Accuracy: 0.71


In [10]:
classifier.show_most_informative_features()

Most Informative Features
             magnificent = True              pos : neg    =     15.0 : 1.0
             outstanding = True              pos : neg    =     13.6 : 1.0
               insulting = True              neg : pos    =     13.0 : 1.0
              vulnerable = True              pos : neg    =     12.3 : 1.0
               ludicrous = True              neg : pos    =     11.8 : 1.0
                  avoids = True              pos : neg    =     11.7 : 1.0
             uninvolving = True              neg : pos    =     11.7 : 1.0
              astounding = True              pos : neg    =     10.3 : 1.0
             fascination = True              pos : neg    =     10.3 : 1.0
                 idiotic = True              neg : pos    =      9.8 : 1.0


## 3. Classifying new data

In [11]:
from nltk import word_tokenize, pos_tag

sentence_1 = "I feel so miserable, and that's amazing!"
tokens_1 = [word for word in word_tokenize(sentence_1) if word not in stop_words]
tokens_1

['I', 'feel', 'miserable', 'amazing', '!']

In [12]:
feats_1 = extract_word_presence_features(tokens_1)
feats_1

{'I': True, 'feel': True, 'miserable': True, 'amazing': True}

In [13]:
classifier.classify(feats_1)

'pos'

In [14]:
sentence_2 = "You are a pathetic fool, a terrible excuse for a human being."
tokens_2 = [word for word in word_tokenize(sentence_2) if word not in stop_words]
tokens_2

['You', 'pathetic', 'fool', 'terrible', 'excuse', 'human']

Note how adjectives are usually the words that convey the sentiment the most. We can try to extract the adjectives from this sentence and feed only those to the classifier.

In [15]:
pos_tags_2 = [pos for pos in pos_tag(tokens_2) if pos[1] == 'JJ']
pos_tags_2

[('pathetic', 'JJ'), ('terrible', 'JJ')]

In [16]:
feats_2 = extract_word_presence_features([word for (word, _) in pos_tags_2])
feats_2

{'pathetic': True, 'terrible': True}

In [17]:
classifier.classify(feats_2)

'neg'

To improve the classifier, bigram features can be examined using `nltk.util.ngrams`. Because “not funny” is very different from funny”.