# Sentiment Analysis with Python & Naive-Bayes

## Typical Supervised ML Workflow

1. Acquire Data
2. Preprocess / Clean Data
3. Build a feature set
4. Build a trained model from training data and feature set
5. Apply the model to test data
6. Score the results of the trained model
7. Revise hypothesis, train & test again until you reach acceptable performance
   or until your PhD funding runs out...

## Background

Sentiment Analysis is a NLP technique for determining the opinion polarity for
a given text.

Let's apply this technique move reviews!

### What is the task?

For the following two reviews we'd expect to receive the two subsequent "labels"
 of their sentiment.

| Review  	                        | Label |
|---	                            |---	|
| "I love this movie!"   	        | `pos`	|
| "This movie really stinks :-("  	| `neg`	|

## Building a Naive-Bayes Sentiment Classifier

### Acquire & Preprocess the data set

A common dataset for training sentiment analysis algorithms
is the IMDB movie review dataset. It contains thousands of
movie reviews along with their sentiment polarity labeling (i.e., pos/neg.)

In [21]:
from nltk.corpus import movie_reviews

negative_ids = movie_reviews.fileids('neg')
positive_ids = movie_reviews.fileids('pos')

print(movie_reviews.sents(negative_ids[0]))

[['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', ',', 'drink', 'and', 'then', 'drive', '.'], ['they', 'get', 'into', 'an', 'accident', '.'], ...]


### Build a feature set

Let's define a function to create our `features`. Features
are names given to data that can be used in a learning algorithm.
Features can be different types dependent on the algorithm being
used, but typically are binary or float values. Therefore, a
transform is necessary to convert our textual data into numerical
data.

In [22]:
from typing import List, Dict
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

from nltk.stem import PorterStemmer
ps = PorterStemmer()


def words_to_features(words: List[str]) -> Dict[str, bool]:

    # remove common, low information words: a, an, the, etc
    filtered_words = filter(lambda x: x not in stop_words, words)

    # "Stemming" the words means removing tense modifiers or
    # other inflections dow to their base word.  Typically,
    # this means:
    #   * if the word ends in 'ed', remove the 'ed'
    #   * if the word ends in 'ing', remove the 'ing'
    #   * if the word ends in 'ly', remove the 'ly'
    # This is a noisy process, but it helps increase your observations of words
    stemmed_words = list(map(lambda x: ps.stem(x), filtered_words))

    # Finally, represent words as a boolean, a true value indicates the word
    # appeared in a review.
    word_mapping = [(word, True) for word in stemmed_words]
    return dict(word_mapping)

print(words_to_features(["NLP", "is", "pretty", "fun"]))

{'nlp': True, 'pretti': True, 'fun': True}


Here, we did a simple transformation of text data to boolean.
Now, actually create the positive and negative `features`

In [24]:
negative_features = [(words_to_features(movie_reviews.words(fileids=[f])), 'neg') for f in negative_ids]
positive_features = [(words_to_features(movie_reviews.words(fileids=[f])), 'pos') for f in positive_ids]
print(negative_features[0])

({'plot': True, ':': True, 'two': True, 'teen': True, 'coupl': True, 'go': True, 'church': True, 'parti': True, ',': True, 'drink': True, 'drive': True, '.': True, 'get': True, 'accid': True, 'one': True, 'guy': True, 'die': True, 'girlfriend': True, 'continu': True, 'see': True, 'life': True, 'nightmar': True, "'": True, 'deal': True, '?': True, 'watch': True, 'movi': True, '"': True, 'sorta': True, 'find': True, 'critiqu': True, 'mind': True, '-': True, 'fuck': True, 'gener': True, 'touch': True, 'cool': True, 'idea': True, 'present': True, 'bad': True, 'packag': True, 'make': True, 'review': True, 'even': True, 'harder': True, 'write': True, 'sinc': True, 'applaud': True, 'film': True, 'attempt': True, 'break': True, 'mold': True, 'mess': True, 'head': True, '(': True, 'lost': True, 'highway': True, '&': True, 'memento': True, ')': True, 'good': True, 'way': True, 'type': True, 'folk': True, 'snag': True, 'correctli': True, 'seem': True, 'taken': True, 'pretti': True, 'neat': True, 

This creates two lists of dictionaries, where every dict
corresponds to the set of words found in a particular positive or negative
document.

Next, we need to split our labeled data into training and
testing data sets. Why? We want to be able to test how accurate
the model we are going to develop is, in order to do that we
need labeled data to test on. An 80/20 split is typical.

### Split the dataset into training & testing sets

In [25]:
neg_cutoff = round(len(negative_features) * 0.80)
print(neg_cutoff)
pos_cutoff = round(len(positive_features) * 0.80)
print(pos_cutoff)
training_features = negative_features[:neg_cutoff] + positive_features[:pos_cutoff]
testing_features = negative_features[neg_cutoff:] + positive_features[pos_cutoff:]
print('train on %d instances, test on %d instances' % (len(training_features), len(testing_features)))

800
800
train on 1600 instances, test on 400 instances


## Train a model

We're ready to train our model. One of the simplest Machine Learning algorithms is the Naive Bayes Classifier.

In [26]:
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier

classifier = NaiveBayesClassifier.train(training_features)
print('accuracy:', nltk.classify.util.accuracy(classifier, testing_features))

accuracy: 0.71


Can we get any sense of how these decisions are being made? What are the most
import words for classification?

In [27]:
classifier.show_most_informative_features()

Most Informative Features
                outstand = True              pos : neg    =     13.9 : 1.0
                  ludicr = True              neg : pos    =     13.8 : 1.0
                uninvolv = True              neg : pos    =     13.0 : 1.0
                  themat = True              pos : neg    =     12.3 : 1.0
                    plod = True              neg : pos    =     11.0 : 1.0
                    anna = True              pos : neg    =     10.3 : 1.0
                  darker = True              pos : neg    =     10.3 : 1.0
                  seagal = True              neg : pos    =     10.3 : 1.0
                  annual = True              pos : neg    =      9.0 : 1.0
                    hatr = True              pos : neg    =      9.0 : 1.0


Okay, cool. What happens if we test on new data?

In [None]:
test_reviews = [
"""Wow! That's about all one can say about this movie. The first time that I saw
it I was mesmerized. The movie looked so cool and hey, it actually had a good
plot. If you haven't seen this movie yet, get out from your cave and see it
right away. I have seen this movie umpteen times and it still shocks and
surprises me.""",
"""Anyway, back to the movie. It is as bad as you've no doubt heard. The scene
changes from night to day to night, the spaceship is a hubcap (you can see the
string it hangs from catch on fire at one point), I could do a better job
acting, etc. """]

from nltk.tokenize import word_tokenize

for review in test_reviews:
    review_features = words_to_features(word_tokenize(review.lower()))
    label = classifier.classify(review_features)
    prob_results = classifier.prob_classify(review_features)
    prob_str = " ({0:.2}/{1:.2})".format(prob_results.prob("pos"), prob_results.prob("neg"))
    print(review[:25], ": ", label, prob_str)


### Next steps

Ways to improve the features (n-grams, TF-IDF, etc.), better performance with
different algorithm?