# Sentiment Analysis with Python

## Background

Sentiment Analysis is a NLP technique for determining the
opinion polarity of a given text.

"I love this movie!" <- (positive)

"This movie really stinks :-(" <- (negative)

### First, import the required libraries

We're using the Python Natural Language Toolkit Library (NLTK),
it includes many datasets, NLP and ML algorithms.

In [None]:
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews


### Next, let's preprocess our data

A common dataset for training sentiment analysis algorithms
is the IMDB movie review dataset. It contains thousands of
movie reviews with their sentiment polarity labeled (pos/neg).

In [None]:
negative_ids = movie_reviews.fileids('neg')
positive_ids = movie_reviews.fileids('pos')

Lets define a function to create our `features`. Features
are names given to data that can be used in a learning algorithm.
Features can be different types dependent on the algorithm being
used, but typically are binary or float values. Therefore, a
transform is necessary to convert our textual data into numerical
data.

In [None]:
def word_feats(words):
    return dict([(word, True) for word in words])

Now, create the positive and negative `features`

In [None]:
negative_features = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negative_ids]
positive_features = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in positive_ids]

This creates two lists of dictionaries, where every dict
corresponds to the set of words found in a particular document.

Next, we need to split our labeled data into training and
testing data sets. Why? We want to be able to test how accurate
the model we are going to develop is, in order to do that we
need labeled data to test on. An 80/20 split is typical.

In [None]:
neg_cutoff = round(len(negative_features) * 0.80)
pos_cutoff = round(len(positive_features) * 0.80)

training_features = negative_features[:neg_cutoff] + positive_features[:pos_cutoff]
testing_features = negative_features[neg_cutoff:] + positive_features[pos_cutoff:]
print('train on %d instances, test on %d instances' % (len(training_features), len(testing_features)))

### Classification

We're ready to train our model. One of the simplest Machine
Learning algorithms is the `Naive Bayes Classifier`.

In order to find the
probability for a label, this algorithm first uses the Bayes rule to
express `P(label|features)` in terms of `P(label)` and `P(features|label)`:

                         P(label) * P(features|label)
    P(label|features) = ------------------------------
                              P(features)

The algorithm then makes the 'naive' assumption that all features are
independent, given the label:

                        P(label) * P(f1|label) * ... * P(fn|label)
    P(label|features) = --------------------------------------------
                                        P(features)

Rather than computing `P(features)` explicitly, the algorithm just
calculates the numerator for each label, and normalizes them so they
sum to one:

                        P(label) * P(f1|label) * ... * P(fn|label)
    P(label|features) = --------------------------------------------
                         SUM[l]( P(l) * P(f1|l) * ... * P(fn|l) )

In [None]:
classifier = NaiveBayesClassifier.train(training_features)
print('accuracy:', nltk.classify.util.accuracy(classifier, testing_features))
