# Sentiment Analysis with Python

## Background

Sentiment Analysis is a NLP technique for determining the
opinion polarity for a given text. Let's apply this technique
move reviews.

"I love this movie!" <- (positive)

"This movie really stinks :-(" <- (negative)

### First, import the required libraries

We're using the Python Natural Language Toolkit Library (NLTK),
it includes many datasets, NLP and ML algorithms.

In [1]:
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews

### Next, let's preprocess our data

A common dataset for training sentiment analysis algorithms
is the IMDB movie review dataset. It contains thousands of
movie reviews with their sentiment polarity labeled (pos/neg).

In [8]:
negative_ids = movie_reviews.fileids('neg')
positive_ids = movie_reviews.fileids('pos')

print(movie_reviews.sents(negative_ids[0]))

[['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', ',', 'drink', 'and', 'then', 'drive', '.'], ['they', 'get', 'into', 'an', 'accident', '.'], ...]


Lets define a function to create our `features`. Features
are names given to data that can be used in a learning algorithm.
Features can be different types dependent on the algorithm being
used, but typically are binary or float values. Therefore, a
transform is necessary to convert our textual data into numerical
data.

In [None]:
def word_feats(words):
    return dict([(word, True) for word in words])

Now, create the positive and negative `features`

In [None]:
negative_features = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negative_ids]
positive_features = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in positive_ids]

This creates two lists of dictionaries, where every dict
corresponds to the set of words found in a particular positive or negative
document.

Next, we need to split our labeled data into training and
testing data sets. Why? We want to be able to test how accurate
the model we are going to develop is, in order to do that we
need labeled data to test on. An 80/20 split is typical.

In [None]:
neg_cutoff = round(len(negative_features) * 0.80)
pos_cutoff = round(len(positive_features) * 0.80)
training_features = negative_features[:neg_cutoff] + positive_features[:pos_cutoff]
testing_features = negative_features[neg_cutoff:] + positive_features[pos_cutoff:]
print('train on %d instances, test on %d instances' % (len(training_features), len(testing_features)))

## Classification
We're ready to train our model. One of the simplest Machine Learning algorithms is the Naive Bayes Classifier.

In [None]:
classifier = NaiveBayesClassifier.train(training_features)
print('accuracy:', nltk.classify.util.accuracy(classifier, testing_features))

Can we get any sense of how these decisions are being made?

In [None]:
classifier.show_most_informative_features()

Okay, cool. What about on some new data?

In [None]:
test_reviews = [
"""Wow! That's about all one can say about this movie. The first time that I saw
it I was mesmerized. The movie looked so cool and hey, it actually had a good
plot. If you haven't seen this movie yet, get out from your cave and see it
right away. I have seen this movie umpteen times and it still shocks and
surprises me. """,
"""Anyway, back to the movie. It is as bad as you've no doubt heard. The scene
changes from night to day to night, the spaceship is a hubcap (you can see the
string it hangs from catch on fire at one point), I could do a better job
acting, etc."""]

from nltk.tokenize import word_tokenize

for review in test_reviews:
    review_features = word_feats(word_tokenize(review.lower()))
    label = classifier.classify(review_features)
    prob_results = classifier.prob_classify(review_features)
    prob_str = " ({0:.2}/{1:.2})".format(prob_results.prob("pos"), prob_results.prob("neg"))
    print(review[:25], ": ", label, prob_str)


Decision Tree, Support Vector Machines. Ways of improving the features, tfidf