We will be working with a CSV file containing movie reviews. **Each row contains the text of the review, as well as a number indicating whether the tone of the review is positive(1) or negative(-1)**.

We want to predict whether a review is negative or positive, based on the text alone. To do this, we will train an algorithm using the reviews and classifications in train.csv, and then make predictions on the reviews in test.csv. We'll be able to calculate our error using the actual classifications in test.csv to see how good our predictions are.

We'll use Naive Bayes for our classification algorithm. A Naive Bayes classifier (based on Bayes' theorem) works by figuring out how likely data attributes are to be associated with a certain class.

# Finding Word Counts

We are trying to determine if we should classify a data row as negative or positive. We have to calculate the probabilities of each classification, and the probabilities of each feature falling into each classification.

All we have is one long string, but we can generate features from it. The easiest way to generate features from text is to split the text up into words. Each word in a movie review will then be a feature that we can work with. To do this, we will split the reviews based on whitespace.

Then, we will determine word frequency in the negative reviews, and positive reviews. Eventually, we'll use the word frequency to compute the probability that a new review will belong to one class versus the other.

In [8]:
# The Counter class allows us to count how many times items occur in a list
from collections import Counter
import csv
import re

# Read in the training data
with open("train.csv", 'r') as file:
    reviews = list(csv.reader(file))
    
reviews[0]

['plot : two teen couples go to a church party drink and then drive . they get into an accident . one of the guys dies but his girlfriend continues to see him in her life and has nightmares . what\'s the deal ? watch the movie and " sorta " find out . . . critique : a mind-fuck movie for the teen generation that touches on a very cool idea but presents it in a very bad package . which is what makes this review an even harder one to write since i generally applaud films which attempt',
 '-1']

In [13]:
def get_text(reviews, score):
    # convert to Lowercase 
    return " ".join([r[0].lower() for r in reviews if r[1] == str(score)])

def count_text(text):
    # Split text into words based on whitespace. words is a list
    words = re.split("\s+", text)
    # Count occurrence of each word and return as Counter object
    # sample of what this looks like - Counter({'the': 3181, '.': 2752, 'a': 1941, 'of': 1649,
    return Counter(words)

negative_text = get_text(reviews, -1)
positive_text = get_text(reviews, 1)

# Generate word counts for negative tone. 
negative_counts = count_text(negative_text)

# Generate word counts for positive tone
positive_counts = count_text(positive_text)

print("Negative text sample: {0}".format(negative_text[:100]))
print("Positive text sample: {0}".format(positive_text[:100]))

Negative text sample: plot : two teen couples go to a church party drink and then drive . they get into an accident . one 
Positive text sample: films adapted from comic books have had plenty of success whether they're about superheroes ( batman


# Making Predictions About Review Classifications

Now that we have the word counts, we just need to convert them to probabilities and multiply them out to predict the classifications.

Let's say we wanted to find the probability that the review "didn't like it" expresses a negative sentiment. We would find the total number of times the word "didn't" occurred in the negative reviews, and divide it by the total number of words in the negative reviews to get the probability of x given y. We would then do the same for "like" and "it". We would multiply all three probabilities, and then multiply by the probability of any document expressing a negative sentiment to get our final probability that the sentence expresses negative sentiment.

We would do the same for positive sentiment. Then, whichever probability is greater would be the class that the algorithm assigns the review to.

To accomplish all of this, we'll need to compute the probabilities of each class occurring in the data, and then write a function that computes the classification.

In [18]:
import re
from collections import Counter

def get_y_count(score):
    # Compute the count of each classification occurring in the data
    return len([r for r in reviews if r[1] == str(score)])

# We'll use these counts for smoothing when computing the prediction
positive_review_count = get_y_count(1)
negative_review_count = get_y_count(-1)

# These are the class probabilities 
prob_positive = positive_review_count / len(reviews)
prob_negative = negative_review_count / len(reviews)

def make_class_prediction(text, counts, class_prob, class_count):
    prediction = 1
    text_counts = Counter(re.split("\s+", text))
    for word in text_counts:
        # For every word in the text, we get the number of times that word occurred in the reviews 
        # for a given class, add 1 to smooth the value, and divide by the total number of words in 
        # the class (plus the class_count, also to smooth the denominator)
        
        # Smoothing ensures that we don't multiply the prediction by 0 if the word didn't exist in 
        # the training data
        
        # We also smooth the denominator counts to keep things even
        prediction *=  text_counts.get(word) * ((counts.get(word, 0) + 1) / 
                       (sum(counts.values()) + class_count))
    # Now we multiply by the probability of the class existing in the documents
    return prediction * class_prob

# Now we can generate probabilities for the classes our reviews belong to
# The probabilities themselves aren't very useful -- we make our classification decision based on 
# which value is greater
print("Sampling an actual negative review:\n")
print("Review: {0}".format(reviews[2][0]))
print("Rating: {0}".format(reviews[2][1]))
print("Negative prediction: {0}".format(make_class_prediction(reviews[2][0], negative_counts, prob_negative, negative_review_count)))
print("Positive prediction: {0}".format(make_class_prediction(reviews[2][0], positive_counts, prob_positive, positive_review_count)))

print("\nSampling an actual positive review:\n")
print("Review: {0}".format(reviews[710][0]))
print("Rating: {0}".format(reviews[710][1]))
print("Negative prediction: {0}".format(make_class_prediction(reviews[710][0], negative_counts, prob_negative, negative_review_count)))
print("Positive prediction: {0}".format(make_class_prediction(reviews[710][0], positive_counts, prob_positive, positive_review_count)))

Sampling an actual negative review:

Review: it is movies like these that make a jaded movie viewer thankful for the invention of the timex indiglo watch . based on the late 1960's television show by the same name the mod squad tells the tale of three reformed criminals under the employ of the police to go undercover . however things go wrong as evidence gets stolen and they are immediately under suspicion . of course the ads make it seem like so much more . quick cuts cool music claire dane's nice hair and cute outfits car
Rating: -1
Negative prediction: 5.327907949310682e-234
Positive prediction: 2.0820582627346563e-241

Sampling an actual positive review:

Review: you've got mail works alot better than it deserves to . in order to make the film a success all they had to do was cast two extremely popular and attractive stars have them share the screen for about two hours and then collect the profits . no real acting was involved and there is not an original or inventive bone in it's 

# Predicting the Test Set

Now that we can make predictions, let's predict the probabilities for the reviews in test.csv. We may get misleadingly good results if we predict on the reviews in train.csv, because we used that data set to generate the probabilities in the first place (so the algorithm has prior knowledge about the data it's predicting on).

Getting good results on the training set could mean that our model is overfit, and just picking up random noise. Testing on a set the model wasn't trained with is the only way to tell if it's performing properly.

In [4]:
import csv

def make_decision(text, make_class_prediction):
    # Compute the negative and positive probabilities
    negative_prediction = make_class_prediction(text, negative_counts, prob_negative, negative_review_count)
    positive_prediction = make_class_prediction(text, positive_counts, prob_positive, positive_review_count)

    # We assign a classification based on which probability is greater
    if negative_prediction > positive_prediction:
      return -1
    return 1

with open("test.csv", 'r') as file:
    test = list(csv.reader(file))

predictions = [make_decision(r[0], make_class_prediction) for r in test]

# Computing Prediction Error

Now that we know the predictions, we'll compute error using the area under the ROC curve. This will tell us how "good" the model is; closer to 1 means that the model is better.

Computing error is a very important measure of whether your model is "good," and when it's getting better or worse.

In [5]:
actual = [int(r[1]) for r in test]

from sklearn import metrics

# Generate the ROC curve using scikits-learn
fpr, tpr, thresholds = metrics.roc_curve(actual, predictions, pos_label=1)

# Measure the area under the curve
# The closer to 1 it is, the "better" the predictions
print("AUC of the predictions: {0}".format(metrics.auc(fpr, tpr)))

AUC of the predictions: 0.680701754385965


# A Faster Way to Make Predictions

There are a lot of extensions we could add to this algorithm to make it perform better. We could look at n-grams instead of unigrams, for example. We could also remove punctuation and other non-characters. We could remove stop words, or perform stemming or lemmatization.

Also an easier way to use Naive Bayes is to use the implementation in scikit-learn. Scikit-learn is a Python machine learning library that contains implementations of all the common machine learning algorithms.

In [39]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics

# Generate counts from text using a vectorizer  
# We can choose from other available vectorizers, and set many different options
# This code performs our step of computing word counts
vectorizer = CountVectorizer(stop_words='english', max_df=.05)
train_features = vectorizer.fit_transform([r[0] for r in reviews])
test_features = vectorizer.transform([r[0] for r in test])

# Fit a Naive Bayes model to the training data
# This will train the model using the word counts we computed and the existing classifications in the training set
nb = MultinomialNB()
nb.fit(train_features, [int(r[1]) for r in reviews])

# Now we can use the model to predict classifications for our test features
predictions = nb.predict(test_features)

# Evaluation of the performance on the test set

In [29]:
# Compute the error using auc
fpr, tpr, thresholds = metrics.roc_curve(actual, predictions, pos_label=1)

# Measure the area under the curve
# The closer to 1 it is, the "better" the predictions
print("AUC of the predictions: {0}".format(metrics.auc(fpr, tpr)))

AUC of the predictions: 0.635500515995872


In [37]:
#using mean
import numpy as np
target_review = [int(r[1]) for r in test]
np.mean(predictions == target_review)

0.63451776649746194

# Using SVM

Let’s see if we can do better with a linear support vector machine (SVM), which is widely regarded as one of the best text classification algorithms (although it’s also a bit slower than naïve Bayes). 

In [36]:
from sklearn.svm import SVC

svc = SVC()
svc.fit(train_features, [int(r[1]) for r in reviews])
predictions = nb.predict(test_features)
np.mean(predictions == target_review)

0.63451776649746194