## Sentiment Analysis of Rotten Tomatoes Reviews using Naive Bayes

#### Import nltk, the Natural Language Processing Toolkit

This is one of the most popular packages for natural language processing on text data. It has APIs to access a large corpus of documents and other lexical resources

In [1]:
import numpy as np
import nltk

### Extract the review text and corresponding sentiment label from the review files

The dataset is available for download from this site: http://www.cs.cornell.edu/people/pabo/movie-review-data/

Search for "sentence polarity dataset v1.0 (includes sentence polarity dataset README v1.0: 5331 positive and 5331 negative processed sentences / snippets. Introduced in Pang/Lee ACL 2005. Released July 2005."

This is the dataset to download, unzip and untar.

**Store the files with positive and negative reviews (rt-polarity.pos and rt-polarity.neg) in the same directory as this code**


In [2]:
def get_reviews(path, positive = True):
    label =1 if positive else 0
    
    with open(path, 'r') as f:
        review_text = f.readlines()
        
    reviews = []
    for text in review_text:
        # Return a tuple of the review text and a label for whether it
        # is a positive or a negative review
        reviews.append((text, label))
    
    return reviews

In [3]:
def extract_reviews():
    positive_reviews = get_reviews('rt-polarity.pos', positive=True)
    negative_reviews = get_reviews('rt-polarity.neg', positive=False)
    
    return positive_reviews, negative_reviews

In [4]:
positive_reviews, negative_reviews = extract_reviews()

In [5]:
len(positive_reviews)

5331

In [6]:
len(negative_reviews)

5331

In [7]:
positive_reviews[:2]

[('the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal . \n',
  1),
 ('the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth . \n',
  1)]

In [8]:
TRAIN_DATA = 5000
TOTAL_DATA = len(positive_reviews)

train_reviews = positive_reviews[:TRAIN_DATA] + negative_reviews[:TRAIN_DATA]

test_positive_reviews = positive_reviews[TRAIN_DATA:TOTAL_DATA]
test_negative_reviews = negative_reviews[TRAIN_DATA:TOTAL_DATA]

In [9]:
len(train_reviews)

10000

#### Get a list of all the unque words in the dataset, the vocabulary

In [10]:
def get_vocabulary(train_reviews):
    words_set = set()
    
    for review in train_reviews:
        words_set.update(review[0].split())
        
    return list(words_set)


vocabulary = get_vocabulary(train_reviews)

In [11]:
len(vocabulary)

20704

In [12]:
vocabulary[:5]

["crowd-pleaser's", 'riveting', 'satisfactory', 'oblivious', "cq's"]

### Represent the words in the review as a feature vector

* *review_text* The review in text form

Each review is represented as a dictionary where keys are all words in the vocabulary. The values associated with each key is True if the word is present in the review.

In [13]:
def extract_features(review_text):
    # Split the review into words, and create a set of the words
    review_words = set(review_text.split())
    
    features = {}
    for word in vocabulary:
        features[word] = (word in review_words)
        
    return features    

#### Map feature vector to labels

* *extract_features* Function to extract the features in feature vector form
* *train_reviews* Training dataset, a list of tuples of the form (review_text, label)

In [14]:
train_features = nltk.classify.apply_features(extract_features, train_reviews)

#### Train the classifier on the training data

In [15]:
trained_classifier = nltk.NaiveBayesClassifier.train(train_features)

#### Classify and measure the accuracy of the model on test data

In [16]:
def sentiment_calculator(review_text):
    features = extract_features(review_text)
    return trained_classifier.classify(features)

In [17]:
sentiment_calculator('What an amazing movie!')

1

In [18]:
sentiment_calculator('Light travels faster than sound. This is why some people appear bright until they speak.')

0

In [19]:
sentiment_calculator('I don’t believe in plastic surgery, But in your case, Go ahead.')

0

In [20]:
sentiment_calculator('I am not young enough to know everything.')

1

In [21]:
def classify_test_reviews(test_positve_reviews, test_negative_reviews, sentiment_calculator):
    positve_results = [sentiment_calculator(review[0]) for review in test_positive_reviews]
    negative_results = [sentiment_calculator(review[0]) for review in test_negative_reviews]
    
    true_positve = sum(x > 0 for x in positve_results)
    true_negative = sum(x == 0 for x in negative_results)
    
    percent_true_positive = float(true_positve) / len(positve_results)
    percent_true_negative = float(true_negative) / len(negative_results)
    
    total_accurate = true_positve + true_negative
    total = len(positve_results) + len(negative_results)
    
    print('Accuracy on positive reviews = ' + '%.2f' % (percent_true_positive * 100) + '%')
    print('Accuracy on negative reviews = ' + '%.2f' % (percent_true_negative * 100) + '%')
    print('Overall Accuracy = ' + '%.2f' % (total_accurate *100 / total) + '%')

In [22]:
classify_test_reviews(test_positive_reviews, test_negative_reviews, sentiment_calculator)

Accuracy on positive reviews = 78.25%
Accuracy on negative reviews = 80.66%
Overall Accuracy = 79.46%
