### Basic Imports

In [1]:
import numpy as np
import pandas as pd
import os

import matplotlib.pyplot as plt

from sklearn.preprocessing import Imputer
from sklearn.feature_extraction.text import CountVectorizer

%matplotlib inline

### Loading Data

In [2]:
#set relative path to data
DATA_PATH = os.path.join("../data/")

#true to load in training data
def load_movie_reviews(path=DATA_PATH, train=True):
    if train:
        tsv_path = os.path.join(path, "labeledTrainData.tsv")
    else:
        tsv_path = os.path.join(path, "testData.tsv")
    return pd.read_csv(tsv_path, delimiter='\t', header=0, quoting=3)

In [3]:
#load dataframe and make copy
reviews_orig = load_movie_reviews()
reviews = reviews_orig.copy()

#load test dataframe and make a copy
review_test_orig = load_movie_reviews(path=DATA_PATH, train=False)
review_test = review_test_orig.copy()

### Basic Data Exploration

It looks like our training data has 25,000 movie reviws with the following features:

- id
- sentiment
- review

In [4]:
reviews.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [5]:
#get examples of a positive and negative review
pos_review = reviews['review'][0]
neg_review = reviews['review'][2]

In [6]:
print(pos_review)

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally sta

In [7]:
print(neg_review)

"The film starts with a manager (Nicholas Bell) giving welcome investors (Robert Carradine) to Primal Park . A secret project mutating a primal animal using fossilized DNA, like ¨Jurassik Park¨, and some scientists resurrect one of nature's most fearsome predators, the Sabretooth tiger or Smilodon . Scientific ambition turns deadly, however, and when the high voltage fence is opened the creature escape and begins savagely stalking its prey - the human visitors , tourists and scientific.Meanwhile some youngsters enter in the restricted area of the security center and are attacked by a pack of large pre-historical animals which are deadlier and bigger . In addition , a security agent (Stacy Haiduk) and her mate (Brian Wimmer) fight hardly against the carnivorous Smilodons. The Sabretooths, themselves , of course, are the real star stars and they are astounding terrifyingly though not convincing. The giant animals savagely are stalking its prey and the group run afoul and fight against on

### Checking Class Balance

In [8]:
class_values = reviews['sentiment'].value_counts()

negative_class_count = class_values[0]
positive_class_count = class_values[1]

print("Positive Class Count: {}, Negative Class Count: {}".format(positive_class_count, negative_class_count))
print("Bad Review Percentage: {0:.2f}%".format(positive_class_count/len(reviews) * 100))
print("Good Review Percentage: {0:.2f}%".format(negative_class_count/len(reviews) * 100))

Positive Class Count: 12500, Negative Class Count: 12500
Bad Review Percentage: 50.00%
Good Review Percentage: 50.00%


### Clean and Tokenize Movie Reviews

In [16]:
#vectorize all reviews and remove stop words and convert all words to lower case
vectorizer = CountVectorizer(   
                                stop_words='english', 
                                lowercase=True,
                                max_features = 5000
                            )

bag_of_words_matrix = vectorizer.fit_transform(reviews['review'])

In [10]:
vocab = vectorizer.get_feature_names()
vocab[:20]

['00',
 '000',
 '10',
 '100',
 '11',
 '12',
 '13',
 '13th',
 '14',
 '15',
 '16',
 '17',
 '18',
 '1930',
 '1930s',
 '1933',
 '1936',
 '1940',
 '1950',
 '1950s']

<br>
**Note:** We set our bag of words size to 5000 but the first 20 words in our vocabulary are just numbers! Numbers likely will not yield any predictive power so it makes sense to filter them out
**Note:** Based on the 2 example reviews from above, it looks like we might have to filter out the HTM tags that are mixed into the review text
<br>
<br>


In [11]:
#remove all integers from reviews
reviews['review'] = reviews['review'].str.replace(pat='[0-9]+', repl='')
#remove all HTML tags from reviews
reviews['review'] = reviews['review'].str.replace(pat='<[^<]+?>', repl='')

In [12]:
bag_of_words_matrix = vectorizer.fit_transform(reviews['review'])

In [13]:
vocab = vectorizer.get_feature_names()
vocab[:20]

['abandoned',
 'abc',
 'abilities',
 'ability',
 'able',
 'abraham',
 'absence',
 'absent',
 'absolute',
 'absolutely',
 'absurd',
 'absurdity',
 'abuse',
 'abused',
 'abusive',
 'abysmal',
 'academy',
 'accent',
 'accents',
 'accept']

<br>
**Note:** Now the vocab list looks a lot better!
<br>
<br>

### Generating labels for test data

**Note:** if we look at the training dataframe, we can see that ever positive review (sentiment == 1) has an ID with *_5 or greater! 

In [14]:
review_test.head()

Unnamed: 0,id,review
0,"""12311_10""","""Naturally in a film who's main themes are of ..."
1,"""8348_2""","""This movie is a disaster within a disaster fi..."
2,"""5828_4""","""All in all, this is a movie for kids. We saw ..."
3,"""7186_2""","""Afraid of the Dark left me with the impressio..."
4,"""12128_7""","""A very accurate depiction of small time mob l..."


In [15]:
sentiment = [1 if int(x[1].strip('"')) >= 5 else 0 for x in review_test['id'].str.split('_')]
review_test['sentiment'] = sentiment

review_test.head()

Unnamed: 0,id,review,sentiment
0,"""12311_10""","""Naturally in a film who's main themes are of ...",1
1,"""8348_2""","""This movie is a disaster within a disaster fi...",0
2,"""5828_4""","""All in all, this is a movie for kids. We saw ...",0
3,"""7186_2""","""Afraid of the Dark left me with the impressio...",0
4,"""12128_7""","""A very accurate depiction of small time mob l...",1
