### Basic Imports

In [16]:
import numpy as np
import pandas as pd
import os

import matplotlib.pyplot as plt

from sklearn.preprocessing import Imputer
from sklearn.feature_extraction.text import CountVectorizer

%matplotlib inline

# to make this notebook's output stable across runs
np.random.seed(42)

### Loading Data

In [2]:
DATA_PATH = os.path.join("../data/")

def load_movie_reviews(path=DATA_PATH, train=True):
    if train:
        tsv_path = os.path.join(path, "labeledTrainData.tsv")
    else:
        tsv_path = os.path.join(path, "testData.tsv")
    return pd.read_csv(tsv_path, delimiter='\t', header=0, quoting=3)

In [3]:
#load dataframe
reviews_orig = load_movie_reviews()
#make a copy of the ogiginal
reviews = reviews_orig.copy()

### Basic Data Exploration

It looks like our training data has 25,000 movie reviws with the following features:

- id
- sentiment
- review

In [5]:
reviews.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [13]:
pos_review = reviews['review'][0]
neg_review = reviews['review'][2]

In [14]:
print(pos_review)

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally sta

In [15]:
print(neg_review)

"The film starts with a manager (Nicholas Bell) giving welcome investors (Robert Carradine) to Primal Park . A secret project mutating a primal animal using fossilized DNA, like ¨Jurassik Park¨, and some scientists resurrect one of nature's most fearsome predators, the Sabretooth tiger or Smilodon . Scientific ambition turns deadly, however, and when the high voltage fence is opened the creature escape and begins savagely stalking its prey - the human visitors , tourists and scientific.Meanwhile some youngsters enter in the restricted area of the security center and are attacked by a pack of large pre-historical animals which are deadlier and bigger . In addition , a security agent (Stacy Haiduk) and her mate (Brian Wimmer) fight hardly against the carnivorous Smilodons. The Sabretooths, themselves , of course, are the real star stars and they are astounding terrifyingly though not convincing. The giant animals savagely are stalking its prey and the group run afoul and fight against on

### Clean and Tokenize Movie Reviews

In [32]:
vectorizer = CountVectorizer(   
                                stop_words='english', 
                                lowercase=True,
                                max_features = 5000
                            )

bag_of_words_matrix = vectorizer.fit_transform(reviews['review'])

In [35]:
vocab = vectorizer.get_feature_names()

5000