#### Introduction

This notebook is my follow up on the Kaggle tutorial about Bag of Words model for sentiment analysis. You can see the detail instruction by clicking the link below. I will not explain in details here, because everything I did here is in the tutorial.

Tutorial link:
https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words

In [1]:
import sys
sys.path.append('/home/hoanvu/anaconda2/envs/ds/lib/python2.7/site-packages/')

import re
import pandas as pd
import numpy as np

from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier

In [3]:
reviews = pd.read_csv('../data/labeledTrainData.tsv', header=0, delimiter="\t", quoting=3)
reviews.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


#### Clean for a single review

Before cleaning every review, let's see what we can do to clean each review:

- Clean HTML tags
- Convert review to lowercase
- Remove unwanted characters
- Remove stop words

In [4]:
# Take a look at a sample review
sample_review = reviews.review[2]
sample_review

'"The film starts with a manager (Nicholas Bell) giving welcome investors (Robert Carradine) to Primal Park . A secret project mutating a primal animal using fossilized DNA, like \xc2\xa8Jurassik Park\xc2\xa8, and some scientists resurrect one of nature\'s most fearsome predators, the Sabretooth tiger or Smilodon . Scientific ambition turns deadly, however, and when the high voltage fence is opened the creature escape and begins savagely stalking its prey - the human visitors , tourists and scientific.Meanwhile some youngsters enter in the restricted area of the security center and are attacked by a pack of large pre-historical animals which are deadlier and bigger . In addition , a security agent (Stacy Haiduk) and her mate (Brian Wimmer) fight hardly against the carnivorous Smilodons. The Sabretooths, themselves , of course, are the real star stars and they are astounding terrifyingly though not convincing. The giant animals savagely are stalking its prey and the group run afoul and 

In [5]:
# Clean all HTML tags
sample_review = BeautifulSoup(sample_review, 'html.parser').get_text()
sample_review

u'"The film starts with a manager (Nicholas Bell) giving welcome investors (Robert Carradine) to Primal Park . A secret project mutating a primal animal using fossilized DNA, like \xa8Jurassik Park\xa8, and some scientists resurrect one of nature\'s most fearsome predators, the Sabretooth tiger or Smilodon . Scientific ambition turns deadly, however, and when the high voltage fence is opened the creature escape and begins savagely stalking its prey - the human visitors , tourists and scientific.Meanwhile some youngsters enter in the restricted area of the security center and are attacked by a pack of large pre-historical animals which are deadlier and bigger . In addition , a security agent (Stacy Haiduk) and her mate (Brian Wimmer) fight hardly against the carnivorous Smilodons. The Sabretooths, themselves , of course, are the real star stars and they are astounding terrifyingly though not convincing. The giant animals savagely are stalking its prey and the group run afoul and fight a

In [6]:
# Remove all characters which are not letters
sample_review = re.sub(r'[^a-zA-Z]', ' ', sample_review)
sample_review

u' The film starts with a manager  Nicholas Bell  giving welcome investors  Robert Carradine  to Primal Park   A secret project mutating a primal animal using fossilized DNA  like  Jurassik Park   and some scientists resurrect one of nature s most fearsome predators  the Sabretooth tiger or Smilodon   Scientific ambition turns deadly  however  and when the high voltage fence is opened the creature escape and begins savagely stalking its prey   the human visitors   tourists and scientific Meanwhile some youngsters enter in the restricted area of the security center and are attacked by a pack of large pre historical animals which are deadlier and bigger   In addition   a security agent  Stacy Haiduk  and her mate  Brian Wimmer  fight hardly against the carnivorous Smilodons  The Sabretooths  themselves   of course  are the real star stars and they are astounding terrifyingly though not convincing  The giant animals savagely are stalking its prey and the group run afoul and fight against 

In [7]:
# Remove all stopwords inside a review
sample_review_words = [word for word in sample_review.lower().split(' ') 
                       if word and not word in stopwords.words('english')]

# Let's see our sample review after cleaning
' '.join(sample_review_words)

u'film starts manager nicholas bell giving welcome investors robert carradine primal park secret project mutating primal animal using fossilized dna like jurassik park scientists resurrect one nature fearsome predators sabretooth tiger smilodon scientific ambition turns deadly however high voltage fence opened creature escape begins savagely stalking prey human visitors tourists scientific meanwhile youngsters enter restricted area security center attacked pack large pre historical animals deadlier bigger addition security agent stacy haiduk mate brian wimmer fight hardly carnivorous smilodons sabretooths course real star stars astounding terrifyingly though convincing giant animals savagely stalking prey group run afoul fight one nature fearsome predators furthermore third sabretooth dangerous slow stalks victims movie delivers goods lots blood gore beheading hair raising chills full scares sabretooths appear mediocre special effects story provides exciting stirring entertainment resu

It would be much more useful if we combine all the above steps to form a method so that all available reviews can be cleaned by calling it

In [8]:
def clean_review(raw_review):
    # Remove all HTML tags present in the review
    review_without_html = BeautifulSoup(raw_review, 'html.parser').get_text()
    
    # Remove all characters which are not letters
    review_with_only_letter = re.sub('[^a-zA-Z]', ' ', review_without_html)
    
    # Convert stopword list into a set, better for checking membership
    all_stopwords = set(stopwords.words('english'))
    
    # Remove all stopwords present inside a review
    review = [word for word in review_with_only_letter.lower().split(' ') if word and not word in all_stopwords]
    
    # Previous line return a list, now join them back together and return the cleaned review
    return ' '.join(review)

Now let's test our method, it should return exactly what we have by above steps (before defining the method)

In [9]:
clean_review(reviews.review[2])

u'film starts manager nicholas bell giving welcome investors robert carradine primal park secret project mutating primal animal using fossilized dna like jurassik park scientists resurrect one nature fearsome predators sabretooth tiger smilodon scientific ambition turns deadly however high voltage fence opened creature escape begins savagely stalking prey human visitors tourists scientific meanwhile youngsters enter restricted area security center attacked pack large pre historical animals deadlier bigger addition security agent stacy haiduk mate brian wimmer fight hardly carnivorous smilodons sabretooths course real star stars astounding terrifyingly though convincing giant animals savagely stalking prey group run afoul fight one nature fearsome predators furthermore third sabretooth dangerous slow stalks victims movie delivers goods lots blood gore beheading hair raising chills full scares sabretooths appear mediocre special effects story provides exciting stirring entertainment resu

Now, let's start cleaning every review in the training data set:

In [10]:
all_clean_reviews = []

for index, review in enumerate(reviews.review):
    all_clean_reviews.append(clean_review(review))

In [11]:
len(all_clean_reviews)

25000

Using `CountVectorizer` from scikit-learn to extract features for our training data. `CountVectorizer` is the Bag-of-Words model.

In below code, `max_features=5000` means that we create a dictionary of 5000 most frequent words from the training data. It also means that each review will be converted to a vector of numbers. Each vector is a 1D list with 5000 columns. Here is an example when `max_features=10`: 

[3 1 0 0 2 3 1 1 1 0]

In [12]:
vectorizer = CountVectorizer(analyzer='word', max_features=5000)

`fit_transform()` does two functions: 

- First, it fits the model and learns the vocabulary 
- Second, it transforms our training data into feature vectors. 

The input to `fit_transform()` should be a list of strings.

In [13]:
train_data_features = vectorizer.fit_transform(all_clean_reviews)

In [14]:
# Convert train_data_features into numpy array so that it's easier to work with in prediction
train_data_features = train_data_features.toarray()

Let's take a look at our output, it should be a sparse matrix

In [15]:
train_data_features[:10]

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [16]:
train_data_features.shape

(25000, 5000)

After calling `fit_transform()` for our training data, what's the vocabulary set that we get?

In [18]:
vocab = vectorizer.get_feature_names()
# Print first 20 words in vocabulary
vocab[:20]

[u'abandoned',
 u'abc',
 u'abilities',
 u'ability',
 u'able',
 u'abraham',
 u'absence',
 u'absent',
 u'absolute',
 u'absolutely',
 u'absurd',
 u'abuse',
 u'abusive',
 u'abysmal',
 u'academy',
 u'accent',
 u'accents',
 u'accept',
 u'acceptable',
 u'accepted']

Print the count for first 20 words in our vocabulary:

In [19]:
# Sum up the counts of each vocabulary word
dist = np.sum(train_data_features, axis=0)

# For each, print the vocabulary word and the number of times it 
# appears in the training set
for tag, count in zip(vocab, dist)[:20]:
    print count, tag

187 abandoned
125 abc
108 abilities
454 ability
1259 able
85 abraham
116 absence
83 absent
352 absolute
1485 absolutely
306 absurd
192 abuse
91 abusive
98 abysmal
297 academy
485 accent
203 accents
300 accept
130 acceptable
144 accepted


#### Training using Random Forest

In [20]:
forest = RandomForestClassifier(n_estimators=100)

In [21]:
forest.fit(train_data_features, reviews.sentiment)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

#### Clean the test data set and predicting

In [23]:
test_data = pd.read_csv('../data/testData.tsv', header=0, quoting=3, delimiter='\t')

In [24]:
clean_test_review = []
for review in test_data.review:
    clean_test_review.append(clean_review(review))

With test data, we dont fit them into model, only transform. If we fit the data into model, overfitting will occur

In [25]:
test_data_features = vectorizer.transform(clean_test_review)

In [26]:
test_data_features = test_data_features.toarray()

In [27]:
test_data_features.shape

(25000, 5000)

#### Predict

In [28]:
result = forest.predict(test_data_features)

#### Output to csv file and submit

In [29]:
output = pd.DataFrame({'id': test_data['id'], 'sentiment': result})
output.to_csv('bag_of_words.csv', index=False, quoting=3)

This submission has the accuracy about 88%.