### Based on [this](https://www.kaggle.com/c/word2vec-nlp-tutorial/overview/part-1-for-beginners-bag-of-words) Kaggle tutorial

In [27]:
# read the training file
import pandas as pd

train = pd.read_csv('labeledTrainData.tsv', sep='\t')

In [28]:
train.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


There are some stray quote marks in the data like in 2nd row review columns- \The

In [32]:
import csv
train = pd.read_csv('labeledTrainData.tsv', sep='\t', quoting=csv.QUOTE_NONE) # ignore doubled quotes
train.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [36]:
print(train.shape, train.columns.values, sep='\n')

(25000, 3)
['id' 'sentiment' 'review']


In [49]:
train["review"][9]

'"<br /><br />This movie is full of references. Like \\"Mad Max II\\", \\"The wild one\\" and many others. The ladybug´s face it´s a clear reference (or tribute) to Peter Lorre. This movie is a masterpiece. We´ll talk much more about in the future."'

### Data Cleaning and Text Preprocessing

The review above has html tags such as < br > and other punctuations. We'll use BeautifulSoup to remove HTML markup

In [39]:
from bs4 import BeautifulSoup as bsoup

In [52]:
example1 = bsoup(train["review"][9], 'html.parser')

print(train.review[9], 2*'\n')
print(example1.get_text())

"<br /><br />This movie is full of references. Like \"Mad Max II\", \"The wild one\" and many others. The ladybug´s face it´s a clear reference (or tribute) to Peter Lorre. This movie is a masterpiece. We´ll talk much more about in the future." 


"This movie is full of references. Like \"Mad Max II\", \"The wild one\" and many others. The ladybug´s face it´s a clear reference (or tribute) to Peter Lorre. This movie is a masterpiece. We´ll talk much more about in the future."


* When considering how to clean the text, we should think about the data problem we are trying to solve. For many problems, it makes sense to remove punctuation. 

* On the other hand, in this case, we are tackling a sentiment analysis problem, and it is possible that "!!!" or ":-(" could carry sentiment, and should be treated as words.

In this tutorial, for simplicity, we remove the punctuation altogether. Similarly, in this tutorial we will remove numbers, but there are other ways of dealing with them that make just as much sense. For example, we could treat them as words, or replace them all with a placeholder string such as "NUM".

In [53]:
import re

letters_only = re.sub("[^a-zA-Z]", " ", example1.get_text())
print(letters_only)

 This movie is full of references  Like   Mad Max II      The wild one   and many others  The ladybug s face it s a clear reference  or tribute  to Peter Lorre  This movie is a masterpiece  We ll talk much more about in the future  


In [54]:
lower_case = letters_only.lower()
words = lower_case.split()

In [55]:
import nltk
from nltk.corpus import stopwords
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [56]:
print("Before removing stopwords: \n", words)
words = [word for word in words if word not in stopwords.words("english")]
print("After removing stopwords: \n", words)

Before removing stopwords: 
 ['this', 'movie', 'is', 'full', 'of', 'references', 'like', 'mad', 'max', 'ii', 'the', 'wild', 'one', 'and', 'many', 'others', 'the', 'ladybug', 's', 'face', 'it', 's', 'a', 'clear', 'reference', 'or', 'tribute', 'to', 'peter', 'lorre', 'this', 'movie', 'is', 'a', 'masterpiece', 'we', 'll', 'talk', 'much', 'more', 'about', 'in', 'the', 'future']
After removing stopwords: 
 ['movie', 'full', 'references', 'like', 'mad', 'max', 'ii', 'wild', 'one', 'many', 'others', 'ladybug', 'face', 'clear', 'reference', 'tribute', 'peter', 'lorre', 'movie', 'masterpiece', 'talk', 'much', 'future']


In [62]:
def review_to_words(raw_review):
    review_text = bsoup(raw_review, 'html.parser').get_text()
    letters_only = re.sub("[^a-zA-Z]", " ", review_text)
    words = letters_only.lower().split()
    stops = set(stopwords.words("english"))
    meaningful_words = [word for word in words if word not in stops]
    
    return ' '.join(meaningful_words)    

In [63]:
clean_review = review_to_words(train["review"][9])
print(clean_review)

movie full references like mad max ii wild one many others ladybug face clear reference tribute peter lorre movie masterpiece talk much future


### Working on the whole dataset

In [65]:
from tqdm import tqdm

num_reviews = train["review"].size
clean_train_reviews = []

for i in tqdm(range(0, num_reviews)):
    clean_train_reviews.append(review_to_words(train["review"][i]))
    
print(clean_train_reviews[9])

100%|███████████████████████████████████████████████████████████████████████████████████████████| 25000/25000 [00:26<00:00, 942.92it/s]


movie full references like mad max ii wild one many others ladybug face clear reference tribute peter lorre movie masterpiece talk much future


### Representing input as Bag of Words

In [66]:
# CountVectorizer is sklearn's bow tool
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer="word", tokenizer=None, preprocessor=None, stop_words=None, max_features=5000)

train_data_features = vectorizer.fit_transform(clean_train_reviews)
print(type(train_data_features))
train_data_features = train_data_features.toarray()

<class 'scipy.sparse.csr.csr_matrix'>


In [76]:
import numpy as np
print(train_data_features.shape)
print(train_data_features[9])

(25000, 5000)
[0 0 0 ... 0 0 0]


In [78]:
vocab = vectorizer.get_feature_names()
print(vocab[:10])

['abandoned', 'abc', 'abilities', 'ability', 'able', 'abraham', 'absence', 'absent', 'absolute', 'absolutely']


In [85]:
# print vocabulary entries for which this review has nonzero values
non_zero_indices = list(np.flatnonzero(train_data_features[9]))
print(np.array(vocab)[non_zero_indices])

['clear' 'face' 'full' 'future' 'ii' 'like' 'mad' 'many' 'masterpiece'
 'max' 'movie' 'much' 'one' 'others' 'peter' 'reference' 'references'
 'talk' 'tribute' 'wild']


In [90]:
# get count of each vocabulary word in the data set
dist = np.sum(train_data_features, axis=0)
for tag, count in zip(vocab, dist):
    print(count, tag)
    break

187 abandoned


In [97]:
from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(n_estimators=100)
forest = forest.fit(train_data_features, train["sentiment"])

In [98]:
test = pd.read_csv("testData.tsv", header=0, delimiter='\t', quoting=csv.QUOTE_NONE)
print(test.shape)

num_reviews = len(test["review"])
clean_test_reviews = []
for review in tqdm(test["review"]):
    clean_test_reviews.append(review_to_words(review))

(25000, 2)



  0%|                                                                                                        | 0/25000 [00:00<?, ?it/s]
  0%|▎                                                                                             | 98/25000 [00:00<00:28, 884.81it/s]
  1%|▊                                                                                            | 213/25000 [00:00<00:26, 943.79it/s]
  1%|█▏                                                                                           | 320/25000 [00:00<00:25, 978.08it/s]
  2%|█▌                                                                                           | 420/25000 [00:00<00:25, 973.16it/s]
  2%|█▉                                                                                           | 526/25000 [00:00<00:24, 994.94it/s]
  3%|██▎                                                                                         | 632/25000 [00:00<00:24, 1003.06it/s]
  3%|██▊                                       

 28%|█████████████████████████▍                                                                 | 6991/25000 [00:06<00:15, 1149.10it/s]
 28%|█████████████████████████▉                                                                 | 7122/25000 [00:06<00:15, 1147.26it/s]
 29%|██████████████████████████▎                                                                | 7239/25000 [00:06<00:15, 1141.69it/s]
 29%|██████████████████████████▊                                                                | 7369/25000 [00:06<00:15, 1139.37it/s]
 30%|███████████████████████████▏                                                               | 7484/25000 [00:06<00:15, 1142.48it/s]
 30%|███████████████████████████▋                                                               | 7599/25000 [00:07<00:15, 1144.32it/s]
 31%|████████████████████████████▏                                                              | 7734/25000 [00:07<00:14, 1154.10it/s]
 31%|████████████████████████████▌              

 57%|███████████████████████████████████████████████████▌                                      | 14328/25000 [00:13<00:09, 1154.49it/s]
 58%|███████████████████████████████████████████████████▉                                      | 14444/25000 [00:13<00:09, 1137.97it/s]
 58%|████████████████████████████████████████████████████▍                                     | 14567/25000 [00:13<00:09, 1124.46it/s]
 59%|████████████████████████████████████████████████████▉                                     | 14699/25000 [00:13<00:09, 1130.64it/s]
 59%|█████████████████████████████████████████████████████▍                                    | 14832/25000 [00:13<00:08, 1139.61it/s]
 60%|█████████████████████████████████████████████████████▊                                    | 14953/25000 [00:13<00:08, 1148.11it/s]
 60%|██████████████████████████████████████████████████████▏                                   | 15068/25000 [00:13<00:08, 1147.26it/s]
 61%|███████████████████████████████████████████

 87%|██████████████████████████████████████████████████████████████████████████████            | 21700/25000 [00:19<00:02, 1112.73it/s]
 87%|██████████████████████████████████████████████████████████████████████████████▌           | 21812/25000 [00:19<00:02, 1112.35it/s]
 88%|██████████████████████████████████████████████████████████████████████████████▉           | 21932/25000 [00:19<00:02, 1093.59it/s]
 88%|███████████████████████████████████████████████████████████████████████████████▍          | 22068/25000 [00:19<00:02, 1118.97it/s]
 89%|███████████████████████████████████████████████████████████████████████████████▉          | 22192/25000 [00:20<00:02, 1130.38it/s]
 89%|████████████████████████████████████████████████████████████████████████████████▎         | 22315/25000 [00:20<00:02, 1125.42it/s]
 90%|████████████████████████████████████████████████████████████████████████████████▊         | 22432/25000 [00:20<00:02, 1126.47it/s]
 90%|███████████████████████████████████████████

In [99]:
test_data_features = vectorizer.transform(clean_test_reviews)
test_data_features = test_data_features.toarray()

In [100]:
result = forest.predict(test_data_features)
output = pd.DataFrame(data={"id":test['id'], "sentiment":result})