### Download the dataset
https://www.kaggle.com/c/word2vec-nlp-tutorial/download/6p5lry6q8vtNVre4DXOg%2Fversions%2FLH6HdfqTrHnAeosA007i%2Ffiles%2FlabeledTrainData.tsv.zip

### Read the data

In [1]:
import pandas as pd       
df = pd.read_csv("labeledTrainData.tsv", header=0, delimiter="\t", quoting=3)

In [2]:
df.shape

(25000, 3)

In [3]:
df.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [4]:
df["review"][0]

'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally

### Split the data into train and test sets

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df['review'],
    df['sentiment'],
    test_size=0.2, 
    random_state=42
)

In [6]:
print(len(X_train), len(X_test), len(y_train), len(y_test))

20000 5000 20000 5000


In [15]:
type(X_train)

pandas.core.series.Series

In [16]:
X_train = X_train.to_list()
X_test = X_test.to_list()
y_train = y_train.to_list()
y_test = y_test.to_list()

### Data Cleaning and Text Preprocessing

In [12]:
#!pip install BeautifulSoup4

Collecting BeautifulSoup4
  Using cached https://files.pythonhosted.org/packages/cb/a1/c698cf319e9cfed6b17376281bd0efc6bfc8465698f54170ef60a485ab5d/beautifulsoup4-4.8.2-py3-none-any.whl
Collecting soupsieve>=1.2 (from BeautifulSoup4)
  Using cached https://files.pythonhosted.org/packages/81/94/03c0f04471fc245d08d0a99f7946ac228ca98da4fa75796c507f61e688c2/soupsieve-1.9.5-py2.py3-none-any.whl
Installing collected packages: soupsieve, BeautifulSoup4
Successfully installed BeautifulSoup4-4.8.2 soupsieve-1.9.5


In [17]:
# Import BeautifulSoup into your workspace
from bs4 import BeautifulSoup             

# Initialize the BeautifulSoup object on a single movie review     
example1 = BeautifulSoup(X_train[0])  

# Print the raw review and then the output of get_text(), for comparison
print(X_train[0])
print(example1.get_text())

"This movie is just plain dumb.<br /><br />From the casting of Ralph Meeker as Mike Hammer to the fatuous climax, the film is an exercise in wooden predictability.<br /><br />Mike Hammer is one of detective fiction's true sociopaths. Unlike Marlow and Spade, who put pieces together to solve the mystery, Hammer breaks things apart to get to the truth. This film turns Hammer into a boob by surrounding him with bad guys who are ... well, too dumb to get away with anything. One is so poorly drawn that he succumbs to a popcorn attack.<br /><br />Other parts of the movie are right out of the Three Stooges play book. Velda's dance at the barre, for instance, or the bad guy who accidentally stabs his boss in the back. And the continuity breaks are shameful: Frau Blucher is running down the centerline of the road when the camera is tight on her lower legs but she's way over the side when the camera pulls back for a wider shot. The worst break, however, precedes the popcorn attack. The bad guy s

In [18]:
import re
# Use regular expressions to do a find-and-replace
letters_only = re.sub("[^a-zA-Z]",           # The pattern to search for
                      " ",                   # The pattern to replace it with
                      example1.get_text() )  # The text to search
print(letters_only)

 This movie is just plain dumb From the casting of Ralph Meeker as Mike Hammer to the fatuous climax  the film is an exercise in wooden predictability Mike Hammer is one of detective fiction s true sociopaths  Unlike Marlow and Spade  who put pieces together to solve the mystery  Hammer breaks things apart to get to the truth  This film turns Hammer into a boob by surrounding him with bad guys who are     well  too dumb to get away with anything  One is so poorly drawn that he succumbs to a popcorn attack Other parts of the movie are right out of the Three Stooges play book  Velda s dance at the barre  for instance  or the bad guy who accidentally stabs his boss in the back  And the continuity breaks are shameful  Frau Blucher is running down the centerline of the road when the camera is tight on her lower legs but she s way over the side when the camera pulls back for a wider shot  The worst break  however  precedes the popcorn attack  The bad guy stalking Hammer passes a clock second

In [19]:
lower_case = letters_only.lower()        # Convert to lower case
words = lower_case.split()               # Split into words

In [20]:
import nltk
from nltk.corpus import stopwords # Import the stop word list
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [21]:
# Remove stop words from "words"
words = [w for w in words if not w in stopwords.words("english")]
print(words)

['movie', 'plain', 'dumb', 'casting', 'ralph', 'meeker', 'mike', 'hammer', 'fatuous', 'climax', 'film', 'exercise', 'wooden', 'predictability', 'mike', 'hammer', 'one', 'detective', 'fiction', 'true', 'sociopaths', 'unlike', 'marlow', 'spade', 'put', 'pieces', 'together', 'solve', 'mystery', 'hammer', 'breaks', 'things', 'apart', 'get', 'truth', 'film', 'turns', 'hammer', 'boob', 'surrounding', 'bad', 'guys', 'well', 'dumb', 'get', 'away', 'anything', 'one', 'poorly', 'drawn', 'succumbs', 'popcorn', 'attack', 'parts', 'movie', 'right', 'three', 'stooges', 'play', 'book', 'velda', 'dance', 'barre', 'instance', 'bad', 'guy', 'accidentally', 'stabs', 'boss', 'back', 'continuity', 'breaks', 'shameful', 'frau', 'blucher', 'running', 'centerline', 'road', 'camera', 'tight', 'lower', 'legs', 'way', 'side', 'camera', 'pulls', 'back', 'wider', 'shot', 'worst', 'break', 'however', 'precedes', 'popcorn', 'attack', 'bad', 'guy', 'stalking', 'hammer', 'passes', 'clock', 'seconds', 'hero', 'except',

In [22]:
def review_to_words( raw_review ):
    # Function to convert a raw review to a string of words
    # The input is a single string (a raw movie review), and 
    # the output is a single string (a preprocessed movie review)
    #
    # 1. Remove HTML
    review_text = BeautifulSoup(raw_review).get_text() 
    #
    # 2. Remove non-letters        
    letters_only = re.sub("[^a-zA-Z]", " ", review_text) 
    #
    # 3. Convert to lower case, split into individual words
    words = letters_only.lower().split()                             
    #
    # 4. In Python, searching a set is much faster than searching
    #   a list, so convert the stop words to a set
    stops = set(stopwords.words("english"))                  
    # 
    # 5. Remove stop words
    meaningful_words = [w for w in words if not w in stops]   
    #
    # 6. Join the words back into one string separated by space, 
    # and return the result.
    return( " ".join( meaningful_words ))

In [23]:
clean_review = review_to_words( X_train[0] )
print(clean_review)

movie plain dumb casting ralph meeker mike hammer fatuous climax film exercise wooden predictability mike hammer one detective fiction true sociopaths unlike marlow spade put pieces together solve mystery hammer breaks things apart get truth film turns hammer boob surrounding bad guys well dumb get away anything one poorly drawn succumbs popcorn attack parts movie right three stooges play book velda dance barre instance bad guy accidentally stabs boss back continuity breaks shameful frau blucher running centerline road camera tight lower legs way side camera pulls back wider shot worst break however precedes popcorn attack bad guy stalking hammer passes clock seconds hero except clock shows seven minutes behind guy fair interesting camera angles lighting grand finale bad must seen reason gets two points


In [25]:
# Get the number of reviews based on the dataframe column size
num_reviews = len(X_train)

# Initialize an empty list to hold the clean reviews
clean_train_reviews = []

# Loop over each review; create an index i that goes from 0 to the length of the movie review list 
print("Cleaning and parsing the training set movie reviews...\n")

for i in range( 0, num_reviews ):
    # If the index is evenly divisible by 1000, print a message
    if( (i+1)%5000 == 0 ):
        print("Review %d of %d\n" % ( i+1, num_reviews ))
    clean_train_reviews.append( review_to_words( X_train[i] ))

Cleaning and parsing the training set movie reviews...

Review 5000 of 20000

Review 10000 of 20000

Review 15000 of 20000

Review 20000 of 20000



### Creating Features from a Bag of Words

In [26]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the "CountVectorizer" object, which is scikit-learn's
# bag of words tool.  
vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = 5000) 

# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of 
# strings.
train_data_features = vectorizer.fit_transform(clean_train_reviews)

# Numpy arrays are easy to work with, so convert the result to an 
# array
train_data_features = train_data_features.toarray()

In [27]:
train_data_features.shape

(20000, 5000)

In [28]:
train_data_features[17][0:50]

array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0], dtype=int64)

In [29]:
vocab = vectorizer.get_feature_names()
vocab[:10]

['abandoned',
 'abc',
 'abilities',
 'ability',
 'able',
 'abraham',
 'absence',
 'absolute',
 'absolutely',
 'absurd']

In [30]:
import numpy as np

# Sum up the counts of each vocabulary word
dist = np.sum(train_data_features, axis=0)

# For each, print the vocabulary word and the number of times it 
# appears in the training set
for tag, count in zip(vocab, dist):
    print(count, tag)

159 abandoned
97 abc
75 abilities
345 ability
1025 able
70 abraham
87 absence
275 absolute
1159 absolutely
247 absurd
156 abuse
73 abusive
81 abysmal
234 academy
382 accent
165 accents
251 accept
104 acceptable
117 accepted
73 access
256 accident
156 accidentally
70 accompanied
63 accomplish
100 accomplished
236 according
150 account
66 accuracy
208 accurate
101 accused
154 achieve
102 achieved
98 achievement
69 acid
770 across
973 act
515 acted
5193 acting
2746 action
249 actions
1907 actor
3580 actors
957 actress
289 actresses
311 acts
631 actual
3427 actually
116 ad
251 adam
78 adams
360 adaptation
126 adapted
659 add
337 added
68 addicted
123 adding
281 addition
267 adds
91 adequate
96 admire
503 admit
100 admittedly
63 adopted
79 adorable
430 adult
316 adults
88 advance
65 advanced
126 advantage
398 adventure
154 adventures
68 advertising
213 advice
66 advise
288 affair
72 affect
97 affected
82 afford
105 aforementioned
269 afraid
155 africa
196 african
135 afternoon
102 afterward

377 computer
140 con
95 conceived
404 concept
81 concern
213 concerned
82 concerning
121 concerns
102 concert
384 conclusion
117 condition
77 confidence
221 conflict
69 conflicts
64 confrontation
289 confused
287 confusing
131 confusion
99 connect
121 connected
207 connection
78 connery
65 conscience
70 conscious
105 consequences
77 conservative
385 consider
82 considerable
395 considered
431 considering
79 consistent
87 consistently
114 consists
97 conspiracy
231 constant
328 constantly
75 constructed
75 construction
116 contact
130 contain
84 contained
320 contains
160 contemporary
292 content
213 context
249 continue
101 continued
197 continues
170 continuity
103 contract
89 contrary
173 contrast
183 contrived
417 control
118 controversial
93 conventional
63 conventions
136 conversation
78 conversations
141 convey
155 convince
163 convinced
427 convincing
75 convincingly
93 convoluted
125 cook
786 cool
140 cooper
519 cop
72 copies
238 cops
452 copy
212 core
120 corner
208 corny
78 c

103 glover
4094 go
110 goal
922 god
104 godfather
81 godzilla
1906 goes
3278 going
234 gold
92 goldberg
209 golden
618 gone
197 gonna
12077 good
88 goodness
122 goofy
186 gordon
826 gore
284 gorgeous
197 gory
2850 got
111 gothic
101 gotta
224 gotten
353 government
84 grab
262 grace
381 grade
88 gradually
65 graham
256 grand
82 grandfather
91 grandmother
204 grant
156 granted
200 graphic
147 graphics
80 grasp
186 gratuitous
134 grave
81 gray
75 grayson
7264 great
125 greater
599 greatest
119 greatly
66 greatness
73 greed
71 greedy
86 greek
319 green
65 greg
194 grew
125 grey
68 griffith
154 grim
114 gripping
156 gritty
149 gross
64 grotesque
278 ground
824 group
88 groups
182 grow
231 growing
209 grown
105 grows
135 gruesome
79 guarantee
132 guard
1034 guess
84 guessed
118 guessing
104 guest
104 guide
115 guilt
167 guilty
463 gun
73 gundam
221 guns
100 guts
2466 guy
1067 guys
158 ha
65 hackneyed
393 hair
92 hal
1658 half
141 halfway
176 hall
198 halloween
69 ham
98 hamilton
125 hamlet
8

487 nudity
821 number
323 numbers
221 numerous
103 nurse
70 nuts
67 nyc
119 object
65 objective
126 obnoxious
93 obscure
189 obsessed
129 obsession
867 obvious
954 obviously
89 occasion
154 occasional
208 occasionally
94 occur
90 occurred
112 occurs
82 ocean
464 odd
135 oddly
91 odds
82 offended
174 offensive
293 offer
148 offered
81 offering
268 offers
453 office
224 officer
79 officers
86 official
1258 often
1186 oh
110 oil
822 ok
545 okay
3611 old
517 older
63 oldest
178 oliver
91 olivier
63 omen
21463 one
762 ones
67 online
266 onto
527 open
128 opened
806 opening
197 opens
306 opera
75 operation
766 opinion
70 opinions
68 opportunities
304 opportunity
102 opposed
216 opposite
63 option
68 orange
765 order
63 ordered
89 orders
207 ordinary
2676 original
134 originality
249 originally
69 orleans
85 orson
701 oscar
108 oscars
66 othello
1266 others
561 otherwise
87 ought
96 outcome
85 outer
85 outfit
95 outrageous
481 outside
346 outstanding
71 overacting
1137 overall
119 overcome
10

207 segment
111 segments
68 seldom
921 self
90 selfish
186 sell
77 sellers
106 selling
163 semi
182 send
119 sends
1848 sense
83 senseless
153 sensitive
299 sent
84 sentence
69 sentiment
111 sentimental
122 separate
72 september
629 sequel
174 sequels
719 sequence
602 sequences
312 serial
2734 series
758 serious
815 seriously
132 serve
131 served
149 serves
164 service
71 serving
1940 set
656 sets
500 setting
141 settings
80 settle
291 seven
95 seventies
1114 several
77 severe
1358 sex
554 sexual
110 sexuality
107 sexually
365 sexy
77 sh
143 shadow
100 shadows
74 shake
232 shakespeare
64 shaky
108 shall
202 shallow
542 shame
72 shanghai
133 shape
281 share
111 shark
157 sharp
72 shaw
71 shed
202 sheer
73 shelf
68 shelley
197 sheriff
89 shine
113 shines
94 shining
272 ship
64 ships
83 shirley
90 shirt
326 shock
170 shocked
281 shocking
100 shoes
366 shoot
396 shooting
114 shoots
208 shop
1506 short
89 shortly
119 shorts
1644 shot
749 shots
65 shoulders
4980 show
64 showcase
69 showdown


### Model

In [32]:
from sklearn.ensemble import RandomForestClassifier

# Initialize a Random Forest classifier with 100 trees
forest = RandomForestClassifier(n_estimators = 100) 

# Fit the forest to the training set, using the bag of words as 
# features and the sentiment labels as the response variable

forest = forest.fit( train_data_features, y_train)

### Testing

In [34]:
# Create an empty list and append the clean reviews one by one
num_reviews = len(X_test)
clean_test_reviews = [] 

print("Cleaning and parsing the test set movie reviews...\n")
for i in range(0,num_reviews):
    if( (i+1) % 5000 == 0 ):
        print("Review %d of %d\n" % (i+1, num_reviews))
    clean_review = review_to_words( X_test[i] )
    clean_test_reviews.append( clean_review )

# Get a bag of words for the test set, and convert to a numpy array
test_data_features = vectorizer.transform(clean_test_reviews)
test_data_features = test_data_features.toarray()

# Use the random forest to make sentiment label predictions
y_pred = forest.predict(test_data_features)

Cleaning and parsing the test set movie reviews...

Review 5000 of 5000



In [35]:
y_pred[:10]

array([0, 1, 0, 1, 0, 1, 1, 1, 0, 0])

### Evaluation

In [37]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.86      0.85      2481
           1       0.86      0.83      0.84      2519

    accuracy                           0.85      5000
   macro avg       0.85      0.85      0.85      5000
weighted avg       0.85      0.85      0.85      5000



### Summary

<li>Read the data and split into train and test sets
<li>Clean (Pre-process) the documents text to remove HTML tags, keep only text, tokenize, split and remove stop words
<li>Use CountVectorizer to fit-transform the cleaned text
<li>Initialize and fit RandomForest model with count vectors as features and reviews as lables
<li>Clean the test data as we have done for training
<li>Apply CountVectorizer transform method only to generate count vectors
<li>Use the fitted model to predict the review for count vectors of test data

<b>Reference</b><br>
https://www.kaggle.com/c/word2vec-nlp-tutorial/overview/part-1-for-beginners-bag-of-words