# Part 1 - Bag of Words

Following https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words

## What is NLP?

We will apply a [Bag of Words](https://en.wikipedia.org/wiki/Bag-of-words_model) model to IMDB movie reviews to perform sentiment analysis.

## Reading the Data

In [1]:
import pandas as pd

In [27]:
from six.moves import range

In [5]:
!head -n 2 labeledTrainData.tsv 

id	sentiment	review
"5814_8"	1	"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feat

In [7]:
# quoting=3 -> ignore doubled quotes
train = pd.read_csv("./labeledTrainData.tsv", header=0, delimiter="\t", quoting=3)

In [9]:
train.shape

(25000, 3)

In [10]:
train.columns.values

array(['id', 'sentiment', 'review'], dtype=object)

In [11]:
print train['review'][0]

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally sta

## Data Cleaning and Text Preprocessing

### Removing HTML Markup: BeautifulSoup

In [12]:
from bs4 import BeautifulSoup

In [13]:
example1 = BeautifulSoup(train['review'][0])



 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))


In [14]:
print example1.get_text()

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.The actual feature film bit when it finally starts is only on for 20 mi

`get_text()` gives us the text without tags or markup. It is not considered a reliable practice to remove markup using regular expressions (well, shoot), so even for a simple application it is usually best to use a package like BeautifulSoup.

### Dealing with Punctuation, Numbers, and Stopwords: NLTK and regular expressions

For simplicity, we will remove punctuation, but it is not clearly the correct thing to do in general if analyzing sentiment. Also for simplicity, we will remove numbers.

In [15]:
import re

In [16]:
letters_only = re.sub(r"[^a-zA-Z]", " ", example1.get_text())

In [17]:
print letters_only

 With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  Moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring  Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him The actual feature film bit when it finally starts is only on for    mi

We will also "tokenize" the words - convert everything to lowecase and split them into individual words.

In [18]:
lower_case = letters_only.lower()
words = lower_case.split()

How do we deal with frequently occuring words that don't carry much information? Such words are called "stop words" - e.g., "a", "and", "is", "the", etc.

In [19]:
import nltk

In [20]:
from nltk.corpus import stopwords

In [21]:
print stopwords.words("english")

[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u'her', u'hers', u'herself', u'it', u'its', u'itself', u'they', u'them', u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', u'that', u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be', u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing', u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until', u'while', u'of', u'at', u'by', u'for', u'with', u'about', u'against', u'between', u'into', u'through', u'during', u'before', u'after', u'above', u'below', u'to', u'from', u'up', u'down', u'in', u'out', u'on', u'off', u'over', u'under', u'again', u'further', u'then', u'once', u'here', u'there', u'when', u'where', u'why', u'how', u'all', u'any', u'both', u'each', u'few', u'more', u'most', u'other', u'some', u'such', u'no', u'nor', u

In [22]:
words = [w for w in words if not w in stopwords.words('english')]
print words

[u'stuff', u'going', u'moment', u'mj', u'started', u'listening', u'music', u'watching', u'odd', u'documentary', u'watched', u'wiz', u'watched', u'moonwalker', u'maybe', u'want', u'get', u'certain', u'insight', u'guy', u'thought', u'really', u'cool', u'eighties', u'maybe', u'make', u'mind', u'whether', u'guilty', u'innocent', u'moonwalker', u'part', u'biography', u'part', u'feature', u'film', u'remember', u'going', u'see', u'cinema', u'originally', u'released', u'subtle', u'messages', u'mj', u'feeling', u'towards', u'press', u'also', u'obvious', u'message', u'drugs', u'bad', u'kay', u'visually', u'impressive', u'course', u'michael', u'jackson', u'unless', u'remotely', u'like', u'mj', u'anyway', u'going', u'hate', u'find', u'boring', u'may', u'call', u'mj', u'egotist', u'consenting', u'making', u'movie', u'mj', u'fans', u'would', u'say', u'made', u'fans', u'true', u'really', u'nice', u'actual', u'feature', u'film', u'bit', u'finally', u'starts', u'minutes', u'excluding', u'smooth', u'cri

We could do more - Porter Stemming and Lemmatizing (both available in NLTK) would allow us to treat "messages", "message", and "messaging" as the same word, which could be useful. But, for simplicity we will stop here.

### Putting it all together

In [23]:
def review_to_words(raw_review):
    # remove html
    review_text = BeautifulSoup(raw_review).get_text()
    # remove non-letters
    letters_only = re.sub(r'[^a-zA-Z]', ' ', review_text)
    # convert to lower case, split into words
    words = letters_only.lower().split()
    # searching a set is faster than searching a list
    stops = set(stopwords.words('english'))
    # remove stop words
    meaningful_words = [w for w in words if not w in stops]
    # return it all back as a joined string
    return " ".join(meaningful_words)

In [24]:
clean_review = review_to_words(train['review'][0])
print clean_review

stuff going moment mj started listening music watching odd documentary watched wiz watched moonwalker maybe want get certain insight guy thought really cool eighties maybe make mind whether guilty innocent moonwalker part biography part feature film remember going see cinema originally released subtle messages mj feeling towards press also obvious message drugs bad kay visually impressive course michael jackson unless remotely like mj anyway going hate find boring may call mj egotist consenting making movie mj fans would say made fans true really nice actual feature film bit finally starts minutes excluding smooth criminal sequence joe pesci convincing psychopathic powerful drug lord wants mj dead bad beyond mj overheard plans nah joe pesci character ranted wanted people know supplying drugs etc dunno maybe hates mj music lots cool things like mj turning car robot whole speed demon sequence also director must patience saint came filming kiddy bad sequence usually directors hate working

In [25]:
num_reviews = train['review'].size
print num_reviews

25000


In [26]:
clean_train_reviews = []

In [28]:
for i in range(0, num_reviews):
    if (i + 1) % 1000 == 0:
        print 'review %d of %d\n' % (i+1, num_reviews)
    clean_train_reviews.append(review_to_words(train['review'][i]))

review 1000 of 25000

review 2000 of 25000

review 3000 of 25000

review 4000 of 25000

review 5000 of 25000

review 6000 of 25000

review 7000 of 25000

review 8000 of 25000

review 9000 of 25000

review 10000 of 25000

review 11000 of 25000

review 12000 of 25000

review 13000 of 25000

review 14000 of 25000

review 15000 of 25000

review 16000 of 25000

review 17000 of 25000

review 18000 of 25000

review 19000 of 25000

review 20000 of 25000

review 21000 of 25000

review 22000 of 25000

review 23000 of 25000

review 24000 of 25000

review 25000 of 25000



## Creating Features from a Bag of Words

We choose a maximum vocabulary size of 5,000 to keep things from getting out of hand.

In [29]:
from sklearn.feature_extraction.text import CountVectorizer  # sklearn's bag-of-words tool

In [30]:
# note we could pass our own functions in for pre-processing and cleaning stop words all in one go here...
vectorizer = CountVectorizer(analyzer='word', tokenizer=None, preprocessor=None, stop_words=None, max_features=5000)

In [31]:
# fit_transform does two things: it fits the model and learns the vocabulary, then it transforms our training data
# into feature vectors - it takes a list of strings as input
train_data_features = vectorizer.fit_transform(clean_train_reviews)

In [32]:
# numpy arrays are easier to work with...
train_data_features = train_data_features.toarray()

In [33]:
train_data_features.shape

(25000, 5000)

In [34]:
vocab = vectorizer.get_feature_names()

In [35]:
type(vocab)

list

In [36]:
vocab[:20]

[u'abandoned',
 u'abc',
 u'abilities',
 u'ability',
 u'able',
 u'abraham',
 u'absence',
 u'absent',
 u'absolute',
 u'absolutely',
 u'absurd',
 u'abuse',
 u'abusive',
 u'abysmal',
 u'academy',
 u'accent',
 u'accents',
 u'accept',
 u'acceptable',
 u'accepted']

In [37]:
import numpy as np

In [38]:
dist = np.sum(train_data_features, axis=0)
print dist

[187 125 108 ..., 740 518 147]


In [39]:
zippered = zip(vocab, dist)
for tag, count in zippered[:10]:
    print tag, count
    

abandoned 187
abc 125
abilities 108
ability 454
able 1259
abraham 85
absence 116
absent 83
absolute 352
absolutely 1485


## Random Forest