# Kaggle's Way of NLP (Phase 1)

The [actual tutorial](https://www.kaggle.com/c/word2vec-nlp-tutorial#part-1-for-beginners-bag-of-words) was developed by **Angela Chapman** during her summer 2014 internship at Kaggle.  
Full credits to her!

Converted by Prashant Brahmbhatt

In [1]:
import re
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from nltk.corpus import stopwords

In [2]:
train = pd.read_csv('kaggleNLP/labeledTrainData.tsv', sep = '\t', quoting = 3)
#quoting=3 tells Python to ignore doubled quotes, otherwise you may encounter errors trying to read the file
#QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3)
train.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


Taking a look at a single review

In [3]:
print(train['review'][0])

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally sta

There are HTML tags such as <**br/**>, abbreviations, punctuation - all common issues when processing text from online.  
So we need to remove these tags using the **BeautifulSoup** library.

In [4]:
example = BeautifulSoup(train['review'][0])
print(example.getText())
#Calling get_text() gives the text of the review, without tags or markup.

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.The actual feature film bit when it finally starts is only on for 20 mi

_____

#### Dealing with Punctuation, Numbers and Stopwords: NLTK and regular expressions

For many problems, it makes sense to remove punctuation. On the other hand, in this case, we are tackling a sentiment analysis problem, and it is possible that **!!!** or **:-(** could carry sentiment, and should be treated as words.  
However for the sake of simplicity we are ignorig them.

Here we will remove the numbers but there are various ways to deal with them sensefully.  
For example, we could treat them as words, or replace them all with a placeholder string such as **NUM**.

To remove punctuation and numbers, we will use a package for dealing with regular expressions, called **re**. 

In [5]:
letters_only = re.sub('[^a-zA-Z]', " ", example.getText())
#for now, it is sufficient to know that [] indicates group membership and ^ means "not".
#In other words, the re.sub() statement above says, "Find anything that is NOT a lowercase letter 
#(a-z) or an upper case letter (A-Z), and replace it with a space."
print(letters_only)

 With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  Moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring  Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him The actual feature film bit when it finally starts is only on for    mi

We'll also convert our reviews to lower case and split them into individual words (called **"tokenization"** in NLP lingo)

In [6]:
lower_case = letters_only.lower()
words = lower_case.split()

Finally, we need to decide how to deal with frequently occurring words that don't carry much meaning. Such words are called **stop words**; in English they include words such as **"a"**, **"and"**, **"is"**, and **"the"**. 

In [7]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [8]:
words = [w for w in words if w not in stopwords.words('english')]

In [9]:
print(words)

['stuff', 'going', 'moment', 'mj', 'started', 'listening', 'music', 'watching', 'odd', 'documentary', 'watched', 'wiz', 'watched', 'moonwalker', 'maybe', 'want', 'get', 'certain', 'insight', 'guy', 'thought', 'really', 'cool', 'eighties', 'maybe', 'make', 'mind', 'whether', 'guilty', 'innocent', 'moonwalker', 'part', 'biography', 'part', 'feature', 'film', 'remember', 'going', 'see', 'cinema', 'originally', 'released', 'subtle', 'messages', 'mj', 'feeling', 'towards', 'press', 'also', 'obvious', 'message', 'drugs', 'bad', 'kay', 'visually', 'impressive', 'course', 'michael', 'jackson', 'unless', 'remotely', 'like', 'mj', 'anyway', 'going', 'hate', 'find', 'boring', 'may', 'call', 'mj', 'egotist', 'consenting', 'making', 'movie', 'mj', 'fans', 'would', 'say', 'made', 'fans', 'true', 'really', 'nice', 'actual', 'feature', 'film', 'bit', 'finally', 'starts', 'minutes', 'excluding', 'smooth', 'criminal', 'sequence', 'joe', 'pesci', 'convincing', 'psychopathic', 'powerful', 'drug', 'lord', 

___________

There are many other things we could do to the data - For example, **Porter Stemming** and **Lemmatizing** (available in NLTK) would allow us to treat **"messages"**, **"message"**, and **"messaging"** as the same word, which could certainly be useful. However for simplicity's sake we ignore for now.

_______

We have worked out a single review as example, now we have to do it for all of the reviews.  
So, we create a callable function.

We will also converted the stop word list to a different data type, a set. This is for speed; since we'll be calling this function tens of thousands of times, it needs to be fast, and searching sets in Python is much faster than searching lists.

In [10]:
def review_to_words( raw_review ):
    #the input will be a single review
    #output will be the same processed review
    
    #Removing HTML
    review_text = BeautifulSoup(raw_review).get_text()
    
    #Removing raw letters
    letters_only = re.sub("[^a-zA-Z]", " ", review_text)
    
    #Converting to lowercase and splitting
    words = letters_only.lower().split()
    
    #Converting to Set
    stops = set(stopwords.words('english'))
    
    #Removing stop words
    meaningful_words = [w for w in words if w not in stops]
    
    #Join the strings back as single string seperated by space
    return (' '.join(meaningful_words))

_______

Now we loop through and clean all of the training set at once.

In [11]:
#Getting the size of the reviews
num_reviews = len(train['review'])  #OR   train['review'].size

#Initializing an empty list for holder of cleaned reviews
clean_train_reviews = []

x = 1000
#Looping through and cleaning
for i in range(0, num_reviews):
    #To give a working update after every 1000 review cleaned
    if ((i+1)%1000 == 0): 
        print("Cleaned {} reviews".format(x))
        x += 1000
    clean_train_reviews.append(review_to_words(train['review'][i]))

Cleaned 1000 reviews
Cleaned 2000 reviews
Cleaned 3000 reviews
Cleaned 4000 reviews
Cleaned 5000 reviews
Cleaned 6000 reviews
Cleaned 7000 reviews
Cleaned 8000 reviews
Cleaned 9000 reviews
Cleaned 10000 reviews
Cleaned 11000 reviews
Cleaned 12000 reviews
Cleaned 13000 reviews
Cleaned 14000 reviews
Cleaned 15000 reviews
Cleaned 16000 reviews
Cleaned 17000 reviews
Cleaned 18000 reviews
Cleaned 19000 reviews
Cleaned 20000 reviews
Cleaned 21000 reviews
Cleaned 22000 reviews
Cleaned 23000 reviews
Cleaned 24000 reviews
Cleaned 25000 reviews


____________

### Creating Features from a Bag of Words

To convert the words o numerics for machine learning one approach is to use **Bag Of Words**.  
The **Bag of Words** model learns a **vocabulary** from all of the documents, then models each document by counting the number of times each word appears.

**Sentences**  
"The cat sat on the hat"  
"The dog ate the cat and the hat"  

**Vocabulary** -  { the, cat, sat, on, hat, dog, ate, and }  
To get our bags of words, we count the number of times each word occurs in each sentence. In Sentence 1, "the" appears twice, and "cat", "sat", "on", and "hat" each appear once, so the feature vector for Sentence 1 is:  

{ the, cat, sat, on, hat, dog, ate, and }

**Bag Of Words**  
Sentence 1 - { 2, 1, 1, 1, 1, 0, 0, 0 }  
Sentence 2 - { 3, 1, 0, 0, 1, 1, 1, 1}

__________

In the **IMDB** data, we have a very large number of reviews, which will give us a large vocabulary. To limit the size of the feature vectors, we should choose some maximum vocabulary size. Below, we use the **5000** most frequent words (remembering that stop words have already been removed).

using the **feature_extraction** module from scikit-learn to create **bag-of-words** features

In [12]:
from sklearn.feature_extraction.text import CountVectorizer

#Initialize the "CountVectorizer" object, which is scikit-learn's
# bag of words tool.
vectorizer = CountVectorizer(analyzer= "word",
                            tokenizer = None,
                            preprocessor = None,
                            stop_words = None,
                            max_features = 5000)


# fit)transform(), First fits the model and learns the vocabulary;
#second, it transforms our training data into feature vectors. 
#The input to fit_transform should be a list of strings.
train_data_features = vectorizer.fit_transform(clean_train_reviews)

#Here train_data_features will be a 'scipy.sparse.csr.csr_matrix'
# Numpy arrays are easy to work with, so convert the result to an array.

train_data_features = np.array(train_data_features)

In [13]:
train_data_features

array(<25000x5000 sparse matrix of type '<class 'numpy.int64'>'
	with 1975048 stored elements in Compressed Sparse Row format>,
      dtype=object)

Note that **CountVectorizer** comes with its own options to automatically do preprocessing, tokenization, and stop word removal -- for each of these, instead of specifying **"None"**, we could have used a built-in method or specified our own function to use.

____

Taking a look at words in the vocabulary

In [14]:
vocab = vectorizer.get_feature_names()
print(vocab)



To look at number of occurence of each word.


<table border = “1“>
<tr>
<th></th> <th>Message 1</th> <th>Message 2</th> <th>...</th> <th>Message N</th> 
</tr>
<tr>
<td><b>Word 1 Count</b></td><td>0</td><td>1</td><td>...</td><td>0</td>
</tr>
<tr>
<td><b>Word 2 Count</b></td><td>0</td><td>0</td><td>...</td><td>0</td>
</tr>
<tr>
<td><b>...</b></td> <td>1</td><td>2</td><td>...</td><td>0</td>
</tr>
<tr>
<td><b>Word N Count</b></td> <td>0</td><td>1</td><td>...</td><td>1</td>
</tr>
</table>

To look at id's of any word

In [81]:
vectorizer.vocabulary_.get("message")

2801

To look at word through the id

In [83]:
vectorizer.get_feature_names()[2496]

'latter'

_____

## Training the Model

We have the numeric training features for our reviews so we can move on to the supervised learning on this

In [87]:
from sklearn.ensemble import RandomForestClassifier

classifier = RandomForestClassifier(n_estimators= 100)

# Fit the classifier to the training set, using the bag of words as 
# features and the sentiment labels as the response variable

classifier.fit(train_data_features, train['sentiment'])

ValueError: setting an array element with a sequence.