# Bag of Words Meets Bags of Popcorn, Part 1
Last updated: 2020.07.19

This notebook is dedicated to getting started on Natural Language Processing and text processing by walking through Kaggle's [Bag of Words Meets Bags of Popcorn](https://www.kaggle.com/c/word2vec-nlp-tutorial/overview/part-1-for-beginners-bag-of-words) tutorial code.

The tutorial code for Part 1 is [here](https://github.com/wendykan/DeepLearningMovies/blob/master/BagOfWords.py)
## Reading the Data

In [1]:
import pandas as pd

train = pd.read_csv('./data/labeledTrainData.tsv', header=0, delimiter='\t', quoting=3)
print(train.shape)
train.head()
# contains 25,000 IMDB movie reviews, each with a positive or negative sentiment label
# quoting=3: ignore double quotes

(25000, 3)


Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [2]:
train.columns.values

array(['id', 'sentiment', 'review'], dtype=object)

In [3]:
# print a raw review
print(train['review'][0])

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally sta

## Data Cleaning and Text Preprocessing
[Beatiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a Python library for pulling data out of HTML and XML files. HTML tags can be removed with the Beautiful Soup Package.

In [4]:
from bs4 import BeautifulSoup
# initialize the BeautifulSoup object on a single movie review
soup1 = BeautifulSoup(train["review"][0])

# get_text() gives the text of soup1, without tags or markup
print(soup1.get_text())

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.The actual feature film bit when it finally starts is only on for 20 mi

But with punctuation, numbers and stopwords, NLTK and regular expressions are used. To remove punctuations and numbers, a package for dealing with regular expressions, called ```re``` (see the [package documentation](https://docs.python.org/2/library/re.html#)), can be utilized.

In [5]:
import re
# use regular expressions to do a find-and-replace
letters_only = re.sub("[^a-zA-Z]", " ", soup1.get_text())
# find anything that is NOT('^') a lowercase letter(a-z)or an uppercase letter (A-Z) and replace it with a space
print(letters_only)

 With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  Moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring  Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him The actual feature film bit when it finally starts is only on for    mi

In [6]:
# Tokenization: converting the review to lower case individual words
lower_case = letters_only.lower()
words = lower_case.split()

In [7]:
import nltk
nltk.download()
from nltk.corpus import stopwords
print(stopwords.words("english"))

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'o

<strong>Error Handling</strong>

```certificate verify failed: unable to get local issuer certificate```

Solution: Install the <i>"Install Certificates.command"</i> from the Python folder
(see [here](https://stackoverflow.com/questions/52805115/certificate-verify-failed-unable-to-get-local-issuer-certificate) for details)

In [8]:
words = [w for w in words if w not in stopwords.words("english")]
print(words)

['stuff', 'going', 'moment', 'mj', 'started', 'listening', 'music', 'watching', 'odd', 'documentary', 'watched', 'wiz', 'watched', 'moonwalker', 'maybe', 'want', 'get', 'certain', 'insight', 'guy', 'thought', 'really', 'cool', 'eighties', 'maybe', 'make', 'mind', 'whether', 'guilty', 'innocent', 'moonwalker', 'part', 'biography', 'part', 'feature', 'film', 'remember', 'going', 'see', 'cinema', 'originally', 'released', 'subtle', 'messages', 'mj', 'feeling', 'towards', 'press', 'also', 'obvious', 'message', 'drugs', 'bad', 'kay', 'visually', 'impressive', 'course', 'michael', 'jackson', 'unless', 'remotely', 'like', 'mj', 'anyway', 'going', 'hate', 'find', 'boring', 'may', 'call', 'mj', 'egotist', 'consenting', 'making', 'movie', 'mj', 'fans', 'would', 'say', 'made', 'fans', 'true', 'really', 'nice', 'actual', 'feature', 'film', 'bit', 'finally', 'starts', 'minutes', 'excluding', 'smooth', 'criminal', 'sequence', 'joe', 'pesci', 'convincing', 'psychopathic', 'powerful', 'drug', 'lord', 

In [9]:
def review_to_words(raw_review):
    reviewed_text = BeautifulSoup(raw_review).get_text()
    letters_only = re.sub("[^a-zA-Z]"," ", reviewed_text)
    words = letters_only.lower().split()
    # searching a set is much faster than searching a list, so convert the stop words to a set
    stops = set(stopwords.words("english")) 
    meaningful_words = [w for w in words if w not in stops] 
    return (' '.join(meaningful_words))

In [10]:
clean_review = review_to_words(train['review'][0])
print(clean_review)

stuff going moment mj started listening music watching odd documentary watched wiz watched moonwalker maybe want get certain insight guy thought really cool eighties maybe make mind whether guilty innocent moonwalker part biography part feature film remember going see cinema originally released subtle messages mj feeling towards press also obvious message drugs bad kay visually impressive course michael jackson unless remotely like mj anyway going hate find boring may call mj egotist consenting making movie mj fans would say made fans true really nice actual feature film bit finally starts minutes excluding smooth criminal sequence joe pesci convincing psychopathic powerful drug lord wants mj dead bad beyond mj overheard plans nah joe pesci character ranted wanted people know supplying drugs etc dunno maybe hates mj music lots cool things like mj turning car robot whole speed demon sequence also director must patience saint came filming kiddy bad sequence usually directors hate working

In [11]:
print("Cleaning and parsing the training set movie reviews...\n")
num_reviews = train['review'].size
clean_train_reviews = []
for i in range(0, num_reviews):
    if ((i+1)%1000 == 0):
        # if the index is evenly divisible by 1000, print a message
        print(f"Reviewing {i+1} of {num_reviews}") 
    clean_train_reviews.append(review_to_words(train['review'][i]))

Cleaning and parsing the training set movie reviews...

Reviewing 1000 of 25000
Reviewing 2000 of 25000
Reviewing 3000 of 25000
Reviewing 4000 of 25000
Reviewing 5000 of 25000
Reviewing 6000 of 25000
Reviewing 7000 of 25000
Reviewing 8000 of 25000
Reviewing 9000 of 25000
Reviewing 10000 of 25000
Reviewing 11000 of 25000
Reviewing 12000 of 25000
Reviewing 13000 of 25000
Reviewing 14000 of 25000
Reviewing 15000 of 25000
Reviewing 16000 of 25000
Reviewing 17000 of 25000
Reviewing 18000 of 25000
Reviewing 19000 of 25000
Reviewing 20000 of 25000
Reviewing 21000 of 25000
Reviewing 22000 of 25000
Reviewing 23000 of 25000
Reviewing 24000 of 25000
Reviewing 25000 of 25000


## Creating Features from a Bag of Words (using scikit-learn)
The [Bag of Words](https://en.wikipedia.org/wiki/Bag-of-words_model) model learns a vocabulary from all of the documents, then models each document by counting the number of times each word appears. 

The IMDB data has a very large number of reviews, which will produce a large vocabulary. To limit the size of the feature vectors, some maximum vocabulary size should be set. Below, the 5000 most frequent words are used (remember that stop words have already been removed).

The feature_extraction module from scikit-learn will be utilized to create bag-of-words features. ```fit_transform()``` does two functions: it (1) fits the model and learns the vocabulary, and (2) transforms the training data into feature vectors. (The input to fit_transform should be a list of strings.)

참고: [Scikit-Learn의 문서 전처리 기능](https://datascienceschool.net/view-notebook/3e7aadbf88ed4f0d87a76f9ddc925d69/)

In [12]:
print("Creating the bag of words...\n")
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer = "word", tokenizer = None, preprocessor = None, stop_words = None, max_features = 5000)
train_data_features = vectorizer.fit_transform(clean_train_reviews)
# numpy arrays are easy to work with, so convert the result to an array
train_data_features = train_data_features.toarray() 

Creating the bag of words...



In [13]:
print(train_data_features.shape)

(25000, 5000)


In [14]:
vocab = vectorizer.get_feature_names()
print(vocab[:100])

['abandoned', 'abc', 'abilities', 'ability', 'able', 'abraham', 'absence', 'absent', 'absolute', 'absolutely', 'absurd', 'abuse', 'abusive', 'abysmal', 'academy', 'accent', 'accents', 'accept', 'acceptable', 'accepted', 'access', 'accident', 'accidentally', 'accompanied', 'accomplished', 'according', 'account', 'accuracy', 'accurate', 'accused', 'achieve', 'achieved', 'achievement', 'acid', 'across', 'act', 'acted', 'acting', 'action', 'actions', 'activities', 'actor', 'actors', 'actress', 'actresses', 'acts', 'actual', 'actually', 'ad', 'adam', 'adams', 'adaptation', 'adaptations', 'adapted', 'add', 'added', 'adding', 'addition', 'adds', 'adequate', 'admire', 'admit', 'admittedly', 'adorable', 'adult', 'adults', 'advance', 'advanced', 'advantage', 'adventure', 'adventures', 'advertising', 'advice', 'advise', 'affair', 'affect', 'affected', 'afford', 'aforementioned', 'afraid', 'africa', 'african', 'afternoon', 'afterwards', 'age', 'aged', 'agent', 'agents', 'ages', 'aging', 'ago', 'ag

<strong>Counts of each word in the vocabulary</strong>

In [None]:
import numpy as np
dist = np.sum(train_data_features, axis=0)
for tag, count in zip(vocab, dist):
    print(count,tag)

## Random Forest

In [15]:
print("Training the random forest...")
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators = 100)
forest = forest.fit(train_data_features, train["sentiment"])

Training the random forest...


## Creating a submission

In [17]:
# Read the test data...
test = pd.read_csv('./data/testData.tsv', header=0, delimiter='\t', quoting=3)
print(test.shape)

(25000, 2)


In [18]:
# Create and empty list, append the clean reviews one by one
clean_test_reviews = []
print('Cleaning and parsing the test set movie reviews...\n')
for i in range(0, num_reviews):
    if ((i+1)%1000 == 0):
        print(f'Reviewing {i+1} of {num_reviews}')
    clean_review = review_to_words(test['review'][i])
    clean_test_reviews.append(clean_review)

Cleaning and parsing the test set movie reviews...

Reviewing 1000 of 25000
Reviewing 2000 of 25000
Reviewing 3000 of 25000
Reviewing 4000 of 25000
Reviewing 5000 of 25000
Reviewing 6000 of 25000
Reviewing 7000 of 25000
Reviewing 8000 of 25000
Reviewing 9000 of 25000
Reviewing 10000 of 25000
Reviewing 11000 of 25000
Reviewing 12000 of 25000
Reviewing 13000 of 25000
Reviewing 14000 of 25000
Reviewing 15000 of 25000
Reviewing 16000 of 25000
Reviewing 17000 of 25000
Reviewing 18000 of 25000
Reviewing 19000 of 25000
Reviewing 20000 of 25000
Reviewing 21000 of 25000
Reviewing 22000 of 25000
Reviewing 23000 of 25000
Reviewing 24000 of 25000
Reviewing 25000 of 25000


In [19]:
# Get the bag of words for the test set, and convert to a numpy array
test_data_features = vectorizer.transform(clean_test_reviews)
test_data_features = test_data_features.toarray()

# Use the random forest to make sentiment label predictions
result = forest.predict(test_data_features)
# Copy the results to a pandas dataframe with an 'id' and 'sentiment' column
output = pd.DataFrame(data={'id': test['id'], 'sentiment': result})
# Use pandas to write the csv file
output.to_csv('bag-of-words-model.csv', index=False, quoting=3)