<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#IMDB-Movie-Review-Sentiment-Classification" data-toc-modified-id="IMDB-Movie-Review-Sentiment-Classification-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>IMDB Movie Review Sentiment Classification</a></span></li><li><span><a href="#Purpose" data-toc-modified-id="Purpose-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Purpose</a></span></li><li><span><a href="#Process" data-toc-modified-id="Process-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Process</a></span></li><li><span><a href="#Configure-notebook,-import-libraries,-and-import-dataset" data-toc-modified-id="Configure-notebook,-import-libraries,-and-import-dataset-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Configure notebook, import libraries, and import dataset</a></span></li><li><span><a href="#Examine-the-data" data-toc-modified-id="Examine-the-data-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Examine the data</a></span></li><li><span><a href="#Cleaning-and-preprocessing" data-toc-modified-id="Cleaning-and-preprocessing-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Cleaning and preprocessing</a></span></li><li><span><a href="#Bag-of-words-feature-creation" data-toc-modified-id="Bag-of-words-feature-creation-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Bag-of-words feature creation</a></span></li><li><span><a href="#Baseline-Model-development" data-toc-modified-id="Baseline-Model-development-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Baseline Model development</a></span></li></ul></div>

<h1>IMDB Movie Review Sentiment Classification</h1>

<img style="float: left; margin-right: 15px; width: 30%; height: 30%;" src="images/imdb.jpg" />

# Purpose

The overall goal of this set of write-ups is to explore a number of machine learning algorithms utilizing natural language processing (NLP) to classify the sentiment in a set of IMDB movie reviews.

The specific goals of this write-up include:
1. Create a sparser feature set by removing words not directly related to sentiment
2. Run the models from the [last write-up](./Model-06.ipynb) against the new feature set
3. Determine if the new feature set improves our ability to correctly classify movie review sentiment

This series of write-ups is inspired by the Kaggle [
Bag of Words Meets Bags of Popcorn](https://www.kaggle.com/c/word2vec-nlp-tutorial) competition.    

Dataset source:  [IMDB Movie Reviews](https://www.kaggle.com/c/word2vec-nlp-tutorial/data)

# Process

Previously covered [here](./Model-06.ipynb#Process).

# Configure notebook, import libraries, and import dataset

##### Import libraries

In [1]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

import warnings
warnings.filterwarnings('ignore')

import os
import re
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd
from pandas import set_option

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression

from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion

from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier

from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

from sklearn.preprocessing import StandardScaler

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

# http://www.nltk.org/index.html
# pip install nltk
import nltk
from nltk.corpus import stopwords

# https://www.crummy.com/software/BeautifulSoup/bs4/doc/
# pip install BeautifulSoup4
from bs4 import BeautifulSoup

# https://pypi.org/project/gensim/
# pip install gensim
from gensim.models import word2vec

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level = logging.INFO)

##### Define global variables

In [2]:
seed = 10
np.random.seed(seed)

# Opens a GUI that allows us to download the NLTK data
# nltk.download()

dataPath = os.path.join('.', 'datasets', 'imdb_movie_reviews')
labeledTrainData = os.path.join(dataPath, 'labeledTrainData.tsv')

##### Import dataset

In [3]:
df = pd.read_csv(labeledTrainData, sep = '\t', header = 0, quoting = 3)

# Examine the data

Previously covered [here](./Model-06.ipynb#Examine-the-data).

# Cleaning and preprocessing

Process justification and methodology previously covered [here](./Model-06.ipynb#Cleaning-and-preprocessing).

Define a 'cleaning' function, and clean the training set:

In [4]:
# Convert the stop words to a set
stopWords = set(stopwords.words("english"))

# Clean IMDB review text
def cleanReview(review, stopWords):
    # Remove HTML
    clean = BeautifulSoup(review)
    
    # Remove non-alpha chars
    clean = re.sub("[^a-zA-Z]", ' ', clean.get_text())
    
    # Convert to lower case and "tokenize"
    clean = clean.lower().split()
    
    # Remove stop words
    clean = [x for x in clean if not x in stopWords]

    # Prepare final, cleaned review
    clean = " ".join(clean)
    
    # Return results
    return clean
    

In [5]:
cleanReviews = [cleanReview(x, stopWords) for x in df['review']]
assert(len(df) == (len(cleanReviews)))

# Bag-of-words feature creation

Initial discussion of the `bag-of-words` algorithm was previously covered [here](./Model-06.ipynb#Bag-of-words-feature-creation).

Next, in the [first write-up](http://localhost:8888/notebooks/Machine-Learning/Python/04-Classic-Datasets/Model-06.ipynb) of this series we examined a sample review--index 108--during the analysis, cleaning, and preprocessing.  We'll post it here again for reference:

In [6]:
cleanReviews[108]

'question one sees movie bad necessarily movie bad get made even see awful first place learned experience learned rules horror movies catalogued satirized countless times last ten years mean someone go ahead make movie uses without shred humor irony movie described loosely based video game script problems black character may always die first asian character always know kung fu may proud figured matrix effect budget necessarily mean use ad nausea ron howard brother guarantee choice roles whenever scene edit together use footage video game one notice cousin rap metal band offers write movie theme free politely decline zombie movies people killing zombies zombies killing people preferably gruesome way possible makes scary white people pay get rave deserve die find old book tell everything need know anything else figure two lines someone asks bare breasts horror movie panacea helicopter boom shot licensing deal sega magically transforms movie student film major studio release try name drop

Since the bag-of-words creation is doing a word count analysis I wanted to explore what would happen if we removed the 'noise' from the reviews.  (And by 'noise' I mean words that likely wouldn't help or hinder sentiment.)  From the review text above we have this string sample for instance:

```
whenever scene edit together use footage video game one notice cousin rap metal band offers
```

It is doubtful this series of words will give the model any 'insights' into if this is a positive or negative review.  However, this next string sample does seem like it would give an indication to review sentiment:

```
question one sees movie bad necessarily movie bad get made even see awful
```

In order to explore this idea let's load a sentiment lexicon into the notebook, and then remove any 'noise' words not found in the sentiment lexicon from the review texts.  We'll then run the 'de-noised' review texts through the same models as we did in the [previous write-up](./Model-06.ipynb), and see if we gain any improvements in speed and/or accuracy.

##### Download the sentiment lexicon

The sentiment lexicon we'll utilize can be found here:  https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon

Using a few commands we can download and extract it:

```
wget https://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar
unrar e opinion-lexicon-English.rar
```

##### Applying the sentiment lexicon  - Single observation

Next we'll want to combine all the positive and negative sentiment words into a single list, and then remove any words from the reviews not found in the sentiment lexicon:

In [7]:
# Combine the positive and negative lists of sentiment lexicon words

with open(os.path.join('.', 'datasets', 'positive-words.txt')) as f:
    _positive = f.read().splitlines()
    
with open(os.path.join('.', 'datasets', 'negative-words.txt')) as f:
    _negative = f.read().splitlines()
    
allWords = _positive[35:] + _negative[35:]

assert( len(allWords) == (len(_positive[35:]) + len(_negative[35:])) )

In [8]:
# Preview our sample review before sentiment lexicon parsing
cleanReviews[108]

'question one sees movie bad necessarily movie bad get made even see awful first place learned experience learned rules horror movies catalogued satirized countless times last ten years mean someone go ahead make movie uses without shred humor irony movie described loosely based video game script problems black character may always die first asian character always know kung fu may proud figured matrix effect budget necessarily mean use ad nausea ron howard brother guarantee choice roles whenever scene edit together use footage video game one notice cousin rap metal band offers write movie theme free politely decline zombie movies people killing zombies zombies killing people preferably gruesome way possible makes scary white people pay get rave deserve die find old book tell everything need know anything else figure two lines someone asks bare breasts horror movie panacea helicopter boom shot licensing deal sega magically transforms movie student film major studio release try name drop

In [9]:
# Apply the sentiment lexicon parsing
_tmp = [x for x in cleanReviews[108].split() if x in allWords]

In [10]:
# Example the 'de-noised' list of remaining words
_tmp

['bad',
 'bad',
 'awful',
 'humor',
 'irony',
 'problems',
 'die',
 'proud',
 'guarantee',
 'free',
 'decline',
 'zombie',
 'killing',
 'killing',
 'preferably',
 'gruesome',
 'scary',
 'die',
 'boom',
 'dead',
 'worse',
 'annihilation']

##### Applying the sentiment lexicon  - All observations

Everything looks good so far, so let's 'de-noise' the entire dataset:

In [11]:
sparseCleanReviews = []

for review in cleanReviews:
    _tmp = [x for x in review.split() if x in allWords]
    sparseCleanReviews.append(" ".join(_tmp))

In [12]:
# Sanity check examination

sparseCleanReviews[108]

'bad bad awful humor irony problems die proud guarantee free decline zombie killing killing preferably gruesome scary die boom dead worse annihilation'

##### CountVectorizer application

We'll now simply repeat the CountVectorizer steps as we did in the [first write-up](./Model-06.ipynb) to create the 'bags-of-words' numeric representation of the 'de-noised' reviews suitable for the machine learning model.

In [13]:
# Utilize the defaults for the object instantiation other than max_features
vec = CountVectorizer(max_features = 5000)

# Similar to how almost every other Scikit-Learn objects works we'll call the fit() and transform() methods
features = vec.fit_transform(sparseCleanReviews)

# And finally we'll convert to a np.array
features = features.toarray()

print("Features shape: ", features.shape)

Features shape:  (25000, 5000)


##### Examine vocabulary

We'll examine what the 'de-noising' did to the top ten top and bottom vocabulary listings:

In [14]:
# Take a look at the first 10 words in the vocabulary
vocab = vec.get_feature_names()
print(vocab[:10])

['abnormal', 'abolish', 'abominable', 'abominably', 'abomination', 'abort', 'aborted', 'aborts', 'abound', 'abounds']


In [15]:
_df = pd.DataFrame(data = features, columns = vocab).sum()
_df.sort_values(ascending = False, inplace = True)

In [16]:
print("Top 10:\n")
print(_df.head(10))

Top 10:

like      20274
good      15140
well      10662
bad        9301
great      9058
plot       6585
love       6454
best       6416
better     5737
work       4372
dtype: int64


Original 'Top 10' before 'de-noising':

```
Top 10:

movie     44031
film      40147
one       26788
like      20274
good      15140
time      12724
even      12646
would     12436
story     11983
really    11736
```

In [17]:
print("Bottom 10:\n")
print(_df.tail(10))

Bottom 10:

hothead          1
pillory          1
immorally        1
immodest         1
beckoned         1
beckoning        1
immoderate       1
horrify          1
hotbeds          1
overbearingly    1
dtype: int64


Original 'Bottom 10' before 'de-noising':

```
Bottom 10:

skull       78
sopranos    78
premiere    78
bunny       78
flair       78
fishing     78
awhile      78
stumbled    78
amused      78
cream       78
```

# Baseline Model development

We are finally ready to develop the baseline model on the data we've explored, cleaned, and processed.  Because the IMDB data set doesn't include a validation set we'll create one from a portion of the training data.  The processes is similar to our work in previous write-ups such as the [Iris classifier](.//Model-01.ipynb).

In [18]:
# Pull in the labeled data
labeledTrainData = os.path.join(dataPath, 'labeledTrainData.tsv')
df = pd.read_csv(labeledTrainData, sep = '\t', header = 0, quoting = 3)

# Pull in the unlabeled data since it can be utilized by Word2Vec
unlabeledTrainData = os.path.join(dataPath, 'unlabeledTrainData.tsv')
dfUn = pd.read_csv(unlabeledTrainData, sep = '\t', header = 0, quoting = 3)

In [19]:
# Validation
print('df.shape :', df.shape)
print('dfUn.shape :', dfUn.shape)

df.shape : (25000, 3)
dfUn.shape : (50000, 2)


Word2Vec expects single sentences, each one as a list of words. In other words, the input format is a list of lists.

In [20]:
# Update stop word helper function to output a list of words

# Clean IMDB review text
def cleanReview(review, removeStopWords = False):
    # Convert the stop words to a set
    stopWords = set(stopwords.words("english"))
    
    # Remove HTML
    clean = BeautifulSoup(review)
    
    # Remove non-alpha chars
    clean = re.sub("[^a-zA-Z]", ' ', clean.get_text())
    
    # Convert to lower case and "tokenize"
    clean = clean.lower().split()
    
    # Remove stop words
    if removeStopWords:
        clean = [x for x in clean if not x in stopWords]
    
    # Return results
    return clean

In [21]:
# Examine
cleanReview(df.iloc[25,2])[:12]

['looking',
 'for',
 'quo',
 'vadis',
 'at',
 'my',
 'local',
 'video',
 'store',
 'i',
 'found',
 'this']

In [22]:
# Examine
cleanReview(dfUn.iloc[0,1])[:12]

['watching',
 'time',
 'chasers',
 'it',
 'obvious',
 'that',
 'it',
 'was',
 'made',
 'by',
 'a',
 'bunch']

Create function to break review into a list of sentences which are list of words (i.e. list of lists)

In [23]:
# Creating function implementing punkt tokenizer for sentence splitting
import nltk.data

# Only need this the first time...
# nltk.download('punkt')

# Load the punkt tokenizer
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

# Define a function to split a review into parsed sentences
def createSentences(review, tokenizer, remove_stopwords = False):
    # Init container to hold results
    sentences = []
    
    # Split review string into sentences
    tokenSentences = tokenizer.tokenize(review.strip())

    # Clean the sentences via cleanReview() function
    for s in tokenSentences:
        # If a sentence is empty, skip it
        if len(s) > 0:
            # Clean sentence
            sentences.append( cleanReview( s, remove_stopwords ))
    
    # Return list of clean sentences
    return sentences

In [24]:
# Examine
_ = createSentences(df.iloc[25,2], tokenizer)
print(_[0][:12])
print(len(_))

['looking', 'for', 'quo', 'vadis', 'at', 'my', 'local', 'video', 'store', 'i', 'found', 'this']
8


In [25]:
# Examine
_ = createSentences(dfUn.iloc[0,1], tokenizer)
print(_[0][:12])
print(len(_))

['watching', 'time', 'chasers', 'it', 'obvious', 'that', 'it', 'was', 'made', 'by', 'a', 'bunch']
5


Now combine the labeled and unlabeled list of lists:

In [26]:
combined = []

for s in df.iloc[:,2]:
    combined += createSentences(s, tokenizer)

In [27]:
for s in dfUn.iloc[:,1]:
    combined += createSentences(s, tokenizer)

Quick examination:

In [28]:
print('len(combined): ', len(combined))
print("\nSample sentence:")
print(combined[0])

len(combined):  795538

Sample sentence:
['with', 'all', 'this', 'stuff', 'going', 'down', 'at', 'the', 'moment', 'with', 'mj', 'i', 've', 'started', 'listening', 'to', 'his', 'music', 'watching', 'the', 'odd', 'documentary', 'here', 'and', 'there', 'watched', 'the', 'wiz', 'and', 'watched', 'moonwalker', 'again']


Train the Word2Vec model:

In [29]:
# Set Word2Vec params
features = 300       # Word vector dimensionality                      
minWordCount = 40    # Minimum word count                        
workers = 4          # Number of threads to run in parallel
context = 10         # Context window size                                                                                    
downSampling = 1e-3  # Downsample setting for frequent words

model = word2vec.Word2Vec(combined, 
                          workers=workers,
                          size=features, 
                          min_count = minWordCount,
                          window = context, 
                          sample = downSampling)

# https://tedboy.github.io/nlps/generated/generated/gensim.models.Word2Vec.init_sims.html
# If replace is set, forget the original vectors and only keep the normalized ones = saves lots of memory!
# Note that you cannot continue training after doing a replace. 
# The model becomes effectively read-only = you can call most_similar, similarity etc., but not train.
model.init_sims(replace = True)

# Save model to disk
model.save("300features_40minwords_10context")

2018-10-19 08:58:49,660 : INFO : collecting all words and their counts
2018-10-19 08:58:49,661 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-10-19 08:58:49,709 : INFO : PROGRESS: at sentence #10000, processed 225803 words, keeping 17776 word types
2018-10-19 08:58:49,751 : INFO : PROGRESS: at sentence #20000, processed 451892 words, keeping 24948 word types
2018-10-19 08:58:49,794 : INFO : PROGRESS: at sentence #30000, processed 671315 words, keeping 30034 word types
2018-10-19 08:58:49,833 : INFO : PROGRESS: at sentence #40000, processed 897815 words, keeping 34348 word types
2018-10-19 08:58:49,874 : INFO : PROGRESS: at sentence #50000, processed 1116963 words, keeping 37761 word types
2018-10-19 08:58:49,914 : INFO : PROGRESS: at sentence #60000, processed 1338404 words, keeping 40723 word types
2018-10-19 08:58:49,954 : INFO : PROGRESS: at sentence #70000, processed 1561580 words, keeping 43333 word types
2018-10-19 08:58:49,995 : INFO : PROGRESS: 

2018-10-19 08:58:52,652 : INFO : PROGRESS: at sentence #720000, processed 16105489 words, keeping 118221 word types
2018-10-19 08:58:52,695 : INFO : PROGRESS: at sentence #730000, processed 16331870 words, keeping 118954 word types
2018-10-19 08:58:52,739 : INFO : PROGRESS: at sentence #740000, processed 16552903 words, keeping 119668 word types
2018-10-19 08:58:52,783 : INFO : PROGRESS: at sentence #750000, processed 16771230 words, keeping 120295 word types
2018-10-19 08:58:52,824 : INFO : PROGRESS: at sentence #760000, processed 16990622 words, keeping 120930 word types
2018-10-19 08:58:52,866 : INFO : PROGRESS: at sentence #770000, processed 17217759 words, keeping 121703 word types
2018-10-19 08:58:52,910 : INFO : PROGRESS: at sentence #780000, processed 17447905 words, keeping 122402 word types
2018-10-19 08:58:52,951 : INFO : PROGRESS: at sentence #790000, processed 17674981 words, keeping 123066 word types
2018-10-19 08:58:52,974 : INFO : collected 123504 word types from a corp

2018-10-19 08:59:38,042 : INFO : EPOCH 4 - PROGRESS: at 93.01% examples, 1070262 words/s, in_qsize 7, out_qsize 0
2018-10-19 08:59:38,794 : INFO : worker thread finished; awaiting finish of 3 more threads
2018-10-19 08:59:38,800 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-10-19 08:59:38,805 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-10-19 08:59:38,812 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-10-19 08:59:38,813 : INFO : EPOCH - 4 : training on 17798082 raw words (12748587 effective words) took 11.8s, 1076128 effective words/s
2018-10-19 08:59:39,823 : INFO : EPOCH 5 - PROGRESS: at 8.88% examples, 1126747 words/s, in_qsize 7, out_qsize 1
2018-10-19 08:59:40,828 : INFO : EPOCH 5 - PROGRESS: at 18.25% examples, 1150927 words/s, in_qsize 5, out_qsize 0
2018-10-19 08:59:41,829 : INFO : EPOCH 5 - PROGRESS: at 27.36% examples, 1152695 words/s, in_qsize 7, out_qsize 0
2018-10-19 08:59:42,830 : INFO : EPOCH 5

Explore the results:

In [30]:
model.most_similar("great")

[('fantastic', 0.7339777946472168),
 ('terrific', 0.7324297428131104),
 ('wonderful', 0.7300347685813904),
 ('superb', 0.634523868560791),
 ('fine', 0.6318458318710327),
 ('good', 0.6126289367675781),
 ('marvelous', 0.6096097826957703),
 ('excellent', 0.6084640026092529),
 ('brilliant', 0.6066595315933228),
 ('fabulous', 0.6053268909454346)]

In [31]:
model.most_similar("awful")

[('terrible', 0.7501150965690613),
 ('atrocious', 0.7334952354431152),
 ('dreadful', 0.723301351070404),
 ('horrible', 0.7197309136390686),
 ('abysmal', 0.709122896194458),
 ('horrendous', 0.6842007040977478),
 ('horrid', 0.6649243831634521),
 ('appalling', 0.6638507843017578),
 ('amateurish', 0.6086978912353516),
 ('lousy', 0.6026760339736938)]

In [32]:
# One of the reviews we referred to often in previous write-ups was of a zombie movie,
# so let's see what words are similar/associated with the word 'zombie'
model.most_similar("zombie")

[('cannibal', 0.674963116645813),
 ('horror', 0.6222474575042725),
 ('slasher', 0.6154731512069702),
 ('zombies', 0.6051949858665466),
 ('werewolf', 0.5976965427398682),
 ('vampire', 0.5963820219039917),
 ('fulci', 0.5891251564025879),
 ('splatter', 0.5795792937278748),
 ('monster', 0.5760996341705322),
 ('mummy', 0.5549553632736206)]

In [33]:
len(model.wv.index2word)

16490

Now that we have a trained model with some semantic understanding of words, how should we use it?

We'll try clustering--even though according to Kaggle it doesn't offer an improvement--as a programming exercise before moving on to other methods.

Note that the Word2Vec model we trained consists of a feature vector for each word in the vocabulary.  The feature vectors can be accessed via the "syn0" object property.


In [34]:
from sklearn.cluster import KMeans

# Set "k" to be 1/5th of the vocabulary size, or an average of 5 words per cluster
wordVecs = model.wv.syn0
k = int(wordVecs.shape[0] / 5)

# Initalize a k-means object and use it to extract centroids
kMeans = KMeans( n_clusters = k )
kModel = kMeans.fit_predict(wordVecs)

So now we have K clusters, and each word in the Word2Vec vocabulary has been assigned to one of the clusters.  Next we want to combine the actual words with their cluster assignments.  We can pull the words themselves from the Word2Vec object with the following property:

```python
model.wv.index2word
```

Let's ensure the list lengths match and then combine the words and their assignments into a dictionary object:

In [52]:
print(len(model.wv.index2word))
print(len(kModel))

print(model.wv.index2word[:5])
print(kModel[:5])

16490
16490
['the', 'and', 'a', 'of', 'to']
[1962 1020 2828 2121 2846]


In [55]:
clusterDict = dict(zip(model.wv.index2word, kModel))

Quick visual inspection:

In [56]:
for i, k in enumerate(clusterDict.keys()):
    print(k, "=", clusterDict[k])
    
    if i > 3:
        break

the = 1962
and = 1020
a = 2828
of = 2121
to = 2846


In [70]:
# Examine the first 10 clusters
for cluster in range(0,10):
    #
    # Print the cluster number  
    print("\nCluster %d" % cluster)
    #
    # Find all of the words for that cluster number, and print them out
    words = []
    for i in range(0,len(clusterDict.values())):
        if( list(clusterDict.values())[i] == cluster ):
            words.append(list(clusterDict.keys())[i])
    print(words)


Cluster 0
['uninteresting', 'meaningless', 'incoherent', 'unoriginal', 'inane', 'senseless', 'illogical', 'derivative', 'nonsensical', 'incomprehensible', 'banal', 'messy', 'jumbled', 'aimless', 'untrue']

Cluster 1
['ariel', 'creasy', 'morgana']

Cluster 2
['myra', 'hoax', 'sic']

Cluster 3
['island', 'enterprise', 'expedition', 'alliance', 'intruder']

Cluster 4
['bam', 'cart']

Cluster 5
['preaching', 'civilized', 'divide', 'persecution', 'discrimination', 'fundamentalist', 'secular']

Cluster 6
['nature', 'dilemma', 'fundamental', 'implications', 'rooted', 'complexities', 'subtleties', 'contradictions', 'conflicting']

Cluster 7
['gregory']

Cluster 8
['passionate', 'sensual', 'seductive', 'strikingly', 'tasteful', 'sensuous', 'forceful']

Cluster 9
['widow', 'waitress', 'housewife', 'heiress', 'fated', 'socialite', 'suitor', 'spinster', 'counselor', 'penniless']


In [65]:
print(_vals)

TypeError: 'dict_values' object does not support indexing

Previously when we implemented bag-of-words we counted up how many times a certain word appeared in each review.  We were hoping that word count patterns would emerge in similar reviews, and that would help us classify unseen reviews as good or bad by comparing their word count patterns.  

In this instance we are doing the same thing, but instead of counting word occurrences we are counting how many times the cluster containing a given word appears in the review.  Again, we are hoping that cluster count patterns emerge that are similar between like reviews, and that we can use this to identify unseen reviews as good or bad.  We are switching from individual words to semantically related clusters comparisons.

The first thing we need to do is write a function that returns a an array for a given review   Each entry in the array should correspond to a cluster in our set, and the values for the array entries will the number of times the cluster was found in the review text.

Next we need to collect each of the feature arrays into a single object suitable for being passed to a machine learning algorithm for training.

##### Feature array creation

In [71]:
def createFeatureArray(wordlist, clusterDict):
    #
    # The number of clusters is equal to the highest cluster index
    # in the word / centroid map
    num_centroids = max( clusterDict.values() ) + 1
    #
    # Pre-allocate the bag of centroids vector (for speed)
    bag_of_centroids = np.zeros( num_centroids, dtype="float32" )
    #
    # Loop over the words in the review. If the word is in the vocabulary,
    # find which cluster it belongs to, and increment that cluster count 
    # by one
    for word in wordlist:
        if word in clusterDict:
            index = clusterDict[word]
            bag_of_centroids[index] += 1
    #
    # Return the "bag of centroids"
    return bag_of_centroids

In [92]:
trainSentences = []

for s in df.iloc[:,2]:
    trainSentences.append(createSentences(s, tokenizer, remove_stopwords = True))
    


In [93]:
k = int(wordVecs.shape[0] / 5)

# Pre-allocate an array for the training set bags of centroids (for speed)
train_centroids = np.zeros( (df.iloc[:,2].size, k), dtype="float32" )

# Transform the training set reviews into bags of centroids
counter = 0
for review in trainSentences:
    #train_centroids[counter,:] = createFeatureArray(review, clusterDict)
    counter += 1

In [97]:
# Init vars and params
eFolds = 10
eSeed = 10

# Use accuracy since this is a classification problem
eScore = 'accuracy'

modelName = 'RandomForestClassifier'

xTrain = train_centroids
yTrain = df.iloc[:, 1]

_DF = pd.DataFrame(columns = ['Model', 'Accuracy', 'StdDev'])
_Results = {}
_model = RandomForestClassifier(n_estimators = 100)

kFold = KFold(n_splits = eFolds, random_state = eSeed)
_Results[modelName] = cross_val_score(_model, xTrain, yTrain, cv = kFold, scoring = eScore)

_DF.loc[len(_DF)] = list(['RandomForestClassifier', _Results[modelName].mean(), _Results[modelName].std()])
display(_DF.sort_values(by = ['Accuracy', 'StdDev', 'Model'], ascending = [False, True, True]))

Unnamed: 0,Model,Accuracy,StdDev
0,RandomForestClassifier,0.49264,0.008644


In [81]:
k

3298

In [79]:
df.iloc[:,2].size

25000

In [80]:
train_centroids.shape

(25001, 3298)

In [82]:
len(trainSentences)

266551

In [83]:
max( clusterDict.values() ) + 1

3298

In [86]:
len(train_centroids)

25000

In [89]:
counter

266551

In [90]:
len(trainSentences)

266551