# Sentiment Analysis using Doc2Vec

Word2Vec is dope. In short, it takes in a corpus, and churns out vectors for each of those words. What's so special about these vectors you ask? Well, similar words are near each other. Furthermore, these vectors represent how we use the words. For example, `v_man - v_woman` is approximately equal to `v_king - v_queen`, illustrating the relationship that "man is to woman as king is to queen". This process, in NLP voodoo, is called **word embedding**. These representations have been applied widely. This is made even more awesome with the introduction of Doc2Vec that represents not only words, but entire sentences and documents. Imagine being able to represent an entire sentence using a fixed-length vector and proceeding to run all your standard classification algorithms. Isn't that amazing?

However, Word2Vec documentation is shit. The C-code is nigh unreadable (700 lines of highly optimized, and sometimes weirdly optimized code). I personally spent a lot of time untangling Doc2Vec and crashing into ~50% accuracies due to implementation mistakes. This tutorial aims to help other users get off the ground using Word2Vec for their own research. We use Word2Vec for **sentiment analysis** by attempting to classify the Cornell IMDB movie review corpus (http://www.cs.cornell.edu/people/pabo/movie-review-data/).

The source code used in this demo can be found at https://github.com/linanqiu/word2vec-sentiments

## Setup

### Modules

We use `gensim`, since `gensim` has a much more readable implementation of Word2Vec (and Doc2Vec). Bless those guys. We also use `numpy` for general array manipulation, and `sklearn` for Logistic Regression classifier.

In [1]:
# gensim modules
from gensim import utils
from gensim.models.doc2vec import LabeledSentence
from gensim.models import Doc2Vec

In [7]:
# numpy
import numpy as np

# classifier
from sklearn.linear_model import LogisticRegression

# random
import random

# os
import os

### Input Format

We can't input the raw reviews from the Cornell movie review data repository. Instead, we clean them up by converting everything to lower case and removing punctuation. I did this via bash, and you can do this easily via Python, JS, or your favorite poison. This step is trivial.

The result is to have five documents:

- `test-neg.txt`: 12500 negative movie reviews from the test data
- `test-pos.txt`: 12500 positive movie reviews from the test data
- `train-neg.txt`: 12500 negative movie reviews from the training data
- `train-pos.txt`: 12500 positive movie reviews from the training data
- `train-unsup.txt`: 50000 Unlabelled movie reviews

Each of the reviews should be formatted as such:

```
once again mr costner has dragged out a movie for far longer than necessary aside from the terrific sea rescue sequences of which there are very few i just did not care about any of the characters most of us have ghosts in the closet and costner s character are realized early on and then forgotten until much later by which time i did not care the character we should really care about is a very cocky overconfident ashton kutcher the problem is he comes off as kid who thinks he s better than anyone else around him and shows no signs of a cluttered closet his only obstacle appears to be winning over costner finally when we are well past the half way point of this stinker costner tells us all about kutcher s ghosts we are told why kutcher is driven to be the best with no prior inkling or foreshadowing no magic here it was all i could do to keep from turning it off an hour in
this is an example of why the majority of action films are the same generic and boring there s really nothing worth watching here a complete waste of the then barely tapped talents of ice t and ice cube who ve each proven many times over that they are capable of acting and acting well don t bother with this one go see new jack city ricochet or watch new york undercover for ice t or boyz n the hood higher learning or friday for ice cube and see the real deal ice t s horribly cliched dialogue alone makes this film grate at the teeth and i m still wondering what the heck bill paxton was doing in this film and why the heck does he always play the exact same character from aliens onward every film i ve seen with bill paxton has him playing the exact same irritating character and at least in aliens his character died which made it somewhat gratifying overall this is second rate action trash there are countless better films to see and if you really want to see this one watch judgement night which is practically a carbon copy but has better acting and a better script the only thing that made this at all worth watching was a decent hand on the camera the cinematography was almost refreshing which comes close to making up for the horrible film itself but not quite
```

The sample up there contains two movie reviews, each one taking up one entire line. Yes, **each document should be on one line, separated by new lines**. This is extremely important, because our parser depends on this to identify sentences.

### Feeding Data to Doc2Vec

Doc2Vec (the portion of `gensim` that implements the Doc2Vec algorithm) does a great job at word embedding, but a terrible job at reading in files. It only takes in `LabeledLineSentence` classes which basically yields `LabeledSentence`, a class from `gensim.models.doc2vec` representing a single sentence. Why the "Labeled" word? Well, here's how Doc2Vec differs from Word2Vec.

Word2Vec simply converts a word into a vector.

Doc2Vec not only does that, but also aggregates all the words in a sentence into a vector. To do that, it simply treats a sentence label as a special word, and does some voodoo on that special word. Hence, that special word is a label for a sentence. 

So we have to format sentences into

```python
[['word1', 'word2', 'word3', 'lastword'], ['label1']]
```

`LabeledSentence` is simply a tidier way to do that. It contains a list of words, and a label for the sentence. We don't really need to care about how `LabeledSentence` works exactly, we just have to know that it stores those two things -- a list of words and a label.

However, we need a way to convert our new line separated corpus into a collection of `LabeledSentence`s. The default constructor for the default `LabeledLineSentence` class in Doc2Vec can do that for a single text file, but can't do that for multiple files. In classification tasks however, we usually deal with multiple documents (test, training, positive, negative etc). Ain't that annoying?

So we write our own `LabeledLineSentence` class. The constructor takes in a dictionary that defines the files to read and the label prefixes sentences from that document should take on. Then, Doc2Vec can either read the collection directly via the iterator, or we can access the array directly. We also need a function to return a permutated version of the array of `LabeledSentence`s. We'll see why later on.

In [3]:
class LabeledLineSentence(object):
    def __init__(self, sources):
        self.sources = sources
        
        flipped = {}
        
        # make sure that keys are unique
        for key, value in sources.items():
            if value not in flipped:
                flipped[value] = [key]
            else:
                raise Exception('Non-unique prefix encountered')
    
    def __iter__(self):
        for source, prefix in self.sources.items():
            with utils.smart_open(source) as fin:
                for item_no, line in enumerate(fin):
                    yield LabeledSentence(utils.to_unicode(line).split(), [prefix + '_%s' % item_no])
    
    def to_array(self):
        self.sentences = []
        for source, prefix in self.sources.items():
            with utils.smart_open(source) as fin:
                for item_no, line in enumerate(fin):
                    self.sentences.append(LabeledSentence(utils.to_unicode(line).split(), [prefix + '_%s' % item_no]))
        return self.sentences
    
    def sentences_perm(self):
        shuffled = list(self.sentences)
        random.shuffle(shuffled)
        return shuffled

Now we can feed the data files to `LabeledLineSentence`. As we mentioned earlier, `LabeledLineSentence` simply takes a dictionary with keys as the file names and values the special prefixes for sentences from that document. The prefixes need to be unique, so that there is no ambiguitiy for sentences from different documents.

The prefixes will have a counter appended to them to label individual sentences in the documetns.

In [4]:
sources = {'test-neg.txt':'TEST_NEG', 'test-pos.txt':'TEST_POS', 'train-neg.txt':'TRAIN_NEG', 
           'train-pos.txt':'TRAIN_POS', 'train-unsup.txt':'TRAIN_UNS'}

sentences = LabeledLineSentence(sources)

In [None]:
# sentences = TaggedDocument(source)

## Model

### Building the Vocabulary Table

Doc2Vec requires us to build the vocabulary table (simply digesting all the words and filtering out the unique words, and doing some basic counts on them). So we feed it the array of sentences. `model.build_vocab` takes an array of `LabeledLineSentence`, hence our `to_array` function in the `LabeledLineSentences` class. 

If you're curious about the parameters, do read the Word2Vec documentation. Otherwise, here's a quick rundown:

- `min_count`: ignore all words with total frequency lower than this. You have to set this to 1, since the sentence labels only appear once. Setting it any higher than 1 will miss out on the sentences.
- `window`: the maximum distance between the current and predicted word within a sentence. Word2Vec uses a skip-gram model, and this is simply the window size of the skip-gram model.
- `size`: dimensionality of the feature vectors in output. 100 is a good number. If you're extreme, you can go up to around 400.
- `sample`: threshold for configuring which higher-frequency words are randomly downsampled
- `workers`: use this many worker threads to train the model 

In [13]:
os.chdir('/Users/jamespearce/repos/dl/data/sentiment')

In [24]:
model = Doc2Vec(min_count=1, window=10, vector_size=100, sample=1e-4, negative=5, workers=7,
               alpha=0.025, min_alpha=0.025)

model.build_vocab(sentences.to_array())



### Training Doc2Vec

Now we train the model. The model is better trained if **in each training epoch, the sequence of sentences fed to the model is randomized**. This is important: missing out on this steps gives you really shitty results. This is the reason for the `sentences_perm` method in our `LabeledLineSentences` class.

We train it for 10 epochs. If I had more time, I'd have done 20.

This process takes around 10 mins, so go grab some coffee.

In [25]:
max_epochs = 10
for epoch in range(max_epochs):
    print ('iteration {0}'.format(epoch))
    model.train(sentences.sentences_perm(),
               total_examples=model.corpus_count,
               epochs=model.iter) # model.epochs ??
    model.alpha -= 0.002 # decrease learning rate
    model.min_alpha = model.alpha # fix so no decay
#     model.train(sentences.sentences_perm())
# model.train(sentences.sentences_perm(), total_words=1000, epochs=50) # beware the magic number
print ('Done!')

iteration 0


  


iteration 1
iteration 2
iteration 3
iteration 4
iteration 5
iteration 6
iteration 7
iteration 8
iteration 9


### Inspecting the Model

Let's see what our model gives. It seems that it has kind of understood the word `good`, since the most similar words to good are `glamorous`, `spectacular`, `astounding` etc. This is really awesome (and important), since we are doing sentiment analysis.

In [26]:
model.wv.most_similar('good')

[('nice', 0.6834791302680969),
 ('decent', 0.6789529323577881),
 ('great', 0.6568922400474548),
 ('bad', 0.6568381190299988),
 ('fine', 0.6295196413993835),
 ('solid', 0.6145359873771667),
 ('terrific', 0.5557154417037964),
 ('fantastic', 0.5430033802986145),
 ('excellent', 0.5312724709510803),
 ('disappointing', 0.5216973423957825)]

We can also prop the hood open and see what the model actually contains. This is each of the vectors of the words and sentences in the model. We can access all of them using `model.syn0` (for the geekier ones among you, `syn0` is simply the output layer of the shallow neural network). However, we don't want to use the entire `syn0` since that contains the vectors for the words as well, but we are only interested in the ones for sentences.

Here's a sample vector for the first sentence in the training set for negative reviews:

In [27]:
model['TRAIN_NEG_0']

array([-0.96430969,  0.10271791,  0.12474386,  0.7186532 ,  1.82451093,
        1.2003125 , -0.23583592,  1.01816785,  0.69412196, -0.65088362,
        0.57726198, -1.93346131, -1.05376565, -0.26727399, -0.43875602,
        0.07049139, -0.91318643,  1.08978844, -0.23533964, -1.03512585,
       -1.02525222, -1.13230348, -0.25420204,  0.05271002, -0.74665707,
       -0.28303775, -0.50565034,  2.35043812, -0.11464602, -0.29187128,
       -1.31907737, -0.42764592,  0.15069836,  1.18162823,  0.16823852,
       -0.9416728 ,  0.27668956,  0.40300432, -1.81020391,  0.86010969,
        1.19802105, -0.36606476,  1.31520963,  0.57425082, -0.0485524 ,
        0.74351275,  1.5635519 ,  1.09844077,  1.85646403,  0.06110585,
        0.84082252,  0.7019977 ,  1.02865839, -0.91546881,  0.00917614,
       -1.51288021,  1.30717695, -0.01342327,  0.98863602, -1.02998543,
       -0.82786661, -0.81280404, -0.26240173,  0.67181259, -0.12052402,
       -0.57238591, -0.17024867, -1.30503106, -0.70099688, -0.27

### Saving and Loading Models

To avoid training the model again, we can save it.

In [28]:
model.save('./imdb.d2v')

And load it.

In [29]:
model = Doc2Vec.load('./imdb.d2v')

## Classifying Sentiments

### Training Vectors

Now let's use these vectors to train a classifier. First, we must extract the training vectors. Remember that we have a total of 25000 training reviews, with equal numbers of positive and negative ones (12500 positive, 12500 negative).

Hence, we create a `numpy` array (since the classifier we use only takes numpy arrays. There are two parallel arrays, one containing the vectors (`train_arrays`) and the other containing the labels (`train_labels`).

We simply put the positive ones at the first half of the array, and the negative ones at the second half.

In [32]:
train_arrays = np.zeros((25000, 100))
train_labels = np.zeros(25000)

for i in range(12500):
    prefix_train_pos = 'TRAIN_POS_' + str(i)
    prefix_train_neg = 'TRAIN_NEG_' + str(i)
    train_arrays[i] = model[prefix_train_pos]
    train_arrays[12500 + i] = model[prefix_train_neg]
    train_labels[i] = 1
    train_labels[12500 + i] = 0

The training array looks like this: rows and rows of vectors representing each sentence.

In [34]:
print (train_arrays)

[[-1.87297237 -0.38670766 -0.73900181 ...,  0.58850521 -0.68588346
  -2.17237258]
 [-3.88280177 -3.56120467  1.16035223 ...,  2.52555108  0.72337848
   0.52997857]
 [-1.08022726 -1.01178837 -0.47317252 ..., -0.30459636 -0.95116109
   0.02349397]
 ..., 
 [-0.28385359  0.1684605  -0.98514009 ...,  0.66366428 -2.31382298
   1.0009079 ]
 [ 0.27408305 -2.1246891   0.89160365 ...,  2.27301073  0.43184912
  -1.51522982]
 [-0.20670033 -1.0146699   0.37222779 ...,  1.33061361 -0.60825211
  -1.41617215]]


The labels are simply category labels for the sentence vectors -- 1 representing positive and 0 for negative.

In [35]:
print (train_labels)

[ 1.  1.  1. ...,  0.  0.  0.]


### Testing Vectors

We do the same for testing data -- data that we are going to feed to the classifier after we've trained it using the training data. This allows us to evaluate our results. The process is pretty much the same as extracting the results for the training data.

In [36]:
test_arrays = np.zeros((25000, 100))
test_labels = np.zeros(25000)

for i in range(12500):
    prefix_test_pos = 'TEST_POS_' + str(i)
    prefix_test_neg = 'TEST_NEG_' + str(i)
    test_arrays[i] = model[prefix_test_pos]
    test_arrays[12500 + i] = model[prefix_test_neg]
    test_labels[i] = 1
    test_labels[12500 + i] = 0

### Classification

Now we train a logistic regression classifier using the training data.

In [37]:
classifier = LogisticRegression()
classifier.fit(train_arrays, train_labels)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

And find that we have achieved near 87% accuracy for sentiment analysis. This is rather incredible, given that we are only using a linear SVM and a very shallow neural network.

In [38]:
classifier.score(test_arrays, test_labels)

0.86643999999999999

Isn't this fantastic? Hope I saved you some time!

### Classification in H2O


In [40]:
import h2o
from h2o.automl import H2OAutoML

In [41]:
h2o.init(min_mem_size="8G")

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "1.8.0_121"; OpenJDK Runtime Environment (Zulu 8.20.0.5-macosx) (build 1.8.0_121-b15); OpenJDK 64-Bit Server VM (Zulu 8.20.0.5-macosx) (build 25.121-b15, mixed mode)
  Starting server from /Users/jamespearce/miniconda3/envs/dl/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /var/folders/53/ywlydrqn0dn1x3lz3_c3j0jr0000gn/T/tmpvm4dwjj5
  JVM stdout: /var/folders/53/ywlydrqn0dn1x3lz3_c3j0jr0000gn/T/tmpvm4dwjj5/h2o_jamespearce_started_from_python.out
  JVM stderr: /var/folders/53/ywlydrqn0dn1x3lz3_c3j0jr0000gn/T/tmpvm4dwjj5/h2o_jamespearce_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.


0,1
H2O cluster uptime:,02 secs
H2O cluster timezone:,Australia/Melbourne
H2O data parsing timezone:,UTC
H2O cluster version:,3.18.0.11
H2O cluster version age:,2 months and 13 days
H2O cluster name:,H2O_from_python_jamespearce_bp8ox0
H2O cluster total nodes:,1
H2O cluster free memory:,7.667 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4


In [42]:
import pandas as pd

In [46]:
train_df = pd.DataFrame(train_arrays)
train_df.columns = ["X_" + str(col) for col in train_df.columns]
predictors = train_df.columns.tolist()

train_df['label'] = train_labels
train_df.head()


Unnamed: 0,X_0,X_1,X_2,X_3,X_4,X_5,X_6,X_7,X_8,X_9,...,X_91,X_92,X_93,X_94,X_95,X_96,X_97,X_98,X_99,label
0,-1.872972,-0.386708,-0.739002,-0.211689,-0.692805,1.092025,0.282619,-0.281471,-0.108712,0.531614,...,-1.464429,-1.756734,0.040509,0.61518,-0.018213,1.224909,0.588505,-0.685883,-2.172373,1.0
1,-3.882802,-3.561205,1.160352,-1.731776,-1.580566,0.57321,2.434028,0.005852,0.599403,3.115823,...,0.59588,-4.09379,1.911239,-0.129363,-1.769357,-1.461325,2.525551,0.723378,0.529979,1.0
2,-1.080227,-1.011788,-0.473173,-0.357528,0.085,0.701937,0.102466,-1.749127,0.589693,0.094849,...,-0.454196,-0.541502,-0.08864,-0.719358,-0.714584,0.407747,-0.304596,-0.951161,0.023494,1.0
3,-0.29975,-0.738922,0.60012,-0.767567,0.517145,-0.035061,0.031394,-0.964378,-1.2111,1.675051,...,-1.160086,1.39005,-0.06782,-1.799572,0.309106,1.03091,1.238531,-1.810848,-1.875456,1.0
4,-0.340574,-0.91217,0.293172,0.985274,-0.920305,-0.931942,0.890645,-0.926597,0.388034,1.36073,...,-0.709455,-1.316947,-0.665811,-1.451187,0.912446,-0.637925,0.144428,-2.237141,-1.279103,1.0


In [47]:
test_df = pd.DataFrame(test_arrays)
test_df.columns = ["X_" + str(col) for col in test_df.columns]
predictors = test_df.columns.tolist()

test_df['label'] = test_labels
test_df.head()


Unnamed: 0,X_0,X_1,X_2,X_3,X_4,X_5,X_6,X_7,X_8,X_9,...,X_91,X_92,X_93,X_94,X_95,X_96,X_97,X_98,X_99,label
0,-0.056883,0.257902,0.917862,1.182799,0.011944,-0.427487,0.970976,-0.364005,0.253675,1.067611,...,-0.44902,-0.587609,1.004264,-0.289868,-0.187725,0.497251,0.089193,-0.084107,1.055744,1.0
1,-0.451884,-0.487854,2.27577,0.750989,-0.312313,0.352184,0.398389,-1.584866,-0.253115,0.191817,...,1.354279,-0.109212,-3.25456,-2.187411,0.595878,1.386712,-0.833938,1.638248,-0.001585,1.0
2,0.373235,-1.60523,2.652814,1.438553,-1.020367,-1.605608,1.944203,0.136347,2.2951,1.479881,...,-0.611217,0.097585,-3.050295,-1.556332,-1.142373,1.27113,-0.353239,-0.161483,0.550128,1.0
3,-1.062646,-1.402895,0.857206,-0.338758,-0.485924,-1.390626,0.936949,-1.338978,0.005824,0.732505,...,-0.57927,0.256764,-0.334779,0.232547,-0.23431,-0.698297,0.526001,-1.230899,-1.205009,1.0
4,1.61261,-1.75997,1.994322,-0.865679,0.791905,2.591355,1.133041,-0.38524,0.303026,0.010593,...,-1.367788,-0.126217,-2.231446,-0.150784,0.113917,0.346769,-0.623452,-0.091713,0.570061,1.0


In [56]:
train_h2o = h2o.H2OFrame(train_df)
test_h2o = h2o.H2OFrame(test_df)

  data = _handle_python_lists(python_obj.as_matrix().tolist(), -1)[1]


Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%


In [110]:
Y = 'label'

# For binary classification, response should be a factor
train_h2o[Y] = train_h2o[Y].asfactor()
test_h2o[Y] = test_h2o[Y].asfactor()

In [111]:
# Run AutoML for some minutes
minutes = 5
aml = H2OAutoML(max_runtime_secs = 5*60, 
#                 exclude_algos=["GBM", "DRF", "DeepLearning"], 
                seed=2055)
aml.train(x = predictors, y = Y,
          training_frame = train_h2o)

AutoML progress: |████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%


In [112]:
# view the leaderboard
lb = aml.leaderboard
lb

model_id,auc,logloss
StackedEnsemble_AllModels_0_AutoML_20180806_222557,0.936507,0.325168
StackedEnsemble_BestOfFamily_0_AutoML_20180806_222557,0.936152,0.325734
GLM_grid_0_AutoML_20180806_222557_model_0,0.933265,0.333024
GBM_grid_0_AutoML_20180806_222557_model_9,0.927504,0.343677
GBM_grid_0_AutoML_20180806_222557_model_15,0.927502,0.344424
GBM_grid_0_AutoML_20180806_222557_model_4,0.925291,0.348505
GBM_grid_0_AutoML_20180806_222557_model_0,0.924301,0.350655
GBM_grid_0_AutoML_20180806_222557_model_1,0.923835,0.351459
GBM_grid_0_AutoML_20180806_222557_model_2,0.92263,0.35538
GBM_grid_0_AutoML_20180806_222557_model_7,0.921052,0.431395




In [113]:
aml.leader

Model Details
H2OStackedEnsembleEstimator :  Stacked Ensemble
Model Key:  StackedEnsemble_AllModels_0_AutoML_20180806_222557
No model summary for this model


ModelMetricsBinomialGLM: stackedensemble
** Reported on train data. **

MSE: 0.03770424172684333
RMSE: 0.19417580108459273
LogLoss: 0.146354704467591
Null degrees of freedom: 20067
Residual degrees of freedom: 20055
Null deviance: 27819.786690884554
Residual deviance: 5874.092418511233
AIC: 5900.092418511233
AUC: 0.9927791119976621
Gini: 0.9855582239953242
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.4130220812610122: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,9366.0,711.0,0.0706,(711.0/10077.0)
1,346.0,9645.0,0.0346,(346.0/9991.0)
Total,9712.0,10356.0,0.0527,(1057.0/20068.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.4130221,0.9480513,228.0
max f2,0.2479523,0.9686033,280.0
max f0point5,0.7182623,0.9637949,135.0
max accuracy,0.4688593,0.9478772,211.0
max precision,0.9777394,1.0,0.0
max recall,0.1175536,1.0,328.0
max specificity,0.9777394,1.0,0.0
max absolute_mcc,0.4688593,0.8958033,211.0
max min_per_class_accuracy,0.4943596,0.9470080,204.0


Gains/Lift Table: Avg response rate: 49.79 %



0,1,2,3,4,5,6,7,8,9,10,11
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,cumulative_response_rate,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0100159,0.9765526,2.0086077,2.0086077,1.0,1.0,0.0201181,0.0201181,100.8607747,100.8607747
,2,0.0200319,0.9757117,2.0086077,2.0086077,1.0,1.0,0.0201181,0.0402362,100.8607747,100.8607747
,3,0.0300478,0.9750917,2.0086077,2.0086077,1.0,1.0,0.0201181,0.0603543,100.8607747,100.8607747
,4,0.0400140,0.9744964,2.0086077,2.0086077,1.0,1.0,0.0200180,0.0803723,100.8607747,100.8607747
,5,0.0500299,0.9739236,2.0086077,2.0086077,1.0,1.0,0.0201181,0.1004904,100.8607747,100.8607747
,6,0.1000100,0.9709830,2.0086077,2.0086077,1.0,1.0,0.1003904,0.2008808,100.8607747,100.8607747
,7,0.1500399,0.9674312,2.0086077,2.0086077,1.0,1.0,0.1004904,0.3013712,100.8607747,100.8607747
,8,0.2000199,0.9623489,2.0086077,2.0086077,1.0,1.0,0.1003904,0.4017616,100.8607747,100.8607747
,9,0.3000299,0.9383943,2.0086077,2.0086077,1.0,1.0,0.2008808,0.6026424,100.8607747,100.8607747




ModelMetricsBinomialGLM: stackedensemble
** Reported on validation data. **

MSE: 0.10160634528899569
RMSE: 0.31875750232582084
LogLoss: 0.33141342582651173
Null degrees of freedom: 4931
Residual degrees of freedom: 4919
Null deviance: 6838.031463986159
Residual deviance: 3269.062032352712
AIC: 3295.062032352712
AUC: 0.9342239172984684
Gini: 0.8684478345969369
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.49994828185931894: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,2067.0,356.0,0.1469,(356.0/2423.0)
1,318.0,2191.0,0.1267,(318.0/2509.0)
Total,2385.0,2547.0,0.1367,(674.0/4932.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.4999483,0.8666930,205.0
max f2,0.1567613,0.9064151,313.0
max f0point5,0.7683774,0.8815899,116.0
max accuracy,0.4999483,0.8633414,205.0
max precision,0.9766120,1.0,0.0
max recall,0.0342643,1.0,389.0
max specificity,0.9766120,1.0,0.0
max absolute_mcc,0.4999483,0.7266126,205.0
max min_per_class_accuracy,0.5360029,0.8617416,193.0


Gains/Lift Table: Avg response rate: 50.87 %



0,1,2,3,4,5,6,7,8,9,10,11
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,cumulative_response_rate,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0101379,0.9726630,1.9657234,1.9657234,1.0,1.0,0.0199283,0.0199283,96.5723396,96.5723396
,2,0.0200730,0.9711930,1.9256066,1.9458676,0.9795918,0.9898990,0.0191311,0.0390594,92.5606592,94.5867604
,3,0.0300081,0.9699010,1.9657234,1.9524415,1.0,0.9932432,0.0195297,0.0585891,96.5723396,95.2441481
,4,0.0401460,0.9686289,1.9264089,1.9458676,0.98,0.9898990,0.0195297,0.0781188,92.6408928,94.5867604
,5,0.0500811,0.9670187,1.9256066,1.9418482,0.9795918,0.9878543,0.0191311,0.0972499,92.5606592,94.1848213
,6,0.1001622,0.9612030,1.9179730,1.9299106,0.9757085,0.9817814,0.0960542,0.1933041,91.7973030,92.9910621
,7,0.1500406,0.9537619,1.9497419,1.9365032,0.9918699,0.9851351,0.0972499,0.2905540,94.9741905,93.6503183
,8,0.2001217,0.9438967,1.8940978,1.9258911,0.9635628,0.9797366,0.0948585,0.3854125,89.4097847,92.5891108
,9,0.3000811,0.9043114,1.7743345,1.8754064,0.9026369,0.9540541,0.1773615,0.5627740,77.4334505,87.5406375




ModelMetricsBinomialGLM: stackedensemble
** Reported on cross-validation data. **

MSE: 0.09917463355076918
RMSE: 0.31492004310740396
LogLoss: 0.32516766415604176
Null degrees of freedom: 20067
Residual degrees of freedom: 20053
Null deviance: 27828.826038623632
Residual deviance: 13050.929368566893
AIC: 13080.929368566893
AUC: 0.9365071017026368
Gini: 0.8730142034052737
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.45716668609840594: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,8606.0,1471.0,0.146,(1471.0/10077.0)
1,1227.0,8764.0,0.1228,(1227.0/9991.0)
Total,9833.0,10235.0,0.1344,(2698.0/20068.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.4571667,0.8666073,214.0
max f2,0.1266258,0.9055332,330.0
max f0point5,0.7393757,0.8797216,126.0
max accuracy,0.4571667,0.8655571,214.0
max precision,0.9762402,1.0,0.0
max recall,0.0280349,1.0,396.0
max specificity,0.9762402,1.0,0.0
max absolute_mcc,0.4571667,0.7313535,214.0
max min_per_class_accuracy,0.4983875,0.8643779,202.0


Gains/Lift Table: Avg response rate: 49.79 %



0,1,2,3,4,5,6,7,8,9,10,11
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,cumulative_response_rate,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0100159,0.9727256,2.0086077,2.0086077,1.0,1.0,0.0201181,0.0201181,100.8607747,100.8607747
,2,0.0200319,0.9710852,2.0086077,2.0086077,1.0,1.0,0.0201181,0.0402362,100.8607747,100.8607747
,3,0.0300478,0.9695620,1.9786285,1.9986147,0.9850746,0.9950249,0.0198178,0.0600540,97.8628527,99.8614674
,4,0.0400140,0.9683753,1.9784786,1.9935995,0.985,0.9925280,0.0197177,0.0797718,97.8478631,99.3599470
,5,0.0500299,0.9671935,1.9586424,1.9866011,0.9751244,0.9890438,0.0196177,0.0993895,95.8642380,98.6601088
,6,0.1000100,0.9612347,1.9725609,1.9795845,0.9820538,0.9855506,0.0985887,0.1979782,97.2560948,97.9584516
,7,0.1500399,0.9539609,1.9505902,1.9699165,0.9711155,0.9807373,0.0975878,0.2955660,95.0590193,96.9916532
,8,0.2000199,0.9436005,1.9244985,1.9585677,0.9581256,0.9750872,0.0961866,0.3917526,92.4498549,95.8567693
,9,0.3000299,0.9028300,1.8314660,1.9162004,0.9118087,0.9539944,0.1831648,0.5749174,83.1465958,91.6200448







In [114]:
# predict against the test set
preds = aml.leader.predict(test_h2o)

stackedensemble prediction progress: |████████████████████████████████████| 100%


In [115]:
preds.head()

predict,p0,p1
1,0.0638986,0.936101
1,0.0313767,0.968623
1,0.0380281,0.961972
1,0.0315846,0.968415
1,0.0262308,0.973769
1,0.0332806,0.966719
1,0.0611084,0.938892
1,0.184531,0.815469
1,0.242612,0.757388
1,0.0420261,0.957974




In [116]:
test_df['predict'] = preds['predict'].as_data_frame()
test_df.head()

Unnamed: 0,X_0,X_1,X_2,X_3,X_4,X_5,X_6,X_7,X_8,X_9,...,X_92,X_93,X_94,X_95,X_96,X_97,X_98,X_99,label,predict
0,-0.056883,0.257902,0.917862,1.182799,0.011944,-0.427487,0.970976,-0.364005,0.253675,1.067611,...,-0.587609,1.004264,-0.289868,-0.187725,0.497251,0.089193,-0.084107,1.055744,1.0,1
1,-0.451884,-0.487854,2.27577,0.750989,-0.312313,0.352184,0.398389,-1.584866,-0.253115,0.191817,...,-0.109212,-3.25456,-2.187411,0.595878,1.386712,-0.833938,1.638248,-0.001585,1.0,1
2,0.373235,-1.60523,2.652814,1.438553,-1.020367,-1.605608,1.944203,0.136347,2.2951,1.479881,...,0.097585,-3.050295,-1.556332,-1.142373,1.27113,-0.353239,-0.161483,0.550128,1.0,1
3,-1.062646,-1.402895,0.857206,-0.338758,-0.485924,-1.390626,0.936949,-1.338978,0.005824,0.732505,...,0.256764,-0.334779,0.232547,-0.23431,-0.698297,0.526001,-1.230899,-1.205009,1.0,1
4,1.61261,-1.75997,1.994322,-0.865679,0.791905,2.591355,1.133041,-0.38524,0.303026,0.010593,...,-0.126217,-2.231446,-0.150784,0.113917,0.346769,-0.623452,-0.091713,0.570061,1.0,1


In [117]:
pd.crosstab(test_df.label, test_df.predict)

predict,0,1
label,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,10826,1674
1.0,1700,10800


In [118]:
sum(test_df['label'] == test_df['predict'])/test_df.shape[0]

0.86504

In [119]:
aml.leader.save_mojo()

'/Users/jamespearce/repos/dl/data/sentiment/StackedEnsemble_AllModels_0_AutoML_20180806_222557.zip'

In [120]:
h2o.save_model(aml.leader)

'/Users/jamespearce/repos/dl/data/sentiment/StackedEnsemble_AllModels_0_AutoML_20180806_222557'

In [121]:
test_df.to_pickle("imdb_pred.pkl")

## References

- Doc2vec: https://radimrehurek.com/gensim/models/doc2vec.html
- Paper that inspired this: http://arxiv.org/abs/1405.4053