# Ensemble

For the grand finale, let's assemble all of our models into one.

In [3]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns



## Load all Model Predictions

Here we'll load all of the predicted sentiments from each of our individual models. We can then input these predictions as features into a final learner. Hopefully our previous predictions have enough diversity to score well as a team.

In [4]:
train = pd.read_csv('../../data/train.tsv', sep='\t')
test = pd.read_csv('../../data/test.tsv', sep='\t')

lda_train = pd.read_csv('../latent-dirichlet-allocation/results_train.csv')
w2v_train = pd.read_csv('../word-2-vector/results_train.csv')
sent_train = pd.read_csv('../sentiment/results_train.csv')
# bow_train = pd.read_csv('../bag-of-words/results_train.csv')
# pos_train = pd.read_csv('../parts-of-speech/results_train.csv')

X_train = pd.DataFrame({
    'lda': lda_train.Predicted,
    'w2v': w2v_train.Predicted,
    'sent': sent_train.Predicted,  
#     'bow': bow_train.Sentiment
})
y_train = sent_train.Sentiment

lda_test = pd.read_csv('../latent-dirichlet-allocation/results_test.csv')
w2v_test = pd.read_csv('../word-2-vector/results_test.csv')
sent_test = pd.read_csv('../sentiment/results_test.csv')
# bow_test = pd.read_csv('../bag-of-words/results_test.csv')
# pos_test = pd.read_csv('../parts-of-speech/results_test.csv')

X_test = pd.DataFrame({
    'lda': lda_test.Sentiment,
    'w2v': w2v_test.Sentiment,
    'sent': sent_test.Sentiment,
#     'bow': bow_test.Sentiment
})

## Deciding on a Learner and Model Representation

In order to decide which learner and model to use, we will do some cross validation. The input features are all in the form of sentiment predictions (0-4). Probably the best way to use this data is to take a weighted average of each model's predictions. As a result, I predict that the models based on trees will not do as well as the others such as logistic regression.

In [11]:
def cv(X, y):
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.ensemble import AdaBoostClassifier
    from sklearn.svm import SVC
    from sklearn.linear_model import LogisticRegression

    forest = RandomForestClassifier(n_estimators=50)
    boost = AdaBoostClassifier()
    svc = SVC()
    log = LogisticRegression()

    from sklearn.cross_validation import cross_val_score
    import time
    
    t0 = time.time()
    print "Logistic Regrssion cross validation runnning..."
    log_score = cross_val_score(log, X, y).mean()
    print "Logistic Regression Score: %2.2f" % log_score
    print "dt: %f" % (time.time() - t0)
    print ""

    t0 = time.time()
    print "Random Forest cross validation runnning..."
    forest_score = cross_val_score(forest, X, y).mean()
    print "Random Forest Score: %2.2f" % forest_score
    print "dt: %f" % (time.time() - t0)
    print ""

    t0 = time.time()
    print "AdaBoost cross validation runnning..."
    boost_score = cross_val_score(boost, X, y).mean()
    print "AdaBoost Score:      %2.2f" % boost_score
    print "dt: %f" % (time.time() - t0)
    print ""

    t0 = time.time()
    print "SVC cross validation runnning..."
    svc_score = cross_val_score(svc, X, y).mean()
    print "SVC Score:           %2.2f" % svc_score
    print "dt: %f" % (time.time() - t0)
    print ""
    
cv(X_train, y_train)

Logistic Regrssion cross validation runnning...
Logistic Regression Score: 0.80
dt: 0.010617

Random Forest cross validation runnning...
Random Forest Score: 0.75
dt: 0.219761

AdaBoost cross validation runnning...
AdaBoost Score:      0.81
dt: 0.198987

SVC cross validation runnning...
SVC Score:           0.80
dt: 0.007176



It looks like SVC works pretty well

In [None]:
from sklearn.svm import SVC
clf = SVC()

print "training svc..."
clf.fit(X_train, y_train)

print "predicting..."
y_pred = clf.predict(X_test)

results = pd.DataFrame({
    'PhraseId': test.PhraseId,
    'Sentiment': y_pred
})
results.to_csv('results.csv', index=False)

print 'done.'

## Kaggle Results

![Kaggle Results]()

## Analysis

Let's take a step back and inspect the data we are learning from.

### How neutral is our data?

In [35]:
print "Percent LDA Neutrality: %f" % sum(100 * (X_train.lda == 2) / len(X_train))
print "Percent Word2Vec Neutrality: %f" % sum(100 * (X_train.w2v == 2) / len(X_train))
print "Percent Sentiment Neutrality: %f" % sum(100 * (X_train.sent == 2) / len(X_train))

Percent LDA Neutrality: 0.315263
Percent Word2Vec Neutrality: 78.441625
Percent Sentiment Neutrality: 82.054338


### How diverse is our data?

In [41]:
all_same_predictions = (X_train.w2v == X_train.sent) & (X_train.w2v == X_train.lda)
print "Percent Similarity: %f" % (100 * sum(all_same_predictions) / float(len(X_train)))

Percent Similarity: 70.762527


### How often do each of the models predict the true sentiment?

In [40]:
print "Percent LDA is correct: %f" % (100 * sum((X_train.lda == y_train)) / float(len(X_train)))
print "Percent Sentiment is correct: %f" % (100 * sum((X_train.sent == y_train)) / float(len(X_train)))
print "Percent Word2Vec is correct: %f" % (100 * sum((X_train.w2v == y_train)) / float(len(X_train)))

Percent LDA is correct: 0.202486
Percent Sentiment is correct: 54.218890
Percent Word2Vec is correct: 54.204793


### How often do any of the models predict the true sentiment?

In [37]:
any_correct = (X_train.lda == y_train) | (X_train.sent == y_train) | (X_train.w2v == y_train)
print "Percent any model is correct: %f" % (100 * sum(any_correct) / float(len(X_train)))

Percent any model is correct: 64.203511
