# Sentiment using TextBlob

This model will simply use TextBlob to determine the sentiment of each phrase.

In [5]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the data
train = pd.read_csv('../Lemmatization/result_train.csv', encoding='ascii')
test = pd.read_csv('../Lemmatization/result_test.csv', encoding='ascii')
train.fillna('', inplace=True)
test.fillna('', inplace=True)

## Calculate Sentiments

We'll calculate sentiments TextBlob.

In [7]:
from textblob import TextBlob

def text_blob_sentiment(phrase):
    try:
        return TextBlob(phrase).sentiment.polarity
    except:
        return 0

def get_text_blob_sentiments(phrases):
    sentiments = map(text_blob_sentiment, phrases)
    return pd.DataFrame({'sentiment': sentiments})

## Split the Data

We will split the training data into two parts. The first of these parts will be used to train the model, and the other will be used to make predictions. Later on, we will feed these predictions into a higher level learner as a feature. Since we are using labeled data, we can give the ensemble learner the these predictions along with the corresponding true sentiments. This will allow the ensemble learner to fit the predictions to the actual sentiments.

In [12]:
split_index = int(len(train) / 2)
cv = train[:split_index]
train = train[split_index:]

## Find Sentiments with TextBlob

In [13]:
X_train_sent = get_text_blob_sentiments(train.Phrase)
X_cv_sent = get_text_blob_sentiments(cv.Phrase)
X_test_sent = get_text_blob_sentiments(test.Phrase)

## Cross Validation

We can do some cross validation to determine which model to use.

In [15]:
def cross_val(X, y):
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.ensemble import AdaBoostClassifier
    from sklearn.svm import SVC

    forest = RandomForestClassifier(n_estimators=100)
    boost = AdaBoostClassifier()
    svc = SVC()

    from sklearn.cross_validation import cross_val_score
    import time

    t0 = time.time()
    print "Random Forest cross validation runnning..."
    forest_score = cross_val_score(forest, X, y).mean()
    print "Random Forest Score: %2.2f" % forest_score
    print "dt: %f" % (time.time() - t0)
    print ""

    t0 = time.time()
    print "AdaBoost cross validation runnning..."
    boost_score = cross_val_score(boost, X, y).mean()
    print "AdaBoost Score:      %2.2f" % boost_score
    print "dt: %f" % (time.time() - t0)
    print ""

    t0 = time.time()
    print "SVC cross validation runnning..."
    svc_score = cross_val_score(svc, X, y).mean()
    print "SVC Score:           %2.2f" % svc_score
    print "dt: %f" % (time.time() - t0)
    print ""
    
cross_val(X_cv_sent, cv.Sentiment)

Random Forest cross validation runnning...
Random Forest Score: 0.50
dt: 0.410119

AdaBoost cross validation runnning...
AdaBoost Score:      0.58
dt: 0.211438

SVC cross validation runnning...
SVC Score:           0.69
dt: 0.008381



## Learn with SVC

In [24]:
from sklearn.svm import SVC
svc = SVC()

print "training SVC..."
svc.fit(X_train_sent, train.Sentiment)

# Predict using cross validation data
cv_pred = svc.predict(X_cv_sent)
results_cv = pd.DataFrame({
    'PhraseId': cv.PhraseId,
    'Predicted': cv_pred,
    'Sentiment': cv.Sentiment
})
results_cv.to_csv('results_train.csv', index=False)

print "predicting..."
y_pred = svc.predict(X_test_sent)

results_test = pd.DataFrame({
    'PhraseId': test.PhraseId,
    'Sentiment': y_pred
})

results_test.to_csv('results_test.csv', index=False)
print "done."

training SVC...
predicting...
done.


## Kaggle Results

![Kaggle Results]()