# 2. Machine learning on words

In this notebook we'll assign better weights to the words we identified earlier -- weights that are actually optimal for classifying the data on our training set.  We'll have to figure out how to represent words as numbers to do so.

In [None]:
import pickle
import csv
import numpy as np

In [None]:
with open("data/sentiment_splits.p", "rb") as f:
    X_train, X_dev, X_test, y_train, y_dev, y_test = pickle.load(f)

In [None]:
# Get the set of words we're interested in
words_of_interest = {}
with open("data/hand_weights.csv") as f:
    reader = csv.reader(f)
    for word, score in reader:
        words_of_interest[word] = 1
words_of_interest = words_of_interest.keys()

In [None]:
# Check out the words we'll learn information about
print words_of_interest

In [None]:
# Create a function that will convert each paragraph to a vector.
#
# The presence of each word in the wordlist is a feature.
# So a cell is 1 if the word of interest appears in the
# paragraph, and near 0 otherwise
def convert_to_vector(paragraph):
    representation = np.zeros(len(words_of_interest))
    for i, word in enumerate(words_of_interest):
        if word in paragraph:
            representation[i] = 1
    return representation

In [None]:
# Test the conversion
print convert_to_vector("not bad")

In [None]:
def convert_dataset(dataset):
    # Convert X_train and X_dev to use the new format
    dataset_vector = np.zeros((len(dataset), len(words_of_interest)))
    for i,paragraph in enumerate(dataset):
        dataset_vector[i] = convert_to_vector(paragraph)
    return dataset_vector

X_train_vector = convert_dataset(X_train)
print X_train_vector.shape
X_dev_vector = convert_dataset(X_dev)
print X_dev_vector.shape

In [None]:
from sklearn import linear_model

clf = linear_model.LogisticRegression()
clf.fit(X_train_vector, y_train)
y_dev_hat = clf.predict(X_dev_vector)

## Evaluation

In [None]:
# Let's evaluate
# No cross-validation this round, but we can use that in the 
# future to get a sense of the variability of the method
from sklearn import metrics

print "Accuracy:"
print metrics.accuracy_score(y_dev, y_dev_hat)

print

print "Classification metrics:"
print metrics.classification_report(y_dev, y_dev_hat)

print 

print "Confusion matrix:"
print "(Rows are truth, columns are predictions)"
print metrics.confusion_matrix(y_dev, y_dev_hat)

Our performance went up!  That's good, and what we'd expect.

Let's brainstorm some possible reasons why it isn't great yet, though:
- Too much regularization
- Overfit to the training data
- Too little data, such that what we're learning is noisy
- Too many rows have 0 occurrences of one of our training words

It turns out there isn't too much regularization -- you can play with this by passing C=2.0 or higher numbers to the LogisticRegression creator.  Regardless of what this value is, performance doesn't improve.

Let's check out whether we're overfitting or underfitting to the problem next...

In [None]:
from sklearn.learning_curve import validation_curve

param = "C"
param_range = [0.01, 0.05, 0.1, 0.5, 1.0, 2.0, 5.0] #[True, False] #
train_scores, test_scores = validation_curve(
    linear_model.LogisticRegression(), 
    X_train_vector, 
    y_train, 
    cv=10, 
    param_name=param, 
    param_range=param_range, 
    scoring="f1")

In [None]:
from matplotlib import pyplot as plt
%matplotlib inline

train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)

print train_mean
print train_std
print test_mean
print test_std

plt.title("Validation  Curve")
plt.xlabel(param)
plt.ylabel("F1")
plt.ylim(0.0, 1.0)
plt.semilogx(param_range,
             train_mean, 
             label="Training score", 
             color="r")
plt.fill_between(param_range,
            train_mean - train_std,
            train_mean + train_std,
            alpha=0.2,
            color="r")
plt.semilogx(param_range,
             test_mean, 
             label="Crossvalidation score", 
             color="g")
plt.fill_between(param_range,
            test_mean - test_std,
            test_mean + test_std,
            alpha=0.2,
            color="g")
plt.legend(loc="best")
plt.show()

This looks pretty great -- we're not overfitting or underfitting much at all, and the amount of regularization is reasonably good where it is (though we might want to decrease it a bit and improve performance).  

Overall, this is great news. We can get a 5% gain over just guessing by using the small number of words we thought a priori might matter.