# 7. Machine learning with feature extraction on GloVe

In this notebook we'll combine feature extraction with GloVe vectors to see if we can get a model that performs better than simple averaging of the meaning space of GloVe vectors.

In [None]:
import pickle
import string
import csv
import numpy as np

from sklearn import preprocessing
from scipy.spatial.distance import cosine 

In [None]:
with open("data/sentiment_splits.p", "rb") as f:
    X_train, X_dev, X_test, y_train, y_dev, y_test = pickle.load(f)

In [None]:
NUM_DIM = 50 #using more dimensions should be better accuracy
PATH_TO_GLOVE = "../glove.6B/" 

path_name = PATH_TO_GLOVE + "glove.6B.{0}d.txt".format(NUM_DIM) 
reader = csv.reader(open(path_name), delimiter=' ', quoting=csv.QUOTE_NONE)    
glove = {line[0]: np.array(list(map(float, line[1: ]))) for line in reader}

In [None]:
# Let's identify a centroid within the GloVe vector space for
# positive and negative words.
#
# There are many seed sets available, or you could create your own
# from scratch or from crawling synonyms in a tool like WordNet.
#
# Here I grab from Turney and Littman 2003.

POS_SEEDS = ['good', 'nice', 'excellent', 'positive', 'fortunate', 'correct', 'superior']
NEG_SEEDS = ['bad', 'nasty', 'poor', 'negative', 'unfortunate', 'wrong', 'inferior']

def get_centroid(word_set):
    centroid = np.zeros(NUM_DIM)
    ctr = 0.
    for word in word_set:
        if word in glove:
            centroid += glove[word]
            ctr += 1.
    return centroid / ctr


POSITIVE_CENTROID = get_centroid(POS_SEEDS)
NEGATIVE_CENTROID = get_centroid(NEG_SEEDS)


In [None]:
# Let's get a sense of the space
print "Difference between centroids:"
print POSITIVE_CENTROID - NEGATIVE_CENTROID
print
print "Distance between centroids:"
print cosine(POSITIVE_CENTROID, NEGATIVE_CENTROID)
print
print "Average distance positive seeds -> positive centroid"
print np.mean([cosine(POSITIVE_CENTROID, glove[word]) for word in POS_SEEDS])
print "Average distance positive seeds -> negative centroid"
print np.mean([cosine(NEGATIVE_CENTROID, glove[word]) for word in POS_SEEDS])
print
print "Average distance negative seeds -> positive centroid"
print np.mean([cosine(POSITIVE_CENTROID, glove[word]) for word in NEG_SEEDS])
print "Average distance negative seeds -> negative centroid"
print np.mean([cosine(NEGATIVE_CENTROID, glove[word]) for word in NEG_SEEDS])

In [None]:
# We want a dataset in which we have clever features that are
# based on the additional information we get from GloVe.

TOTAL_FEATURES = 10

def convert_to_vector(paragraph):
    unigrams = paragraph.split()
    
    num_out_of_vocab = 0
    dist_to_positive = np.zeros(len(unigrams))
    dist_to_negative = np.zeros(len(unigrams))
    for i, word in enumerate(unigrams):
        if word in glove:
            dist_to_positive[i-num_out_of_vocab] = cosine(POSITIVE_CENTROID, glove[word])
            dist_to_negative[i-num_out_of_vocab] = cosine(NEGATIVE_CENTROID, glove[word])
        else:
            num_out_of_vocab += 1
            
    # Cut out any unused words
    if num_out_of_vocab > 0:
        dist_to_positive = dist_to_positive[:-num_out_of_vocab]
        dist_to_negative = dist_to_negative[:-num_out_of_vocab]
        
    # Sort them so we can take the top-n easily
    dist_to_positive = sorted(dist_to_positive)
    dist_to_negative = sorted(dist_to_negative)
            
    # Derive a feature representation
    representation = np.zeros(TOTAL_FEATURES)
    # First feature: How close is this paragraph on average
    # to the centroid of "positive" words?
    representation[0] = np.mean(dist_to_positive)

    # Second feature: Same for "negative" words
    representation[1] = np.mean(dist_to_negative)

    # Third feature: What's the average of the top 3 words
    # for positive?
    representation[2] = np.mean(dist_to_positive[:3])

    # Fourth feature: What's the average of the top 3 words
    # for negative?
    representation[3] = np.mean(dist_to_negative[:3])

    # Fifth feature: Top 1 positive word?
    representation[4] = dist_to_positive[0]

    # Sixth feature: Top 1 negative word?
    representation[5] = dist_to_negative[0]

    # Seventh feature: What percent of words are out of vocab?
    representation[6] = num_out_of_vocab*1. / len(unigrams)
    
    # Eighth feature: What's the difference between the words'
    # distances to the positive seed centroid vs. the negative
    # seed centroid?
    representation[7] = np.sum(dist_to_positive) - np.sum(dist_to_negative)

    # Ninth feature: What's the difference between the words'
    # closest distance to the positive seed set vs. the negative
    # seed set?
    if word in glove: #production code should check that the seed words are in glove
        positive_distances = np.min([cosine(glove[word], glove[pos]) for pos in POS_SEEDS])
        negative_distances = np.min([cosine(glove[word], glove[neg]) for neg in NEG_SEEDS])
        representation[8] = positive_distances - negative_distances
        
    # Tenth feature: What's the difference between the words'
    # closest distance to the positive seed set vs. the negative
    # seed set?
    # This is "semantic orientation"
    if word in glove: #production code should check that the seed words are in glove
        positive_distances = np.sum([cosine(glove[word], glove[pos]) for pos in POS_SEEDS])
        negative_distances = np.sum([cosine(glove[word], glove[neg]) for neg in NEG_SEEDS])
        representation[9] = positive_distances - negative_distances

    return representation

convert_to_vector("this is an awful dfsd sentence .")

In [None]:
def convert_dataset(dataset):
    dataset_matrix = np.zeros((len(dataset), TOTAL_FEATURES))
    for i,paragraph in enumerate(dataset):
        dataset_matrix[i] = convert_to_vector(paragraph)
    return dataset_matrix

X_train_vector = convert_dataset(X_train)
print X_train_vector.shape
X_dev_vector = convert_dataset(X_dev)
print X_dev_vector.shape

In [None]:
from sklearn import linear_model

clf = linear_model.LogisticRegression()
clf.fit(X_train_vector, y_train)
y_dev_hat = clf.predict(X_dev_vector)

# Evaluation

In [None]:
# Let's evaluate
# No cross-validation this round, but we can use that in the 
# future to get a sense of the variability of the method
from sklearn import metrics

print "Accuracy:"
print metrics.accuracy_score(y_dev, y_dev_hat)

print

print "Classification metrics:"
print metrics.classification_report(y_dev, y_dev_hat)

print 

print "Confusion matrix:"
print "(Rows are truth, columns are predictions)"
print metrics.confusion_matrix(y_dev, y_dev_hat)

## Let's scale the data

It's often a good idea to force the data to have a mean of 0 and variance of 1.

But remember, we can only using the training data to impose this...

In [None]:
scaler = preprocessing.StandardScaler().fit(X_train_vector)
X_train_vector_s = scaler.transform(X_train_vector)
X_dev_vector_s = scaler.transform(X_dev_vector)

In [None]:
rev_clf = linear_model.LogisticRegression()
rev_clf.fit(X_train_vector_s, y_train)
y_dev_hat = rev_clf.predict(X_dev_vector_s)

In [None]:
# Let's evaluate
# No cross-validation this round, but we can use that in the 
# future to get a sense of the variability of the method
from sklearn import metrics

print "Accuracy:"
print metrics.accuracy_score(y_dev, y_dev_hat)

print

print "Classification metrics:"
print metrics.classification_report(y_dev, y_dev_hat)

print 

print "Confusion matrix:"
print "(Rows are truth, columns are predictions)"
print metrics.confusion_matrix(y_dev, y_dev_hat)

So we do get a gain.

We're doing pretty good for only 10 features -- and since a lot of these features are probably correlated with each other, we could probably do just about this good with even fewer features.

But really, we would probably want to combine this information with the unigram model that was performing best earlier.

## Let's see which features are most helpful

We can look at the coefficients on our model to get a sense for which features matter most.

In [None]:
print rev_clf.coef_

This means that:

In [None]:
labels = ["How close to the positive centroid on average?",
"How close to the negative centroid on average?",
"How close are the top 3 words to the positive centroid?",
"How close are the top 3 words to the negative centroid?",
"How close is the closest word to the positive centroid?",
"How close is the closest word to the negative centroid?",
"What percent of words are out of vocabulary?",
"What's the difference between dist to positive centroid & negative centroid?",
"What's the difference between closest dist to positive vs. negative word?",
"What's the semantic orientation?"]

data = zip(labels, rev_clf.coef_[0])
data_sorted = sorted(data, key=lambda x: abs(x[1]), reverse=True)

print "From most important to least important feature:"
for label, coef in data_sorted:
    print "%5.2f %s" % (coef, label)

Sign interpretation here is a bit complex.

Big values for distance mean "low probability of being related".  So a negative score on a positive centroid means that lower values are good for getting the positive class.  Similarly, positive scores on the negative centroids mean that higher distances are good for getting the positive class.

We find from this that larger features are more stable.  One-offs aren't as useful in this context as the sentiment of the sentence as a whole.