# 6. Machine learning starting from GloVe

In this notebook we'll augment the unigram model by using GloVe dense vector representations of words.  

Dense representations like GloVe and word2vec are helpful because they have mapped the words into a shared conceptual space.  In these vector spaces, direction and distance have meaning.  Nearby words are similar to each other, and it is possible to solve analogies by looking at direction and distance between pairs of words. For some graphics and more details, check out the GloVe page: http://nlp.stanford.edu/projects/glove/

In [None]:
import pickle
import csv
import string
import numpy as np
from collections import Counter
from sklearn.feature_extraction import DictVectorizer

In [None]:
with open("data/sentiment_splits.p", "rb") as f:
    X_train, X_dev, X_test, y_train, y_dev, y_test = pickle.load(f)

In [None]:
NUM_DIM = 50 #using more dimensions should be better accuracy
PATH_TO_GLOVE = "../glove.6B/" 

path_name = PATH_TO_GLOVE + "glove.6B.{0}d.txt".format(NUM_DIM) 
reader = csv.reader(open(path_name), delimiter=' ', quoting=csv.QUOTE_NONE)    
glove = {line[0]: np.array(list(map(float, line[1: ]))) for line in reader}

In [None]:
# We want a dataset in which each example is an average of its 
# individual word embeddings

def convert_to_vector(paragraph):    
    unigrams = paragraph.split()
    representation = np.zeros(NUM_DIM)
    ctr = 0.
    for word in unigrams:
        if word in glove:
            representation += glove[word]
            ctr += 1
    return representation/ctr

def convert_dataset(dataset):
    dataset_matrix = np.zeros((len(dataset), NUM_DIM))
    for i,paragraph in enumerate(dataset):
        dataset_matrix[i] = convert_to_vector(paragraph)
    return dataset_matrix

X_train_vector = convert_dataset(X_train)
print X_train_vector.shape
X_dev_vector = convert_dataset(X_dev)
print X_dev_vector.shape

In [None]:
from sklearn import linear_model

clf = linear_model.LogisticRegression()
clf.fit(X_train_vector, y_train)
y_dev_hat = clf.predict(X_dev_vector)

# Evaluation

In [None]:
# Let's evaluate
# No cross-validation this round, but we can use that in the 
# future to get a sense of the variability of the method
from sklearn import metrics

print "Accuracy:"
print metrics.accuracy_score(y_dev, y_dev_hat)

print

print "Classification metrics:"
print metrics.classification_report(y_dev, y_dev_hat)

print 

print "Confusion matrix:"
print "(Rows are truth, columns are predictions)"
print metrics.confusion_matrix(y_dev, y_dev_hat)

So blindly averaging word vectors isn't the best for assessing sentiment.

What if instead we used the single word in the sentence that was closest to a set of "positive words"? Or what if our classification was based on the distance to the positive words vs. the negative words?  That's perhaps a better utilization of the dense vector representations altogether.  We'll turn to that next.