# 5. Machine learning with smarter features

In this notebook we'll augment the unigram model, often called a "bag of words" model, with a handful of smarter features.  We'll start doing *feature engineering*.  

You should be able to add some features of your own that might also improve performance.

In [None]:
import pickle
import string
import numpy as np
from collections import Counter
from sklearn.feature_extraction import DictVectorizer

In [None]:
with open("data/sentiment_splits.p", "rb") as f:
    X_train, X_dev, X_test, y_train, y_dev, y_test = pickle.load(f)

In [None]:
# Let's get the set of words that occur in at least 2 documents
words_of_interest = Counter()
for item in X_train:
    for word in set(item.split()):
        words_of_interest[word] += 1

In [None]:
# Let's build a feature function that prepends all words
# that between "not" and punctuation with "_NOT",
# as per:
# Das, Sanjiv and Mike Chen. 2001. Yahoo! for Amazon: Extracting market sentiment from stock message boards. In Proceedings of the Asia Pacific Finance Association Annual Conference (APFA).
# Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan.  2002.  Thumbs up? Sentiment Classification using Machine Learning Techniques. EMNLP-2002, 79—86.
#
# This could be much more complicated -- we could use parsing, 
# for instance -- so that's something to play with if you'd like.
PUNCTUATION = string.punctuation + "--"
KNOWN_WORDS = dict([(val, amt) for val, amt in words_of_interest.items() if amt > 1])
def get_notted_unigrams(paragraph, verbose=False):    
    preceded_by_not = False
    
    unigrams = paragraph.split()
    for i, word in enumerate(unigrams):
        if word not in KNOWN_WORDS:
            unigrams[i] = word = "UNK"
        
        if word in PUNCTUATION:
            preceded_by_not = False
        elif preceded_by_not:
            unigrams[i] = "NOT_"+word
        elif word == "not":
            preceded_by_not = True
    if verbose:
        print unigrams
    return Counter(unigrams)

print "Samples:"
get_notted_unigrams("the abject big brown bear is not happy .", True)
get_notted_unigrams("it was not terrible and not awful ; in fact , it was quite good .", True)
print

In [None]:
# This time we leverage the DictVectorizer to get features --
# we walk across the entirety of the training data to get a list of
# feature vectors, and then we create a matrix using that approach.
X_train_feat_dicts = []
for item in X_train:
    X_train_feat_dicts.append(get_notted_unigrams(item))

# Create a vectorizer from the dictionary
vectorizer = DictVectorizer(sparse=True)
X_train_matrix = vectorizer.fit_transform(X_train_feat_dicts)


In [None]:
# Use the same vectorizer to transform the dev data
X_dev_feat_dicts = []
for item in X_dev:
    X_dev_feat_dicts.append(get_notted_unigrams(item))

X_dev_matrix = vectorizer.transform(X_dev_feat_dicts)

In [None]:
from sklearn import linear_model

clf = linear_model.LogisticRegression()
clf.fit(X_train_matrix, y_train)
y_dev_hat = clf.predict(X_dev_matrix)

# Evaluation

In [None]:
# Let's evaluate
# No cross-validation this round, but we can use that in the 
# future to get a sense of the variability of the method
from sklearn import metrics

print "Accuracy:"
print metrics.accuracy_score(y_dev, y_dev_hat)

print

print "Classification metrics:"
print metrics.classification_report(y_dev, y_dev_hat)

print 

print "Confusion matrix:"
print "(Rows are truth, columns are predictions)"
print metrics.confusion_matrix(y_dev, y_dev_hat)

It looks like we experienced basically no effect by incorporating this feature. In particular, we see that recall goes up for the negative cases, and precision goes up for the positive cases -- this suggests that we are capturing negativity better now.  This is what we were going for, so it's good news!

But what might be nice is to get a sense of the statistical range of what's going on here.  So let's switch to cross-validation and do a model comparison of the "_NOT" augmented model from the original model.

In [None]:
# Cross-validation on this model
from sklearn import cross_validation
scores = cross_validation.cross_val_score(clf, X_train_matrix, y_train, cv=10 )

print "The 95% confidence interval on the model performance is..."
print "Accuracy: %0.2f (+/- %0.2f)" % (np.mean(scores), 2*np.std(scores))

This suggests that the new model performs statistically similar on this training corpus.  However, since now we're able to differentiate "good" and "not good", we have a theoretical reason to prefer the current model.

To further improve performance, you might want to introduce other feature functions, or improvements to the current feature function (like stopping at phrase boundaries instead of punctuation).