# 1. Hand engineered features

In this notebook we'll hand-engineer some word features and score our performance.

In [None]:
import pickle
import csv
import random
import numpy as np

In [None]:
with open("data/sentiment_splits.p", "rb") as f:
    X_train, X_dev, X_test, y_train, y_dev, y_test = pickle.load(f)

Let's frame this problem as averaging word scores across a sentence. Each word can get a score between 1 and -1.

We might start out by generating this list:
- good, 1.0
- enjoyable, 1.0
- interesting, 0
- bad, -1.0
- really, 0
- very, 0
- boring, -1

We might also want to include phrases, like:
- edge of my seat, 1
- couldn't stop eating my popcorn, 1

For this exercise, we'll stick with single words ("unigrams") for simplicity.

Figuring out which words to include from nothing is hard!

Let's take a look at some training data to help us along....

In [None]:
X = zip(X_train, y_train)
pos = [x for x in X if x[1] == 1]
neg = [x for x in X if x[1] == -1]

In [None]:
# Positive first...
for item in pos[0:5]:
    print item

Let's keep:
- nice, 0.75
- marvelous, 1
- courage, 0.75
- emotional, 0.75

Do you see anything else?

In [None]:
# Now negative...
for item in neg[0:5]:
    print item

Let's keep:
- somber, -0.25
- little, -0.1
- funny, 1
- extraordinarily, 0.8
- beautiful, 1
- wonderful, 1
- weird, -1

This is clearly time-consuming. It's hard to list all the variants of the same word -- though this can be helped with stemming. It's also highly driven by personal judgments that are hard to justify.

Within the problem itself, lots of the meaning seems to be compositional rather than word-level.  It's also not always clear from the paragraph itself, out of context, whether a review is marked positive or negative.

These scores and some others have been saved in data/hand_weights.csv.  Feel free to look through the data and/or to add your own.

## Evaluation

In [None]:
# Let's pull in the weights
weights = {}
with open("data/hand_weights.csv") as f:
    reader = csv.reader(f)
    for word, score in reader:
        weights[word] = float(score)
        
# And check them
print "Number of words:", len(weights)
print
print "Sample of words:"
for k in weights.keys()[0:10]:
    print k, weights[k]

In [None]:
# Let's predict
y_dev_hat = []

for item in X_dev:
    score = 0
    for word in item:
        if word in weights:
            # Use the score if we have it
            score += weights[word]
        else:
            # Use a small random number centered on 0 if not
            # (this helps ensure we always get a positive or
            # negative # and shouldn't hurt any real information)
            score += random.uniform(-0.001, 0.001)
    # Convert score to an assessment
    avg_score = score/len(item)
    y_dev_hat.append((avg_score > 0)*2 - 1)

In [None]:
# Let's evaluate
# No cross-validation this round, but we can use that in the 
# future to get a sense of the variability of the method
from sklearn import metrics

print "Accuracy:"
print metrics.accuracy_score(y_dev, y_dev_hat)

print

print "Classification metrics:"
print metrics.classification_report(y_dev, y_dev_hat)

print 

print "Confusion matrix:"
print "(Rows are truth, columns are predictions)"
print metrics.confusion_matrix(y_dev, y_dev_hat)

So this is pretty lousy -- if we put on rose-colored glasses, we might be slightly above chance.  We can't be sure, of course, unless we sample repeatedly (but why bother -- with performance this bad, this method is useless).

You can probably image a host of ways to improve this method.

We'll investigate more rigorous, data-driven methods going forward, but others you're welcome to experiment with on your own.