# 3. Machine learning with WordNet

In this notebook we'll grow the dataset to include the synonyms of the words we initially identified.  To grow the dataset, we will use WordNet. WordNet is like a megathesaurus. It groups words into "synsets", and provides relations like antonymy and metonymy between word pairs. It also has information about stems.  It covers all parts of speech.

This is an oldie but goodie resource. Sometimes you'll want something that WordNet puts at your fingertips.

https://wordnet.princeton.edu/

In [None]:
import pickle
import csv
import numpy as np
from nltk.corpus import wordnet as wn

In [None]:
with open("data/sentiment_splits.p", "rb") as f:
    X_train, X_dev, X_test, y_train, y_dev, y_test = pickle.load(f)

In [None]:
# Get the set of words we're interested in from the text document
words_of_interest = {}
with open("data/hand_weights.csv") as f:
    reader = csv.reader(f)
    for word, score in reader:
        words_of_interest[word] = 1

In [None]:
# Demonstrate WordNet senses:
print wn.synsets('cat')

In [None]:
# What are all these senses, you ask?  There is an online tool:
# http://wordnetweb.princeton.edu/perl/webwn

# You can also look in more depth...
# Look for only a certain part of speech
print wn.synsets('cat', pos=wn.VERB)

In [None]:
# Look at definitions
print wn.synsets('cat', pos=wn.VERB)[0].definition()
print wn.synsets('cat', pos=wn.VERB)[1].definition()

In [None]:
# Look at examples
print wn.synsets('cat', pos=wn.VERB)[1].examples()

In [None]:
# Look at lemmas. Lemmas (surface manifestations of particular orderings of characters) have senses.
print wn.synsets('cat', pos=wn.VERB)[1].lemmas()

In [None]:
# You can also access a synset more directly using its surface form + part of speech + sense number
print wn.synset('cat.v.1').definition()

In [None]:
# WordNet is all about word *senses*, so it wants:
# (a) the part of speech, and (b) which sense you are interested in
# These are hard to get from raw text! So what we are going to do
# is assume the part of speech is always adjective (which is on
# the whole true -- but fails for words like w00t), and we are going
# to always choose the most common sense of a word, which is listed
# first.
# In real texts, you can use part-of-speech tagging preprocessing
# to get more information as to the right sense to guess.

# Let's see how well the first sense does on the first 10 words,
# assuming adjective status
for word in words_of_interest.keys()[0:10]:
    print "%-10s %-30s" % (word, wn.synsets(word, wn.ADJ))
    print

In [None]:
# Now let's build out our words of interest
expanded_adjectives = set(words_of_interest.keys())
for word in words_of_interest.keys():
    synsets = wn.synsets(word, wn.ADJ)
    if len(synsets) > 0:
        for lemma in synsets[0].lemmas():
            expanded_adjectives.add(lemma.name())

In [None]:
# How much did our set change?
print "The original list:"
print len(words_of_interest)
print "The expanded list:"
print len(expanded_adjectives)

In [None]:
print "New words that we missed on the first pass:"
print expanded_adjectives - set(words_of_interest.keys())

In [None]:
# Create a function that will convert each paragraph to a vector.
#
# The presence of each word in the wordlist is a feature.
# So a cell is 1 if the word of interest appears in the
# paragraph, and near 0 otherwise
def convert_to_vector(paragraph):
    representation = np.zeros(len(expanded_adjectives))
    for i, word in enumerate(expanded_adjectives):
        if word in paragraph.decode('latin-1'):
            representation[i] = 1
    return representation

In [None]:
def convert_dataset(dataset):
    # Convert X_train and X_dev to use the new format
    dataset_vector = np.zeros((len(dataset), len(expanded_adjectives)))
    for i,paragraph in enumerate(dataset):
        dataset_vector[i] = convert_to_vector(paragraph)
    return dataset_vector

X_train_vector = convert_dataset(X_train)
print X_train_vector.shape
X_dev_vector = convert_dataset(X_dev)
print X_dev_vector.shape

In [None]:
from sklearn import linear_model

clf = linear_model.LogisticRegression()
clf.fit(X_train_vector, y_train)
y_dev_hat = clf.predict(X_dev_vector)

# Evaluation

In [None]:
# Let's evaluate
# No cross-validation this round, but we can use that in the 
# future to get a sense of the variability of the method
from sklearn import metrics

print "Accuracy:"
print metrics.accuracy_score(y_dev, y_dev_hat)

print

print "Classification metrics:"
print metrics.classification_report(y_dev, y_dev_hat)

print 

print "Confusion matrix:"
print "(Rows are truth, columns are predictions)"
print metrics.confusion_matrix(y_dev, y_dev_hat)

You could easily imagine using all the synsets instead of just the primary sense of a word.  If you try that, does performance go up?