# Learning Word Vectors from Sherlock Holmes Series
Patrick Coady (pcoady@alum.mit.edu)

In [1]:
from wordvector import WordVector
from windowmodel import WindowModel
import docload

import numpy as np
import sklearn.utils
import matplotlib.pyplot as plt

### Load Books, Build Dictionary & Convert Books to Integer Vector
Start with these 3 books (all written by Sir Arthor Conan Doyle):
1. The Adventures of Sherlock Holmes
2. The Hound of the Baskervilles
3. The Sign of the Four

Load the books and build a dictionary of all unique words. The dictionary maps each unique word to an integer. All words are converted to lower case. And punctuation are treated as words (i.e. " , . ? and !). If the size of the book vocabulary exceeds the pre-set limit (**vocab_size**), then the most infrequent words are mapped to the last integer in the dictionary.

In [2]:
files = ['../data/adventures_of_sherlock_holmes.txt',
        '../data/hound_of_the_baskervilles.txt',
        '../data/sign_of_the_four.txt']
word_array, dictionary, num_lines, num_words = docload.build_word_array(
    files, vocab_size=50000, gutenberg=True)

print('Document loaded and processed: {} lines, {} words.'
      .format(num_lines, num_words))

Document loaded and processed: 24080 lines, 244986 words.


## Neural Net Architecture
![](notebook_images/NN_diagram.png)

In [None]:
print('Building training set ...')
x, y = WindowModel.build_training_set(word_array)

# shuffle and split 10% validation data
x_shuf, y_shuf = sklearn.utils.shuffle(x, y, random_state=0)
split = round(x_shuf.shape[0]*0.9)
x_val, y_val = (x_shuf[split:, :], y_shuf[split:, :])
x_train, y_train = (x[:split, :], y[:split, :])

print('Training set built.')
graph_params = {'batch_size': 32,
                'vocab_size': np.max(x)+1,  # +1 for unknown word dictionary entry
                'embed_size': 16,
                'hid_size': 16,
                'learn_rate': 0.01,
                'momentum': 0.9}
model = WindowModel(graph_params)
print('Model built. Vocab size = {}. Document length = {} words.'
      .format(np.max(x)+1, len(word_array)))

print('Training ...')
results = model.train(x_train, y_train, x_val, y_val, epochs=120)

word_vector_embed = WordVector(results['embed_weights'], dictionary)
word_vector_nce = WordVector(results['nce_weights'], dictionary)

### 100 Most Common Words

In [None]:
print(word_vector_embed.most_common(100))

## Word Similarities
The model learns 2 word vector representations. 
1. The embedding vector from the one-hot input
2. The vector from the hidden layer to the network output

In general, the output layer vector seems to learn more meaningful vector representation of words. We quickly check the closest words (cosine similarity) to the word "six". Remember, this model had no human-labeled data or any data sources outside of the raw book text. The hidden layer to output matrix correctly finds that other numbers are most similar to "six".

In [None]:
word = "quickly"
print('Embedding layer: 10 closest words to:', "'" + word + "'")
print(word_vector_embed.n_closest(word=word, num_closest=10, metric='cosine'))
print()
print('Hidden-to-output layer: 10 closest words to:', "'" + word + "'")
print(word_vector_nce.n_closest(word=word, num_closest=10, metric='cosine'))

In [None]:
# skipping first 100 words (i.e. 'the', 'if', 'and', '.', ',', ...) gives more
# interesting visualization
embed_2d, word_list = word_vector_nce.project_2d(100, 600) # t-sne projection
reverse_dict = word_vector_nce.get_reverse_dict()

In [None]:
%pylab inline
pylab.rcParams['figure.figsize'] = (8, 6)
# minx, maxx, miny, maxy = (-20, 0, -20, 10)
# plt.ylim(miny, maxy)
# plt.xlim(minx, maxx)
plt.scatter(embed_2d[:,0], embed_2d[:,1])
for i in range(500):
    plt.text(embed_2d[i,0], embed_2d[i,1], reverse_dict[i], clip_on=True)
plt.show()

In [None]:
word_vector_nce.analogy('london', 'england', 'scotland', 5)

In [None]:
x_predict = x[:100,:]
y_hat = model.predict(x_predict, 120)
np.sum(y_hat[0]-y[:100,0]==0)