# Learning Word Vectors from Sherlock Holmes Series
*Patrick Coady (pcoady@alum.mit.edu)*  

With no human teacher, a neural net can learn that the words "seven" and "eight" are similar. The network also learns slightly more subtle relationships: "**looking** is to **look** as **see** is to \_\_\_." The algorithm learns with only the raw text of three Sir Arthur Conan Doyle *Sherlock Holmes* books (thanks to Project Gutenberg). The algorithm has no access to dictionaries or the Internet, only the raw ASCII text of these 3 books.  


By learning word vectors with unlabeled training data, you save your precious human-labeled data for other aspects of learning. Versus "wasting" this labeled data on learning the basic structure of your problem. This approach has been applied to many applications (aside from "reading" text) with great success.  

This notebook illustrates how a neural net can learn meaningful vector representations of English words. However, my primary objective was to learn TensorFlow. I built 3 python modules that support this notebook:

1. **wordvector.py**: A handy python Class (WordVector) for exploring word vectors learned by the neural network. For example:  
    a) Returns words that are "close" to each other based on different distance metrics  
    b) "Guess" the 4th word in an analogy  
    c) Project the vectors onto 2D space for plotting
2. **windowmodel.py**: This is where the TensorFlow graph, training and prediction routines reside. The class also contains a static method to build the training set from an integer word array.  
3. **docload.py**: Load ASCII documents (with some special hooks for Project Gutenberg books). Returns the book as an numpy integer array, along with dictionaries for converting back-and-forth between integers and the English words.

In [1]:
from wordvector import WordVector
from windowmodel import WindowModel
import docload

import numpy as np
import sklearn.utils

### Load Books, Build Dictionary & Convert Books to Integer Vector
Start with these 3 books (all written by Sir Arthor Conan Doyle):
1. The Adventures of Sherlock Holmes
2. The Hound of the Baskervilles
3. The Sign of the Four

Load the books and build a dictionary of all unique words. The dictionary maps each unique word to an integer. All words are converted to lower case. And punctuation are treated as words (i.e. " , . ? and !). If the size of the book vocabulary exceeds the pre-set limit (**vocab_size**), then the most infrequent words are mapped to the last integer in the dictionary.

In [2]:
files = ['../data/adventures_of_sherlock_holmes.txt',
        '../data/hound_of_the_baskervilles.txt',
        '../data/sign_of_the_four.txt']
word_array, dictionary, num_lines, num_words = docload.build_word_array(
    files, vocab_size=50000, gutenberg=True)

print('Document loaded and processed: {} lines, {} words.'
      .format(num_lines, num_words))

Document loaded and processed: 24080 lines, 244986 words.


## Neural Net Architecture
![](notebook_images/NN_diagram.png)

In [26]:
print('Building training set ...')
x, y = WindowModel.build_training_set(word_array)

# shuffle and split 10% validation data
x_shuf, y_shuf = sklearn.utils.shuffle(x, y, random_state=0)
split = round(x_shuf.shape[0]*0.9)
x_val, y_val = (x_shuf[split:, :], y_shuf[split:, :])
x_train, y_train = (x[:split, :], y[:split, :])

print('Training set built.')
graph_params = {'batch_size': 32,
                'vocab_size': np.max(x)+1,
                'embed_size': 64,
                'hid_size': 64,
                'neg_samples': 64,
                'learn_rate': 0.01,
                'momentum': 0.9,
                'embed_noise': 0.1,
                'hid_noise': 0.3,
                'optimizer': 'Momentum'}
model = WindowModel(graph_params)
print('Model built. Vocab size = {}. Document length = {} words.'
      .format(np.max(x)+1, len(word_array)))

print('Training ...')
results = model.train(x_train, y_train, x_val, y_val, epochs=120, verbose=False)

word_vector_embed = WordVector(results['embed_weights'], dictionary)
word_vector_nce = WordVector(results['nce_weights'], dictionary)

Building training set ...
Training set built.
Model built. Vocab size = 11750. Document length = 244986 words.
Training ...
End Training: total batches = 826800. train loss = 1.39, val loss = 1.64


### 100 Most Common Words

In [68]:
print(word_vector_embed.most_common(100))

[',', '.', 'the', '"', 'and', 'i', 'of', 'to', 'a', 'that', 'it', 'in', 'he', 'you', 'was', 'his', 'is', 'my', 'have', 'had', 'with', 'as', 'at', '?', 'for', 'which', 'we', 'but', 'be', 'not', 'me', 'this', 'there', 'upon', 'him', 'said', 'from', 'so', 'no', 'on', 'one', 'all', 'holmes', 'been', 'her', 'were', 'what', 'very', 'by', 'your', 'an', 'she', 'are', '!', 'would', 'man', 'out', 'could', 'then', 'if', 'our', 'up', 'when', 'has', 'do', 'will', "'", 'us', 'who', 'some', 'into', 'sir', 'now', 'see', 'down', 'they', 'or', 'should', 'little', 'mr', 'well', 'more', 'over', 'can', 'may', 'know', 'about', 'am', 'think', 'them', 'only', 'must', 'did', 'here', 'come', 'time', 'than', 'how', 'two', 'before']


## Word Similarities
The model learns 2 word vector representations. 
1. The embedding vector from the one-hot input
2. The vector from the hidden layer to the network output

In general, the output layer vector seems to learn more meaningful vector representation of words. We quickly check the closest words (cosine similarity) to the word "seven". Remember, this model had no human-labeled data or any data sources outside of the raw book text. The hidden layer to output matrix correctly finds that other numbers are most similar to "seven".

In [27]:
word = "seven"
print('Embedding layer: 8 closest words to:', "'" + word + "'")
print(word_vector_embed.n_closest(word=word, num_closest=8, metric='cosine'), '\n')
print('Hidden-to-output layer: 8 closest words to:', "'" + word + "'")
print(word_vector_nce.n_closest(word=word, num_closest=8, metric='cosine'))

Embedding layer: 8 closest words to: 'seven'
['four', 'ten', 'five', 'three', 'six', 'fifty', 'eight', 'different'] 

Hidden-to-output layer: 8 closest words to: 'seven'
['eight', 'five', 'ten', 'eleven', 'twelve', 'six', 'four', 'nine']


## Qualitative Performance of *embed_weights* vs. *nce_weights*  

From a qualitative perspective, the *nce_weights* consistently give more meaningful results when asking for similar words. Although, they both do OK on suggesting similar words to "seven". Also, for the analogy task (e.g. A is to B, as C is to ?) the *nce_weights* give more "sensible" results.  

Clearly, the *embed_weights* are learning something. These weights are the first stage in the model, and the model wouldn't perform if they were learning nonsense. 

Cosine similarity is used as a distance metric for running similarity and analogy tasks. It might be interesting to experiment with other distance measures. A quick look at Euclidean distance was not promising. The code supports 15+ different distance metrics from *scipy.spatial.distance*, some experimentation here might be interesting.  

That said, to avoid clutter, the rest of this notebook will focus on the word vectors from the *nce_weights* matrix.

In [34]:
word = "laughing"
print('8 closest words to:', "'" + word + "'")
print(word_vector_nce.n_closest(word=word, num_closest=8, metric='cosine'))

Hidden-to-output layer: 8 closest words to: 'laughing'
['smiling', 'rising', 'chuckling', "'and", 'wild', 'pensively', "'well", 'yawning']


In [36]:
word = "mr"
print('8 closest words to:', "'" + word + "'")
print(word_vector_nce.n_closest(word=word, num_closest=8, metric='cosine'))

Hidden-to-output layer: 8 closest words to: 'mr'
['mrs', 'dr', "'mr", 'blind', 'earnestly', 'l', 'servant', 'st']


## Analogies  

Because words are represented as vectors, it is interesting to try vector addition to predict the 4th word in an analogy. The premise being that A is to B as C is to D can be represented as:  $\mathbf{x_d}=\mathbf{x_b}-\mathbf{x_a}+\mathbf{x_c}$.  

And it does indeed work. The first example predicts that "**had** is to **has** as **was** is to **is**". So the word vectors can translate between past and present tense. Note, that it predicts words that are already in the analogy. In 64-dimension vector space, it is hard to get too far away from $\mathbf{x_a}$, $\mathbf{x_b}$ and $\mathbf{x_c}$ in the above equation.  

In the 2nd example, the vector addition can correctly translate between the singular and plural form of a noun. With the fun example being, "**boot** is to **boots** as **arm** is to (**arms** or **limbs**)".  

(I am being generous by giving credit for making a guess in the top-5. I don't feel too badly about this, the training set is fairly small.)

In [37]:
print(word_vector_nce.analogy('had', 'has', 'was', 5))

['was', 'is', 'has', 'lives', 'makes']


In [62]:
print(word_vector_nce.analogy('boot', 'boots', 'arm', 5))

['boots', 'arms', 'weeks', 'limbs', 'heart']


### Predict Replacement Words in a Passage

For fun, I took a random 200 word passage and used the network to make predictions to replace all the words. The results has some semblance of grammar, but is mostly nonsensical. This is to be expected, the model only uses the 2 preceding and 2 following words to make predictions. A Recurrent NN is the right tool for this, but here it goes anyway:

#### Original Passage

well,  it is just as i have been telling you,  mr.  sherlock holmes, " said jabez wilson,  mopping his forehead" i have a small pawnbroker's business at coburg square,  near the city.

#### Reconstructed Passage

oh,  it was,  as i have been told you,  mr.  sherlock holmes, " said sherlock wilson,  upon his? " i am a small public business at coburg square,  with the time.

In [69]:
# grab 100 word passage from book
reverse_dict = word_vector_nce.get_reverse_dict()
passage = [x for x in map(lambda x: reverse_dict[x], word_array[12200:12300])]

In [70]:
# print passage with some crude formatting (e.g. space after comma)
readable = ''
for word in passage:
    if word == '"':
        readable += word
    elif word in ['?', '!', '.', ',']:
        readable += word + ' '
    else: 
        readable += ' ' + word
print(readable)

"" well,  it is just as i have been telling you,  mr.  sherlock holmes, " said jabez wilson,  mopping his forehead" i have a small pawnbroker's business at coburg square,  near the city.  it's not a very large affair,  and of late years it has not done more than just give me a living.  i used to be able to keep two assistants,  but now i only keep one and i would have a job to pay him but that he is willing to come for half wages


In [71]:
# use model to replace words in original passage with predicted words
# need to grab 2 words before and after passage
x, y = WindowModel.build_training_set(word_array[(12200-2):(12300+2)])
y_hat = model.predict(x, 120)
passage_predict = [x for x in map(lambda x: reverse_dict[x], y_hat[0])]

In [72]:
# print predicted passage
readable = ''
for word in passage_predict:
    if word == '"':
        readable += word
    elif word in ['?', '!', '.', ',']:
        readable += word + ' '
    else: 
        readable += ' ' + word
print(readable)

"" oh,  it was,  as i have been told you,  mr.  sherlock holmes, " said sherlock wilson,  upon his? " i am a small public business at coburg square,  with the time. " all a very old one,  and of those,  he is been two more than you let me a word.  you seemed to be able to take an minutes,  and as i must only it that he should have a time to help. .  because he is able to come for half past


## Conclusion

I hope you found this interesting or even useful. It is amazing that with a few afternoons you can demonstrate results like this. And all using free, open-source tools. 

I've been pleased with TensorFlow. I haven't benchmarked for performance, but the API is intuitive and you can get something working quickly. I was previously using Keras to build NN models. I decided to try TensorFlow to have more flexibility to try ideas and also to understand what was going on behind the scenes.

The code in the 3 python modules is intended to be easily understood and reused. I encourage people to run the model on different books or documents - I'd love to hear your results. I welcome suggestions and even Pull Requests with fixes or improvements.

Next I plan to train a RNN on the same data set. 