# Vector representation of word
 Tham khao *Patrick Coady (pcoady@alum.mit.edu)*  

In [1]:
from wordvector import WordVector
from windowmodel import WindowModel
import docload

import numpy as np
import sklearn.utils

### Load Books, Build Dictionary & Convert Books to Integer Vector
Input: 3 file .txt 
1. adventures_of_sherlock_holmes.txt
2. hound_of_the_baskervilles.txt
3. sign_of_the_four.txt

Load the books and build a dictionary of all unique words. The dictionary maps each unique word to an integer. All words are converted to lower case. And punctuation are treated as words (i.e. " , . ? and !). If the size of the book vocabulary exceeds the pre-set limit (**vocab_size**), then the most infrequent words are mapped to the last integer in the dictionary.

In [2]:
files = ['../data/adventures_of_sherlock_holmes.txt',
        '../data/hound_of_the_baskervilles.txt',
        '../data/sign_of_the_four.txt']
word_array, dictionary, num_lines, num_words = docload.build_word_array(
    files, vocab_size=50000, gutenberg=True)

print('Document loaded and processed: {} lines, {} words.'
      .format(num_lines, num_words))

Document loaded and processed: 24080 lines, 247812 words.


In [3]:
print('Building training set ...')
x, y = WindowModel.build_training_set(word_array)

# shuffle and split 10% validation data
x_shuf, y_shuf = sklearn.utils.shuffle(x, y, random_state=0)
split = round(x_shuf.shape[0]*0.9)
x_val, y_val = (x_shuf[split:, :], y_shuf[split:, :])
x_train, y_train = (x[:split, :], y[:split, :])

print('Training set built.')
graph_params = {'batch_size': 32,
                'vocab_size': np.max(x)+1,
                'embed_size': 64,
                'hid_size': 64,
                'neg_samples': 64,
                'learn_rate': 0.01,
                'momentum': 0.9,
                'embed_noise': 0.1,
                'hid_noise': 0.3,
                'optimizer': 'Momentum'}
model = WindowModel(graph_params)
print('Model built. Vocab size = {}. Document length = {} words.'
      .format(np.max(x)+1, len(word_array)))

print('Training ...')
results = model.train(x_train, y_train, x_val, y_val, epochs=120, verbose=False)

word_vector_embed = WordVector(results['embed_weights'], dictionary)
word_vector_nce = WordVector(results['nce_weights'], dictionary)

Building training set ...
Training set built.
Model built. Vocab size = 11756. Document length = 247812 words.
Training ...
End Training: total batches = 836280. train loss = 1.40, val loss = 1.65


### 100 Most Common Words

In [4]:
print(word_vector_embed.most_common(100))

[',', '.', 'the', '"', 'and', 'i', 'of', 'to', 'a', 'that', 'it', 'in', 'he', 'you', 'was', '-', 'his', 'is', 'my', 'have', 'had', 'with', 'as', 'at', '?', 'for', 'which', 'we', 'but', 'be', 'not', 'me', 'this', 'there', 'upon', 'him', 'said', 'from', 'so', 'no', 'on', 'one', 'all', 'holmes', 'been', 'her', 'were', 'what', 'very', 'by', 'your', 'an', 'she', 'are', 'would', '!', 'man', 'out', 'could', 'then', 'if', 'our', 'up', 'when', 'has', 'do', 'will', "'", 'us', 'who', 'some', 'into', 'sir', 'now', 'see', 'down', 'they', 'or', 'should', 'little', 'mr', 'well', 'more', 'over', 'can', 'may', 'know', 'about', 'am', 'think', 'them', 'only', 'must', ';', 'did', 'here', 'come', 'time', 'than', 'how']


## Word Similarities
The model learns 2 word vector representations. 
1. The embedding vector from the one-hot input
2. The vector from the hidden layer to the network output

In general, the output layer vector seems to learn more meaningful vector representation of words. We quickly check the closest words (cosine similarity) to the word "seven". Remember, this model had no human-labeled data or any data sources outside of the raw book text. The hidden layer to output matrix correctly finds that other numbers are most similar to "seven".

In [20]:
word = "seven"
print('Embedding layer: 8 closest words to:', "'" + word + "'")
print(word_vector_embed.n_closest(word=word, num_closest=8, metric='cosine'), '\n')
print('Hidden-to-output layer: 8 closest words to:', "'" + word + "'")
print(word_vector_nce.n_closest(word=word, num_closest=8, metric='cosine'))

Embedding layer: 8 closest words to: 'seven'
['eight', 'ten', 'four', 'sixty', 'five', 'arc', 'straining', 'wandered'] 

Hidden-to-output layer: 8 closest words to: 'seven'
['eight', 'thirty', 'twenty', 'ten', 'eleven', 'five', 'zero', 'twelve']


## Qualitative Performance of *embed_weights* vs. *nce_weights*  

From a qualitative perspective, the *nce_weights* consistently give more meaningful results when asking for similar words. Although, they both do OK on suggesting similar words to "seven". Also, for the analogy task (e.g. A is to B, as C is to ?) the *nce_weights* give more "sensible" results.  

Clearly, the *embed_weights* are learning something. These weights are the first stage in the model, and the model wouldn't perform if they were learning nonsense. 

Cosine similarity is used as a distance metric for running similarity and analogy tasks. It might be interesting to experiment with other distance measures. A quick look at Euclidean distance was not promising. The code supports 15+ different distance metrics from *scipy.spatial.distance*, some experimentation here might be interesting.  

That said, to avoid clutter, the rest of this notebook will focus on the word vectors from the *nce_weights* matrix and use cosine similarity.

In [6]:
word = "laughing"
print('8 closest words to:', "'" + word + "'")
print(word_vector_nce.n_closest(word=word, num_closest=8, metric='cosine'))

8 closest words to: 'laughing'
['pensively', 'yawning', 'chuckling', 'lightly', 'languidly', 'dryly', 'earnestly', 'underneath']


In [7]:
word = "mr"
print('8 closest words to:', "'" + word + "'")
print(word_vector_nce.n_closest(word=word, num_closest=8, metric='cosine'))

8 closest words to: 'mr'
['mrs', 'st', 'dr', 'others', "'mr", 'c', 'l', 'reading']


## Analogies  

Because words are represented as vectors, it is interesting to try vector addition to predict the 4th word in an analogy. The premise being that A is to B as C is to D can be represented as:  $\mathbf{x_d}=\mathbf{x_b}-\mathbf{x_a}+\mathbf{x_c}$.  

![](notebook_images/analogies.png)



In [19]:
print(word_vector_nce.analogy('had', 'has', 'was', 5))

['is', 'was', 'has', 'looks', 'seems']


In [9]:
print(word_vector_nce.analogy('boot', 'boots', 'arm', 5))

['forehead', 'arm', 'cheeks', 'forearm', 'boots']


### Predict Replacement Words in a Passage

For fun, I took a random 200 word passage and used the network to make predictions to replace all the words. The results has some semblance of grammar, but is mostly nonsensical. This is to be expected, the model only uses the 2 preceding and 2 following words to make predictions. A Recurrent NN is a more appropriate tool for this, but here it is anyway:

#### Original Passage

**well,  it is just as i have been telling you,  mr.  sherlock holmes, " said jabez wilson,  mopping his forehead" i have a small pawnbroker's business at coburg square,  near the city.**

#### Reconstructed Passage

**oh,  it was,  as i have been told you,  mr.  sherlock holmes, " said sherlock wilson,  upon his? " i am a small public business at coburg square,  with the time.**

In [10]:
# grab 100 word passage from book
reverse_dict = word_vector_nce.get_reverse_dict()
passage = [x for x in map(lambda x: reverse_dict[x], word_array[12200:12300])]

In [11]:
# print passage with some crude formatting (e.g. space after comma)
readable = ''
for word in passage:
    if word == '"':
        readable += word
    elif word in ['?', '!', '.', ',']:
        readable += word + ' '
    else: 
        readable += ' ' + word
print(readable)

,  at eleven o'clock,  to duncan ross,  at the offices of the league,  7 pope's court,  fleet street. "" what on earth does this mean? " i ejaculated after i had twice read over the extraordinary announcement.  holmes chuckled and wriggled in his chair,  as was his habit when in high spirits. " it is a little off the beaten track,  isn't it? " said he. " and now,  mr.  wilson,  off you go at scratch and tell us all about yourself,  your


In [12]:
# use model to replace words in original passage with predicted words
# need to grab 2 words before and after passage
x, y = WindowModel.build_training_set(word_array[(12200-2):(12300+2)])
y_hat = model.predict(x, 120)
passage_predict = [x for x in map(lambda x: reverse_dict[x], y_hat[0])]

INFO:tensorflow:Restoring parameters from ../model-save/model_save-120


In [13]:
# print predicted passage
readable = ''
for word in passage_predict:
    if word == '"':
        readable += word
    elif word in ['?', '!', '.', ',']:
        readable += word + ' '
    else: 
        readable += ' ' + word
print(readable)

                        

,  after two o'clock,  to step baskerville,  at the time of the night,  though pope's court,  oxford street. "" what on earth will it be? " i thought after i could just right on the old open.  holmes rose and away in his death,  there in his wife,  in black street. " it is a man from the low door,  is he? " said holmes. " and now,  mr.  holmes,  when you go at last and give you all for it,  your


In [14]:
print(word_vector_nce.analogy('were', 'is', 'was', 5))

['is', 'was', 'over', 'becomes', 'does']


In [15]:
word = ","
print('8 closest words to:', "'" + word + "'")
print(word_vector_embed.n_closest(word=word, num_closest=8, metric='cosine'))
print('Hidden-to-output layer: 8 closest words to:', "'" + word + "'")
print(word_vector_nce.n_closest(word=word, num_closest=8, metric='cosine'))

8 closest words to: ','
['!', ';', '.', '?', 'illegibly', ':', 'chimerical', 'trove']
Hidden-to-output layer: 8 closest words to: ','
[';', '!', '?', 'doubt', ':', 'anyone', 't', 'blinds']


In [17]:
# word_counter is a list: 
# dictionary is a collection:
print('Model built. Vocab size = {}. Document length = {} words.'
      .format(len(dictionary), len(word_array)))
print("number of words: ", num_words)
print("dictionary length ", len(dictionary))

Model built. Vocab size = 11756. Document length = 247812 words.
number of words:  247812
dictionary length  11756
