# Introduction
Before we start feeding data to our recurrent models we need first to convert the words to a numeric representation (feature), that also conveys some meaning. There are 2 main options to do the job:
* [Glove](https://nlp.stanford.edu/projects/glove/)
* [Word2Vec](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)

#### References
* [Pytorch tutorials](http://pytorch.org/tutorials/)
* [Pytorch text glove tutorial](https://github.com/spro/practical-pytorch/blob/master/glove-word-vectors/glove-word-vectors.ipynb)
* [Pytorch word embeddings tutorial](http://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html#sphx-glr-beginner-nlp-word-embeddings-tutorial-py)
* [The Amazing power of word vectors](https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/)
* [Glove global vectors](https://blog.acolyer.org/2016/04/22/glove-global-vectors-for-word-representation/)
* [NLTK Python NLP Toolkit](http://www.nltk.org/)

### Download a glove word to vector converter and define some helper functions

In [1]:
import torch
# Pytorch extension for NLP (Datasets, and some tools like word2vec and glove)
from torchtext.vocab import load_word_vectors

# Get a dictionary of words (word to vec) with word vector size of 100 dimensions
# It will download around 800Mb if necessary
wv_dict, wv_arr, wv_size = load_word_vectors('.', 'glove.6B', 100)
print('Loaded', len(wv_arr), 'words')

# Define simple function to get the word vector (using returned dictionaries)
def get_embeddings(word):
    return wv_arr[wv_dict[word.lower()]]

# Get closest n words
def get_closest(d, n=10):
    all_dists = [(w, torch.dist(d, get_embeddings(w))) for w in wv_dict]
    return sorted(all_dists, key=lambda t: t[1])[:n]

loading word vectors from ./glove.6B.100d.pt
Loaded 400000 words


In [2]:
vec_europa = get_embeddings('europa')

### Doing some vector Arithimetic
One of the coolest features of a well trained word to vector model is that we can use arithimetic with the semantic vectors.

In [3]:
# In the form w1 - w2 + w3 : ?
def analogy(w1, w2, w3, n=5, filter_given=True):
    print('\n[%s : %s :: %s : ?]' % (w1, w2, w3))
   
    # w2 - w1 + w3 = w4
    closest_words = get_closest(get_embeddings(w1) - get_embeddings(w2) + get_embeddings(w3))
    
    # Optionally filter out given words
    if filter_given:
        closest_words = [t for t in closest_words if t[0] not in [w1, w2, w3]]
    
    return closest_words

In [4]:
analogy('king','man', 'woman')


[king : man :: woman : ?]


[('queen', 4.081078406285764),
 ('monarch', 4.642907605578163),
 ('throne', 4.905500998697052),
 ('elizabeth', 4.921558575828151),
 ('prince', 4.981146936056392),
 ('daughter', 4.985715105960012),
 ('mother', 5.064087418704118),
 ('cousin', 5.077497332002661),
 ('princess', 5.07868555349649)]