Word embeddings are a neural network's representation of the relationships between words. A network that has seen,
say, 20 billion words in English and a bunch of other languages often has a lot to say about what words are all about.
A word embedding takes the form of giant matrix,
which sounds a bit boring, but what's neat is that every row of the matrix represents a word 
as a vector of real numbers.

With this vector you can compare words in nifty ways. Computing the cosine of the angle between two vectors 
gives the cosine similarity score, which maxes out at 1 if the vectors have the same direction and gets lower as the angle between the vectors increases:

$$\cos(\theta) = \frac{x \cdot y}{||x|| \space||y||}$$

## Riding English
Using this metric you can choose an arbitrary vector and find the words closest to it, on whatever dimensions you want.
This vector could be a word, or something you calculate yourself. Let's look up the nearest neighbors to "English" in a 
200-dimensional GloVe embedding trained on 27 billion words from Twitter (available from the GloVe creators):


In [1]:
# load word vectors
import os
import urllib2
import zipfile
import numpy as np
from miniglove import Glove

myglove = Glove()
glove_folder = "downloads"
glove_path = os.path.join(glove_folder, "glove.twitter.27B.200d.txt")
# Download if necessary (big)
if not os.path.isfile(glove_path):
    print("Downloading pretrained GloVe")
    if not os.path.isdir(glove_folder):
        os.makedirs(glove_folder)
    with open("downloads/glove.zip", "wb") as glove_zip:
        glove_url = "http://nlp.stanford.edu/data/glove.twitter.27B.zip"
        glove_data = urllib2.urlopen(glove_url)
        glove_zip.write(glove_data.read())
    print("Downloaded.\nExtracting...")
    glove_zip = zipfile.ZipFile("downloads/glove.zip", "r")
    glove_zip.extractall(glove_folder)
    glove_zip.close()
    print("Extracted")
myglove.load_glove(glove_path, gz=False)
near_words = [i[0] for i in myglove.get_nearest('english')]
for wd in near_words:
    print wd

english
spanish
language
math
french
speaking
class
arabic
exam
essay


This example shows some of the diversity in relationships that a word embedding model can represent. The relationship
between _English_ and _Spanish_ is different than the relationship between _English_ and _language_. It also points up some of the shortcomings of the model, as the kind of _English_ that is related to _math_, _class_, and _exam_ is a different word sense than the _English_ that is related to _language_ and _speaking_.

A lot of ink has already been spilled on how and why GloVe and word2vec encode semantic and syntactic content of words. What I'd like to point out is the extent to which they also encode
*stylistic* relationships as well, even across semantically and syntactically diverse contexts.

Twitter data provides a nice playground for this because it plays host to many different styles and varieties of English. 

In [8]:
boy = myglove.get_vec("boy")
girl = myglove.get_vec("girl")
cat = myglove.get_vec("cat")
print myglove.nearest_to_vec(cat)
myglove[myglove['piccolo'] - myglove['flute'] + myglove['drum']]

[(u'cat', 1.0000000000000002), (u'dog', 0.83243026697337774), (u'cats', 0.76851845026425081), (u'kitty', 0.75044559650017162), (u'kitten', 0.74896981271733654), (u'pet', 0.73198618323283737), (u'puppy', 0.70231925278655516), (u'dogs', 0.70163820992334303), (u'animal', 0.6421107261361646), (u'bear', 0.63091854794411106)]
<type 'str'>


KeyError: 'piccolo'