# Introduction to Word Embeddings

In the previous section we discussed topic models, which find groups of words that tend to occur together. These clusters often correspond to recognizable themes, which is why we describe them as "topics". Topic models do not consider word order within documents.

Word embeddings are a similar method for finding relationships between words. They represent each word as a vector in a space with usually 50-200 dimensions. Words $a$ and $b$ with vectors that are close are similar for one of these reasons:

* Substitutability. If word $a$ appears in a sentence, you could replace it with word $b$ and not change the meaning that much.
* Collocation. Word $a$ usually appears immediately before or after word $b$.

Embedding algorithms pay attention to word order, but only within a small sliding window. We will discuss how this works more on Wednesday, for now we will just look at what we can get from embeddings.

In [None]:
import numpy, sys, math
from matplotlib import pyplot
from sklearn.cluster import KMeans
from IPython.display import display, clear_output, Markdown, Latex

from collections import Counter

We'll compare word vectors from two collections:

* English translations of Icelandic sagas
* The same Old Bailey 1820s collection we looked at last week

I preprocessed these in slightly different ways, see if you can tell.

In [None]:
vocabulary = []
reverse_vocabulary = {}

#vector_filename = "../data/OldBailey/oldbailey.vec"
vector_filename = "../data/Sagas/sagas_en.vec"

with open(vector_filename) as infile:
    
    matrix_shape = [int(x) for x in infile.readline().split()]
    
    embeddings = numpy.zeros(matrix_shape)
    
    for line in infile:
        fields = line.rstrip().split()
        
        word_id = len(vocabulary)
        vocabulary.append(fields[0])
        reverse_vocabulary[fields[0]] = word_id
        
        embeddings[word_id,:] = numpy.array([float(x) for x in fields[1:]])

normalizer = 1.0 / numpy.sqrt(numpy.sum(embeddings ** 2, axis=1))
embeddings *= normalizer[:, numpy.newaxis]

In [None]:
## The vocabulary includes punctuation and "end of sentence" marker </s>
" ".join(vocabulary[:20])

### Why linear algebra is great

An embedding is really a matrix, with one row for each word. That means that anything we can do with matrices we can do with these embeddings.

We'll start by looking for near neighbors by cosine similarity.

In [None]:
def nearest(query):
    q_id = reverse_vocabulary[query]
    scores = embeddings.dot(embeddings[q_id,:])
    return sorted(zip(scores, vocabulary), reverse=True)

def show_nearest(query, n=20):
    markdown_table = "|Cosine similarity | Word|\n|---:|:---|\n"
    sorted_words = nearest(query)
    for score, word in sorted_words[:n]:
        markdown_table += "|{:.3f}|{}|\n".format(score, word)
    display(Markdown(markdown_table))

In [None]:
show_nearest("horse")

Next, we'll cluster words by their embeddings.

In [None]:
num_clusters = 50
clustering = KMeans(n_clusters=num_clusters).fit(embeddings)

In [None]:
numpy_vocab = numpy.array(vocabulary)
for cluster in range(num_clusters):
    cluster_words = numpy_vocab[clustering.labels_ == cluster]
    print(cluster, " ".join(cluster_words[:12]))

Finally, we can use an SVD to visualize the dimensions of maximum variation, and look for semantic clusters.

In [None]:
U, dimension_weights, Vt = numpy.linalg.svd(embeddings, full_matrices=False)

The first dimension is mostly just reporting word frequency. The remaining dimensions have weights that are pretty close, so we shouldn't expect the 2D plot to summarize everything.

In [None]:
dimension_weights[:10]

In [None]:
x = U[:,1]
y = U[:,2]

pyplot.figure(figsize=(30,30))
pyplot.scatter(x, y, alpha=0.1)
for i in range(0, 700):
    pyplot.text(x[i], y[i], vocabulary[i])
for i in range(701, 3000, 5):
    pyplot.text(x[i], y[i], vocabulary[i])
pyplot.show()