### Eighteenth century word vectors

To aid our discussion, here are some tools for analyzing Heuser's ECCO vectors.

The source is the [eighteenth century collections online (ECCO)](https://www.gale.com/primary-sources/eighteenth-century-collections-online). You should be able to access specific words from this collection from a Cornell IP. There is also [ECCO TCP](https://quod.lib.umich.edu/e/ecco?key=author;page=browse;value=g), which is a smaller, more well-curated open-access segment. The vectors we have seem to have been on the larger, uncorrected OCR version: errors like the word *honefly* (what was this really?) do not occur.

In [None]:
import numpy, sys, math
from matplotlib import pyplot
from sklearn.cluster import KMeans
from IPython.display import display, clear_output, Markdown, Latex

from collections import Counter

Heuser released the vectors he trained for his experiments. Here's the first 25,000 vectors.

In [None]:
vocabulary = []
reverse_vocabulary = {}

vector_filename = "../data/ecco/ecco_vectors.vec"

with open(vector_filename) as infile:
    
    matrix_shape = [int(x) for x in infile.readline().split()]
    
    embeddings = numpy.zeros(matrix_shape)
    
    for line in infile:
        fields = line.rstrip().split()
        
        word_id = len(vocabulary)
        vocabulary.append(fields[0])
        reverse_vocabulary[fields[0]] = word_id
        
        embeddings[word_id,:] = numpy.array([float(x) for x in fields[1:]])
        
normalizer = 1.0 / numpy.sqrt(numpy.sum(embeddings ** 2, axis=1))
embeddings *= normalizer[:, numpy.newaxis]

This next function returns the vector associated with a word. It finds the numeric ID for the string in the vocabulary array, and grabs the associated row from the embedding matrix.

In [None]:
def vector(word):
    word_id = reverse_vocabulary[word]
    return embeddings[word_id,:]

Calling this function returns a 300-dimensional vector. Here are the first 10 elements:

In [None]:
vector("king")[:10]

In [None]:
vector("queen")[:10]

Since anything we get from a call to `vector` is a `numpy` array, we can do mathematical operations on them, like subtraction.

In [None]:
diff = vector("king") - vector("man")
diff[:10]

In [None]:
def nearest(v):
    scores = embeddings.dot(v) / numpy.linalg.norm(v)
    return sorted(zip(scores, vocabulary), reverse=True)

def show_nearest(v, n=20):
    markdown_table = "|Cosine similarity | Word|\n|---:|:---|\n"
    sorted_words = nearest(v)
    for score, word in sorted_words[:n]:
        markdown_table += "|{:.3f}|{}|\n".format(score, word)
    markdown_table += "| ... | ... | "
    for score, word in sorted_words[-n:]:
        markdown_table += "|{:.3f}|{}|\n".format(score, word)

    display(Markdown(markdown_table))

Here's the famous example of vector arithmetic providing analogies. What if you just do *king* + *woman*? What about just *king*? Or instead of *man* and *woman* use *he* and *she*?

In [None]:
show_nearest(vector("king") - vector("man") + vector("woman"), 7)

Here's a view at the vectors directly. This heatmap shows the first 100 dimensions for nine words: the first three are male royalty, the last three are female pronouns, and the middle three are female royalty. Can we spot dimensions (columns) that seem to code for royalness (or maybe personhood) and others that code for gender? (... maybe? it's not obvious to me)

In [None]:
def get_word_rows(word_list):
    ids_array = numpy.array([reverse_vocabulary[w] for w in word_list])
    return embeddings[ids_array,:100]

In [None]:
words = ["king", "emperor", "prince", "queen", "empress", "princess", "she", "her", "hers"]

pyplot.figure(figsize=(14, 8))
pyplot.xticks([])
pyplot.yticks(range(len(words)), words)
pyplot.imshow(get_word_rows(words))
pyplot.show()

Here's an attempt to reproduce Heuser's plot showing simple and refined virtues/vices. The centering and distance calculation is my attempt to automate the "put labels on words on the periphery" aesthetic. Zoom in!

In [None]:
x = embeddings.dot(vector("virtue") - vector("vice"))
y = embeddings.dot(vector("simplicity") - vector("refinement"))

x -= x.mean()
y -= y.mean()
x /= x.std()
y /= y.std()

pyplot.figure(figsize=(30,30))
pyplot.scatter(x, y, alpha=0.3)
for i in range(len(vocabulary)):
    distance = 0.08 * (x[i]**2 + y[i]**2)
    if numpy.random.random() < distance ** 3:
        pyplot.text(x[i], y[i], vocabulary[i], alpha=0.8)
pyplot.show()