# Vector representations of words

There are many methods one might use to build a semantic network with words as nodes and similarity scores as weights on edges. We could also threshold this weight and just keep unweighted edges that were greater than the threshold. A Bayesian approach like Latent Dirichlet Allocation would build a semantic network with weights being probabilities of relatedness. In this vector space model (VSM) of semantics, similarity is given by the cosine of the angle between two vector representations of words, $\mathbf{x}$ and $\mathbf{y}$,

\begin{equation}
\cos \theta = \frac{\mathbf{x} \cdot \mathbf{y}}{\lVert x \rVert \lVert y \rVert}
\end{equation}

In this notebook, I'll use TensorFlow to build and query a semantic network by following this tutorial: https://www.tensorflow.org/versions/r0.12/tutorials/word2vec/

In particular, I'm taking all the code from the file `word2vec_basic` that was being run as a script and putting it here, or I have rolled it in to other helper functions that I import. I tried to put the most relevant lines here so it was clear what the essential elements are. In the script `word2vec_basic.py`, I have put all these into a function called `run_word2vec()`.

In [1]:
from word2vec_basic import (
    maybe_download, read_data, build_dataset, generate_batch, run_word2vec
)

In [2]:
# don't understand what's goign on with the printing in
# foreach below
# 
# this is coming from what I've made into the __main__ from
# my edited version of word2vec_basic.py

filename = maybe_download('text8.zip', 31344016)

vocabulary = read_data(filename)

vocabulary_size = 50000

data, count, dictionary, reverse_dictionary = build_dataset(
    vocabulary, vocabulary_size
)

del vocabulary

data_index = 0

Found and verified text8.zip


In [3]:
BATCH_SIZE = 20
batch, labels = generate_batch(data, count, dictionary, reverse_dictionary, batch_size=20, num_skips=2, skip_window=1)

In [4]:
for i in range(BATCH_SIZE):
    print(batch[i], reverse_dictionary[batch[i]], '->', labels[i, 0],
          reverse_dictionary[labels[i, 0]])

3081 originated -> 5234 anarchism
3081 originated -> 12 as
12 as -> 6 a
12 as -> 3081 originated
6 a -> 195 term
6 a -> 12 as
195 term -> 2 of
195 term -> 6 a
2 of -> 195 term
2 of -> 3134 abuse
3134 abuse -> 46 first
3134 abuse -> 2 of
46 first -> 59 used
46 first -> 3134 abuse
59 used -> 156 against
59 used -> 46 first
156 against -> 128 early
156 against -> 59 used
128 early -> 156 against
128 early -> 742 working


In [7]:
run_word2vec(data, count, dictionary, reverse_dictionary, vocabulary_size)

Initialized
Average loss at step  0 :  271.258544922
Nearest to UNK: hedonistic, bosworth, bodies, precipitating, formalization, sponge, transaction, cain,
Nearest to while: boii, asses, heiner, teborg, taxa, chivalry, stamford, bats,
Nearest to th: sedative, coerce, applicants, edomites, marshallese, barbagia, currency, bartle,
Nearest to can: beech, sited, outtakes, onshore, sheehan, trustworthy, goddesses, redirection,
Nearest to many: dinar, torquay, mediating, itching, mep, oppression, geffen, fiery,
Nearest to other: team, envelopment, poorest, embarrassed, rheims, miata, fighting, watchful,
Nearest to states: orphan, caribs, uighur, appendices, misgivings, outfielder, meson, frey,
Nearest to b: stricter, halogen, humours, asymptotic, cube, fractional, regan, encourages,
Nearest to system: permissions, annoy, kanji, through, inflation, ferrite, occitan, othniel,
Nearest to was: cordless, cups, buyout, mah, museum, llewelyn, lassa, snowmobiles,
Nearest to an: subregions, kroto, la

Average loss at step  44000 :  0.964760509759
Average loss at step  46000 :  0.947724177212
Average loss at step  48000 :  0.955948749095
Average loss at step  50000 :  0.950863456696
Nearest to UNK: the, been, whilst, used, to, as, self, that,
Nearest to while: boii, asses, heiner, teborg, taxa, chivalry, stamford, bats,
Nearest to th: sedative, coerce, applicants, edomites, marshallese, barbagia, currency, bartle,
Nearest to can: beech, sited, outtakes, onshore, sheehan, trustworthy, goddesses, redirection,
Nearest to many: dinar, torquay, mediating, itching, mep, oppression, geffen, fiery,
Nearest to other: team, envelopment, poorest, embarrassed, rheims, miata, fighting, watchful,
Nearest to states: orphan, caribs, uighur, appendices, misgivings, outfielder, meson, frey,
Nearest to b: stricter, halogen, humours, asymptotic, cube, fractional, regan, encourages,
Nearest to system: permissions, annoy, kanji, through, inflation, ferrite, occitan, othniel,
Nearest to was: cordless, cups

Average loss at step  92000 :  0.942836084843
Average loss at step  94000 :  0.938486485243
Average loss at step  96000 :  0.936984151125
Average loss at step  98000 :  0.939497591645
Average loss at step  100000 :  0.934990822941
Nearest to UNK: used, whilst, been, the, as, to, a, means,
Nearest to while: boii, asses, heiner, teborg, taxa, chivalry, stamford, bats,
Nearest to th: sedative, coerce, applicants, edomites, marshallese, barbagia, currency, bartle,
Nearest to can: beech, sited, outtakes, onshore, sheehan, trustworthy, goddesses, redirection,
Nearest to many: dinar, torquay, mediating, itching, mep, oppression, geffen, fiery,
Nearest to other: team, envelopment, poorest, embarrassed, rheims, miata, fighting, watchful,
Nearest to states: orphan, caribs, uighur, appendices, misgivings, outfielder, meson, frey,
Nearest to b: stricter, halogen, humours, asymptotic, cube, fractional, regan, encourages,
Nearest to system: permissions, annoy, kanji, through, inflation, ferrite, occ