
<h1 id="Random-Indexing">Random Indexing<a class="anchor-link" href="#Random-Indexing">¶</a></h1>


By **Robert Östling**

<p>After the third presenter in a row quoted Firth's famous words, "<em>You shall know a word by the company it keeps</em>", a colleague sitting next to me at a conference whispered "<em>not again!</em>"</p>
<p>Even though it may be overused, this observation underlies the field of distributional semantics, where we try to aggregate the contexts of words in order to approximate their meaning.</p>
<p>We start with an example. Let us make a list of the words occurring within two words of the word <em>four</em> in the Brown corpus. The first step is to extract all <em>context windows</em>, in this case 5-grams:</p>


In [1]:
import nltk
from nltk.util import ngrams

n = 5

text = list(nltk.corpus.brown.words())
vocabulary = set(text)

brown_ngrams = list(ngrams(text, n))

brown_ngrams[:8]

[('The', 'Fulton', 'County', 'Grand', 'Jury'),
 ('Fulton', 'County', 'Grand', 'Jury', 'said'),
 ('County', 'Grand', 'Jury', 'said', 'Friday'),
 ('Grand', 'Jury', 'said', 'Friday', 'an'),
 ('Jury', 'said', 'Friday', 'an', 'investigation'),
 ('said', 'Friday', 'an', 'investigation', 'of'),
 ('Friday', 'an', 'investigation', 'of', "Atlanta's"),
 ('an', 'investigation', 'of', "Atlanta's", 'recent')]


<p>Next, we gather for each word a list of other words which occur in the same 5-gram.</p>


In [None]:
from collections import defaultdict, Counter

# neighbors['focus']['context'] will contain the number of times the word 'context' occurs in the same n-gram as 'focus'
neighbors = defaultdict(Counter)

# Compute the position of the middle word in an n-gram (this is the focus word)
middle_position = n // 2

# For each n-gram, add the cooccurrence statistics:
for ngram in brown_ngrams:
    # This is the focus word
    focus = ngram[middle_position]
    # These are all the words _except_ the focus word
    context = ngram[:middle_position] + ngram[middle_position+1:]
    for word in context:
        neighbors[focus][word] += 1

# Now we can answer the original question:
print(', '.join('"%s"' % word for word,_ in neighbors['four'].most_common(10)))


<p>Now, let's look at a different word, <em>five</em>:</p>


In [None]:
print(', '.join('"%s"' % word for word,_ in neighbors['five'].most_common(10)))


<p>And a third word, <em>dog</em>:</p>


In [None]:
print(', '.join('"%s"' % word for word,_ in neighbors['dog'].most_common(10)))


<p>Note that the lists for <em>four</em> and <em>five</em> are more similar to each other than to that of <em>dog</em>.</p>
<p>I should note that looking at the top neighbors is a very crude measure of similarity of contexts, below we will discuss better ways of doing this.</p>
<p>So far we used only an intuitive sense of similarity, the next step is to formalize this. To begin with, we simply choose the number of overlapping words among the 100 most common neighbors:</p>


In [None]:
# Return the number of overlapping words in the top 100 list of contexts
def top100_similarity(word1, word2):
    return len({word for word,_ in neighbors[word1].most_common(100)} &
               {word for word,_ in neighbors[word2].most_common(100)})

# Print the similarity between each pair of words in the given list
def pairwise_similarity(words):
    for word1 in words:
        for word2 in words:
            print('%-8s %-8s %4d' % (word1, word2, top100_similarity(word1, word2)))
            
pairwise_similarity('three four dog'.split())


<p>As we can see, each word is (obviously) most similar to itself, but <em>three</em> and <em>four</em> indeed have more overlapping contexts than <em>three</em> and <em>dog</em> or <em>four</em> and <em>dog</em>.</p>



<h2 id="Adding-randomness">Adding randomness<a class="anchor-link" href="#Adding-randomness">¶</a></h2>



<p>In distributional semantics we normally work with large corpora, with billions or even trillions of words. This means we want to minimize the amount of storage and computation needed for each word. The key theorem behind Random Indexing says that you can represent every context word as a random vector of a fixed number of dimensions (usually around 1000), and then represent a word by the sum of these vectors (instead of the top 100 lists used above).</p>
<p>First, let's create a random vector for each word in our vocabulary, these are termed <em>index vectors</em>:</p>


In [None]:
import random

# Dimensionality of vectors. Normally this is a large number, but for demonstration purposes we use a smaller number.
d = 5

# We represent a vector using a list of float values.
def random_vector():
    return [random.random() for _ in range(d)]

index_vector = { word: random_vector() for word in vocabulary }

[index_vector[word] for word in 'four five dog'.split()]


<p>These vectors are random, and there is no correlation between related words. This correlation will appear when we add the index vectors together into <em>context vectors</em>.</p>


In [None]:
# Start with zero vectors
context_vector = { word: [0.0]*d for word in vocabulary }

# Add vector b to vector a and store the result in a:
# a <- a + b
def add_vector(a, b):
    for i,x in enumerate(b):
        a[i] += x

# This is almost exactly the same as we did earlier,
# except now we modify the context_vector object instead of the neighbors object.
for ngram in brown_ngrams:
    # This is the focus word
    focus = ngram[middle_position]
    # These are all the words _except_ the focus word
    context = ngram[:middle_position] + ngram[middle_position+1:]
    for word in context:
        add_vector(context_vector[focus], index_vector[word])

[context_vector[word] for word in 'four five dog'.split()]


<p>The values of all of these vectors are quite different, but that's actually just because the words have different frequencies. Let us look what happens if we normalize the vectors, so that the sum of each is 1:</p>


In [None]:
# Normalize a vector so that the sum of its elements is 1
# We also round the result to 3 decimals, so it looks prettier when we print it,
# but this is not necessary.
def normalize(a):
    total = sum(a)
    return [round(x/total, 3) for x in a]

[normalize(context_vector[word]) for word in 'four five dog'.split()]


<p>Now the first two (representing <em>four</em> and <em>five</em>) are quite similar! The <em>dog</em> vector is not that far off either, but again we can quantify the distance measure to avoid subjective bias:</p>


In [None]:
import math

# Compute the Euclidean distance between vectors a and b
def euclidean(a, b):
    return math.sqrt(sum((x-y)*(x-y) for x,y in zip(a,b)))

# Print the similarity between each pair of words in the given list
def pairwise_distance(words):
    for word1 in words:
        for word2 in words:
            print('%-8s %-8s %.3f' % (word1, word2, euclidean(
                normalize(context_vector[word1]),
                normalize(context_vector[word2]))))
            
pairwise_distance('three four dog'.split())


<p>For demonstration purposes we have used only pure Python, but in practice is it easier and faster to use the <em>numpy</em> and <em>scipy</em> libraries.</p>
<p>In particular, the module <a href="http://docs.scipy.org/doc/scipy/reference/spatial.distance.html">scipy.spatial.distance</a> contains functions for a wide variety of different similarity measures (including Euclidean distance, as shown above, and cosine similarity, which is perhaps more commonly used in practice).</p>
