# What do word vectors represent?

On Monday we saw the result of running a word embedding algorithm on two collections. We "embed" words in a vector space, such that words that are subtitutable ("ship" and "boat") or that occur together ("Stop" and "thief!").

In today's work we will look at the intuition for how the properties of embedding vectors relate to properties we can directly observe in texts. They are really just representations of the words that occur near a given word.

What can we tell about a word from the words that occur near it?

In [None]:
import numpy, sys, math
from IPython.display import display, clear_output, Markdown, Latex

from collections import Counter

In [None]:
## Helper functions to nicely display numeric word scores

def show(sorted_words, n=20):
    markdown_table = "|Score | Word|\n|---:|:---|\n"
    for score, word in sorted_words[:n]:
        markdown_table += "|{:.3f}|{}|\n".format(score, word)
    display(Markdown(markdown_table))
    
def show_counter(counter, n=20):
    markdown_table = "|Count | Word|\n|---:|:---|\n"
    for word, score in counter.most_common(n):
        markdown_table += "|{}|{}|\n".format(score, word)
    display(Markdown(markdown_table))

First we'll read the texts. I've already split the tokens with Spacy and written the output to a file with one sentence per line, so punctuation will be included as distinct tokens.

While we read this, we'll also count the frequency of each word type in `all_counter`.

In [None]:
text_filename = "../data/Sagas/sagas_en_split.txt"

sentences = []
all_counter = Counter()

with open(text_filename, encoding="utf-8") as reader:
    for line in reader:
        ## The file has already been tokenized, so we can split on whitespace
        tokens = line.strip().split()
        all_counter.update(tokens)
        
        sentences.append(tokens)

Let's start by looking at the context that words appear in. The next block defines a *key word in context* (KWIC) view.

In [None]:
window_size = 5

In [None]:
def keyword_in_context(query):
    table_markdown = "|left context|word|right context|\n|--:|--|:--|\n"
    for sentence in sentences:
        
        if not query in sentence:
            continue
            
        for i, word in enumerate(sentence):
            if word == query:
                start = max(i-window_size, 0)
                left_context = sentence[start:i]
                right_context = sentence[(i+1):(i+window_size+1)]
                table_markdown += "|{}|{}|{}|\n".format(" ".join(left_context), word, " ".join(right_context))
                    
    display(Markdown(table_markdown))

### Part 1

I've given you an example, for *Shetland*, a chain of islands north of Scotland near the Orkney islands. 

Add 10 additional cells, each with one call to the `keyword_in_context` function. Choose five pairs of words that you think might be similar (e.g. *Shetland* and *Orkneys*). Select a variety of parts of speech, such as nouns, verbs, adjectives, prepositions, and proper names.

Discuss what you notice about the similarities and differences between the contexts of these words.

**Answer here**

In [None]:
keyword_in_context("Shetland")

In [None]:
# add more `keyword_in_context` cells here

Now let's look at the distribution of words immediately preceding (*left* or *previous* context) and immediately following (*right* or *next* context) a word. This block creates two dictionaries, which map a string to the `Counter` of the words that follow that word and precede it, respectively.

In [None]:
previous_context_counters = {} # count words that precede the key
next_context_counters = {} # count words that follow the key

for sentence in sentences:
    for i in range(len(sentence) - 1): # stop at the next-to-last token
        word = sentence[i]
        next_word = sentence[i+1]
        
        if not word in next_context_counters:
            next_context_counters[word] = Counter()
        if not next_word in previous_context_counters:
            previous_context_counters[next_word] = Counter()
        
        next_context_counters[word][next_word] += 1
        previous_context_counters[next_word][word] += 1

### Part 2

In the next code cell I'm demonstrating how to get the most frequent following words for a query word.

Use this function like a "predictive text" feature. Generate two Viking sentences of 10-20 words.
* In the first sentence, start with "Then" and pick the most frequent following word. Record your sentence, and comment on why always picking the most common word might not be a good idea.
* In the second sentence, start with "Then" but choose the next word based on both the frequency distribution and your artistic sensibilities.

**First sentence here**


**Comment on first sentence**


**Second sentence here**



Add cells to show previous *and* next context words for at least 10 more words. Use a selection of nouns, verbs, adjectives, prepositions, and names. These may be the same words you looked at before, but you may also want to add additional examples.

Discuss whether the words to the right or left of a word indicate its part of speech. Cite examples to support your argument. Are the two contexts equally informative for a given part of speech, and is that consistent across different parts of speech?

**Answer here**

In [None]:
show_counter(next_context_counters["she"])

In [None]:
show_counter(previous_context_counters["she"])

In [None]:
## add cells here

Next we'll look at sums over the full five-word context window. This code creates one `Counter` for each word type, which adds up all the words that appear within the window around the word.

In [None]:
word_context_counters = {}

for sentence in sentences:
    
    for i, word in enumerate(sentence):
        start = max(i-window_size, 0)
        left_context = sentence[start:i]
        right_context = sentence[(i+1):(i+window_size+1)]
        
        if not word in word_context_counters:
            word_context_counters[word] = Counter()
        
        word_context_counters[word].update(left_context)
        word_context_counters[word].update(right_context)

### Part 3

This next cell is an example showing output for the full context counts of a word, essentially adding up all the words you saw in the KWIC view earlier.

Show output for at least 10 words, from a mix of parts of speech.

Discuss how this view of a word's context differs from the single-previous-word and single-next-word context views we saw in Part 2.

**Answer here**

In [None]:
show_counter(word_context_counters["Shetland"], 15)

In [None]:
## add cells here

Finally, let's look at a way of comparing the word frequencies we actually observed to the word frequencies in the collection as a whole. We'll use a method called *pointwise mutual information*.

PMI is closely related to KL divergence. In this case, the two distributions we want to compare are the probability of context word $c$ *near* word $w$ and the probabilty of $c$ anywhere. The word *the* is common throughout the collection, so we expect to see it. This metric measures the ratio between the frequency that we actually saw it in the context and our expectation for any random context.

Notation: 
* $N(c|w)$ is `word_context_counters[w][c]`
* $N(w)$ is `sum(word_context_counters[w].values()`
* $N(c)$ is `all_counter[c]`
* $N$ is `all_sum`

$$
\begin{align}
PMI(c, w) & = P(c, w) \log \frac{P(c,w)}{P(c)P(w)} \\
& = P(c, w) \log \frac{P(c|w)P(w)}{P(c)P(w)} \\
& = P(c, w) \log \frac{P(c|w)}{P(c)} \\
& \propto N(c|w) \log \frac{\frac{N(c|w)}{N(w)} }{ \frac{N(c)}{N}  } \\
& = N(c|w) \log \frac{N(c|w)N}{N(w)N(c)}
\end{align}$$


In [None]:
def log_ratio(word):
    counter = word_context_counters[word]
    
    all_sum = sum(all_counter.values()) ## N
    word_sum = sum(counter.values())    ## N(w)
    
    comparisons = []
    for c in counter.keys():
        score = counter[c] * math.log((counter[c] * all_sum) / (word_sum * all_counter[c]))
        comparisons.append((score, c))
    
    return sorted(comparisons, reverse=True)


### Part 4

Compare results using this `log_ratio` function to the output of the `nearest` function used in Monday's notebook.

Provide some examples, and describe how they are similar or different from the output of the word embedding. If there are "missing" words in the output here that are close in the embedding space, show the `log_ratio` output for those words. Do the two words have similar context words? Describe whether this is true and mention examples.

**Answer here**

In [None]:
## 'spae' is a Scots word for prophecy. Gunnhilda was the wife of Eric Bloodaxe.
## She was ordered to be drowned in a bog by King Harald Bluetooth, the namesake
## of the wireless standard. Think about that next time you put on some headphones.

show(log_ratio("queen"))

In [None]:
## add cells with examples here

**Extra bonus for those interested** The embedding algorithm adds an additional step: subsampling the most frequent words. Here's code that generates this subsampling probability.

In [None]:
sampling_probs = {}
all_sum = sum(all_counter.values())
for word in all_counter.keys():
    p_word = all_counter[word] / all_sum
    score = 1.0 / (10000 * p_word)
    sampling_probs[word] = math.sqrt(score) + score