<div class="alert alert-danger">
**Due date:** 2018-02-23
</div>

# Lab 5: Semantic analysis

A **word embedding** represents words as vectors in a high-dimensional vector space. In this lab you will train word embeddings on the English Wikipedia via truncated singular value decomposition of the co-occurrence matrix.

Start by loading the Python module for this lab:

In [27]:
import nlp5
import scipy
import numpy as np
import time
import json

## Accessing the data

The data for this lab is the text of the [Simple English Wikipedia](https://simple.wikipedia.org/wiki/Main_Page). We have excluded articles shorter than 50&nbsp;words, as well as certain meta-articles. The remaining articles were pre-processed by removing non-textual elements, sentence splitting, and tokenisation. The result is a text file containing 2.4M sentences, spanning 23M tokens.

<div class="alert alert-warning">
Note that the data file is quite big. It is therefore very important to think about efficiency in this lab!
</div>

Because the data file is so big, we have compressed it using [bz2](https://en.wikipedia.org/wiki/Bzip2), which can be processed sequentially without completely decompressing the file. This functionality is provided by the `bz2` module:

In [2]:
import bz2

The (uncompressed) text contains one sentence per line, with individual tokens separated by spaces. To loop over the tokens, you can use the following generator function:

In [3]:
def tokens(source):
    for sentence in source:
        for token in sentence.split():
            yield token

The next code cell shows you how you can open the compressed data file and print the number of tokens in the text:

In [4]:
with bz2.open('/home/TDDE09/labs/l5/data/simplewiki.txt.bz2') as source:
    print(sum(1 for t in tokens(source)))

22963492


Note that the use of the generator function to obtain the tokens is essential here &ndash; returning the tokens as a list would require a lot of memory. If you have not worked with generators and iterators before, now is a good time to read up on them. [More information about generators](https://wiki.python.org/moin/Generators)

## Build the vocabulary

Your first task in this lab is to build the vocabulary of the word embedding that you are about to construct.

<div class="panel panel-primary">
<div class="panel-heading">Problem 1</div>
<div class="panel-body">
Write code that builds the vocabulary of the word embedding. Represent the vocabulary as a dictionary that maps words to a contiguous range of integer ids. Ignore words that occur less than 100&nbsp;times.
</div>
</div>

To solve this problem, complete the skeleton code in the following cell:

In [5]:
def make_vocab(source, threshold=100):
    # TODO: Replace the following line with your own code
    V = {}
    token_count = {}
    for t in tokens(source):
        token_count[t] = token_count.get(t, 0) + 1
    i = 0
    for t in token_count:
        if t not in V and token_count[t] >= threshold:
            V[t] = i
            i+=1
    return V

In order to help you test your implementation, we provide a smaller data file with the first 1M tokens from the full data. The code in the next cell builds the word-to-index mapping for this file and prints the size of the vocabulary.

In [6]:
with bz2.open('/home/TDDE09/labs/l5/data/simplewiki-small.txt.bz2') as source:
    small_vocab = make_vocab(source)
    print(len(small_vocab))

1256


Once you are confident that your implementation is correct, you can run it on the full data:

In [7]:
with bz2.open('/home/TDDE09/labs/l5/data/simplewiki.txt.bz2') as source:
    vocab = make_vocab(source)
    print(len(vocab))

14887


## Extract context windows

To build the co-occurrence matrix, we need to define the notion of &lsquo;context&rsquo;. Here we will use **linear contexts**, consisting of the words that precede and follow the target word in a window of $k$ tokens on each side. Your next task is to implement a generator function that extracts all such context windows from the data.

<div class="panel panel-primary">
<div class="panel-heading">Problem 2</div>
<div class="panel-body">
Implement a generator function that yields all context windows for the data. Represent context windows as tuples consisting of $2k+1$ tokens, with the target word in the center component of the tuple.
</div>
</div>

Later in the lab you will use context windows of width $k=2$, but your code should support any width $k \geq 1$. The windows at the beginning and end of each sentence should be padded with `<bos>` and `<eos>` markers. With this padding, the total number of contexts should equal the number of tokens in the data.

To solve the problem, complete the skeleton code in the following cell:

In [8]:
def contexts(source, k):
    # TODO: Replace the following line with your own code
    
    for sentence in source:
        
        s_list = sentence.split()
        for i in range(len(s_list)):
            context = []
            for pos in range(i-k,i+k+1):
                if pos < 0: 
                    context.append('<bos>')
                elif pos >= len(s_list):
                    context.append('<eos>')
                else:
                    context.append(s_list[pos])
            yield tuple(context)
    #yield from nlp5.contexts(source, k)

    

To test your code, you can run the following cell:

In [9]:
from itertools import islice

with bz2.open('/home/TDDE09/labs/l5/data/simplewiki.txt.bz2') as source:
    for context in islice(contexts(source, 2), 10):    # 10 windows of width k = 2
        print(context)

('<bos>', '<bos>', b'april', b'is', b'the')
('<bos>', b'april', b'is', b'the', b'th')
(b'april', b'is', b'the', b'th', b'month')
(b'is', b'the', b'th', b'month', b'of')
(b'the', b'th', b'month', b'of', b'the')
(b'th', b'month', b'of', b'the', b'year')
(b'month', b'of', b'the', b'year', b'and')
(b'of', b'the', b'year', b'and', b'comes')
(b'the', b'year', b'and', b'comes', b'between')
(b'year', b'and', b'comes', b'between', b'march')


## Build the co-occurrence matrix

Your next task is to construct the co-occurrence matrix for the data. However, rather than with raw counts, you will fill it with positive pointwise mutual information values.

Recall that the **pointwise mutual information (PMI)** between a target word $w$ and a context word $c$ is defined as

$$
\text{PMI}(w, c) = \log \Biggl( \frac{\#(w, c) \cdot N}{\#(w) \cdot \#(c)} \Biggr)
$$

where $\#(w, c)$ is the number of times $w$ was observed in the same context as $c$, $\#(w)$ is the total number of times $w$ was observed, $\#(c)$ is the total number of times $c$ was observed, and $N$ is the total number of observations. In the case where either the enumerator or the denominator of this expression is zero, we let $\text{PMI}(w, c) = 0$.

**Positive pointwise mutual information (PPMI)** is derived from PMI by clipping all negative values:

$$
\text{PPMI}(w, c) = \max \bigl(0, \text{PMI}(w, c) \bigr)
$$

Here we will actually use a shifted version of PPMI, where the PMI value is decreased by a constant $\log \delta$ before clipping. For $\delta = 1$, this gives the same result as standard PMI. Higher values of $\delta$ can improve the performance of word embeddings for different tasks ([Levy and Goldberg, 2014](https://papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-matrix-factorization/)).

$$
\text{PPMI}(w, c) = \max \bigl(0, \text{PMI}(w, c) - \log \delta \bigr)
$$

<div class="panel panel-primary">
<div class="panel-heading">Problem 3</div>
<div class="panel-body">
Write code that builds the shifted PPMI co-occurrence matrix for the data. Represent it as a [SciPy sparse matrix](https://docs.scipy.org/doc/scipy/reference/sparse.html) whose row indices correspond to the target words, and whose column indices correspond to the context words.
</div>
</div>

To solve this problem, complete the skeleton code in the following cell:

In [10]:
def make_ppmi_matrix(vocab, source, k=2, delta=1.0):
    # TODO: Replace the following line with your own code
    contexts_list = []
    for context in contexts(list(source),k):
        contexts_list.append(context)
    context_matrix = np.zeros((len(vocab),len(vocab)))
    N = 0
    for ctxt in contexts_list:
        word = ctxt[k]
        if word not in vocab:
            continue
        context_words = [w for w in ctxt if w != word and w in vocab]
        N += len(context_words)
                         
        for w in context_words:
            context_matrix[vocab[word], vocab[w]] += 1

    w_count = np.sum(context_matrix, axis=1, keepdims=True)
    c_count = np.sum(context_matrix, axis=0, keepdims=True)
    
    norm_cm = context_matrix * N / w_count.dot(c_count)

    # Set zero valued elements to 1 (log(1) == 0)
    norm_cm[norm_cm == 0] = 1
    
    log_norm_cm = np.log(norm_cm) - np.log(delta)
    
    # Set negative values to zero
    log_norm_cm[log_norm_cm < 0] = 0
    
    return scipy.sparse.csr_matrix(log_norm_cm)

def ref_ppmi_matrix(vocab, source, k=2, delta=1.0):
    return nlp5.make_ppmi_matrix(vocab, source, k, delta)

While the problem of constructing the shifted PPMI matrix is not hard from a conceptual point of view, writing efficient code for it is slightly harder. We recommend to proceed in two steps: First, collect the relevant counts $\#(w, c)$, $\#(w)$, $\#(c)$, and $N$ in standard Python data structures. Then, use these counts to compute the shifted PPMI values and return them as a matrix in [CSR format](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html). (See the documentation of scipy.sparse.csr_matrix for an example of how to do this.)

You can test your code by running it on the small data set:

In [11]:
with bz2.open('/home/TDDE09/labs/l5/data/simplewiki-small.txt.bz2') as source:
    #print(list(source)[0])
    small_vocab = make_vocab(source)
    
now = time.time()
with bz2.open('/home/TDDE09/labs/l5/data/simplewiki-small.txt.bz2') as source:
    #print(list(source)[0])
    ppmi_matrix = make_ppmi_matrix(small_vocab, source)

end = time.time()
print ("Ours: %s" % (end - now))
 
now = time.time()
with bz2.open('/home/TDDE09/labs/l5/data/simplewiki-small.txt.bz2') as source:
    #print(list(source)[0])
    ppmi_matrix = ref_ppmi_matrix(small_vocab, source)
    
end = time.time()
print ("Reference: %s" % (end - now))

Ours: 5.356584310531616
Processed 1000000 contexts
Computed 1256 word vectors
Reference: 5.4228174686431885



You should now be able to obtain the PPMI as in the following example:

In [12]:
ppmi_matrix[small_vocab[b'april'], small_vocab[b'december']]

2.952847218567449

Once you feel confident that your code is correct, you can run it on the full data. (This will take a while.)

In [13]:
with bz2.open('/home/TDDE09/labs/l5/data/simplewiki.txt.bz2') as source:
    ppmi_matrix = make_ppmi_matrix(vocab, source)

To avoid re-computing the matrix several times, you can save it to a file as follows:

In [28]:
import scipy.sparse

scipy.sparse.save_npz('simplewiki.npz', ppmi_matrix)

#with open('vocab.json', 'w') as json_file:
#    json.dump(vocab, json_file)
# ppmi_matrix = scipy.sparse.load_npz('simplewiki.npz')

TypeError: key b'mix' is not a string

Storing the PPMI matrix will require approximately 60&nbsp;MB of disk space.

## Build the word embedding

Once you have the PPMI matrix, you can construct the word embedding.

<div class="panel panel-primary">
<div class="panel-heading">Problem 4</div>
<div class="panel-body">
Implement a class representing word embeddings. The class should support the construction of the embedding by means of truncated singular value decomposition.
</div>
</div>

More specifically, we ask you to implement the following interface:

In [15]:
import numpy as np
from sklearn.decomposition import TruncatedSVD

class Embedding(object):
    """A word embedding.

    A word embedding represents words as vectors in a d-dimensional
    vector space. The central attribute of the word embedding is a
    dense matrix whose rows correspond to the words in some
    vocabulary, and whose columns correspond to the d dimensions in
    the embedding space.

    Attributes:

        vocab: The vocabulary, specified as a dictionary that maps
            words to integer indices, identifying rows in the embedding
            matrix.
        dim: The dimensionality of the word vectors.
        m: The embedding matrix. The rows of this matrix correspond to the
            words in the vocabulary.
    """
    
    def __init__(self, vocab, matrix, dim=100):
        """Initialies a new word embedding.

        Args:
            vocab: The vocabulary for the embedding, specified as a
                dictionary that maps words to integer indices.
            matrix: The co-occurrence matrix, represented as a SciPy
                sparse matrix with one row and one column for each word in
                the vocabulary.
            dim: The dimensionality of the word vectors.
        """
        self.vocab = vocab
        self.dim = dim
        self.svd = TruncatedSVD(n_components=dim,n_iter=10)
        self.m = self.svd.fit_transform(matrix)

    def vec(self, w):
        """Returns the vector for the specified word.

        Args:
            w: A word, an element of the vocabulary.

        Returns:
            The word vector for the specified word.
        """
        # TODO: Replace the following line with your own code
        return self.m[self.vocab[w]]

    def distance(self, w1, w2):
        """Computes the cosine similarity between the specified words.
        
        Args:
            w1: The first word (an element of the vocabulary).
            w2: The second word (an element of the vocabulary).
        
        Returns:
            The cosine similarity between the specified words in this
            embedding.
        """
        # So we can compare vectors with words in analogy()
        if type(w1).__module__ == "numpy":
            v1 = w1
        else:
            v1 = self.vec(w1)
        v2 = self.vec(w2)
        
        return v1.dot(v2.T) / (np.linalg.norm(v1) * np.linalg.norm(v2))
    
    def most_similar(self, w, n=10):
        """Returns the most similar words for the specified word.

        Args:
            w: A word, an element of the vocabulary.
            n: The maximal number of most similar words to return.

        Returns:
            A list containing distance/word pairs.
        """
        words = list(self.vocab.keys())
        distances = [self.distance(w, w_) for w_ in words]
        similarity_indices = np.argsort(distances)[::-1]
        similar_words = [words[i] for i in similarity_indices[:n]]
        return similar_words

    def analogy(self, w1, w2, w3):
        """Answers an analogy question of the form w1 - w2 + w3 = ?

        Args:
            w1: A word, an element of the vocabulary.
            w2: A word, an element of the vocabulary.
            w3: A word, an element of the vocabulary.

        Returns:
            The word closest to the vector w1 - w2 + w3 that is different
            from all the other words.
        """
        v1 = self.vec(w1)
        v2 = self.vec(w2)
        v3 = self.vec(w3)
        
        v = v1 - v2 + v3
        
        most_sim = self.most_similar(v, n=4)
        for w in most_sim:
            if w not in [w1,w2,w3]:
                return w
        
        #norm_context_matrixnorm_context_matrixnorm_context_matrixreturn most_sim

Recall that the **singular value decomposition** of an $m \times n$ matrix $\mathbf{M}$ is a factorization of the form $\mathbf{U}\mathbf{\Sigma}\mathbf{V}^*$ where $\mathbf{U}$ is an $m \times m$ matrix of which we may assume that its columns are sorted in decreasing order of importance when it comes to explaining the variance of $\mathbf{M}$. (Formally, these columns correspond to the singular values in the matrix $\mathbf{\Sigma}$.) By truncating $\mathbf{U}$ after the first $\mathit{dim}$ columns, we thus obtain an approximation of the original matrix $\mathbf{M}$. In your case, $\mathbf{M}$ is the PPMI matrix, and the truncated matrix $\mathbf{U}$ gives the word vectors of the embedding. To compute the matrix $\mathbf{U}$, you can use the class [TruncatedSVD](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html), which is available in [scikit-learn](http://scikit-learn.org/stable/index.html).

## Exploring the embedding

The following cell shows how to initalise a new embedding with the PPMI matrix:

In [16]:
print (ppmi_matrix.shape)
embedding = Embedding(vocab, ppmi_matrix)

(14887, 14887)


Here are some things that you can do with the embedding:

#### Word similarity

What is the semantic similarity between &lsquo;man&rsquo; and &lsquo;woman&rsquo;?

In [17]:
print (embedding.distance(b'man', b'woman'))
print (embedding.distance(b'man', b'guy'))
print (embedding.distance(b'man', b'banana'))

0.8322261392443765
0.6748007883787293
0.2653145159961847


What words are most similar to &lsquo;man&rsquo;?

In [18]:
print (embedding.most_similar(b'man'))

[b'man', b'boy', b'woman', b'girl', b'dog', b'little', b'dead', b'young', b'child', b'who']


What words are most similar to &lsquo;woman&rsquo;?

In [19]:
print (embedding.most_similar(b'woman'))

[b'woman', b'child', b'man', b'girl', b'herself', b'her', b'person', b'girls', b'children', b'boy']


#### Analogies

Here is the famous king &minus; man + woman = ? example.

In [20]:
embedding.analogy(b'king', b'man', b'woman')

b'heir'

When experimenting with other examples, you will find that the embedding picks up common stereotypes:

In [21]:
embedding.analogy(b'doctor', b'man', b'woman')

b'nurse'

The model knows the capital of Sweden.

In [22]:
embedding.analogy(b'berlin', b'germany', b'sweden')

b'stockholm'

The embedding also &lsquo;learns&rsquo; some syntactic analogies, such as the analogy between the past-tense and present-tense forms of verbs (here: *jump* and *eat*):

In [23]:
embedding.analogy(b'jumped', b'jump', b'eat')

b'feed'