<div class="alert alert-danger">
**Due date:** 2018-02-23
</div>

# Lab 5: Semantic analysis

A **word embedding** represents words as vectors in a high-dimensional vector space. In this lab you will train word embeddings on the English Wikipedia via truncated singular value decomposition of the co-occurrence matrix.

Start by loading the Python module for this lab:

In [1]:
import nlp5
import scipy
import numpy as np
import time
import json

## Accessing the data

In [2]:
import bz2

In [3]:
def tokens(source):
    for sentence in source:
        for token in sentence.split():
            yield token

Note that the use of the generator function to obtain the tokens is essential here &ndash; returning the tokens as a list would require a lot of memory. If you have not worked with generators and iterators before, now is a good time to read up on them. [More information about generators](https://wiki.python.org/moin/Generators)

In [4]:
def make_vocab(source, threshold=100):
    # TODO: Replace the following line with your own code
    V = {}
    token_count = {}
    for t in tokens(source):
        token_count[t] = token_count.get(t, 0) + 1
    i = 0
    for t in token_count:
        if t not in V and token_count[t] >= threshold:
            V[t] = i
            i+=1
    return V

In [5]:
with bz2.open('/home/TDDE09/labs/l5x/data/oanc.txt.bz2') as source:
    vocab = make_vocab(source)
    print(len(vocab))
    #print(vocab.keys())

9313


In order to help you test your implementation, we provide a smaller data file with the first 1M tokens from the full data. The code in the next cell builds the word-to-index mapping for this file and prints the size of the vocabulary.

In [6]:
def contexts(source, k):
    # TODO: Replace the following line with your own code
    
    for sentence in source:
        
        s_list = sentence.split()
        for i in range(len(s_list)):
            context = []
            for pos in range(i-k,i+k+1):
                if pos < 0: 
                    context.append('<bos>')
                elif pos >= len(s_list):
                    context.append('<eos>')
                else:
                    context.append(s_list[pos])
            yield tuple(context)
    #yield from nlp5.contexts(source, k)

    

To test your code, you can run the following cell:

In [7]:
def make_ppmi_matrix(vocab, source, k=2, delta=1.0):
    # TODO: Replace the following line with your own code
    contexts_list = []
    for context in contexts(list(source),k):
        contexts_list.append(context)
    context_matrix = np.zeros((len(vocab),len(vocab)))
    N = 0
    for ctxt in contexts_list:
        word = ctxt[k]
        if word not in vocab:
            continue
        context_words = [w for w in ctxt if w != word and w in vocab]
        N += len(context_words)
                         
        for w in context_words:
            context_matrix[vocab[word], vocab[w]] += 1

    w_count = np.sum(context_matrix, axis=1, keepdims=True)
    c_count = np.sum(context_matrix, axis=0, keepdims=True)
    
    norm_cm = context_matrix * N / w_count.dot(c_count)

    # Set zero valued elements to 1 (log(1) == 0)
    norm_cm[norm_cm == 0] = 1
    
    log_norm_cm = np.log(norm_cm) - np.log(delta)
    
    # Set negative values to zero
    log_norm_cm[log_norm_cm < 0] = 0
    
    return scipy.sparse.csr_matrix(log_norm_cm)

def ref_ppmi_matrix(vocab, source, k=2, delta=1.0):
    return nlp5.make_ppmi_matrix(vocab, source, k, delta)

In [8]:
with bz2.open('/home/TDDE09/labs/l5x/data/oanc.txt.bz2') as source:
    ppmi_matrix = make_ppmi_matrix(vocab, source)
    
    


You should now be able to obtain the PPMI as in the following example:

In [9]:
import scipy.sparse

scipy.sparse.save_npz('simplewnorm_context_matrixiki.npz', ppmi_matrix)
#ppmi_matrix = scipy.sparse.load_npz('simplewnorm_context_matrixiki.npz')

#with open('vocab.json') as json_file:
#    vocab = json.loads(json_file.read())
    
#print (vocab)

Storing the PPMI matrix will require approximately 60&nbsp;MB of disk space.

## Build the word embedding

Once you have the PPMI matrix, you can construct the word embedding.

<div class="panel panel-primary">
<div class="panel-heading">Problem 4</div>
<div class="panel-body">
Implement a class representing word embeddings. The class should support the construction of the embedding by means of truncated singular value decomposition.
</div>
</div>

More specifically, we ask you to implement the following interface:

In [10]:
import numpy as np
from sklearn.decomposition import TruncatedSVD

class Embedding(object):
    """A word embedding.

    A word embedding represents words as vectors in a d-dimensional
    vector space. The central attribute of the word embedding is a
    dense matrix whose rows correspond to the words in some
    vocabulary, and whose columns correspond to the d dimensions in
    the embedding space.

    Attributes:

        vocab: The vocabulary, specified as a dictionary that maps
            words to integer indices, identifying rows in the embedding
            matrix.
        dim: The dimensionality of the word vectors.
        m: The embedding matrix. The rows of this matrix correspond to the
            words in the vocabulary.
    """
    
    def __init__(self, vocab, matrix, dim=100):
        """Initialies a new word embedding.

        Args:
            vocab: The vocabulary for the embedding, specified as a
                dictionary that maps words to integer indices.
            matrix: The co-occurrence matrix, represented as a SciPy
                sparse matrix with one row and one column for each word in
                the vocabulary.
            dim: The dimensionality of the word vectors.
        """
        self.vocab = vocab
        self.dim = dim
        self.svd = TruncatedSVD(n_components=dim)
        self.m = self.svd.fit_transform(matrix)

    def vec(self, w):
        """Returns the vector for the specified word.

        Args:
            w: A word, an element of the vocabulary.

        Returns:
            The word vector for the specified word.
        """
        # TODO: Replace the following line with your own code
        return self.m[self.vocab[w]]

    def distance(self, w1, w2):
        """Computes the cosine similarity between the specified words.
        
        Args:
            w1: The first word (an element of the vocabulary).
            w2: The second word (an element of the vocabulary).
        
        Returns:
            The cosine similarity between the specified words in this
            embedding.
        """
        try:
            # So we can compare vectors with words in analogy()
            if type(w1).__module__ == "numpy":
                v1 = w1
            else:
                v1 = self.vec(w1)
            v2 = self.vec(w2)

            return v1.dot(v2.T) / (np.linalg.norm(v1) * np.linalg.norm(v2))
        except:
            #print(w1," ", str(w1 in self.vocab))
            #print(w2," ", str(w2 in self.vocab))
            return 0
    
    def most_similar(self, w, n=10):
        """Returns the most similar words for the specified word.

        Args:
            w: A word, an element of the vocabulary.
            n: The maximal number of most similar words to return.

        Returns:
            A list containing distance/word pairs.
        """
        words = list(self.vocab.keys())
        distances = [self.distance(w, w_) for w_ in words]
        similarity_indices = np.argsort(distances)[::-1]
        similar_words = [words[i] for i in similarity_indices[:n]]
        return similar_words

    def analogy(self, w1, w2, w3):
        """Answers an analogy question of the form w1 - w2 + w3 = ?

        Args:
            w1: A word, an element of the vocabulary.
            w2: A word, an element of the vocabulary.
            w3: A word, an element of the vocabulary.

        Returns:
            The word closest to the vector w1 - w2 + w3 that is different
            from all the other words.
        """
        v1 = self.vec(w1)
        v2 = self.vec(w2)
        v3 = self.vec(w3)
        
        v = v1 - v2 + v3
        
        most_sim = self.most_similar(v, n=4)
        for w in most_sim:
            if w not in [w1,w2,w3]:
                return w
        
        #norm_context_matrixnorm_context_matrixnorm_context_matrixreturn most_sim

Recall that the **singular value decomposition** of an $m \times n$ matrix $\mathbf{M}$ is a factorization of the form $\mathbf{U}\mathbf{\Sigma}\mathbf{V}^*$ where $\mathbf{U}$ is an $m \times m$ matrix of which we may assume that its columns are sorted in decreasing order of importance when it comes to explaining the variance of $\mathbf{M}$. (Formally, these columns correspond to the singular values in the matrix $\mathbf{\Sigma}$.) By truncating $\mathbf{U}$ after the first $\mathit{dim}$ columns, we thus obtain an approximation of the original matrix $\mathbf{M}$. In your case, $\mathbf{M}$ is the PPMI matrix, and the truncated matrix $\mathbf{U}$ gives the word vectors of the embedding. To compute the matrix $\mathbf{U}$, you can use the class [TruncatedSVD](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html), which is available in [scikit-learn](http://scikit-learn.org/stable/index.html).

## Exploring the embedding

The following cell shows how to initalise a new embedding with the PPMI matrix:

In [11]:
print (ppmi_matrix.shape)
embedding = Embedding(vocab, ppmi_matrix)

(9313, 9313)


Here are some things that you can do with the embedding:

#### Word similarity

What is the semantic similarity between &lsquo;man&rsquo; and &lsquo;woman&rsquo;?

In [12]:
print (embedding.distance(b'man', b'woman'))
print (embedding.distance(b'man', b'guy'))
print (embedding.distance(b'man', b'banana'))

0.918921016019255
0.8158151704019476
0


What words are most similar to &lsquo;man&rsquo;?

In [13]:
print (embedding.most_similar(b'man'))

[b'man', b'woman', b'boy', b'girl', b'person', b'father', b'guy', b'mother', b'wife', b'dead']


What words are most similar to &lsquo;woman&rsquo;?

In [14]:
print (embedding.most_similar(b'woman'))

[b'woman', b'man', b'girl', b'person', b'boy', b'mother', b'lady', b'father', b'daughter', b'men']


#### Analogies

Here is the famous king &minus; man + woman = ? example.

In [15]:
embedding.analogy(b'king', b'man', b'woman')

b'prince'

When experimenting with other examples, you will find that the embedding picks up common stereotypes:

In [16]:
embedding.analogy(b'doctor', b'man', b'woman')

b'nurse'

The model knows the capital of Sweden.

In [17]:
#embedding.analogy(b'berlin', b'germany', b'sweden')

The embedding also &lsquo;learns&rsquo; some syntactic analogies, such as the analogy between the past-tense and present-tense forms of verbs (here: *jump* and *eat*):

In [18]:
#embedding.analogy(b'jumped', b'jump', b'eat')

In [19]:
with open('/home/TDDE09/labs/l5x/data/toefl.txt') as fp:
    predictions = []
    ground_truths = []
    for line in fp:
        distances = []
        elements = line.split()
        word = elements[0]
        correct = int(elements[1])
        for i,other_word in enumerate(elements[2:]):
            distances.append(embedding.distance(str.encode(word),str.encode(other_word)))
        pred = np.argmax(distances)
        predictions.append(pred)
        ground_truths.append(correct)
predictions = np.array(predictions)
ground_truths = np.array(ground_truths)
accuracy = np.sum(predictions == ground_truths)/len(predictions)
print("Accuracy: ",accuracy)

Accuracy:  0.425
