<div class="alert alert-danger">
**Due date:** 2017-02-24
</div>

# Lab 5: Semantic analysis

**Students:** Ludvig Noring (ludno249), Michael Sörsäter (micso554), Victor Tranell (victr593)

## Introduction

A **word space model** of word meanings represents words as vectors in a high-dimensional vector space. In this lab you will experiment with a word space model which trained on the Swedish Wikipedia using [word2vec](https://code.google.com/archive/p/word2vec/). In order to use word2vec in Python, we use the [gensim](https://radimrehurek.com/gensim/) library.

The library and some more essentials for this lab are contained in the module we load in the following cell.

In [1]:
import nlp5

## Explore the lab system

Run the next cell to load the pre-trained word space model:

In [2]:
model = nlp5.load_model("/home/TDDE09/labs/nlp5/data/wikipedia-sv.bin")

The model consists of word vectors. In Python a word vector is represented as an [*array*](https://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html). For the purposes of this lab, you can treat arrays as lists. The next line of code prints the vector for the word *student*:

In [3]:
model['student']

array([ 0.38911471, -0.25333604,  0.10631166,  0.36140671,  0.14798231,
       -0.28869128,  0.4014135 , -1.01921523, -0.00860699,  0.76315218,
       -0.30077016,  0.31991726, -0.30887559, -0.21920508, -0.10915887,
        0.41282091, -0.23703265,  0.93853813,  0.81494588,  0.01140385,
        0.24421778,  0.3621935 ,  0.44512981,  0.32729414,  0.82020354,
       -0.79330653, -0.044444  , -0.42768687, -0.88712269,  0.13306266,
        0.57084686,  0.46596572, -0.48475036,  0.22611499,  0.36376789,
        0.12183799,  0.71142977,  0.33212078,  1.33993554, -0.78250617,
       -0.60445988,  0.0656125 , -0.18711154,  0.70977861,  0.11466026,
        0.36936972,  0.19929321, -0.41768453,  0.88794357, -0.49968722,
       -0.53722715, -0.32763621, -0.05238692, -0.21328461,  0.68021107,
       -0.49659464,  0.78859437,  0.51455098,  0.59885758,  0.31225756,
       -0.47549149,  0.12143688,  0.79769266, -0.1092938 ,  0.05594372,
        0.94904387,  1.13171613, -0.23624501,  0.31303319, -0.45

All vectors in the model have the same dimensionality $n$; this value is a parameter that is fixed when training the model.

<div class="panel panel-primary">
<div class="panel-heading">Problem 1</div>
<div class="panel-body">
Write some code that prints $n$ for the model we loaded earlier.
</div>
</div>

In [4]:
print(len(model['student']))

100


Given a word space model, we can compute the semantic similarity between words using the cosine distance between their respective word vectors. The next line of code showcases how to compute the cosine distance:

In [5]:
print(model.similarity('student', 'lärare'))

0.67229160209


<div class="panel panel-primary">
<div class="panel-heading">Problem 2</div>
<div class="panel-body">
<p>Write code to print the following:</p>
<ul>
<li>the cosine distance between a word and the word itself</li>
<li>the cosine distance between two words that are, according to your judgement, semantically related</li>
<li>the cosine distance between two words that are, according to your judgement, semantically unrelated</li>
</ul>
</div>
</div>

In [6]:
print(model.similarity('tomato', 'tomato'))
print(model.similarity('kaviar', 'knäckebröd'))
print(model.similarity('kaviar', 'betong'))

1.0
0.840633706043
0.396592828848


## Word analogies

In a word analogy task you are given two pairs of words that share a common semantic relation. A well-known example is *man/woman* and *king/queen*, where the semantic relation could be dubbed &lsquo;female&rsquo;. The task is to predict one of the words, e.g. *queen*, given the other three. By doing that we answer the question: &lsquo;*man* is to *woman* as *king* is to —?&rsquo;.

### Predict the fourth word

[Mikolov et al. (2013)](http://www.aclweb.org/anthology/N13-1090) have shown that the word analogy task can be solved by adding and substracting word vectors in a word2vec-model: the vector for *queen* is close (in terms of cosine distance) to the vector *king* $-$ *man* $+$ *woman*. In the next problem you will implement this idea.

<div class="panel panel-primary">
<div class="panel-heading">Problem 3</div>
<div class="panel-body">
Write a function `complete()` that takes the first three words of a word analogy quadruple as input and predicts the fourth word.
</div>
</div>

To solve the problem you should complete the following code cell:

In [7]:
def complete(model, a, b, c):
    """Returns the fourth word in the analogy quadruple"""
    word, score = model.most_similar([c, b], [a], 1)[0]    
    return word

The function is supposed to be called like this:

In [8]:
print(complete(model, "man", "kvinna", "kung"))
print(complete(model, "tenta", "plugg", "ångest"))

drottning
andningsuppehåll


To solve Problem&nbsp;3 you can use the following method of the model:

`model.most_similar(pos, neg, n)`

The method takes as its inputs two lists with words (strings), `pos` and `neg`, and a number `n`, and returns the `n` closest vectors to the vector that one gets by adding all the vectors in the `pos` list and subtracting all the vectors in the `neg` list. Here is an example:

In [9]:
print(model.most_similar(['kung', 'kvinna'], ['man'], 3))

[('drottning', 0.7310913801193237), ('tronföljare', 0.7307088971138), ('prinsessa', 0.7277407646179199)]


### Categories of analogies

Word vectors are computed based on co-occurrence counts: words that co-occur frequently with certain other words are going to have similar vectors. In order to get a better understanding of the model&rsquo;s possibilities and limitations, we load a list of ten analogy pairs:

In [10]:
analogies = nlp5.load_data("/home/TDDE09/labs/nlp5/data/analogies.txt")

<div class="panel panel-primary">
<div class="panel-heading">Problem 4</div>
<div class="panel-body">
<p>
Write code that computes the model&rsquo;s accuracy on the task of predicting the fourth word in every analogy pair, given the other three. Feel free to use your `complete()`-function.
</p>
<p>
Analyse the mistakes that the model make and write a short explanation. Ground your explanations in your understanding of the basic distributional principle that underlies word space models.
</p>
</div>
</div>

Use the next code cell to solve the problem:

In [11]:
def evaluate(model, analogies):
    """Computes the accuracy of the specified model on the specified list of analogy quadruples"""
    res = [(complete(model, a, b, c), gold) for a, b, c, gold in [x.split() for x in analogies]]

    accuracy = []
    for i, (pred, golden) in enumerate(res):
        print(analogies[i])
        print("Predicted: {0}".format(pred))
        accuracy.append(pred == golden)
        print()
        
    return sum(accuracy) / len(accuracy)

print(evaluate(model, analogies))

Stockholm Sverige Berlin Tyskland
Predicted: Tyskland

Östergötland Linköping Södermanland Nyköping
Predicted: Nyköping

USA dollar Sverige krona
Predicted: kronor

Sverige svensk Frankrike fransk
Predicted: fransk

kvinna man syster bror
Predicted: bror

stor liten gammal ung
Predicted: pojke

stor större liten mindre
Predicted: mindre

stor störst liten minst
Predicted: starkast

cykla cyklade gå gick
Predicted: gick

bil bilar människa människor
Predicted: människor

0.7


Pairs that are related to eachother seems to be more accurate.

We predict "stor liten gammal ung" wrong because "stor liten" is not very related to "gammal ung".

"USA dollar Sverige krona" is almost correct but we probably miss because of the ambiguity in dollar's grammatical number. 

<div class="panel panel-primary">
<div class="panel-heading">Problem 5</div>
<div class="panel-body">
The analogies in the example file have been picked from ten different categories. Invent names for these categories. Which categories would you call syntactic in nature, which syntactic? Choose four categories and find a new example for each of them. Choose two examples where the model succeeds in reproducing the analogy and two where it fails.
</div>
</div>

In [12]:
new_examples = [
    'Köpenhamn Danmark Paris Frankrike',
    'flyga flög prata pratade',
    'bra bättre ful fulare',
    'Tyskland Euro Thailand Bath'
]
print(evaluate(model, new_examples))

Köpenhamn Danmark Paris Frankrike
Predicted: Frankrike

flyga flög prata pratade
Predicted: pratade

bra bättre ful fulare
Predicted: vackrare

Tyskland Euro Thailand Bath
Predicted: MX

0.5


### Playground

Next to the ten categories from the previous problem, there are also other categories the model &lsquo;understands&rsquo;. Here are some examples:

* ``Frankrike vin Sverige ?``
* ``Jesus kristendom Buddha ?``
* ``Tyskland Hitler Italien ?``

Your next task will be to find some examples of your own.

<div class="panel panel-primary">
<div class="panel-heading">Problem 6</div>
<div class="panel-body">
Find at least eight new word analogies (like above) from new categories and write code to see if the model &lsquo;understands&rsquo; them. Try to find at least one example from a syntactic category. Summarise your results in a short reflective text.
</div>
</div>

In [13]:
new_examples = [    
    'hockey klubba tennis racket',
    'man kostym kvinna klänning',
    'Churchill whiskey Löfven mjölk',
    'Sverige socialism Nordkorea kommunism',
    'chef rik student fattig',
    'sitta sittandes stå ståendes',
    'kaffe te öl vin',
]
print(evaluate(model, new_examples))

hockey klubba tennis racket
Predicted: racket

man kostym kvinna klänning
Predicted: klänning

Churchill whiskey Löfven mjölk
Predicted: glögg

Sverige socialism Nordkorea kommunism
Predicted: kommunism

chef rik student fattig
Predicted: kultiverad

sitta sittandes stå ståendes
Predicted: ståendes

kaffe te öl vin
Predicted: vin

0.7142857142857143


We underestimated the student.

If the words are closely related the model performs surpringsingly well. 

## Application of vector space models

Your last task for this lab is to reflect on how one could use word space models in a practical application.

<div class="panel panel-primary">
<div class="panel-heading">Problem 7</div>
<div class="panel-body">
How could one apply word vectors in a recommendation system for books?
</div>
</div>

You can look at: genre, title, author, language, summary of the book and use them as to find books that are close in the n-dimensional model.