# L5: Semantic analysis

## Introduction

A **word space model** of word meanings represents words as vectors in a high-dimensional vector space. In this lab you will experiment with a word space model which trained on the Swedish Wikipedia using [word2vec](https://code.google.com/archive/p/word2vec/). In order to use word2vec in Python, we use the [gensim](https://radimrehurek.com/gensim/) library.

The library and some more essentials for this lab are contained in the module we load in the following cell.

In [1]:
import lt5

## Explore the lab system

Run the next cell to load the pre-trained word space model:

In [2]:
model = lt5.load_model("/home/TDDE17/labs/l5/data/wikipedia-sv.bin")

The model consists of word vectors. In Python a word vector is represented as an [*array*](https://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html). For the purposes of this lab, you can treat arrays as lists. The next line of code prints the vector for the word *student*:

In [3]:
model['student']

array([ 0.3891147 , -0.25333604,  0.10631166,  0.3614067 ,  0.14798231,
       -0.28869128,  0.4014135 , -1.0192152 , -0.00860699,  0.7631522 ,
       -0.30077016,  0.31991726, -0.3088756 , -0.21920508, -0.10915887,
        0.4128209 , -0.23703265,  0.93853813,  0.8149459 ,  0.01140385,
        0.24421778,  0.3621935 ,  0.4451298 ,  0.32729414,  0.82020354,
       -0.7933065 , -0.044444  , -0.42768687, -0.8871227 ,  0.13306266,
        0.57084686,  0.46596572, -0.48475036,  0.22611499,  0.3637679 ,
        0.12183799,  0.7114298 ,  0.33212078,  1.3399355 , -0.78250617,
       -0.6044599 ,  0.0656125 , -0.18711154,  0.7097786 ,  0.11466026,
        0.36936972,  0.19929321, -0.41768453,  0.88794357, -0.49968722,
       -0.53722715, -0.3276362 , -0.05238692, -0.21328461,  0.68021107,
       -0.49659464,  0.78859437,  0.514551  ,  0.5988576 ,  0.31225756,
       -0.4754915 ,  0.12143688,  0.79769266, -0.1092938 ,  0.05594372,
        0.94904387,  1.1317161 , -0.236245  ,  0.3130332 , -0.45

All vectors in the model have the same dimensionality $n$; this value is a parameter that is fixed when training the model.

<div class="panel panel-primary">
<div class="panel-heading">Problem 1</div>
<div class="panel-body">
Write some code that prints $n$ for the model we loaded earlier.
</div>
</div>

In [4]:
print(len(model['student']))

100


Given a word space model, we can compute the semantic similarity between words using the cosine distance between their respective word vectors. The next line of code showcases how to compute the cosine distance:

In [5]:
print(model.similarity('student', 'lärare'))

0.67229164


  if np.issubdtype(vec.dtype, np.int):


<div class="panel panel-primary">
<div class="panel-heading">Problem 2</div>
<div class="panel-body">
<p>Write code to print the following:</p>
<ul>
<li>the cosine distance between some word of your liking and the word itself</li>
<li>the cosine distance between two words that are, according to your judgement, semantically related</li>
<li>the cosine distance between two words that are, according to your judgement, semantically unrelated</li>
</ul>
</div>
</div>

In [9]:
print(model.similarity('hej', 'hej'))
print(model.similarity('ja', 'nej'))
print(model.similarity('bord', 'jaga'))

0.99999994
0.8446068
0.35996804


  if np.issubdtype(vec.dtype, np.int):


## Word analogies

In a word analogy task you are given two pairs of words that share a common semantic relation. A well-known example is *man/woman* and *king/queen*, where the semantic relation could be dubbed &lsquo;female&rsquo;. The task is to predict one of the words, e.g. *queen*, given the other three. By doing that we answer the question: &lsquo;*man* is to *woman* as *king* is to —?&rsquo;.

### Predict the fourth word

[Mikolov et al. (2013)](http://www.aclweb.org/anthology/N13-1090) have shown that the word analogy task can be solved by adding and substracting word vectors in a word2vec-model: the vector for *queen* is close (in terms of cosine distance) to the vector *king* $-$ *man* $+$ *woman*. In the next problem you will implement this idea.

<div class="panel panel-primary">
<div class="panel-heading">Problem 3</div>
<div class="panel-body">
    
Write a function `complete()` that takes the first three words of a word analogy quadruple as input and predicts the fourth word.
</div>
</div>

To solve the problem you should complete the following code cell:

In [31]:
def complete(model, a, b, c):
    """Returns the fourth word in the analogy quadruple"""
    return model.most_similar(positive=[b, c], negative=[a])[0][0]

The function is supposed to be called like this:

In [32]:
complete(model, "man", "kvinna", "kung")

  if np.issubdtype(vec.dtype, np.int):


'drottning'

To solve Problem&nbsp;3 you can use the following method of the model:

`model.most_similar(pos, neg, n)`

The method takes as its inputs two lists with words (strings), `pos` and `neg`, and a number `n`, and returns the `n` closest vectors to the vector that one gets by adding all the vectors in the `pos` list and subtracting all the vectors in the `neg` list. Here is an example:

In [22]:
print(model.most_similar(['kung', 'kvinna'], ['man'], 3))

('drottning', 0.7310913801193237)


  if np.issubdtype(vec.dtype, np.int):


### Categories of analogies

Word vectors are computed based on co-occurrence counts: words that co-occur frequently with certain other words are going to have similar vectors. In order to get a better understanding of the model&rsquo;s possibilities and limitations, we load a list of ten analogy pairs:

In [16]:
analogies = lt5.load_data("/home/729G17/labs/l5/data/analogies.txt")

Each element of `analogies` is a string consisting of four space-separated words. Here is an example:

In [39]:
print(analogies[0].split(" ")[3])

Tyskland


<div class="panel panel-primary">
<div class="panel-heading">Problem 4</div>
<div class="panel-body">
<p>Write code that computes the model&rsquo;s accuracy on the task of predicting the fourth word in every analogy pair, given the other three. Feel free to use the <code>complete()</code> function that you implemented for Problem&nbsp;3.</p>
</div>
</div>

Use the next code cell to solve the problem:

In [45]:
def evaluate(model, analogies):
    """Computes the accuracy of the specified model on the specified list of analogy quadruples"""
    count = 0
    correctCount = 0
    for i in range(0, len(analogies)):
        if(complete(model, analogies[i].split(" ")[0], 
                    analogies[i].split(" ")[1], analogies[i].split(" ")[2]) == analogies[i].split(" ")[3]):
            correctCount += 1
        count += 1
    return correctCount/count

print(evaluate(model, analogies))

  if np.issubdtype(vec.dtype, np.int):


0.7


<div class="panel panel-primary">
<div class="panel-heading">Problem 5</div>
<div class="panel-body">
    <p>The analogies in the example file have been picked from ten different categories. Invent names for these categories. Which categories would you call semantic (related to the <em>meaning</em> of the words), which would you call syntactic (related to the <em>form</em> and the <em>grammatical behaviour</em> of the words)?</p>
    <p>Select four categories, and find one new example for each of them. Of the four examples, two should be examples where the model succeeds in reproducing the intended analogy, and two should be examples where where the model fails to do so.</p>
</div>
</div>

*TODO: Answer for Problem 5 by completing the following tables*

<p><strong>Part 1: Naming the categories</strong></p>

<table>
    <tr><th>Example</th><th>Category</th></tr>
    <tr><td>1</td><td>Huvudstad - Land</td></tr>
    <tr><td>2</td><td>Landskap - Stad i landskapet</td></tr>
    <tr><td>3</td><td>Land - Valuta</td></tr>
    <tr><td>4</td><td>Land - Etnicitet</td></tr>
    <tr><td>5</td><td>Kön - Familjerelation</td></tr>
    <tr><td>6</td><td>Storlek - Ålder</td></tr>
    <tr><td>7</td><td>Positiv - Komparativ</td></tr>
    <tr><td>8</td><td>Positiv - Superlativ</td></tr>
    <tr><td>9</td><td>Presens - Preteritum</td></tr>
    <tr><td>10</td><td>Singular - Plural</td></tr>
</table>

<p><strong>Part 2: New examples for four of the categories</strong></p>

<table>
    <tr><th>Category</th><th>Example</th><th>Model&rsquo;s completion</th></tr>
    <tr><td>0</td><td>man kvinna kung <em>drottning</em></td><td>man kvinna kung <em>drottning</em></td></tr>
    <tr><td>1</td><td>Kina Peking Norge <em>Oslo</em></td><td>Kina Peking Norge <em>Lillehammer</em></td></tr>
    <tr><td>3</td><td>USA USD UK <em>GBP</em></td><td>USA USD UK <em>Chart</em></td></tr>
    <tr><td>4</td><td>Norge Norsk Mexiko <em>Mexikan</em></td><td>Norge Norsk Mexiko <em>Mexikos</em></td></tr>
    <tr><td>10</td><td>Banan Bananer Boll <em>Bollar</em></td><td>Banan Bananer Boll <em>Stropp</em></td></tr>
</table>

In [65]:
words = ['Kina', 'Peking', 'Norge']
print(complete(model, words[0], words[1], words[2]))
words = ['USA', 'USD', 'UK']
print(complete(model, words[0], words[1], words[2]))
words = ['Norge', 'Norsk', 'Mexiko']
print(complete(model, words[0], words[1], words[2]))
words = ['Banan', 'Bananer', 'Boll']
print(complete(model, words[0], words[1], words[2]))

Lillehammer
Chart
Mexikos
Stropp


  if np.issubdtype(vec.dtype, np.int):


## Limitations of word embeddings

In the last problem of this lab, you will reflect on shortcomings of word embeddings.

<div class="panel panel-primary">
<div class="panel-heading">Problem 6 (Reflection)</div>
<div class="panel-body">
    <p>The lecture on word embeddings mentioned several shortcomings of the model. Design an experiment to find concrete examples that illustrate these shortcomings. Write a short reflection piece about your experience. Use the following prompts:</p>
    <ul>
        <li>How did you set up the experiment? What were the results?</li>
        <li>Based on your previous knowledge, did you expect the results? How do you explain them?</li>
        <li>What did you learn from this experiment? How, exactly, did you learn it? Why does this learning matter?</li>
    </ul>
</div>
</div>

In [79]:
words = ['Palestina', 'Palestinier', 'Israel']
print(complete(model, words[0], words[1], words[2]))
words = ['Svensk', 'Sverige', 'Kurd']
print(complete(model, words[0], words[1], words[2]))
words = ['Öland', 'Sverige', 'Krim']
print(complete(model, words[0], words[1], words[2]))

måltavlorna
Laurentien
Ukraina


  if np.issubdtype(vec.dtype, np.int):


Då resultatet från word embedding är väldigt beroende av dess datamängd så kommer resultaten vara väldigt "biased" beroende på i vilka sammanhang som ord förekommer och även om det finns ett entydigt svar kommer att påverka. 
Vi tänkte därav testa "kontroversiella" anagram för att se resultaten och därav kunna avgöra hur väl algoritmen fungerar för dessa. Det vi tror kommer ske är att ord som påträffats i många olika sammanhang (och med olika innebörd/agenda) inte kommer att ha en hög säkerhet. 
För att välja ut anagram att testa har vi valt nutida konflikter då vi tror att dessa kommer att ha en större samplerate då de rimligtvis har större datamängder att utgå ifrån.

De resultaten vi fick från de tre olika anagramen var inte helt osannolika men heller inte helt väntade. 

Det vi lärde oss från experimentet var hur mycket word embedding kan påverkas av sample-data, framförallt då det inte finns ett entydigt givet "svar" på anagrammet. Vi lärde oss detta från föreläsningen men vi fick resultat som påvisade detta utifrån egna tester. Detta är viktigt att tänka på om man ska bygga ett word-prediction program då det ofta inte är lämpligt att använda denna typ av funktion utan att ha någon av filter när ord anges som inte kan ge ett entydigt svar. 

<div class="alert alert-info">
    <p>Once you have finished all problems, submit this notebook according to the instructions on the course web site.</p>
</div>