# Semantics and Pragmatics, KIK-LG103

## Lab session 4, Part 1

---

In this lab session we have two main topics: **similarity** and **clustering**. Last time we learned how to work with distributed representations of meaning (vectors). Through visualizations and studying word occurences and their contexts in corpora, we tried to get an intuitive understanding of what the notion of *meaning as a vector* is all about. We also examined the notion of *similarity* visually. 

In this first part we will move toward more rigorous study of similarity. We will introduce a mathematical measure called **cosine similarity** and see how it relates to the visualizations we made last time.

As always, let's first import some tools. The module `lab4utils` contains a few convenient functions that will be explained on the go. Run the code cell below and read on.

In [None]:
%matplotlib notebook

import plot_utils
from lab4utils import cosine_similarity, embed, get_words, \
                      get_mapping, get_top, plot_w2v_2d, plot_w2v_3d, \
                      tabulate_similarities, tabulate_angles, get_n

---
### Section 1.1: Problems with visualization
---

Let's start this part with a short motivating example. In the code cell below we first retrieve the 5 closests and the 5 farthest words to our target word *random*. The embeddings used are the familiar Word2Vec embeddings we saw last time. We then plot the set of words using the function `plot_w2v_3d`. 

---

**Exercise 1.1.1** Run the code cell below and examine the output. Pay special attention to how the different sets of words (closest/farthest) are spread out in the figure. Can you see any problems with examining word similarities using visualizations only? Can you figure out an explanation for the results?

Change the word to something else and see what kinds of results you get.

---

In [None]:
# Select a word
word = "random"

# Embed the word using Word2Vec
word_embedding = embed(word)

# 'get_n(embedding, n, mode)' gives us 'n' closest
# (if mode == "best") or 'n' farthest (if mode == "worst")
# embeddings.
print("Closest:", get_n(word_embedding, 5, "best"))
print("Farthest:", get_n(word_embedding, 5, "worst"))

plot_w2v_3d("random randomly arbitrary anonymous deranged thorn mahogany poached statesman strides".split())

---

### Section 1.2: Back to semantic feature analysis

---

Recall semantic feature analysis from last session. In the code cell below we show an example of an analysis with three features.

In [None]:
features = {
    "x": "human",
    "y": "adult",
    "z": "male"
}

# Features are binary; 0 means 'not defined'
words = [
    ("girl",  [ 1, -1, -1]),
    ("boy",   [ 1, -1,  1]),
    ("adult", [ 1,  1,  0]),
    ("woman", [ 1,  1, -1]),
    ("calf",  [-1, -1,  0]),
    ("cow",   [-1,  1, -1]),
    ("mature",[ 0,  1,  0])
]

plot_utils.plot_3d_binary(features, words)

Looking at the figure already gives us some information about the similarities (or differences) between the words, but as already mentioned, this time we want to properly quantify those similarities. In the code cell below we show you how to tabulate pairwise cosine similarities and angles for the word vectors.

As a reminder: Cosine similarity is a real number (the cosine of the angle between the vectors) between `-1` and `1`. It is `1` for identical vectors, `-1` for vectors that point in completely different directions, and something inbetween for the rest. You can see the lecture slides for more information if you want. We will also dive deep into cosine similarity in Part 3 of this session.

---

**Exercise 1.2.1** Run the code cell below, examine the output and compare the tables to the figure above.

Change some of the vectors or add new ones to the list `words`. You can also also analyze some other set of words using you own features.

How do the similarities and angles change when you change the vectors? Can you see how the similarities relate to the figure?

Pick one word from the figure and come up with a new word so that the angle between the two is either 45°, 90°, or 180°. Figure out at least one new word for each angle.

---

In [None]:
# 'words' was defined in the code cell above
tabulate_similarities(words)
tabulate_angles(words)

After this, continue to Part 2.