# Lab 2 home assignment: Distributional semantics

---

<font color="red">**This page contains interactive graphics. It only works properly if you use the "classic notebook" user interface. If you did not do that yet, start by selecting *Launch Classic Notebook* from the *Help* menu.**</font>

---

Your homework consists of three tasks, worth 4 points combined. Add your solutions to this notebook page. When you are done, download the page as a notebook file (.ipynb) and submit that file on Moodle.

---

### Task 1: Semantic feature analysis of emotions

*You can get a maximum of 1.5 points from this task.*

Your first task concerns *sentiment analysis*, or more specifically what connotations of *emotions* words carry. According to the American psychologist [Robert Plutchik](https://en.wikipedia.org/wiki/Robert_Plutchik) (1927 – 2006), there are eight primary emotions: **anger**, **fear**, **sadness**, **disgust**, **surprise**, **anticipation**, **trust**, and **joy**.

In Plutchik's theory, more complex emotions are derived from the eight primary emotions by combining different intensities of the primary emotions.

For instance, the word "*reward*" involves the emotions *anticipation*, *joy*, *surprise*, and *trust* to different degrees. The word "*worry*" involves *anticipation*, *fear*, and *sadness*. The word "*suddenly*" is only about *surprise*, whereas the word "*garbage*" triggers an emotion of *disgust*.

**Task 1.1:** `(1 point)` Your task is to do a semantic feature analysis for a set of words. As semantic features you should use Plutchik's eight primary emotions. You should perform dimensionality reduction and plot the word vectors you obtain in both 2D and 3D, exactly as in the _games_ example in Part 2 of the class assignment. Also include a table of the pairwise angles between all vectors.

However, first you need to import the necessary plot utilities:

In [None]:
%matplotlib notebook
import sys
sys.path.append("../../../sem-prag-2025/src")
import plot_utils

For your analysis, pick *10 words* at your own choice from the following list of words: `anaconda, bucket, concert, dirty, etymology, forsaken, gain, hairy, instruction, judging, kind, lustful, monster, night, overestimate, profit, quick, refund, summer, tolerate, underpants, vinegar, warn, xenophobia, yesterday, zodiac`.

In the code cell below, produce eight-dimensional vectors for the ten words you have selected. For every word, go through all eight primary emotions. Depending on what emotions the words convey to you, set feature values as follows: Use a value of zero (0) if the emotion is totally absent, and one (1) for a strong presense of that particular emotion. You can also use degrees between zero and one for some of the feature dimensions.

In [None]:
# features: (1) anger, (2) fear, (3) sadness, (4) disgust,
#           (5) surprise, (6) anticipation, (7) trust, (8) joy

words = [            # (1)   (2)   (3)   (4)   (5)   (6)   (7)   (8)
    ("word1",        [ 0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0]),  # modify these!
    ("word2",        [ 0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0]),
    ("word3",        [ 0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0]),
    # etc.
]

Next perform dimensionality reduction and plot in two and three dimensions. Also produce the table of pairwise angles between the word vectors. (If the plot does not work as it should, go back to Part 1 of the class assignment to see what to do about that.)

In [None]:
plot_utils.plot_2d_binary_hd(words, arrows=False)
plot_utils.plot_3d_binary_hd(words, arrows=True)
plot_utils.tabulate_angles(words)

**Task 1.2:** `(0.5 points)` When you have produced the ten word vectors, have plotted them and counted the angles between the vectors, please answer the following *questions*:

1. Looking at the pairwise angles, which two words are *most similar* and which two words are *least similar* to each other based on the emotion features? Does this make sense?

2. Compared to the exactly measured angles, do the two- and three-dimensional projections of the vectors reflect the similarities and dissimilarities accurately? That is, in comparison to the measured angles, do the plotted figures give you the same impression of which words are most similar and which words are least similar? Discuss.

In [None]:
# Your answer to question 1:
#
#

# Your answer to question 2:
#
#

# Put a hashtag (#) at the beginning of every line of your answer, so
# that your text is treated as comments rather than Python commands.

---

### Task 2: Word contexts of related and unrelated words

*You can get a maximum of 1.5 points from this task. You need to answer 5 questions, and each question is worth 0.3 points.*

Before proceeding to Tasks 2 and 3, you need to import some further libraries and data:

In [None]:
# Import two book texts from the Gutenberg corpus: Moby Dick and Sense and Sensibility
import sys
!{sys.executable} -m pip install nltk

import nltk
from nltk.corpus import gutenberg
from nltk.text import Text
nltk.download("gutenberg")

moby_dick = Text(gutenberg.words("melville-moby_dick.txt"))
sense_and_sensibility = Text(gutenberg.words("austen-sense.txt"))

# Import some additional auxiliary libraries
import distribsem
import numpy as np

# These lines filter out some characters from the texts to make it less noisy
moby_dick = distribsem.filter_text(moby_dick)
sense_and_sensibility = distribsem.filter_text(sense_and_sensibility)

Your second task is to study the contexts of different words, keeping in mind the ideas about learning word embeddings from corpora. You can read slides 55 – 58 from Lecture 2 if you need a refresher on this. We also remind you of some basics here.

When learning word embeddings from a corpus, there are a few parameters we need to decide on beforehand. First of all, we need to define a window size for the context we want to use. For example, a window of +/- 4 words means we will consider as context the four words before and the four words after the focus word.

We also need to decide the dimensionality of the final vectors. In our case this means the size of the context vocabulary we want to take into account. We can decide, for example, to use the 1000 most common words in the corpus as context vocabulary. Consequently, our embeddings will be 1000-dimensional. Any words outside the top 1000 will simply not be taken into account.

In the code cell below, you are given a function `show_kwic` (kwic = **key-word in context**), that you can use to retrieve instances of a word of your choice in a corpus. The example shows the 10 first occurrences of the word *water* in *Moby Dick*, using a window size of +/- 4 and the 500 most common words as context vocabulary.

---

**Your task:** Pick **one pair** of *closely related* words (for example, *sky* and *cloud* can be considered closely related) and **one pair** of *unrelated* words (for example, *house* and *jump*). Study the contexts of the words and see how/whether the (un)relatedness is manifested in the contexts. Answer as comments in the code cell.

You should also try different values for `dimensionality` and `window_size`. `dimensionality` controls the size of the context vocabulary (using top frequent words) and `window_size` controls the window size of the context. Try `dimensionality` values in different ranges (tens, hundreds, thousands) and vary the window size for example between 1 and 10. Write answers to the following questions as comments in the code cell below:

1. Which pairs of words did you pick as the two related words and the two unrelated words?
2. How does dimensionality affect how easy it is to see the (un)relatedness of the words?
3. How does window size affect how easy it is to see the (un)relatedness of the words?
4. What kinds of characteristics do you think the embeddings capture with small window sizes and dimensionalities?
5. What kinds of characteristics do you think the embeddings capture with large window sizes and dimensionalities?

You can use either the text `moby_dick` or `sense_and_sensibility`.

**Note:** *Words outside the top "dimensionality" frequent words are replaced with UNKs in the kwics.*

---

In [None]:
source_text = moby_dick   # alternatively: sense_and_sensibility
focus_word = "water"      # use different words
window_size = 4           # try values between 1 and 10
dimensionality = 500      # try values in the range of tens, hundreds, thousands
show_n_occurrences = 10   # increase this if you like

distribsem.show_kwic(
    text=source_text,
    word=focus_word,
    window=window_size,
    dimensionality=dimensionality,
    show_n=show_n_occurrences
)

# Answer questions 1 – 5 as comments in this code cell:
#
#

---

### Task 3: Comparing embeddings across corpora

*You can get a maximum of 1 point from this task.*

In this third task we will continue the theme of word embeddings and contexts. This time we will study embeddings trained on two different texts. The texts we will use are the familiar *Moby Dick* (variable `moby_dick`) and *Sense and Sensibility* (variable `sense_and_sensibility`). 

In the code cell below we train embeddings for a set of words on both texts (function `create_vectors_shared`). This will result in two embedding matrices (`M_moby_shared` and `M_sense_shared`) as well as a mapping from words to their row indices (`mapping_shared`). The vocabulary size and embedding dimensionalities are a bit weird because of the way they are restricted to get comparable embeddings. (Don't worry about them.)

Run the code cell and read on.

In [None]:
M_moby_shared, M_sense_shared, mapping_shared = distribsem.create_vectors_shared(
    max_vocab_size=10000,
    min_dimensionality=1000,
    window_size=4,
    text1=moby_dick,
    text2=sense_and_sensibility
)

In the code cell below we show how you can visualize the word embeddings. Embeddings trained on different texts are color-coded. 

**Task 3.1:** `(0.5 points)` Your first task is to find one word that has relatively similar embeddings in the two texts (meaning the two embeddings for the *same word* are close to each other in the figure) as well as one word that has completely different embeddings (again, a case where the same word gets two very different embeddings in the two texts). Plot the two words like in the example code and include comments where you explain which words are close and which not. 

**Do not use the words *water* or *man* that have been supplied.** Also, don't change the window size or dimensionality for this task.

Read on to **Task 3.2** after you have done this part of the task.

**Note:** *Use your best judgment on what "relatively close" means. It is hard to find words that have completely identical embeddings or words that have embeddings that are on completely opposite regions of the vector space. If you run the code cell below, you can see two word visualized, "water" and "man". "Water" is a good example of dissimilar embeddings and "man" is an example of a word with two similar embeddings.*

In [None]:
focus_words = "water man".split()  # replace this with two other words!

distribsem.plot_two_embeddings(
    words=focus_words,
    embeddings_1=M_moby_shared,
    embeddings_2=M_sense_shared,
    mapping_1=mapping_shared,
    embeddings_1_name="Moby Dick",
    embeddings_2_name="Sense and Sensibility"
)

# Task 3.1: Answer the questions here
#

**Task 3.2:** `(0.5 points)` Your second task is to pick the two words you found in **Task 3.1**: one with similar embeddings and the other one with dissimilar embeddings. Analyze the contexts of the two words in the two texts, similar to what you did in Task 2. How do the (dis)similarities of the embeddings show in the word contexts in different texts? Again, answer the question in your code. 

You do not need to change the window size of dimensionality in this task.

In [None]:
show_n_occurrences = 20
focus_word = "water"      # change this!

print("Occurrences in Sense and Sensibility:")
distribsem.show_kwic(
    text=sense_and_sensibility,
    word=focus_word,
    window=4,
    dimensionality=1000,
    show_n=show_n_occurrences
)

print("\nOccurrences in Moby Dick:")
distribsem.show_kwic(
    text=moby_dick,
    word=focus_word,
    window=4,
    dimensionality=1000,
    show_n=show_n_occurrences
)

# Task 3.2: Answer the questions here
#

When you are done, download this page as a notebook file (.ipynb) and submit it through Moodle before the deadline. Good luck!