# Semantics and Pragmatics, KIK-LG103

## Lab session 3, Part 1: Clustering

---

<font color="red">**This page contains interactive graphics. It only works properly if you change to the "classic notebook" user interface. Start by selecting *Launch Classic Notebook* from the *Help* menu.**</font>

---

In this lab, we will continue using word2vec (W2V) word embeddings.

Start by running the code in the cell below. This code imports the necessary module for plotting vectors and for clustering. The code also intializes the word2vec word embeddings.

In [None]:
%matplotlib notebook
import sys
sys.path.append("../../../sem-prag-2025/src")
import plot_utils

from sklearn.cluster import AgglomerativeClustering, KMeans

embeddings, mapping = plot_utils.get_embeddings()

We will start the lab session by taking a look at **clustering**. We will try out two different clustering methods that were introduced in the lecture. The first one is **k-means** and the second one is **hierarchical (agglomerative)** clustering.

---

### Section 1.1: Flat clustering 

---

The first method we will look at is a "flat" clustering algorithm called k-means. Flat in this case means that the resulting clusters do not have any explicit structure. The optimal result is simply that words within a cluster are maximally similar to each other, while words in different clusters are maximally different from each other. We saw a [demo](http://shabal.in/visuals/kmeans/1.html) of how k-means clustering works; if you need a refresher you can check that out again.

---

**Exercise 1.1.1** In the code cell below we show you how to cluster a set of words and plot the results. You only need to worry about what words to be clustered and how many clusters there should be (`words = ...`, `clusters = ...`). 

Try out different words and numbers of clusters and think about the following questions: 

- How well does the clustering work?
- Which words seem to work best?
- What kind of categories do you think the clusters represent?
- Is there a number of clusters that gives sensible results most of the time, or one that doesn't work at all?
- Do you see any problems with having to define the number of clusters yourself?
- Do the clusters change when you run the algorithm several times?

Note that here the *colors* indicate the clusters. The two-dimensional projection is not in itself a reliable view of which words are close to each other in this task.

---

In [None]:
# Define the words to be clustered and plotted
words = "run jump swim walk go take cry laugh speak talk hear".split()
clusters = 2

# Represent the words in a suitable way for the clustering algorithm
X = plot_utils.to_feature_matrix(words, embeddings, mapping)
# Initialize clustering algorithm
model = KMeans(n_clusters=clusters)
# Train model
model = model.fit(X)
    
# Plot results
plot_utils.plot_kmeans(model, words, embeddings, mapping)

In [None]:
# Comment on your observations (1.1.1) here:
#
#
#

---

### Section 1.2: Hierarchical clustering

---

In this section we will look at the second method: **hierarchical agglomerative clustering**. The method is hierarchical because it gives us a hierarchy of clusters instead of the flat, structureless clusters of k-means. We can investigate the hierarchy at different depths, resulting in different clusters depending on the level where we decide to group the words. Agglomerative means that the algorithm works in a bottom-up manner. Initially, each word is considered its own cluster. The clusters are then iteratively merged until we end up with one cluster. This results in the hierarchical structure.

All of this is easier to see in a dendrogram. Run the code cell below to see the results.

---

**Exercise 1.2.1** Again, try out different words. This time the number of clusters isn't the most important thing. As you might notice, the number you define doesn't change the structure of the resulting hierachy. What it changes is the depth where the algorithm groups the words into clusters. These clusters are shown after the word labels (`word/cluster_id`). Think about the following questions:

- Do you see any potential benefits to using a hierarchical clustering instead of a flat one like k-means? Any problems? 
- Do the resulting clusters from this method match on to the clusters produced by k-means?
- Do the clusters change when you run the algorithm multiple times?

---

In [None]:
# Define the words to be clustered and plotted
words = "run jump swim walk go take cry laugh speak talk hear".split()
clusters = 4

# Represent the words in a suitable way for the clustering algorithm
X = plot_utils.to_feature_matrix(words, embeddings, mapping)
# Initialize clustering algorithm
model = AgglomerativeClustering(n_clusters=clusters)

# Train model
model = model.fit(X)
    
# Plot results
plot_utils.plot_dendrogram(model, labels=words)

In [None]:
# Comment on your observations (1.2.1) here:
#
#

---

**Exercise 1.2.2** In this exercise you have a chance to make sure you understand how the final clusters can be determined from the dendrogram. 

In the code cell below you are given a function `get_clusters_at_cutoff`, that takes the clustering `model`, the source words and a cutoff value as arguments. It will print out the clusters that we would get if we cut off the hierarchy at the depth determined by the value *(Note: Depth in this case is calculated "bottom-up")*. 

In the example code we use a cutoff value 9. In that case all the words end up in a single cluster.

Now make sure you understand how that is determined. Look at the dendrogram above, and try to predict what kind of clusters form at a certain depth. Then use the function below to check if you predicted correctly. 

Try this with a few different sets of words and cutoff values.

---

In [None]:
plot_utils.get_clusters_at_cutoff(model, words, 9)

After this you can continue to Part 2.