# Lab 3 home assignment: Similarity and clustering

---

Run the code cell below to import some necessary functions and modules.

In [None]:
%matplotlib notebook

import plot_utils
import operator

from sklearn.cluster import KMeans

---

### Task 1: Clustering Word2Vec embeddings

---

In this first task we will explore clustering on a larger scale. In this case larger scale means we will work with ~26000 words. If you run the code cell below, you will see all the words plotted, color-coded by cluster. It is obviously impossible to label this many points without making the figure unreadable, but it might still cointain some interesting information.

In [None]:
# Get embeddings and mapping from words to matrix rows
embeddings, mapping = plot_utils.get_embeddings()
# Get all the words and sort by row id
words = [w for w, i in sorted(mapping.items(), key=operator.itemgetter(1))]

# Number of clusters
clusters = 3

# Initialize algorithm and perform k-means on embeddings
model = KMeans(n_clusters=clusters)
model = model.fit(embeddings)
    
# Plot clusters
plot_utils.plot_kmeans(model, words, embeddings, mapping, plot_text=False, small_points=True)

Below we show you how you can randomly sample words from each cluster. In the example code we sample 100 words (`show_n = 100`) from the clusters 0-4 (`show_clusters = [0, 1, 2, 3]`). Notice that the clusters are given as integers in a list. Because of random sampling you will get different words for each cluster every time you run the code. 

Your task is to explore the results of the clustering. Follow the steps below.

---

**1.1** The example code performs k-means with 3 clusters. Run the code cell above a few times to see how robust the three clusters are. **Do they change in different runs?** Note that the cluster IDs might change every time you run the cell; what we mean by changing here is that the points are clearly grouped differently. **Answer the question as comments in the dedicated code cell below.**

**1.2** After you have performed the clustering a few times, leave one result as it is and sample words from the three clusters in the code cell below. You can change the number of words to be sampled if you want. Again, run the sampling multiple times to get a good picture of the words in each cluster. **Do the clusters represent clearly different groups or is is hard to see any connections? How do you come to the conclusion? Answer the question as comments in the dedicated code cell below.**

**1.3** In this third step you should explore different numbers of clusters. Change the line `clusters = 3` above and run the cell to get the new clustering. The number can be anything from a few clusters to tens or hundreds of them. Try at least **three** different numbers. **For each clustering, sample some of the clusters and see if you can find clear connections between the words. Answer as comments whether the clusters correspond to some sensible categories (these might be for example semantic or syntactic) and how you come to that conclusion.** 

*Note 1: In 1.3 you do not have to sample every cluster when you have tens or hundreds of them. Just try to get a good overview of the quality.*

*Note 2: The more clusters you use, the slower it is to run the code. If it is unbearably slow, reduce the number of clusters. It works fine (takes a few minutes to run) with up to a few hundred clusters, at least.*

**You can get a maximum of 2 points from this task.**

In [None]:
show_n = 100
show_clusters = [0, 1, 2]

plot_utils.sample_clusters(model, words, show_n, show_clusters)

In [None]:
# Answer the above questions (1.1 – 1.3) as comments here
#
#

---

### Task 2: Training and clustering embeddings

---

Run the code cell below to import some more necessary stuff.

In [None]:
import distribsem

from nltk.book import text1 as moby_dick, text2 as sense_and_sensibility

In this task we will train embeddings on our own text. As you might guess, you have two choices; you can use either *Moby Dick* or *Sense and Sensibility*. The first code cell below contains the code for training the embeddings. Like last time, `vocab_size` means how many words we will train embeddings for, `dimensionality` is the size of the context vocabulary (and consequently the dimensionality of the embeddings), `window_size` tells how many words we will take into account on each side of the target word, and `text` is the text we train the embeddings on (either `moby_dick` or `sense_and_sensibility`).

Last time we examined the word contexts and tried to figure out how different window sizes and dimensionalities affect the resulting word embeddings. This time we will examine how the different parameters affect the clustering.

---

**Your task** is to train word embeddings on one of the two texts using different dimensionalities and window sizes. Try at least **two different dimensionality/window size-combinations**. For each combination, try out different numbers of clusters like in the first task. Answer the following questions as comments in the dedicated code cell below: 

**2.1** How do the dimensionalities/window sizes affect the resulting clustering?

**2.2** Are the differences easier to see with a small or a large number of clusters?

**2.3** Can you find an optimal number of clusters for some combination?

**2.4** How does the quality of the clustering compare to that of the W2V embeddings in Task 1?

*Note: The code cells are split as follows: the first code cell contains the code for training the embeddings, the second cell contains the code for plotting the clusters, and the third cell contains the code for sampling the clusters. They are split so that you do not have to train the embeddings every time you want to plot something, for example.*

**You can get a maximum of 2 points from this task.**

In [None]:
# These lines filter out some characters from the texts to make it less noisy
moby_dick = distribsem.filter_text(moby_dick)
sense_and_sensibility = distribsem.filter_text(sense_and_sensibility)

vocab_size = 10000
dimensionality = 1000
window_size = 4
text = moby_dick


embeddings, mapping = distribsem.create_vectors(
    vocab_size, 
    dimensionality, 
    window_size, 
    text
)

In [None]:
words = [w for w, i in sorted(mapping.items(), key=operator.itemgetter(1))]

# Number of clusters
clusters = 4

# Initialize algorithm and perform k-means on embeddings
model = KMeans(n_clusters=clusters)
model = model.fit(embeddings)
    
# Plot clusters
plot_utils.plot_kmeans(model, words, embeddings, mapping, plot_text=False, small_points=True)

In [None]:
show_n = 50
show_clusters = [0, 1, 2, 3]

plot_utils.sample_clusters(model, words, show_n, show_clusters)

In [None]:
# Answer the questions (2.1 – 2.4) as comments here
#
#

Download this page with your additions as a Notebook file (.ipynb) and return through Moodle.