In [None]:
%load_ext autoreload
%autoreload 2

# Working with Word Embeddings

In this notebook, we will apply linear algebra operations using NumPy to find analogies between words manually.

In [None]:
import numpy as np

from htwgnlp.embeddings import WordEmbeddings

The embeddings we use for this lab are from the [Google News Word2Vec model](https://code.google.com/archive/p/word2vec/). This model was trained on part of the Google News dataset (about 100 billion words). 

The model contains 300-dimensional vectors for 3 million words and phrases and is about 3.5GB large.

For this notebook, we use a small subset of 243 words, which were selected beforehand and are stored in the pickle file `data/embeddings.pkl`.

Besides some sample words, it contains mostly capitals and countries. We will use the embeddings to find analogies between words.

In [None]:
embeddings = WordEmbeddings()
embeddings.load_embeddings("data/embeddings.pkl")

Now that the model is loaded, we can take a look at the word representations. 

We can see that these word embeddings are 300-dimensional vectors.

In our case, we only use a small dataset of 243 words.

In [None]:
print(f"number of features: {len(embeddings.get_embeddings('queen'))}")
print(f"number of words: {len(embeddings._embeddings.keys())}")
print(embeddings.get_embeddings("queen"))

## Operating on word embeddings

Word embeddings are the result of machine learning processes and will be part of the input for further processes.

Word embeddings are multidimensional arrays, usually with hundreds of attributes that pose a challenge for its interpretation. 

We can try to visually inspect the word embedding of some words.

In [None]:
import matplotlib.pyplot as plt

words = [
    "oil",
    "gas",
    "happy",
    "sad",
    "city",
    "town",
    "village",
    "country",
    "continent",
    "petroleum",
    "joyful",
]

# Convert each word to its vector representation
vectors_2d = np.array([embeddings.get_embeddings(word) for word in words])

fig, ax = plt.subplots(figsize=(10, 10))

# Select a column for the x and y axes
x_axis = 3
y_axis = 2

# Plot an arrow for each word
for word in vectors_2d:
    ax.arrow(
        0,
        0,
        word[x_axis],
        word[y_axis],
        head_width=0.005,
        head_length=0.005,
        fc="r",
        ec="r",
        width=1e-5,
    )

# Plot a dot for each word
ax.scatter(vectors_2d[:, x_axis], vectors_2d[:, y_axis])

# Add the word label over each dot in the scatter plot
for i in range(0, len(words)):
    ax.annotate(words[i], (vectors_2d[i, x_axis], vectors_2d[i, y_axis]))


plt.show()

Note that similar words like 'village' and 'town' or 'petroleum', 'oil', and 'gas' tend to point in the same direction. 

Also, note that 'sad' and 'happy' looks close to each other; however, the vectors point in opposite directions.

## Word distance

Now we plot the words 'sad', 'happy', 'town', and 'village' and display the vector from 'village' to 'town' and the vector from 'sad' to 'happy'.

In [None]:
words = ["sad", "happy", "town", "village"]

# Convert each word to its vector representation
vectors_2d = np.array([embeddings.get_embeddings(word) for word in words])

fig, ax = plt.subplots(figsize=(10, 10))

# Select a column for the x and y axes
x_axis = 3
y_axis = 2

# Print an arrow for each word
for word in vectors_2d:
    ax.arrow(
        0,
        0,
        word[x_axis],
        word[y_axis],
        head_width=0.0005,
        head_length=0.0005,
        fc="r",
        ec="r",
        width=1e-5,
    )

# plot the vector difference between village and town
village = embeddings.get_embeddings("village")
town = embeddings.get_embeddings("town")
diff = town - village
ax.arrow(
    village[x_axis],
    village[y_axis],
    diff[x_axis],
    diff[y_axis],
    fc="b",
    ec="b",
    width=1e-5,
)

# plot the vector difference between village and town
sad = embeddings.get_embeddings("sad")
happy = embeddings.get_embeddings("happy")
diff = happy - sad
ax.arrow(
    sad[x_axis], sad[y_axis], diff[x_axis], diff[y_axis], fc="b", ec="b", width=1e-5
)

# Plot a dot for each word
ax.scatter(vectors_2d[:, x_axis], vectors_2d[:, y_axis])

# Add the word label over each dot in the scatter plot
for i in range(0, len(words)):
    ax.annotate(words[i], (vectors_2d[i, x_axis], vectors_2d[i, y_axis]))


plt.show()

## Predicting capitals

Now, applying vector addition or substraction, one can create a vector representation for a new word. For example, we can say that the vector difference between 'France' and 'Paris' represents the concept of the capital of a country.

We can move from the city of Madrid in the direction of the concept of capital, and obtain something close to the corresponding country to which Madrid is the capital.

For this, recap vector subtraction:

![Vector substraction](https://upload.wikimedia.org/wikipedia/commons/thumb/2/24/Vector_subtraction.svg/206px-Vector_subtraction.svg.png)

In [None]:
capital_to_country = embeddings.get_embeddings("France") - embeddings.get_embeddings(
    "Paris"
)

print(capital_to_country)

In [None]:
predicted_country = embeddings.get_embeddings("Madrid") + capital_to_country
print(predicted_country)

We can observe that we do not end up exactly in the corresponding country.

In [None]:
diff = predicted_country - embeddings.get_embeddings("Spain")
print(diff)

So, we have to look for the closest words in the embedding that matches the predicted country. 

If the word embedding works as expected, the most similar word must be 'Spain'.

In [None]:
embeddings.find_closest_word(predicted_country, metric="euclidean")

In [None]:
embeddings.euclidean_distance(predicted_country).shape

Let's see if cosine similarity also works as expected.

In [None]:
embeddings.find_closest_word(predicted_country, metric="cosine")

## Predicting other Countries

Let's play around a little bit, and see if we also end up in Spain when we start from the capital of another countries.

In [None]:
embeddings.find_closest_word(
    embeddings.get_embeddings("Italy")
    - embeddings.get_embeddings("Rome")
    + embeddings.get_embeddings("Madrid")
)

Now let's try to predict the country from the capital of some another countries

In [None]:
countr_of_berlin = embeddings.get_embeddings("Berlin") + capital_to_country
countr_of_beijing = embeddings.get_embeddings("Beijing") + capital_to_country

print(f"Berlin is the capital of: {embeddings.find_closest_word(countr_of_berlin)}")
print(f"Beijing is the capital of: {embeddings.find_closest_word(countr_of_beijing)}")

And test the prediction with cosine similarity.

In [None]:
print(
    f"Berlin is the capital of: {embeddings.find_closest_word(countr_of_berlin, metric='cosine')}"
)
print(
    f"Beijing is the capital of: {embeddings.find_closest_word(countr_of_beijing, metric='cosine')}"
)

Now let's use the `get_most_similar_words` function and see what happens.

In [None]:
embeddings.get_most_similar_words("Spain")

In [None]:
word = "Spain"
print(
    f"Most similar words to '{word}' by euclidean: {embeddings.get_most_similar_words(word, metric='euclidean')}"
)
print(
    f"Most similar words to '{word}' by cosine: {embeddings.get_most_similar_words(word, metric='cosine')}"
)

Note that Spain itself is not returned, as we want to find similar words, not the same word.

In [None]:
word = "Berlin"
print(
    f"Most similar words to '{word}' by euclidean: {embeddings.get_most_similar_words(word, metric='euclidean')}"
)
print(
    f"Most similar words to '{word}' by cosine: {embeddings.get_most_similar_words(word, metric='cosine')}"
)

In [None]:
word = "happy"
print(
    f"Most similar words to '{word}' by euclidean: {embeddings.get_most_similar_words(word, metric='euclidean')}"
)
print(
    f"Most similar words to '{word}' by cosine: {embeddings.get_most_similar_words(word, metric='cosine')}"
)

## Conclusion

If we have word embeddings available, we can use simple vector operations to find relationships between words. Using this technique, we can already find some interesting relationships between words.