<a href="https://colab.research.google.com/github/nicolaiberk/llm_ws/blob/main/notebooks/02b_embeddings_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scaling Word Embeddings & Document Embeddings

In [None]:
!pip install gensim # restart after installation

In [None]:
import numpy as np

## Calculating Projections and Cosine Similarity

In [None]:
## again we load the 100-dimensional GloVe model trained on Wikipedia data
import gensim.downloader as api
wv = api.load('glove-wiki-gigaword-100')

We can calculate the projection of vector a on a given vector b by using the dot product, divided by the magnitude of vector b:

In [None]:
## calculate dot product
from numpy import dot
from numpy.linalg import norm

def proj_mag(a, b):
    return dot(a, b) / norm(b)

proj_mag(wv['nurse'], wv['woman'])

But cosine similarity is a lot easier to interpret, since it is bounded [-1,1]:

In [None]:
def cos_sim(a, b):
    return dot(a, b)/(norm(a)*norm(b))

cos_sim(wv['nurse'], wv['woman'])

As you have seen in the first tutorial, there is also a function in the `gensim`-package to calculate the cosine similarity:

In [None]:
wv.similarity('nurse', 'woman')

## Defining Scales

An association of a given word might not tell us much, especially because opposite concepts are often located close to each other in the vector space. Rather than the association of two words, we might often be interested in the relative position of a concept on a pre-defined scale. We can create such a scale by subtracting the negative pole of that scale from the positive pole. You have seen this concept being used already when we calculated the word vector for 'berlin' from the vectors for 'paris', 'france', and 'germany'. Try to imagine this subtraction in the vector space.

By calculating the magnitude of the projection of a given word to this defined axis, we can identify the position this word has relative to this scale. This visualization from the Kozlowski paper might be helpful to understand this concept:

![](https://www.dropbox.com/scl/fi/fy9ihaxiwiql2c0s13vjy/projection_example.png?rlkey=aszlo5mp3rdzy4svawug5eybl&dl=1)

You can hence create pretty much any scale by defining two poles for it, subtracting the negative from the positive pole, and calculating the magnitude of the projection:

In [None]:
## simple scaling, projection magnitude
fr_de_scale = wv['french'] - wv['german']
print("Frenchness of Paris: ", proj_mag(wv['paris'], fr_de_scale))
print("Frenchness of Lausanne: ", proj_mag(wv['lausanne'], fr_de_scale))
print("Frenchness of Bern: ", proj_mag(wv['bern'], fr_de_scale))
print("Frenchness of Berlin: ", proj_mag(wv['berlin'], fr_de_scale))

As mentioned, the boundedness of the cosine similarity [-1,1] makes it a lot more interpretable:

In [None]:
## simple scaling, cosine similarity
fr_de_scale = wv['french'] - wv['german']
print("Frenchness of Paris: ", cos_sim(wv['paris'], fr_de_scale))
print("Frenchness of Lausanne: ", cos_sim(wv['lausanne'], fr_de_scale))
print("Frenchness of Bern: ", cos_sim(wv['bern'], fr_de_scale))
print("Frenchness of Berlin: ", cos_sim(wv['berlin'], fr_de_scale))

In practice, it is better to create a scale from an averaged vector of multiple pole words (often referred to as the 'centroid'). To do so, we define our pole words and average across their vectors:

In [None]:
## start by defining dictionaries (usually larger than 3-4 words)
rich_words = ['rich', 'wealthy', 'wealth', 'billionaire']
poor_words = ['poor', 'broke', 'poverty', 'beggar']

female_words = ['female', 'woman', 'feminine']
male_words = ['male', 'man', 'masculine']

In [None]:
## pro tip: you can extend your dictionaries with embeddings!
wv.most_similar(positive=rich_words, topn=5)

In [None]:
## create mean vectors for your concepts of interest
rich_vec = np.mean(wv[rich_words], axis = 0)
poor_vec = np.mean(wv[poor_words], axis = 0)

female_vec = np.mean(wv[female_words], axis = 0)
male_vec = np.mean(wv[male_words], axis = 0)

Try to imagine what these vectors look like in the vector space, relative to the dictionary terms. These vectors can already be informative in themselves, given that they are designed to represent the essence of a concept:

In [None]:
cos_sim(wv['nurse'], female_vec)

In [None]:
cos_sim(wv['nurse'], male_vec)

And the differences of cosine similarities are often used in research:

In [None]:
## cosine similarity difference
nurse_mf_cos = cos_sim(wv['nurse'], female_vec) - cos_sim(wv['nurse'], male_vec)
teacher_mf_cos = cos_sim(wv['teacher'], female_vec) - cos_sim(wv['teacher'], male_vec)
lifeguard_mf_cos = cos_sim(wv['lifeguard'], female_vec) - cos_sim(wv['lifeguard'], male_vec)
banker_mf_cos = cos_sim(wv['banker'], female_vec) - cos_sim(wv['banker'], male_vec)
miner_mf_cos = cos_sim(wv['miner'], female_vec) - cos_sim(wv['miner'], male_vec)

print("Association of 'nurse' with female (vs. male) terms: ", nurse_mf_cos)
print("Association of 'teacher' with female (vs. male) terms: ", teacher_mf_cos)
print("Association of 'lifeguard' with female (vs. male) terms: ", lifeguard_mf_cos)
print("Association of 'banker' with female (vs. male) terms: ", banker_mf_cos)
print("Association of 'miner' with female (vs. male) terms: ", miner_mf_cos)

However, the neat nature of projections is that they can capture a vectors' position on that very scale. To calculate them, we first need subtract the poles from each other to create the axes:

In [None]:
## define axes
gender_axis = female_vec - male_vec
rich_axis = rich_vec - poor_vec

... and calculate the cosine similarities:

In [None]:
import numpy as np
import pandas as pd

scaled_occupations = pd.DataFrame(
    [['Nurse', 'Teacher', 'Lifeguard', 'Banker', 'Miner'],
    [
        cos_sim(wv['nurse'], gender_axis),
        cos_sim(wv['teacher'], gender_axis),
        cos_sim(wv['lifeguard'], gender_axis),
        cos_sim(wv['banker'], gender_axis),
        cos_sim(wv['miner'], gender_axis)
    ],
    [
        cos_sim(wv['nurse'], rich_axis),
        cos_sim(wv['teacher'], rich_axis),
        cos_sim(wv['lifeguard'], rich_axis),
        cos_sim(wv['banker'], rich_axis),
        cos_sim(wv['miner'], rich_axis)
    ]]
).T.rename(columns={0:'Occupation', 1:'Gender score', 2:'Economic Score'})

In [None]:
scaled_occupations

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.scatter(scaled_occupations['Gender score'], scaled_occupations['Economic Score'])

# Add labels and title
plt.xlabel('Gender Score (Higher = More Female)')
plt.ylabel('Economic Score (Higher = Richer)')
plt.title('Occupations Scaled by Gender and Status')

# Add text labels for each point
for i, row in scaled_occupations.iterrows():
    plt.text(row['Gender score'], row['Economic Score'], row['Occupation'])

plt.grid(True)
plt.show()

In practice, both approaches (differences in cosine similarities and the cosine similarity relative to a scale) lead to almost identical scores:

In [None]:
scaled_occupations['cos_dif_gender'] = [nurse_mf_cos, teacher_mf_cos, lifeguard_mf_cos, banker_mf_cos, miner_mf_cos]
scaled_occupations[['Gender score', 'cos_dif_gender']].corr()

## Document Embeddings

We'll return to the adapted dataset of dialogue in the Simpsons from [Kaggle](https://www.kaggle.com/datasets/prashant111/the-simpsons-dataset?resource=download&select=simpsons_script_lines.csv). This time, we estimate a model with document embeddings indicating the character who provided the line. This allows us to create semantic vectors for each character in the same space as the word embeddings.

Again, we load and clean our texts:

In [None]:
## load data
import pandas as pd
dataset = pd.read_csv('https://www.dropbox.com/scl/fi/n5ffxvm4qyjkp8ws7qgoq/simpsons_script_lines_clean.csv?rlkey=gfliitwgi8cqsjxlcdmmdwtym&dl=1')
dataset['cleaned_text'] = dataset['spoken_words'].str.replace('[^a-zA-Z ]','', regex=True) # remove anything that is not a letter or whitespace
dataset['cleaned_text'] = dataset['cleaned_text'].str.lower() # lowercase
dataset['cleaned_text'] = dataset['cleaned_text'].str.replace(' +', '') # remove multiple whitespaces
dataset = dataset[dataset['cleaned_text'] != '']
dataset.loc[:,'cleaned_text'] = dataset['cleaned_text'].astype(str)
dataset.cleaned_text.head()

We then need to prepare this data for the Doc2Vec-model. Remember that, alongside the vector indicating context words, a vector indicating the document ID is passed. We need to concatenate all texts per speaker in order to create a single 'document' for each speaker. Then, we split those documents into tokens of interest (in our case words, similar to the word2vec model) and use  `gensim`'s `TaggedDocument`-class to pass the information about the speaker:

In [None]:
## concatenate all cleaned texts from the same speaker and tag documents (this runs a bit)
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
character_texts = list()
for speaker in dataset.raw_character_text.unique():
    print('Preparing texts from ', speaker, '...')
    concat_text = ' '.join(dataset[dataset.raw_character_text == speaker].cleaned_text)
    character_texts.append(TaggedDocument(concat_text.split(" "), [speaker]))

Each document then contains all words uttered from a given character, as well as an indicator for the character:

In [None]:
character_texts[1100]

Time to define our model and train it on these documents!

In [None]:
model = Doc2Vec(vector_size=300, min_count=2, epochs=40)

In [None]:
## estimate doc2vec model (this takes some patience)
model.build_vocab(character_texts)
model.train(character_texts, total_examples=model.corpus_count, epochs=model.epochs)

This results in a model with both word and document embeddings. These can be accessed via `model.dv`:

In [None]:
model.dv['Homer Simpson'].shape

In [None]:
model.wv['donut'].shape

Crucially, the document embeddings are learned into the same vector space as the word embeddings. This means we can assess which words are associated with this character, meaning which words are similar in meaning as the learned the meaning of the character (note that words are predictive of words used by the character - they might not even be used by the characters themselves).

In [None]:
model.wv.most_similar(model.dv['Homer Simpson'], topn=5)

In [None]:
model.wv.most_similar(model.dv['Lisa Simpson'], topn=5)

Of course, we can also plot the positions of different characters in our vector space:

In [None]:
## run a dimensionality reduction
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca.fit(model.dv[model.dv.index_to_key])
dv_2d = pca.transform(model.dv[model.dv.index_to_key])
dv_2d = pd.DataFrame(dv_2d, index = model.dv.index_to_key)

In [None]:
import matplotlib.pyplot as plt
import random
random.seed(0)

plt.scatter(dv_2d[0], dv_2d[1])
plt.show()

And scale them:

In [None]:
cool_boring_scale = model.wv['cool'] - model.wv['boring']
print("Coolness of Homer: ", cos_sim(model.dv['Homer Simpson'], cool_boring_scale))
print("Coolness of Bart: ", cos_sim(model.dv['Bart Simpson'], cool_boring_scale))

In [None]:
## ...or predict vectors for new documents...
new_document = [
    "it", "wont", "last", "brothers", "and", "sisters", "are", "natural",
    "enemies", "like", "englishmen", "and", "scots", "or", "welshmen", "and",
    "scots", "or", "japanese", "and", "scots", "or", "scots", "and", "other",
    "scots", "damn", "scots", "they", "ruined", "scotland"
]
vector= model.infer_vector(new_document)

In [None]:
## ...and compare them
model.dv.most_similar([vector], topn=5)

## Exercise

1) You are interested in the political leaning of the characters. How could you measure their political leaning, based on their language?

2) Define dictionaries to measure your concept of interest.

3) Create a scale (or if you are very eager, two). Select some characters of interest that you would like to scale. Calculate the cosine similarity of their vectors relative to the scale.

4) Create a table of the political leanings of your characters. What do you observe? Who do you think might vote republican?

5) Estimate the political leaning of all characters and create a plot of their distribution.