<a href="https://colab.research.google.com/github/nicolaiberk/llm_ws/blob/main/notebooks/02a_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro to embedding manipulation with `gensim`

To use word embeddings, we will use a package called `gensim`. As it is not installed in Colab by default, we first need to install it using the package manager pip. The ! signifies that this is a system command, not python code.

In [None]:
!pip install gensim # restart after installation

## Word Embeddings from Pre-Trained Models

In [None]:
## we load a 100-dimensional GloVe model trained on Wikipedia data
import gensim.downloader as api
wv = api.load('glove-wiki-gigaword-100') # small model so we don't have to wait too long...

In [None]:
wv['ziti'] # what do our embeddings look like?

### Get most similar words

In [None]:
## so what are 'ziti' according to our model? Let's check the most similar embeddings:
print(wv.most_similar(positive=['ziti'], topn=5))

In [None]:
## you can assess the similarity to other words with the `similarity` function (we'll cover what this score is later today)
wv.similarity('ziti', 'penne')

In [None]:
# Whereas
wv.similarity('ziti', 'banana')

In [None]:
# and
wv.similarity('ziti', 'car')

In [None]:
## You can calculate with these embeddings:
wv_london = wv['paris'] - wv['france'] + wv['england']
print(wv.most_similar(positive=[wv_london], topn=5))

In [None]:
## Though it does not always work perfectly
wv_queen = wv['king'] + wv['woman'] - wv['man']
print(wv.most_similar(positive=[wv_queen], topn=5))

In [None]:
## you can also use the inbuilt function to get weighted averages
print(wv.most_similar(positive=['woman', 'king'], negative=['man'], topn=5))

In [None]:
## other cute functions
print(wv.doesnt_match(['fire', 'water', 'land', 'sea', 'air', 'car']))

### Training your own Model

We'll use an adapted dataset of dialogue in the Simpsons from [Kaggle](https://www.kaggle.com/datasets/prashant111/the-simpsons-dataset?resource=download&select=simpsons_script_lines.csv).

![](https://media.giphy.com/media/v1.Y2lkPWVjZjA1ZTQ3cHA1dDZ5MWUwOWIyMmd3dHk3MGNyNGdvamEzc2w2dzVjdzdvMW5wOCZlcD12MV9naWZzX3NlYXJjaCZjdD1n/tkYpAbKdWj4TS/giphy.gif)

In [None]:
## load data
import pandas as pd
dataset = pd.read_csv('https://www.dropbox.com/scl/fi/n5ffxvm4qyjkp8ws7qgoq/simpsons_script_lines_clean.csv?rlkey=gfliitwgi8cqsjxlcdmmdwtym&dl=1')
dataset.head()

In [None]:
## clean the texts (very rough here)
dataset['cleaned_text'] = dataset['spoken_words'].str.replace('[^a-zA-Z ]','', regex=True) # remove anything that is not a letter or whitespace
dataset['cleaned_text'] = dataset['cleaned_text'].str.lower() # lowercase
dataset['cleaned_text'] = dataset['cleaned_text'].str.replace(' +', ' ', regex=True) # remove multiple whitespaces

In [None]:
dataset.cleaned_text.head()

In [None]:
## filter empty rows and enforce string
dataset = dataset[dataset['cleaned_text'] != '']
dataset.loc[:,'cleaned_text'] = dataset['cleaned_text'].astype(str)

In [None]:
## how many texts are left?
dataset.shape

In [None]:
## create a list of lists of words by splitting along whitespaces
sentences = [s.split(" ") for s in dataset['cleaned_text']]

In [None]:
sentences[1] # the pre-processing is not perfect

In [None]:
## load dataset (use an iterator for larger datasets: https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#training-your-own-model)
from gensim.models import Word2Vec

## estimate a model and save
model = Word2Vec(
    sentences=sentences,
    vector_size=100,  # number of dimensions of word embeddings
    window=5,         # number of context words in either direction to include
    min_count=5,      # how often must a word appear to enter the corpus?
    workers=4         # how many CPUs should be used to fit the model?
    )
model.save("simpsons.w2v.model")

In [None]:
## what is the main output?
model.wv.vectors.shape

In [None]:
## assess model
model.wv.most_similar('homer', topn=5)

In [None]:
## subset words of interest
interesting_words = [
    'banana', 'pineapple', 'mango',
    'car', 'bike', 'motorcycle',
    'bart', 'lisa', 'homer']
interesting_vecs = wv[interesting_words]

In [None]:
## dimensionality reduction using PCA
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca.fit(interesting_vecs)
wv_2d = pca.transform(interesting_vecs)
wv_2d = pd.DataFrame(wv_2d, index = interesting_words)

In [None]:
wv_2d

In [None]:
import matplotlib.pyplot as plt

plt.scatter(wv_2d[0], wv_2d[1])

for i in wv_2d.index:
    plt.annotate(i, (wv_2d[0][i], wv_2d[1][i]))

plt.show()

In [None]:
## subset words of interest
interesting_words = [
    'paris', 'berlin', 'france', 'germany']
interesting_vecs = wv[interesting_words]

In [None]:
## dimensionality reduction
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca.fit(interesting_vecs)
wv_2d = pca.transform(interesting_vecs)
wv_2d = pd.DataFrame(wv_2d, index = interesting_words)

In [None]:
import matplotlib.pyplot as plt
import random

random.seed(0)

plt.scatter(wv_2d[0], wv_2d[1])

for i in wv_2d.index:
    plt.annotate(i, (wv_2d[0][i], wv_2d[1][i]))

plt.show()

## Exercise

1) Calculate the word vector for Berlin from the vectors for 'paris', 'france', and 'germany'. Explain your reasoning.

In [None]:
## your answer

2) Assess the most similar words to this vector.

In [None]:
## your answer

3) Plot the calculated vector into the same vector space alongside paris, france, germany, and berlin using PCA.

In [None]:
## your answer

4) We have trained these embeddings on a dataset from the simpsons, a popular cartoon series. What consequences might this choice of data have for the word embeddings? How might this compare to a corpus trained on Wikipedia data?

**Your answer**


5) Try to find an interesting bias in the data by looking at word similarities

In [None]:
## your answer