# Embeddings

We will start by analysing embeddings; how machines interpret words numerically whilst preserving meaning. First, we need to import some utility functions.

In [1]:
from embedding_utils import *

  from .autonotebook import tqdm as notebook_tqdm


Now we can look at an example. Here is some simple code that embeds our words, then calculates soem measure of similarity between them. Embeddings are vectors, and as such, we can perform mathematical operations on them. The similarity measure here is <i>cosine similarity</i>, which measures the cosine of the angle between vectors; 

## $d(a, b) = \frac{a \cdot b}{\|a\|\|b\|}$

if it is close to 1, the angle is close to 0, meaning the vectors point in the same direction/are similar, which <i>we want</i> to represent that they have similar meaning. Creating embedding models that actually do this is a) hard and b) subjective.

In [2]:
doc = nlp("dog cat banana apple")

similarities = np.zeros((4, 4))

for i, token1 in enumerate(doc):
    for j, token2 in enumerate(doc):
        similarities[i, j] = token1.similarity(
            token2
        )  # computes the cosine similarity

pd.DataFrame(
    similarities,
    index=["Dog", "Cat", "Banana", "Apple"],
    columns=["Dog", "Cat", "Banana", "Apple"],
)

Unnamed: 0,Dog,Cat,Banana,Apple
Dog,1.0,0.822082,0.20909,0.22881
Cat,0.822082,1.0,0.223588,0.203681
Banana,0.20909,0.223588,1.0,0.66467
Apple,0.22881,0.203681,0.66467,1.0


These embeddings are very high dimensional (here, 300 dimensions!), making them hard to visualise. One tactic is to perform some <i>dimensionality reduction</i>, to compress the information into a lower amount of dimensions, whilst still preserving some of the original meaning. We first turn to PCA.

In [3]:
init_notebook_mode(connected=True)
words = ["cat", "meow", "dog", "woof", "bird", "tweet", "lion", "roar", "horse", "neigh", ]

pca = PCA()

glove_vectors = np.concatenate(
    [nlp(word).vector.reshape(1, 300) for word in words]
)

glove_pca = pca.fit_transform(glove_vectors)
scatter(glove_pca[:, 0], glove_pca[:, 1], np.array(words), np.array(words))

## Exercises
1) Think about why these relationships aren't exact (here it isn't exactly king - man + woman = queen). I can think of two main ones. For one - why may tweet be a bit of an outlier here? Same to a lesser extent with roar?
2) Play around yourself and find some relationships - really try and think about whats going on here - what are the dimensions each representing? 

As a note, investigating what it is AI 'understands' is a hard problem, but there are ways to gain intuition. I suggest Semantle (https://semantle.com/). They use a different embedding: word2vec, brilliantly visualised by TensorFlow (https://projector.tensorflow.org/).

# Case Study - Newspost Data

In [4]:
df = pd.read_csv('data/newsposts/newsposts_science.csv', index_col=0)
labels = df['Class Name'].astype('category').values.unique().tolist()
df.head()

Unnamed: 0,Text,Class Name
0,(Graham Toal)Re: Do we need the clipper for c...,sci.crypt
1,(Graham Toal)Let's build software cryptophone...,sci.crypt
2,"(technopagan priest)Re: Would ""clipper"" make ...",sci.crypt
3,"(Vesselin Bontchev)Re: Once tapped, your code...",sci.crypt
4,"(Phil G. Fraering)Re: Once tapped, your code ...",sci.crypt


In [5]:
df['glove_embedding'] = df['Text'].apply(get_embedding_glove)

In [6]:
init_notebook_mode(connected=True)

X = np.vstack(df['glove_embedding'].values)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

hover_plot(X_pca, df)

We see poor separation, why? Can we change something?

In [7]:
df['SBERT_embedding'] = df['Text'].apply(get_embedding_SBERT)

In [8]:
X = np.vstack(df['SBERT_embedding'].values)

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

hover_plot(X_pca, df)

We are now starting to see true separation, but, can we do better? What else can we change?

## Exercise:

Change the dimensionality reduction technique. Try t-sne.

In [9]:
X = np.vstack(df['SBERT_embedding'].values)

tsne = TSNE(n_components=2)
X_tsne = tsne.fit_transform(X)

hover_plot(X_tsne, df)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


T-sne does come with the disadvantage of lost interpretability

## Extra Exercises

Try UMAP - this is likely the most commonly used in practice as it has the best performance, but is rather complicated.

Perform k-means on this two dimensional plot and do your own test/train split to see how effective this embedding method is.

Redo the animal/noises plot with SBert.

In [18]:
init_notebook_mode(connected=True)
words = ["cat", "meow", "dog", "woof", "bird", "tweet", "lion", "roar", "horse", "neigh", ]

tsne = TSNE(n_components=2, perplexity=5)

SBERT_vectors = get_embedding_SBERT(words)
SBERT_tsne = tsne.fit_transform(SBERT_vectors)

scatter(SBERT_tsne[:, 0], SBERT_tsne[:, 1], np.array(words), np.array(words))