In [5]:
#run these in your environment for this doc to work
#pip install gensim
#pip install --upgrade plotly nbformat

<div class="alert alert-block alert-info">

# Embeddings

This document will serve as less of a [tutorial](https://diataxis.fr/tutorials/) and more of an [explanation](https://diataxis.fr/explanation/) of what embeddings are.




<div class="alert alert-block alert-info">

### LLM Embeddings <a class="anchor" id="embeddings"></a>
Now, assuming you have a basic idea of embeddings as numerical representations of words (tokens), you might be wondering how we get these embeddings to contain sort of semantic meaning. 
Remember that this is needed in order for our RAG machine to be able to find the relevant piece of data (chunk) to answer your question. The key lies in training another model to transform our words (technically [tokens](https://medium.com/thedeephub/all-you-need-to-know-about-tokenization-in-llms-7a801302cf54)) into multi-dimensional numbers as tensors (essentially n-dimensional vectors), that are closer to eachother the more similar they are in meaning. We also want the representations to have meaning in combination, a classic example of this is looking at embeddings in a dimensional space that we can think about easily, like 3D:






In [1]:
import plotly.graph_objects as go
import numpy as np
import plotly.io as pio

pio.renderers.default = 'notebook_connected'  # or 'iframe' if VS Code acts up

# Define origins and vectors
origins = np.array([[0, 0, 0], [0, 0, 0],[0, 0, 0]])
vectors = np.array([[1, 1, 1], [2,-1, 1],[3,0,2]])

fig = go.Figure()
legend_names = ["man", "monarch", "king"]

for o, v, name in zip(origins, vectors, legend_names):
    fig.add_trace(go.Scatter3d(
        x=[o[0], o[0] + v[0]],
        y=[o[1], o[1] + v[1]],
        z=[o[2], o[2] + v[2]],
        mode='lines+markers',
        line=dict(width=6),
        marker=dict(size=4),
        name=name
    ))

fig.update_layout(
    scene=dict(
        xaxis_title='X',
        yaxis_title='Y',
        zaxis_title='Z',
        aspectmode='cube'
    ),

)

fig.show()



<div class="alert alert-block alert-info">

Imagine the blue and red vectors represent the words 'monarch' and 'man', now, theoretically, if we add these two vectors together, we could expect the resulting vector to be a combination of the two words' meanings. 
So our resultant vector might be the embedding green of the token for 'king'. This does work for some examples, however the way that the embeddings have meanings encoded in them is not always this clear, and interactions between words can be tricky. The mathematical tool that is usually used to analyse the similarity between two embeddings should be familiar from your maths classes, being cosine similarity. Cosine similarity computes the dot product between two vectors and divides it by the product of their magnitudes, resulting in a measure of similarity between 0 and 1:
$$
\text{cosine\_similarity}(A, B) = \frac{A \cdot B}{\|A\| \, \|B\|} \\
\text{where:}\\
A \cdot B = \sum_{i=1}^{n} A_i B_i, \\
\|A\| = \sqrt{\sum_{i=1}^{n} A_i^2}, \quad
\|B\| = \sqrt{\sum_{i=1}^{n} B_i^2}
$$  

 For our RAG purposes, this enables us to use a clustering algorithm to identify topics (clusters of chunk embeddings) that are related to our prompt. 

If you are keen for more visual explanantion of these embeddings, check out 3blue1brown's explanation using the visual libary Manim on [youtube](https://youtu.be/wjZofJX0v4M?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&t=753).

<div class="alert alert-block alert-info">

Below is an example of creating and comparing embeddings using the gensim library, feel free to change the code and test different embeddings.

In [None]:
import gensim.downloader
model = gensim.downloader.load("glove-wiki-gigaword-50")


In [14]:
import numpy as np
from numpy.linalg import norm
king_embedding = model["king"]
monarch_embedding = model["monarch"]
man_embedding = model["man"]
monarch_plus_man_embedding = monarch_embedding + man_embedding
cosine_similarity_king_with_monarch = np.dot(king_embedding, monarch_embedding) / (norm(king_embedding) * norm(monarch_embedding))
cosine_similarity_king_with_man = np.dot(king_embedding, man_embedding) / (norm(king_embedding) * norm(man_embedding))
cosine_similarity_king_with_monarch_plus_man = np.dot(king_embedding, monarch_plus_man_embedding) / (norm(king_embedding) * norm(monarch_plus_man_embedding))
print("Cosine Similarity between 'king' and 'monarch':", cosine_similarity_king_with_monarch)
print("Cosine Similarity between 'king' and 'man':", cosine_similarity_king_with_man)
print("Cosine Similarity between 'king' and 'monarch + man':", cosine_similarity_king_with_monarch_plus_man)

Cosine Similarity between 'king' and 'monarch': 0.71930116
Cosine Similarity between 'king' and 'man': 0.5309377
Cosine Similarity between 'king' and 'monarch + man': 0.74011725


<div class="alert alert-block alert-info">
As you can hopefully see, the embeddings roughly track our theory, as for the example king, monarch and man, the most similarity can be obvserved between the embedding for king, and the sum of the embeddings for man and monarch. However our similarity score is not 1, it's only 0.74, this is because the way language is used doesn't track exactly for addition, that is king does not literally mean man + monarch. There is a lot more going on with these embeddings which isn't always clear and is not the focus of our workshop.