# Retrieval Augmented Generation

<span style="text-transform: uppercase;
        font-size: 14px;
        letter-spacing: 1px;
        font-family: 'Segoe UI', sans-serif;">
    Author
</span><br>
efrén cruz cortés
<hr style="border: none; height: 1px; background: linear-gradient(to right, transparent 0%, #ccc 10%, transparent 100%); margin-top: 10px;">

**What is Retrieval augmented generation?**

See slides!

**What libraries will we use?**
- Embedding step: `sentence_transformers`, other good options available.
- Indexing step: `faiss`, it's my favorite so far.
- Generation step: `transformers` or `ollama`.

**Pre-requisites**
- Basic python, including numpy
- The intro slides

## Imports

In [1]:
try:
    import google.colab
    print("Looks like you are working on google Colab! Let's install the necessary packages:")
    !pip install faiss-cpu scikit-learn umap-learn hf_xet
except ModuleNotFoundError:
    print("Looks like you are working locally! Make sure you create a virtual environment and install the necessary packages.")
    pass

Looks like you are working locally! Make sure you create a virtual environment and install the necessary packages.


In [2]:
# LLM libraries
from sentence_transformers import SentenceTransformer
import faiss

# Helper libraries
import pandas as pd
import numpy as np
from pathlib import Path

## Load Data

We first need "anchoring" data. This data will be retrieved as necessay by our system to answer relevant prompts. In our case, we are going to use a dataset of song lyrics.

Indicate the data directory:

In [3]:
data_path = Path("data/songs.csv")

Load the data into a pandas dataframe:

In [4]:
lyrics = pd.read_csv(data_path)

Let's take a look at the first few songs:

In [5]:
lyrics.head()

Unnamed: 0,Artist,Title,Lyrics
0,Taylor Swift,cardigan,"Vintage tee, brand new phone\nHigh heels on co..."
1,Taylor Swift,exile,"I can see you standing, honey\nWith his arms a..."
2,Taylor Swift,Lover,We could leave the Christmas lights up 'til Ja...
3,Taylor Swift,the 1,"I'm doing good, I'm on some new shit\nBeen say..."
4,Taylor Swift,Look What You Made Me Do,I don't like your little games\nDon't like you...


Check how many songs we have:

In [6]:
lyrics.shape

(745, 3)

And let's look at the artists we have songs from:

In [7]:
lyrics['Artist'].unique()

array(['Taylor Swift', 'Billie Eilish', 'The Beatles', 'David Bowie',
       'Billy Joel', 'Ed Sheeran', 'Eric Clapton', 'Bruce Springsteen',
       'Vance Joy', 'Lana Del Rey', 'Bryan Adams', 'Leonard Cohen',
       'Nat King Cole', 'twenty one pilots', 'Ray LaMontagne',
       'Bob Dylan', 'John Denver', 'Frank Sinatra', 'Queen', 'Elton John',
       'George Michael'], dtype=object)

## Preprocessing

There are many elements of preprocessing. The main ones you will encounter are:
* Removing boilerplate text, cleaning white spaces, etc.
* Chunking
* Tokenization

I have done the first one for you. In this workshop we may not have time to go over tokenization or chunking, but I've written sections about them at the end of the notebook, under *Optional*.

## Create Embeddings

This will be our first crucial step. To create embeddings, you need to have a model that is designed for the same type of data as your data! There are models for text, for images, multimodal, etc.

**Step 1** Choose your model

In [8]:
# Let's pick a popular text model
model_name = 'all-mpnet-base-v2' 

# We create the SentenceTransformer based on our model. This is the function that takes texts and produces embeddings.
emb_model = SentenceTransformer(model_name)

**Step 2** Create embeddings

In [9]:
# Just one line!!
embeddings = emb_model.encode(lyrics['Lyrics'])

Now let's take a second to study our embeddings.

In [10]:
# SentenceTransformer returns numpy arrays. Other embedding libraries may return different data types.
type(embeddings)

numpy.ndarray

In [11]:
# The numpy array is basically an nxd matrix, where n is number of songs and d is the embedding dimension
embeddings.shape

(745, 768)

<hr>

**<span style="color:red">EXERCISE</span>** <span style="color:darkred"><<*ERASE solutions after debugging*>></span>

Different embedding models will perform differently. Indeed, some are trained for specific purpuses in mind (Q&A, semantic search, multimodality, etc.). Throughout the exercises you'll be comparing our running example model with a diffrent embedding model.

1. Create an embedding SentenceTransformer using the model `all-MiniLM-L6-v2`. For comparison, this model is smaller than our running example model ($80$ MB, compared to $500$ MB) Call it something different to what we have above, for example `exercise_emb_model`.

2. Create embeddings of our lyrics for your exercise model. Call them something like `exercise_lyrics`.

3. Check if the number of dimensions is the same for this model.

In [12]:
# Create SentenceTransformer here
exercise_emb_model = SentenceTransformer("all-MiniLM-L6-v2")

In [13]:
# Create embeddings here
exercise_emebeddings = exercise_emb_model.encode(lyrics['Lyrics'])

In [14]:
# Check number of dimensions
exercise_emebeddings.shape

(745, 384)

<hr>

**Normalization**

Note that our vectors are normalized (euclidean norm is 1). This is not always the case, but it is very important you know if your vectors are normalized or not. We'll get back to it when we create our index.

In [18]:
# Compute the squared norm of the first embedding. If normalized, it should be 1.
np.inner(embeddings[0], embeddings[0])

np.float32(1.0)

In [19]:
# Normalization is not always ensured by default, but you can set it to be so with an argument:
embeddings = emb_model.encode(lyrics['Lyrics'], normalize_embeddings=True)

SentenceTransformers provides a handy `similarity` function, which computes the pairwise similarity of two sets of songs.

In [20]:
# Comparing two different songs
emb_model.similarity(embeddings[0], embeddings[1])

tensor([[0.4468]])

The result is a *tensor*, which you can index as you would with numpy arrays:

In [21]:
cos_sim_0_1 = emb_model.similarity(embeddings[0], embeddings[1])
print(f"The cosine similarity between song 0 and song 1 is {cos_sim_0_1[0,0]:.3f}")

The cosine similarity between song 0 and song 1 is 0.447


Note the tensor is actually a $1\times1$ matrix:

In [22]:
cos_sim_0_1.shape

torch.Size([1, 1])

The maximum possible similarity is $1$.

In [23]:
# A maximum of 1 is achieved if the vectors are the same
emb_model.similarity(embeddings[0], embeddings[0])

tensor([[1.0000]])

You can compare one song to multiples songs:

In [24]:
cos_sim_0_05 = emb_model.similarity(embeddings[0], embeddings[0:5])
print(cos_sim_0_05)
print(cos_sim_0_05.shape)

tensor([[1.0000, 0.4468, 0.6080, 0.5934, 0.5363]])
torch.Size([1, 5])


And now we got back a $1\times5$ matrix.

Or you can compare multiple songs to multiple songs, in which case you get a matrix of similarities:

In [25]:
cos_sim_05_05 = emb_model.similarity(embeddings[0:5], embeddings[0:5])
print(cos_sim_05_05)

tensor([[1.0000, 0.4468, 0.6080, 0.5934, 0.5363],
        [0.4468, 1.0000, 0.5352, 0.5242, 0.6023],
        [0.6080, 0.5352, 1.0000, 0.5728, 0.4673],
        [0.5934, 0.5242, 0.5728, 1.0000, 0.5427],
        [0.5363, 0.6023, 0.4673, 0.5427, 1.0000]])


In [26]:
cos_sim_05_05.shape

torch.Size([5, 5])

And now we got a $5\times5$ matrix.

By default, sentence_transformers uses the `cosine` similarity (see slides). But you can use other distances like the euclidean or manhattan distances (see their documentation).

<hr>

**<span style="color:red">EXERCISE</span>** <span style="color:darkred"><<*ERASE solutions after debugging*>></span>

1. Check if the Euclidean norm of some of your `exercise_embeddings` is 1 or not.
2. Calculate the cosine similarity for the first 5 of your exercise embeddings.

In [27]:
np.inner(exercise_emebeddings[0], exercise_emebeddings[0])

np.float32(1.0)

In [28]:
exercise_emb_model.similarity(exercise_emebeddings[0:5], exercise_emebeddings[0:5])

tensor([[1.0000, 0.3810, 0.5573, 0.6025, 0.5036],
        [0.3810, 1.0000, 0.4271, 0.4929, 0.4268],
        [0.5573, 0.4271, 1.0000, 0.5127, 0.3940],
        [0.6025, 0.4929, 0.5127, 1.0000, 0.4785],
        [0.5036, 0.4268, 0.3940, 0.4785, 1.0000]])

<hr>

### Visualization for Intuition

**Similarity Heatmap**

Before continuing, let's try to develop an intuition about these embeddings.

First, recall emebeddings are vectors in a d-dimensional space, where d is quite large. We can't visualize them directly, but we can see how they interact with each other. Let's look at a heatmap of the similarities between different artists songs.

In the image below, we show the similarities between $5$ songs from Taylor Swift and $5$ from Bob Dylan.

![heatmap-swift-dylan](images/taylor_dylan_heatmap.png){width=80%}

Seems like Taylor Swift's songs tend to be similar among themselves, while Bob Dylan's song are less similar among themselves. As expected, songs from the different artists are the most disimilar, with the exception of *Like a Rolling Stone*, which is more similar to Taylor's songs than to the rest of Dylan's!!

**Dimensionality Reduction**

We can also visualize embeddings by looking at them in a lower dimension. The image below shows the embeddings in $2d$. If you want to hover over each point to find out which song it represents, open the itneractive file `images/song_embeddings.html` instead.

![song-embeddings](images/song_embeddings.png){width=70%}

(To see how this was done you can check the `supplementary.ipynb` file. To learn more about dimensionality reduction, you can come to my machine learning with scikit-learn workshop!).

### Summary of Embeddings

In [29]:
# You can create embeddings with just a few lines:
model_name = 'all-mpnet-base-v2' 
emb_model = SentenceTransformer(model_name)
embeddings = emb_model.encode(lyrics['Lyrics'], normalize_embeddings=True)

# and you can compare them to each other to see how similar they are (semantically)
emb_model.similarity(embeddings[0:5], embeddings[0:5])

tensor([[1.0000, 0.4468, 0.6080, 0.5934, 0.5363],
        [0.4468, 1.0000, 0.5352, 0.5242, 0.6023],
        [0.6080, 0.5352, 1.0000, 0.5728, 0.4673],
        [0.5934, 0.5242, 0.5728, 1.0000, 0.5427],
        [0.5363, 0.6023, 0.4673, 0.5427, 1.0000]])

<hr>

**<span style="color:red">EXERCISE</span>** <span style="color:darkred"></span>

1. Write some text of your choice. We will compare it to the song lyrics. You can write something silly or nonsense.
    - Store it in a variable named something like `exercise_my_lyrics`.
2. Using your exercise embedding model, embed your lyrics.
3. Compute the similarity to all other lyrics, and find the song closest to what you got. Which one is it?
4. Repeat $2$ and $3$ above. Did the models deem the same song as closest to yours?

In [None]:
# I will help you with finding the closest song.
    # I'm assuming you stored similarities in exercise_sims_to_my_song
closest_song_index = list(exercise_sims_to_my_song[0]).index(max(exercise_sims_to_my_song[0]))
print(lyrics.iloc[closest_song_index]['Lyrics'])

<hr>

## Create Index

OK, now that we have this collection of high-dimensional embeddings, we can create our index!

As a reminder, an *index* is a data structure which efficiently allows us to find the most similar vectors in our collection to one reference point (usually, a user's query). See slides if you need a refresher.

Creating a `faiss` index takes one line!

In [46]:
# We need the dimension of our embeddings
d_emb = len(embeddings[0])

# Create a faiss index
faiss_index = faiss.IndexFlatIP(d_emb) # <-- creating the d-dimensional index (empty for now)
print(faiss_index.is_trained)
print(faiss_index.ntotal)

True
0


We will talk about the meaning of `FaltIP` later. For now, just know that **only because our embeddings are normalized** this index works with the cosine similarity.

### Add embeddings to index

In [47]:
faiss_index.add(embeddings)

In [48]:
print(faiss_index.is_trained)
print(faiss_index.ntotal)

True
745


<hr>

**<span style="color:red">EXERCISE</span>** <span style="color:darkred"><<*ERASE solutions after debugging*>></span>

1. Create a new `exercise_index` and add your exercise emebeddings to it.

<hr>

### Search

In [49]:
# Make a query and embed it, don't forget to normalize!
query_1 = "Life is good and I will survive. I am happy that things turned out this way"
qemb_1 = emb_model.encode(query_1, normalize_embeddings=True)

# Let's remind ourselves how the embedding is returned:
print(f"Query embedding info:\n\nType: {type(qemb_1)}\nShape: {qemb_1.shape}")

Query embedding info:

Type: <class 'numpy.ndarray'>
Shape: (768,)


In theory, we can ask `faiss` to find the closest songs to it with just one line of code:

```python
faiss_index.search(qemb_1, 1)
```

However, the above will produce an error because `faiss` expects a $(n,d)$ numpy array, but when we encode only one query we get a $(d,)$ numpy array back (afterall, `faiss` and `sentence_transformers` were developed by different people).

Therefore, we need to reshape!

In [51]:
qemb_1 = qemb_1.reshape((1, qemb_1.shape[0])) # reshape into (1,d)

print(f"Query embedding info:\n\nType: {type(qemb_1)}\nShape: {qemb_1.shape}")

Query embedding info:

Type: <class 'numpy.ndarray'>
Shape: (1, 768)


Now we can search!

In [52]:
faiss_index.search(qemb_1, 1) # the 1 indicates how many close neighbors to find.

(array([[0.42986214]], dtype=float32), array([[294]]))

- The first element $[[.43]]$ is the cosine similarity between the first song and its closest neighbor.
- The second element $[[294]]$ is the index (song number) of such neighbor.

In [53]:
neighbor_sim, neighbor_idx = faiss_index.search(qemb_1, 1)

print(f"The cosine similarity to the closest neighbor is: {neighbor_sim[0,0]:.2f}\n")
print(f"And the neighbor is:\n{lyrics.iloc[294]}")

The cosine similarity to the closest neighbor is: 0.43

And the neighbor is:
Artist                                         Eric Clapton
Title                                       I’ll Be Alright
Lyrics    I'll be alright\nI'll be alright\nI'll be alri...
Name: 294, dtype: object


Pay special attention to the indexing of the results. If you see double brackets `[[]]` think of it as a matrix, so access it either as `[i][j]` or `[i,j]`. Even if we get one result, we'll be given, as output, a $1x1$ matrix.

If we wanted to find the k closest neighors:

In [54]:
k = 3
faiss_index.search(qemb_1, k)

(array([[0.42986214, 0.41438007, 0.3375955 ]], dtype=float32),
 array([[294, 608, 566]]))

### Searching multiple queries

You can search for multiple queries at the same time, however, some `faiss` versions are buggy when that happens. To avoid any complications, we'll just have to search individually for each query.

First, let's make a helper function that does the embedding and reshaping in one step:

In [55]:
# Let's make a helper function that automatically normalized and reshapes embeddings for us:
def embed_reshape(query):
    qemb = emb_model.encode(query, normalize_embeddings=True)
    qemb = qemb.reshape((1, qemb.shape[0]))
    return qemb

In [56]:
query_1 = "Life is good and I will survive. I am happy that things turned out this way"
query_2 = "Why did you leave me? I am so sad. The world is so cruel."
queries = [query_1, query_2]

In [57]:
# Embed the queries, remember to normalize
qembs = [embed_reshape(q) for q in queries]

Let's find the 4 closest songs to each query.

In [None]:
k = 4
D_matched = []
I_matched = []
# using a for loop to avoid faiss buggy behavior
for qe in qembs:
    dists_q_matched, idxs_q_matched = faiss_index.search(qe, k)
    D_matched.append(dists_q_matched)
    I_matched.append(idxs_q_matched)

In [59]:
D_matched

[array([[0.42986214, 0.41438007, 0.3375955 , 0.33043662]], dtype=float32),
 array([[0.44099832, 0.43135837, 0.41771904, 0.41135132]], dtype=float32)]

In [61]:
I_matched

[array([[294, 608, 566, 621]]), array([[ 76, 682,  82, 657]])]

In [62]:
# Let's look at matched songs for the first query:
print(f"Matched song to your query \'{query_1}\':\n")
for i in I_matched[0][0]:
    artist = lyrics['Artist'].iloc[i]
    title = lyrics['Title'].iloc[i]
    song_lyrics = lyrics['Lyrics'].iloc[i]
    print(f"Artist: {artist}\nTitle: {title}\nLyrics:{song_lyrics[:100]}\n")

Matched song to your query 'Life is good and I will survive. I am happy that things turned out this way':

Artist: Eric Clapton
Title: I’ll Be Alright
Lyrics:I'll be alright
I'll be alright
I'll be alright someday
If in my heart
I do not give
Then I'll be al

Artist: John Denver
Title: Poems, Prayers, & Promises
Lyrics:I've been lately thinking
About my life's time
All the things I've done
And how it's been
And I can'

Artist: Ray LaMontagne
Title: Part of the Light
Lyrics:Why so many people always runnin' 'round
Looking for a happiness that can't be found?
I don't know
I

Artist: John Denver
Title: Matthew
Lyrics:I had an Uncle name of Matthew
He was his father's only boy
Born just south of Colby, Kansas
He was 



In [63]:
# Let's look at matched songs for the second query:
print(f"Matched song to your query \'{query_2}\':\n")
for i in I_matched[1][0]:
    artist = lyrics['Artist'].iloc[i]
    title = lyrics['Title'].iloc[i]
    song_lyrics = lyrics['Lyrics'].iloc[i]
    print(f"Artist: {artist}\nTitle: {title}\nLyrics:{song_lyrics[:100]}\n")

Matched song to your query 'Why did you leave me? I am so sad. The world is so cruel.':

Artist: Billie Eilish
Title: ​bitches broken hearts
Lyrics:You can pretend you don't miss me (Me)
You can pretend you don't care
All you wanna do is kiss me (M

Artist: Queen
Title: Too Much Love Will Kill You
Lyrics:I'm just the pieces
Of the man I used to be
Too many bitter tears
Are raining down on me
I'm far awa

Artist: Billie Eilish
Title: ​goodbye
Lyrics:Please, please
Don't leave﻿ me
Be

It's not true
Take me to the rooftop
Told you not to worry
What d

Artist: Queen
Title: Love of My Life
Lyrics:Love of my life, you've hurt me
You've broken my heart
And now you leave me

Love of my life, can't 



<hr>

**<span style="color:red">EXERCISE</span>** <span style="color:darkred"><<*ERASE solutions after debugging*>></span>

1. Search for the $4$ closest songs to the first query using your exercise index.
2. Compare the results to what we got in our main index. Are the songs the same? Do they actually make sense?
3. Search for the $4$ closest songs to the lyrics you wrote earlier, and print them.

<hr>

### RECAP

OK, let's do a quick recap, now that we have developed an intuition, we can actually perform the above steps super quickly:

In [64]:
# :: Load the data ::
data_path = Path("data/songs.csv")
lyrics = pd.read_csv(data_path)

# :: Create embeddings ::
model_name = 'all-mpnet-base-v2' 
emb_model = SentenceTransformer(model_name)
embeddings = emb_model.encode(lyrics['Lyrics'], normalize_embeddings=True)

# :: Create index ::
d_emb = len(embeddings[0])
faiss_index = faiss.IndexFlatIP(d_emb)
faiss_index.add(embeddings)     # Add embeddings to index

# :: Search ::
# Make query and embed it
query_1 = "Life is good and I will survive. I am happy that things turned out this way"
qemb_1 = emb_model.encode(query_1, normalize_embeddings=True)
# Reshape if necessary
qemb_1 = qemb_1.reshape((1, qemb_1.shape[0]))
# Search
k = 2
faiss_index.search(qemb_1, k)

(array([[0.42986214, 0.41438007]], dtype=float32), array([[294, 608]]))

And there you have it!! A full retrieval system in less than 20 lines :-)

## (Optional)

### A note about indexes

**IndexFlatIP vs IndexFlatL2**

IP stands for *inner product*, while L2 stands for *L2 norm* (euclidean distance). In general, these may produce different results, so you need to choose carefully. A simple heuristic is:
- Text data: `cosine similarity`.
- Image data: `euclidean distance`.

But, you should still do some research on the model you are using, the data type, and what you care about (direction, magnitude, etc.). More details on the slides.

**Cosine Similarity, Inner Product, Normalization**

If and only if your vectors are *normalized*, the cosine similarity is the same as the inner product. In our case, since our embeddings are normalized, we can use `IndexFlatIP` and it will be equivalent to using the cosine similarity, which is what we want.

**Flat vs Other Indexes**

Our flat index is not the most computationally efficient. It doesn't quantize vectors (see slides) and all searches are brute force (that is, it will compare a query to all vectors in the index). For our toy dataset this is fine, but for larger datasets you should use other indices. Different libraries have implementations of different indices, for example `faiss` has the hierarchical navigable small world index `IndexHNSWFlat`, which is more search efficient, but returns approximate results.

We can't go over all index types in this workshop, but here's a handy table for the ones `faiss` offers:
https://github.com/facebookresearch/faiss/wiki/Faiss-indexes.

### Save index on disc

In general you will create a large index, and store it on disk for further use. This could be purely for convenience (you don't need to create a new index every time you use your RAG system) but will be vital if your index is quite large and memory becomes a limitation.

In [None]:
# Saving to a .index file
index_directory = 'rag_workshop.index'
faiss.write_index(faiss_index, index_directory)

If you need to read it later on, you can use `faiss.read_index(index_directory, faiss.IO_FLAG_MMAP)`, the MMAP flag tells it not to load the full index into memory.