# Retrieval Augmented Generation

**What is Retrieval augmented generation?**

See slides!

**What libraries will we use?**
- Embedding step: `sentence_transformers`, other good options available.
- Indexing step: `faiss`, it's my favorite so far.
- Generation step: `transformers` or `ollama`.

**Pre-requisites**
- Basic python, including numpy
- The intro slides

## Imports

TO-DO:
* Install faiss-cpu or faiss-gpu?

In [None]:
# for better performance when loading xet models:
# pip install hf_xet

# or:
#pip install huggingface_hub[hf_xet]

In [None]:
# load olmo model

In [None]:
# in_google_colab = 1
# if in_google_colab:
#   !pip install faiss-cpu

In [None]:
# # Dictionary to handle cases where package name and module name are different
# package_to_module = {
#     "faiss-cpu": "faiss",
#     "scikit-learn": "sklearn",
#     "umap-learn": "umap"
# }

# Packages to install
# packages = ["faiss-cpu", "numpy", "scikit-learn", "umap-learn"]

# def check_and_install_packages(package_list):
#     """Check if each package in the list is installed, and install it if not."""
#     for package in package_list:
#         # Get the correct module name (or fallback to package name if they are the same)
#         module_name = package_to_module.get(package, package)
#         try:
#             __import__(module_name)  # Try to import the correct module name
#             print(f"{package} ({module_name}) is already installed")
#         except ImportError:
#             print(f"{package} is not installed, installing now...")
#             !pip install {package}
#             # And add import here no?

# # Check and install packages
# check_and_install_packages(packages)

In [None]:
# Helps with threading issues
# import os
# os.environ["OMP_NUM_THREADS"] = "1"
# os.environ["MKL_NUM_THREADS"] = "1"

In [2]:
# LLM libraries
from sentence_transformers import SentenceTransformer
import faiss
from transformers import pipeline
import ollama

# Machine learning libraries
from umap import UMAP

# Helper libraries
import pandas as pd
import numpy as np
from pathlib import Path

# Visualization libraries
import plotly.express as px
import plotly.graph_objects as go

## Section 1: Building the Retrieval System

### Load Data

We first need "anchoring" data that our system will retrieve as necessary to answer to relevant prompts. In our case, we are going to use a dataset of song lyrics.

Indicate the data directory:

In [None]:
data_path = Path("data/songs.csv")

Load the data into a pandas dataframe:

In [None]:
lyrics = pd.read_csv(data_path)

Let's take a look at the first few songs:

In [None]:
lyrics.head()

Check how many songs we have:

In [None]:
lyrics.shape

And let's look at the artists we have songs from:

In [None]:
lyrics['Artist'].unique()

### Preprocessing

There are many elements of preprocessing. The main ones you will encounter are:
* Removing boilerplate text, cleaning white spaces, etc.
* Tokenization
* Chunking

I have done the first one for you. In this workshop we may not have time to go over tokenization or chunking, but I've written sections about them at the end of the notebook, in the *Optional* section.

### Create Embeddings

This will be our first crucial step. To create embeddings, you need to have a model that is designed for the same type of data as your data! There are models for text, for images, multimodal, etc.

**Step 1** Choose your model

In [None]:
# Let's pick a popular text model
model_name = 'all-mpnet-base-v2' 

# We create the SentenceTransformer based on our model. This is the function that takes texts and produces embeddings.
emb_model = SentenceTransformer(model_name)

**Step 2** Create embeddings

In [None]:
# Just one line!!
embeddings = emb_model.encode(lyrics['Lyrics'])

Now let's take a second to study our embeddings.

In [None]:
# SentenceTransformer returns numpy arrays. Other libraries may return different data types.
type(embeddings)

In [None]:
# The numpy array is basically an nxd matrix, where n is number of songs and d is the embedding dimension
embeddings.shape

<hr>

**<span style="color:red">EXERCISE</span>** <span style="color:darkred"><<*ERASE solutions after debugging*>></span>

Different embedding models will perform differently. Indeed, some are trained for specific purpuses in mind (Q&A, semantic search, multimodality, etc.). Throughout the exercises you'll be comparing our running example model with a diffrent embedding model.

1. Create an embedding SentenceTransformer using the model `all-MiniLM-L6-v2`. For comparison, this model is smaller than our running example model (80MB, compared to 500MB) Call it something different to what we have above, for example `exercise_emb_model`.

2. Create embeddings of our lyrics for your exercise model. Call them something like `exercise_lyrics`.

3. Check if the number of dimensions is the same for this model.

In [None]:
# Create SentenceTransformer here
exercise_emb_model = SentenceTransformer("all-MiniLM-L6-v2")

In [None]:
# Create embeddings here
exercise_emebeddings = exercise_emb_model.encode(lyrics['Lyrics'])

In [86]:
# Check number of dimensions
exercise_emebeddings.shape

(745, 384)

<hr>

**Normalization**

Note that our vectors are normalized (euclidean norm is 1). This is not always the case, but it is very important you know if your vectors are normalized or not. We'll get back to it when we create our index.

In [None]:
np.inner(embeddings[0], embeddings[0])

In [None]:
# Normalization is not always ensured by default, but you can set it to be so with an argument:
embeddings = emb_model.encode(lyrics['Lyrics'], normalize_embeddings=True)

SentenceTransformers provides a handy `similarity` function, which computes the pairwise similarity of two sets of songs.

In [None]:
# Comparing two songs
emb_model.similarity(embeddings[0], embeddings[1])

The result is a *tensor*, which you can index as you would with numpy arrays:

In [None]:
d01 = emb_model.similarity(embeddings[0], embeddings[1])
print(f"The cosine similarity between song 0 and song 1 is {d01[0,0]}")

In [None]:
# A maximum of 1 is achieved if the vectors are the same
emb_model.similarity(embeddings[0], embeddings[0])

You can compare one song to multiples songs:

In [None]:
emb_model.similarity(embeddings[0], embeddings[0:5])

Or multiple songs to multiple songs, in which case you get a matrix of similarities:

In [None]:
emb_model.similarity(embeddings[0:5], embeddings[0:5])

By default, sentence_transformers uses the `cosine` similarity (see slides). But you can use other distances like the euclidean or manhattan distances (see their documentation).

<hr>

**<span style="color:red">EXERCISE</span>** <span style="color:darkred"><<*ERASE solutions after debugging*>></span>

1. Check if the Euclidean norm of some of your `exercise_embeddings` is 1 or not.
2. Calculate the cosine similarity for the first 5 of your exercise embeddings.

In [87]:
np.inner(exercise_emebeddings[0], exercise_emebeddings[0])

np.float32(1.0)

In [88]:
exercise_emb_model.similarity(exercise_emebeddings[0:5], exercise_emebeddings[0:5])

tensor([[1.0000, 0.3810, 0.5573, 0.6025, 0.5036],
        [0.3810, 1.0000, 0.4271, 0.4929, 0.4268],
        [0.5573, 0.4271, 1.0000, 0.5127, 0.3940],
        [0.6025, 0.4929, 0.5127, 1.0000, 0.4785],
        [0.5036, 0.4268, 0.3940, 0.4785, 1.0000]])

<hr>

### Visualization for Intuition

**Similarity Heatmap**

Before continuing, let's try to develop an intuition about these embeddings.

First, recall emebeddings are vectors in a d-dimensional space, where d is quite large. We can't visualize them directly, but we can see how they interact with each other. Let's look at a heatmap of the similarities between different artists songs.

All the code in this section will be skipped! It is not important for our workshop.

In [None]:
n_heatmap = 5
a_artist = 'Taylor Swift'
b_artist = 'Bob Dylan'

In [None]:
lyrics[lyrics['Artist']==a_artist].head(n_heatmap)

In [None]:
lyrics[lyrics['Artist']==b_artist].head(n_heatmap)

In [None]:
# Let's save the indices for easy access
a_idxs = lyrics[lyrics['Artist']==a_artist].index.to_list()[:n_heatmap]
b_idxs = lyrics[lyrics['Artist']==b_artist].index.to_list()[:n_heatmap]

# subset embeddings of first and last n songs
a_embs = embeddings[a_idxs]
b_embs = embeddings[b_idxs]
both_embs = np.concatenate((a_embs, b_embs), axis=0)

# we'll use this in our visualization:
a_titles = lyrics['Title'].iloc[:n_heatmap].to_list()
b_titles = lyrics['Title'].iloc[-n_heatmap:].to_list()
a_titles = [title[:20] for title in a_titles] # truncating text
b_titles = [title[:20] for title in b_titles]
both_titles = a_titles + b_titles

# compute their similarity, we want to visualize this with a heatmap
fl_sim_matrix = emb_model.similarity(both_embs, both_embs)

In [None]:
fig = px.imshow(
    fl_sim_matrix,
    x=both_titles,
    y=both_titles,
    color_continuous_scale="Viridis",
    text_auto=".2f"
)

fig.update_layout(
    title=f"Cosine Similarity Among {a_artist} and {b_artist} Lyrics",
    width=750,
    height=750,
    xaxis=dict(tickangle=45)
)
fig.show()

**Dimensionality Reduction**

We can also visualize embeddings by looking at them in a lower dimension. We will use a machine learning technique called *dimensionality reduction*. You don't need to know how it's done, and don't worry about the code, we'll use it only for visualization purposes.

(If you attend the topic modeling workshop, you may learn about it more in depth).

In [None]:
dimred_model = UMAP(
    n_neighbors=3,  # umap hyper-parameter
    n_components=2, # dimension we are reducing to
    metric='cosine'
)

two_d_rep = dimred_model.fit_transform(embeddings)

In [None]:
fig_clustering = go.Figure()

fig_clustering.add_trace(go.Scatter(
    x=two_d_rep[:, 0],
    y=two_d_rep[:,1],
    mode='markers',
    marker=dict(size=6),
    text=lyrics['Title'],
    hoverinfo='text'
))

fig_clustering.update_layout(
    height=750, width=750,
    title='Low dimensional view of embedded lyrics',
)

fig_clustering.show()

In [None]:
# A reminder of the artists, which one would you like to see?
lyrics['Artist'].unique()

In [None]:
# Let's highlight an artist's songs just for fun:
artist_highlight = 'John Denver'
artist_idxs = lyrics[lyrics['Artist']==artist_highlight].index.to_list()

fig_clustering = go.Figure()

fig_clustering.add_trace(go.Scatter(
    x=two_d_rep[:, 0],
    y=two_d_rep[:,1],
    mode='markers',
    marker=dict(size=6),
    text=lyrics['Title'],
    hoverinfo='text',
    name = 'All artists'
))

fig_clustering.add_trace(go.Scatter(
    x=two_d_rep[artist_idxs, 0],
    y=two_d_rep[artist_idxs,1],
    mode='markers',
    marker=dict(size=6, color='crimson'),
    text = lyrics['Title'].iloc[artist_idxs],
    hoverinfo = 'text',
    name = artist_highlight
))

fig_clustering.update_layout(
    height=750, width=750,
    title='Low dimensional view of embedded lyrics',
)

fig_clustering.show()

### Create Index

OK, back to our main task. Now we have this collection of high-dimensional embeddings. It's time to create our index! As a reminder, an index is a data structure which efficiently allows us to find the most similar vectors in our collection to one reference point (usually, a user's query). See slides if you need a refresher.

Creating a fass index takes one line!

In [None]:
# We need the dimension of our embeddings
d_emb = len(embeddings[0])

# Create a faiss index
faiss_index = faiss.IndexFlatIP(d_emb) # <-- creating the d-dimensional index (empty for now)
print(faiss_index.is_trained)
print(faiss_index.ntotal)

We will talk about the meaning of `FaltIP` later. For now, just know that **only because our embeddings are normalized** this index works with the cosine similarity.

### Add embeddings to index

In [None]:
faiss_index.add(embeddings)

In [None]:
print(faiss_index.is_trained)
print(faiss_index.ntotal)

<hr>

**<span style="color:red">EXERCISE</span>** <span style="color:darkred"><<*ERASE solutions after debugging*>></span>

1. Create a new `exercise_index` and add your exercise emebeddings to it.

<hr>

### Search

In [None]:
# Make a query and embed it, don't forget to normalize!
query_1 = "Life is good and I will survive. I am happy that things turned out this way"
qemb_1 = emb_model.encode(query_1, normalize_embeddings=True)

# Let's remind ourselves how the embedding is returned:
print(f"Query embedding info:\n\nType: {type(qemb_1)}\nShape: {qemb_1.shape}")

In theory, we can ask `faiss` to find the closest songs to it with just one line of code:

```python
faiss_index.search(qemb_1, 1)
```

However, the above will produce an error because `faiss` expects a (n,d) numpy array, but when we encode only one query we get a (d,) numpy array back. Therefore, we need to reshape!

In [None]:
qemb_1 = qemb_1.reshape((1, qemb_1.shape[0])) # reshape into (1,d)

print(f"Query embedding info:\n\nType: {type(qemb_1)}\nShape: {qemb_1.shape}")

Now we can search!

In [None]:
faiss_index.search(qemb_1, 1) # the 1 indicates how many close neighbors to find.

The first element [[.43]] is the cosine similarity between the first song and its closest neighbor. The second element [[294]] is the index (song number) of such neighbor.

In [None]:
neighbor_sim, neighbor_idx = faiss_index.search(qemb_1, 1)

print(f"The cosine similarity to the closest neighbor is: {neighbor_sim[0,0]:.2f}\n")
print(f"And the neighbor is:\n{lyrics.iloc[294]}")

Pay special attention to the indexing of the results. If you see double brackets [[]] think of it as a matrix, so access it either as [i][j] or [i,j]. Even if we get one result, we'll be given, as output, a 1x1 matrix.

If we wanted to find the k closest neighors:

In [None]:
k = 3
faiss_index.search(qemb_1, k)

### Searching multiple queries

You can search for multiple queries at the same time, however, some `faiss` versions are buggy when that happens. To avoid any complications, we'll just have to search individually for each query.

In [None]:
# Let's make a helper function that automatically normalized and reshapes embeddings for us:
def embed_reshape(query):
    qemb = emb_model.encode(query, normalize_embeddings=True)
    qemb = qemb.reshape((1, qemb.shape[0]))
    return qemb

In [None]:
query_1 = "Life is good and I will survive. I am happy that things turned out this way"
query_2 = "Why did you leave me? I am so sad. The world is so cruel."
queries = [query_1, query_2]

In [None]:
# Embed the queries, remember to normalize
qembs = [embed_reshape(q) for q in queries]

Let's find the 4 closest songs to each query.

In [None]:
k = 4
D_matched = []
I_matched = []
for qe in qembs:
    dists_q_matched, idxs_q_matched = faiss_index.search(qe, k)
    D_matched.append(dists_q_matched)
    I_matched.append(idxs_q_matched)
# distances_q_matched, indices_q_matched = faiss_index.search(qembs, k)

In [None]:
D_matched

In [None]:
I_matched[0][0]

In [None]:
# Let's look at matched songs for the first query:
print(f"Matched song to your query \'{query_1}\':\n")
for i in I_matched[0][0]:
    artist = lyrics['Artist'].iloc[i]
    title = lyrics['Title'].iloc[i]
    song_lyrics = lyrics['Lyrics'].iloc[i]
    print(f"Artist: {artist}\nTitle: {title}\nLyrics:{song_lyrics[:100]}\n")

In [None]:
# Let's look at matched songs for the first query:
print(f"Matched song to your query \'{query_2}\':\n")
for i in I_matched[1][0]:
    artist = lyrics['Artist'].iloc[i]
    title = lyrics['Title'].iloc[i]
    song_lyrics = lyrics['Lyrics'].iloc[i]
    print(f"Artist: {artist}\nTitle: {title}\nLyrics:{song_lyrics[:100]}\n")

<hr>

**<span style="color:red">EXERCISE</span>** <span style="color:darkred"><<*ERASE solutions after debugging*>></span>

1. Search for the 4 closest songs to the first query using your exercise index.
2. Compare the results to what we got in our main index. Are the songs the same? Do they actually make sense?

<hr>

### RECAP

OK, let's do a quick recap, now that we have developed an intuition, we can actually perform the above steps super quickly:

In [3]:
# :: Load the data ::
data_path = Path("data/songs.csv")
lyrics = pd.read_csv(data_path)

# :: Create embeddings ::
model_name = 'all-mpnet-base-v2' 
emb_model = SentenceTransformer(model_name)
embeddings = emb_model.encode(lyrics['Lyrics'], normalize_embeddings=True)

# :: Create index ::
d_emb = len(embeddings[0])
faiss_index = faiss.IndexFlatIP(d_emb)
# Add embeddings to index
faiss_index.add(embeddings)

# :: Search ::
# Make query and embed it
query_1 = "Life is good and I will survive. I am happy that things turned out this way"
qemb_1 = emb_model.encode(query_1, normalize_embeddings=True)
# Reshape if necessary
qemb_1 = qemb_1.reshape((1, qemb_1.shape[0]))
# Search
k = 2
faiss_index.search(qemb_1, k)

(array([[0.42986214, 0.41438007]], dtype=float32), array([[294, 608]]))

And there you have it!! A full retrieval system in less than 20 lines :-)

### (Optional) A note about indexes

**IndexFlatIP vs IndexFlatL2**

IP stands for *inner product*, while L2 stands for *L2 norm* (euclidean distance). In general, these may produce different results, so you need to choose carefully. A simple heuristic is:
- Text data: `cosine similarity`.
- Image data: `euclidean distance`.

But, you should still do some research on the model you are using, the data type, and what you care about (direction, magnitude, etc.). More details on the slides.

**Cosine Similarity, Inner Product, Normalization**

If and only if your vectors are *normalized*, the cosine similarity is the same as the inner product. In our case, since our embeddings are normalized, we can use `IndexFlatIP` and it will be equivalent to using the cosine similarity, which is what we want.

**Flat vs Other Indexes**

Our flat index is not the most computationally efficient. It doesn't quantize vectors (see slides) and all searches are brute force (that is, it will compare a query to all vectors in the index). For our toy dataset this is fine, but for larger datasets you should use other indices. Different libraries have implementations of different indices, for example `faiss` has the hierarchical navigable small world index `IndexHNSWFlat`, which is more search efficient, but returns approximate results.

We can't go over all index types in this workshop, but here's a handy table for the ones `faiss` offers:
https://github.com/facebookresearch/faiss/wiki/Faiss-indexes.

### (Optional) Save index on disc

In general you will create a large index, and store it on disk for further use. This could be purely for convenience (you don't need to create a new index every time you use your RAG system) but will be vital if your index is quite large and memory becomes a limitation.

In [None]:
# Saving to a .index file
index_directory = 'rag_workshop.index'
faiss.write_index(faiss_index, index_directory)

If you need to read it later on, you can use `faiss.read_index(index_directory, faiss.IO_FLAG_MMAP)`, the MMAP flag tells it not to load the full index into memory.

## Section 2 -  Adding Generation

OK, we are done with the R of RAG. Time to get the A done!

For this, we need a generation model and a library that helps us run such model.

I recommend either using `transformers` or `ollama`

In [None]:
# # The transformers way, if using a gated model like gemma, you'll need to provide an access token
# generation_model = "allenai/OLMo-2-0425-1B"
# gen_ppln = pipeline(task="text-generation", model=generation_model)

In [4]:
generation_model = "llama3.2"

# The ollama way:
test_reponse = ollama.generate(
    model=generation_model,
    prompt="Why is the sky blue and not pink?"
)

print(test_reponse.response)

The reason why the sky appears blue rather than pink has to do with a combination of physics, optics, and atmospheric conditions.

Here's what happens:

1. **Light from the sun**: When sunlight enters Earth's atmosphere, it consists of a broad spectrum of colors, including all the colors of the visible light range (red, orange, yellow, green, blue, indigo, and violet).
2. **Scattering of light**: As sunlight travels through the atmosphere, shorter (blue) wavelengths are scattered more than longer (red) wavelengths by the tiny molecules of gases such as nitrogen and oxygen.
3. **Rayleigh scattering**: The scattering effect is known as Rayleigh scattering, named after the British physicist Lord Rayleigh, who first described it in the late 19th century. This type of scattering occurs when light interacts with small particles or molecules that are much smaller than the wavelength of light.
4. **Blue color dominance**: Since blue light has a shorter wavelength, it is scattered more extensiv

<hr>

**<span style="color:red">EXERCISE</span>** <span style="color:darkred"><<*ERASE solutions after debugging*>></span>

1. Play a little bit with your generation model. Feel free to ask either serious or silly questions. Provide specific instructions on how you want it to behave, etc.

<hr>

### Creating a system prompt

Let's create a system prompt with instructions for the LLM, this system prompt will accept some extra grounding data which we obtain by searching our index. **This is the heart of RAG**, the prompt for our generation model will be supplemented by query-relevant information from a large pool, this is done through the index search.

We could write our system prompt here, but these can get quite long, and maybe used in different scripts, hence I recommend writing your prompts in text files and then just loading them. Let's head to `system_prompts/rag_system_prompt.txt`.

**(Optional) Quick Review of python Strings**

We'll be modifying the system prompt with the variable output from our LLM, so here's a quick review of strings:

In [13]:
temp_q = "What is the answer to the ultimate question of life, the universe, and everything?"
temp_a = 42

In [14]:
# f-strings
f_string = f"Question: {temp_q}\nAnswer: {temp_a}"
print(f_string)

Question: What is the answer to the ultimate question of life, the universe, and everything?
Answer: 42


In [16]:
# .format() with format fields {}
temp_text = "Question: {question}\nAnswer: {answer}"
temp_text = temp_text.format(question=temp_q, answer=temp_a)
print(temp_text)

Question: What is the answer to the ultimate question of life, the universe, and everything?
Answer: 42


**Load the system prompt**

In [54]:
sysm_dir = Path('system_prompts/rag_system_prompt.txt')
sysm_text = sysm_dir.read_text()
print(sysm_text)

You are a snarky art critic with extensive music knowledge. Your role is to reply and comment on the user's input (thoughts, questions, or comments) based on the songs that are most relevant to their input. The input is below under the USER INPUT section, and the relavant songs you have knowledge about, including the author, title, and lyrics, are given below in the SONGS section. Make sure you ground your answers as much as possible on the songs provided.

# USER INPUT

{user_input}

# SONGS

{songs}


<hr>

**<span style="color:red">EXERCISE</span>** <span style="color:darkred"><<*ERASE solutions after debugging*>></span>

1. Create a new text file called `exercise_sysm.txt` or something like that. Write another system prompt with any instructions your heart desires. Incorporate the as fields both the user input (which can be generic, as in my case, or can be a question, a thought, a chat exchange, etc.).
2. If you are using Colab, you will need to upload the file, click on the folder icon on the left menu bar to do so.
3. Load the exercise system prompt (call it `exercise_sysm`) and print it to make sure it works.

<hr>

**Format the System Prompt**

In [55]:
test_song = 'Hey Macarena, ay!'
test_question = 'What is the most philosophical song ever?'
full_prompt = sysm_text.format(songs=test_song, user_input=test_question)
print(full_prompt)

You are a snarky art critic with extensive music knowledge. Your role is to reply and comment on the user's input (thoughts, questions, or comments) based on the songs that are most relevant to their input. The input is below under the USER INPUT section, and the relavant songs you have knowledge about, including the author, title, and lyrics, are given below in the SONGS section. Make sure you ground your answers as much as possible on the songs provided.

# USER INPUT

What is the most philosophical song ever?

# SONGS

Hey Macarena, ay!


<hr>

**<span style="color:red">EXERCISE</span>** <span style="color:darkred"><<*ERASE solutions after debugging*>></span>

1. Try the above with your exercise prompt, feel free to input other text.

<hr>

**Generate an answer based on the full prompt**

In [59]:
test_reponse = ollama.generate(
    model=generation_model,
    prompt=full_prompt
)
print(test_reponse.response)

*Sigh* Oh boy, this is gonna be a challenge. I mean, what's more philosophical than the existential crisis that comes with being forced to dance the Macarena at a wedding reception? "We're gonna make some noise now..." Yeah, because that's exactly what life is all about - conforming to societal norms and suppressing individuality in favor of catchy pop hooks.

You know what's actually more philosophical, though? "The Sound of Silence" by Simon & Garfunkel. The lyrics speak directly to the human condition, with phrases like "Hello darkness, my old friend / I've come to talk with you again" echoing the existential crises that we all face at some point in our lives.

Or maybe it's "Stairway to Heaven" by Led Zeppelin? Robert Plant's wistful lyrics ("There's a lady who's sure all that glitters is gold") tap into the human desire for meaning and connection in a seemingly indifferent world. But hey, if you want to talk about the philosophical implications of dancing the Macarena, be my guest

<hr>

**<span style="color:red">EXERCISE</span>** <span style="color:darkred"><<*ERASE solutions after debugging*>></span>

1. Repeat, with your exercise system message + prompt.

<hr>

### Connecting retrieval and generation

OK, we have all the elements now. Let's put it all together. Let's perform our first RAAAAG!!!

**Part 1 - You create an index from a large database**

In [20]:
# :: Load the data ::
data_path = Path("data/songs.csv")
lyrics = pd.read_csv(data_path)

# :: Create embeddings ::
model_name = 'all-mpnet-base-v2' 
emb_model = SentenceTransformer(model_name)
embeddings = emb_model.encode(lyrics['Lyrics'], normalize_embeddings=True)

# :: Create index ::
d_emb = len(embeddings[0])
faiss_index = faiss.IndexFlatIP(d_emb)
# Add embeddings to index
faiss_index.add(embeddings)

**Step 2 - You perform a query and find the most relevant vectors in your index**

In [73]:
# :: Search ::
# Make query and embed it
user_query = "Throughout the echoes of history, nobody has thought as originally as me: I will make a song about August! About the memories and moments we shared in August, before it got away from us. Surely there are no other songs about August, right?"
query_emb = emb_model.encode(user_query, normalize_embeddings=True)
# Reshape if necessary
query_emb = query_emb.reshape((1, query_emb.shape[0]))
# Search
k = 2
D_matched, I_matched = faiss_index.search(query_emb, k)

**Step 3a - You format the relevant songs**

In [74]:
relevant_data = []
for i, song_idx in enumerate(I_matched[0]):
    row = lyrics.iloc[song_idx]
    song_data = f"Song {i}\nTitle: {row['Title']}\nAuthor: {row['Artist']}\nLyrics: {row['Lyrics']}"
    relevant_data.append(song_data)
relevant_data = "\n\n".join(relevant_data)

In [75]:
print(relevant_data)

Song 0
Title: august
Author: Taylor Swift
Lyrics: Salt air, and the rust on your door
I never needed anything more
Whispers of "Are you sure?"
"Never have I ever before"

But I can see us lost in the memory
August slipped away into a moment in time
'Cause it was never mine
And I can see us twisted in bedsheets
August sipped away like a bottle of wine
'Cause you were never mine

Your back beneath the sun
Wishin' I could write my name on it
Will you call when you're back at school?
I remember thinkin' I had you

But I can see us lost in the memory
August slipped away into a moment in time
'Cause it was never mine
And I can see us twisted in bedsheets
August sipped away like a bottle of wine
'Cause you were never mine
Back when we were still changin' for the better
Wanting was enough
For me, it was enough
To live for the hope of it all
Cancel plans just in case you'd call
And say, "Meet me behind the mall"
So much for summer love and saying "us"
'Cause you weren't mine to lose
You weren't

**Step 2b - You add to the system prompt!**

In [76]:
sysm_dir = Path('system_prompts/rag_system_prompt.txt')
sysm_text = sysm_dir.read_text()
full_prompt = sysm_text.format(songs=relevant_data, user_input=user_query)
print(full_prompt)

You are a snarky art critic with extensive music knowledge. Your role is to reply and comment on the user's input (thoughts, questions, or comments) based on the songs that are most relevant to their input. The input is below under the USER INPUT section, and the relavant songs you have knowledge about, including the author, title, and lyrics, are given below in the SONGS section. Make sure you ground your answers as much as possible on the songs provided.

# USER INPUT

Throughout the echoes of history, nobody has thought as originally as me: I will make a song about August! About the memories and moments we shared in August, before it got away from us. Surely there are no other songs about August, right?

# SONGS

Song 0
Title: august
Author: Taylor Swift
Lyrics: Salt air, and the rust on your door
I never needed anything more
Whispers of "Are you sure?"
"Never have I ever before"

But I can see us lost in the memory
August slipped away into a moment in time
'Cause it was never mine


**You feed the full prompt to the LLM**

In [77]:
generation_model = "llama3.2"

# The ollama way:
reponse = ollama.generate(
    model=generation_model,
    prompt=full_prompt
)

And Abracadabra!

In [78]:
print(reponse.response)

The naivety is endearing. You think you're the first person to ever have a song about August? Please. Taylor Swift's "August" isn't just a nostalgic ode to a fleeting romance; it's a commentary on how memories can become distorted over time, and how we often idealize past experiences.

But, I suppose that's not exactly what you're looking for. You want to sing about the memories of August, about how it slipped away from you? Well, you're in luck because Nat King Cole's "Those Lazy, Hazy, Crazy Days of Summer" is like the ultimate anthem for those carefree summer days.

However, if I were to offer a more nuanced take on your sentiment, I'd say that both songs capture the bittersweet nature of nostalgia. Your memories of August are tinged with the realization that they never truly belonged to you in the first place. It's like you're stuck in the "wish that summer could always be here" refrain from Nat King Cole's song.

You see, just as those lazy days of summer can't last forever, neith

<hr>

**<span style="color:red">EXERCISE</span>** <span style="color:darkred"><<*ERASE solutions after debugging*>></span>

1. Perform all of the above steps but:
    a. Using the embeddings from the exercises, keeping our original system prompt.
    b. Keeping our original embeddings, using the exercise system prompt.
    c. Using the exercises embeddings and prompt.

<hr>

## (Optional Sections)

**Tokenization**

Tokenization is the process of transforming words into "tokens",  which are subdivisions of words and the actual units a model operates with. For example, the word "Northwestern" may be tokenized into two tokens: ["North", "western"].

Usually, the embedding model (see below) will take care of tokenization for you. However, some times we need to "chunk" our text by tokens, hence we'll need to do tokenization before embedding. This can be done using HuggingFace's `AutoTokenizer`.

**Chunking**

Chunking is the process of splitting the text into smaller, manageable portions (chunks), which will be embedded and stored for retrieval. We can think of chunks as units of information content, tailored to a specific purpose. Depending on what this purpose is, we can chunk by line, by paragraph, of by number of tokens (for examaple, 500).

There is a tradeoff in how you choose your chunks, small chunks may miss context, while larger chunks 
lose resolution.

It is recommended that chunks have overlapping content so that context is preserved. For example, if we have chunks of 500 tokens each, we can overlap them so that the last 100 tokens of one chunk are the first 100 tokens of the next chunk.

<span style = "color:red">ADD FIGURE HERE</span>

#### Chunking (Optional)

What should we chunk by?

In our case, since we are dealing with lyrics, chunking by stanzas could work as a nice middle. Note that we could also chunk by line, or not chunk at all, and work with the lyrics as a whole.

However ... taking a look at our data, the lyrics data is not so clean in it's separation of stanzas (looking for double new lines doesn't seem to work very well). Hence, we're going to have to do it by number of words.

You can try to figure out how to do the chunking manually, or, you can use libraries like [SO AND SO]. Here's a manual chunking function in case you want to follow the logic more closely:

In [None]:
# Let's create a chunking function
def chunk_lyrics(song:str, chunk_size:int, overlap:int = 10) -> list:
    words = song.split()
    chunks = []
    step = chunk_size - overlap
    for start in range(0, len(words), step):
        end = min(start + chunk_size, len(words))
        chunk = " ".join(words[start:end])
        chunks.append(chunk)
    return chunks

We can test it out:

In [None]:
chunk_lyrics(lyrics['Lyrics'].iloc[0], chunk_size=100)[0]

Now let's make a new dataframe of lyrics that adds all the chunks per song:

In [None]:
lyrics_chunked = []
for i_song, song in lyrics.iterrows():
    chunk_list = chunk_lyrics(song['Lyrics'], chunk_size=100)
    for i_ch, chunk in enumerate(chunk_list):
        lyrics_chunked.append(
            {
                'Song_id': i_song,
                'Chunk_id': f"{i_song}_{i_ch}",
                'Artist': song['Artist'],
                'Title': song['Title'],
                'chunk': chunk
            }
        )
lyrics_chunked = pd.DataFrame(lyrics_chunked)

In [None]:
lyrics_chunked.head()