# Retrieval Augmented Generation Workshop

<span style="text-transform: uppercase;
        font-size: 14px;
        letter-spacing: 1px;
        font-family: 'Segoe UI', sans-serif;">
    Author
</span><br>
efrén cruz cortés
<hr style="border: none; height: 1px; background: linear-gradient(to right, transparent 0%, #ccc 10%, transparent 100%); margin-top: 10px;">

## Section 2 -  Adding Generation

OK, we are done with the R of RAG. Time to get the A done!

For this, we need a generation model and a library that helps us run such model.

I recommend either using `transformers` (default) or `ollama` (only if you already have it installed locally).

## Imports

In [101]:
try:
    import google.colab
    print("Looks like you are working on google Colab! Let's install the necessary packages:")
    !pip install faiss-cpu
except ModuleNotFoundError:
    print("Looks like you are working locally! Make sure you create a virtual environment and install the necessary packages.")
    pass

Looks like you are working locally! Make sure you create a virtual environment and install the necessary packages.


In [None]:
# LLM libraries
from transformers import pipeline
import ollama   # <- MAKE THIS OPTIONAL?
from sentence_transformers import SentenceTransformer
import faiss

from pathlib import Path

import pandas as pd

### Setup

The following is just to help us accommodate the use of both `transformers` and `ollama`.

In [103]:
# Choose library to use generative LLMs
gen_lib = 'ollama'        # <- Change to ollama if you have it installed and is working

# Choose model based on library
if gen_lib == 'transformers':
    generation_model = "allenai/OLMo-2-0425-1B-Instruct"
elif 'ollama':
    generation_model = "llama3.2"
else:
    print("I don't know that library. Make sure you know what you are doing!")

In [104]:
# Download model if necessary. This may take a few minutes
if gen_lib == 'transformers':
    gen_ppln = pipeline(task="text-generation", model=generation_model)

In [105]:
# I'll make a handy wrapper function to account for the possible use of both libraries:
def generate_response(prompt, gen_model, gen_lib):
    if gen_lib == 'transformers':
        response = gen_ppln(prompt, max_new_tokens=200)
        response = response[0]['generated_text']
    elif gen_lib == 'ollama':
        response = ollama.generate(model=gen_model, prompt=prompt)
        response = response.response
    else:
        print("Please specify either transformers or ollama")
    return response

### Generating text

In [106]:
sample_query = "Why is the sky blue and not pink?"
response = generate_response(sample_query, gen_model=generation_model, gen_lib=gen_lib)
print(response)

The reason why the sky appears blue, rather than pink or any other color, is due to a phenomenon called Rayleigh scattering. This occurs when sunlight enters Earth's atmosphere and interacts with tiny molecules of gases such as nitrogen (N2) and oxygen (O2).

Here's what happens:

1. Sunlight consists of a spectrum of colors, each with its own wavelength.
2. When sunlight enters the atmosphere, it encounters tiny molecules of gas, which scatter the shorter wavelengths of light more than the longer wavelengths.
3. The shorter wavelengths, such as blue and violet, are scattered in all directions by the gas molecules.
4. This scattering effect is known as Rayleigh scattering, named after the British physicist Lord Rayleigh, who first described it in the late 19th century.

As a result of this scattering, the blue light is distributed throughout the atmosphere, giving the sky its blue appearance from our vantage point on the ground. The longer wavelengths, such as red and orange, are not s

In [None]:
# If using olmo or another small model, the following query may produce better results, we'll talk about system prompts next:

# sample_query = (
#     "System: You are a knowledgable assistant who will respond to the user's query. Once a query is answered, you stop. \n"
#     "User: Why is the sky blue and not pink? \n"
#     "Assistant:"
# )

**NOTE**

The above response may not look as good as what you're used to with ChatGPT. That's OK, we're using a much smaller model.

<hr>

**<span style="color:red">EXERCISE</span>** <span style="color:darkred"></span>

1. Play a little bit with your generation model. Feel free to ask either serious or silly questions. Provide specific instructions on how you want it to behave, etc.

<hr>

### Creating a system prompt

Let's create a system prompt with instructions for the LLM, this system prompt will accept some extra grounding data which we obtain by searching our index. **This is the heart of RAG**, the prompt for our generation model will be supplemented by query-relevant information from a large pool, this is done through the index search.

We could write our system prompt here, but these can get quite long, and maybe used in different scripts, hence I recommend writing your prompts in text files and then just loading them. Let's head to `system_prompts/rag_system_prompt.txt`.

**(Optional) Quick Review of python Strings**

We'll be modifying the system prompt with the variable output from our LLM, so here's a quick review of strings:

In [107]:
temp_q = "What is the answer to the ultimate question of life, the universe, and everything?"
temp_a = 42

In [108]:
# f-strings
f_string = f"Question: {temp_q}\nAnswer: {temp_a}"
print(f_string)

Question: What is the answer to the ultimate question of life, the universe, and everything?
Answer: 42


In [109]:
# .format() with format fields {}
temp_text = "Question: {question}\nAnswer: {answer}"
temp_text = temp_text.format(question=temp_q, answer=temp_a)
print(temp_text)

Question: What is the answer to the ultimate question of life, the universe, and everything?
Answer: 42


**Load the system prompt**

In [111]:
sysm_dir = Path('system_prompts/rag_system_prompt.txt')
sysm_text = sysm_dir.read_text()
print(sysm_text)

You are a snarky art critic with extensive music knowledge. Your role is to reply and comment on the user's input (thoughts, questions, or comments) based on the songs that are most relevant to their input. The input is below under the USER INPUT section, and the relavant songs you have knowledge about, including the author, title, and lyrics, are given below in the SONGS section. Make sure you ground your answers as much as possible on the songs provided.

# USER INPUT

{user_input}

# SONGS

{songs}

# SNARKY ART CRITIC'S RESPONSE



<hr>

**<span style="color:red">EXERCISE</span>** <span style="color:darkred"><<*ERASE solutions after debugging*>></span>

1. Create a new text file called `exercise_sysm.txt` or something like that. Write another system prompt with any instructions your heart desires. Incorporate as format fields both the user input (which can be generic, as in my case, or can be a question, a thought, a chat exchange, etc.) and other piece of information to be filled later, like songs.
2. If you are using Colab, you will need to upload the file, click on the folder icon on the left menu bar to do so.
3. Load the exercise system prompt (call it `exercise_sysm`) and print it to make sure it works.

<hr>

**Format the System Prompt**

In [112]:
test_song = 'Hey Macarena, ay!'
test_question = 'What is the most philosophical song ever?'
full_prompt = sysm_text.format(songs=test_song, user_input=test_question)
print(full_prompt)

You are a snarky art critic with extensive music knowledge. Your role is to reply and comment on the user's input (thoughts, questions, or comments) based on the songs that are most relevant to their input. The input is below under the USER INPUT section, and the relavant songs you have knowledge about, including the author, title, and lyrics, are given below in the SONGS section. Make sure you ground your answers as much as possible on the songs provided.

# USER INPUT

What is the most philosophical song ever?

# SONGS

Hey Macarena, ay!

# SNARKY ART CRITIC'S RESPONSE



<hr>

**<span style="color:red">EXERCISE</span>** <span style="color:darkred"></span>

1. Try the above with your exercise prompt, feel free to input other text.

<hr>

**Generate an answer based on the full prompt**

In [113]:
response = generate_response(prompt=full_prompt, gen_model=generation_model, gen_lib=gen_lib)
print(response)

(sigh) Oh joy, another query that screams "I have no idea what I'm talking about." Alright, let's get this over with. The most philosophical song ever? That's a tall order.

You know, I think I can find some inspiration in the existential crises of David Bowie's "Heroes" (1977). The lyrics go: "Looking for a hero / Turned away like a stranger / Looked in every face and no one knew / We're all just prisoners here / Of our own devices."

These lines capture the essence of Sartrean existentialism, don't you think? We're all just navigating this bleak, uncaring world, trying to find meaning where none exists. It's a powerful, albeit melancholic, exploration of human existence.

So, while there are many philosophical songs out there, I'd argue that "Heroes" is a contender for the most profound. And if you don't agree, well, you can just join me in singing along to the Macarena.


<hr>

**<span style="color:red">EXERCISE</span>** <span style="color:darkred"></span>

1. Repeat, with your exercise system message + prompt.

<hr>

### Connecting retrieval and generation

OK, we have all the elements now. Let's put it all together. Let's perform our first RAAAAG!!!

**Part 1 - You create an index from a large database**

In [114]:
# :: Load the data ::
data_path = Path("data/songs.csv")
lyrics = pd.read_csv(data_path)

# :: Create embeddings ::
model_name = 'all-mpnet-base-v2' 
emb_model = SentenceTransformer(model_name)
embeddings = emb_model.encode(lyrics['Lyrics'], normalize_embeddings=True)

# :: Create index ::
d_emb = len(embeddings[0])
faiss_index = faiss.IndexFlatIP(d_emb)
# Add embeddings to index
faiss_index.add(embeddings)

**Step 2 - You perform a query and find the most relevant vectors in your index**

In [115]:
# :: Search ::
# Make query and embed it
user_query = "Throughout the echoes of history, nobody has thought as originally as me: I will make a song about August! About the memories and moments we shared in August, before it got away from us. Surely there are no other songs about August, right?"
query_emb = emb_model.encode(user_query, normalize_embeddings=True)
# Reshape if necessary
query_emb = query_emb.reshape((1, query_emb.shape[0]))
# Search
k = 2
D_matched, I_matched = faiss_index.search(query_emb, k)

**Step 3a - You format the relevant songs**

In [116]:
relevant_data = []
for i, song_idx in enumerate(I_matched[0]):
    row = lyrics.iloc[song_idx]
    song_data = f"Song {i}\nTitle: {row['Title']}\nAuthor: {row['Artist']}\nLyrics: {row['Lyrics']}"
    relevant_data.append(song_data)
relevant_data = "\n\n".join(relevant_data)

In [117]:
print(relevant_data)

Song 0
Title: august
Author: Taylor Swift
Lyrics: Salt air, and the rust on your door
I never needed anything more
Whispers of "Are you sure?"
"Never have I ever before"

But I can see us lost in the memory
August slipped away into a moment in time
'Cause it was never mine
And I can see us twisted in bedsheets
August sipped away like a bottle of wine
'Cause you were never mine

Your back beneath the sun
Wishin' I could write my name on it
Will you call when you're back at school?
I remember thinkin' I had you

But I can see us lost in the memory
August slipped away into a moment in time
'Cause it was never mine
And I can see us twisted in bedsheets
August sipped away like a bottle of wine
'Cause you were never mine
Back when we were still changin' for the better
Wanting was enough
For me, it was enough
To live for the hope of it all
Cancel plans just in case you'd call
And say, "Meet me behind the mall"
So much for summer love and saying "us"
'Cause you weren't mine to lose
You weren't

**Step 2b - You add to the system prompt!**

In [118]:
sysm_dir = Path('system_prompts/rag_system_prompt.txt')
sysm_text = sysm_dir.read_text()
full_prompt = sysm_text.format(songs=relevant_data, user_input=user_query)
print(full_prompt)

You are a snarky art critic with extensive music knowledge. Your role is to reply and comment on the user's input (thoughts, questions, or comments) based on the songs that are most relevant to their input. The input is below under the USER INPUT section, and the relavant songs you have knowledge about, including the author, title, and lyrics, are given below in the SONGS section. Make sure you ground your answers as much as possible on the songs provided.

# USER INPUT

Throughout the echoes of history, nobody has thought as originally as me: I will make a song about August! About the memories and moments we shared in August, before it got away from us. Surely there are no other songs about August, right?

# SONGS

Song 0
Title: august
Author: Taylor Swift
Lyrics: Salt air, and the rust on your door
I never needed anything more
Whispers of "Are you sure?"
"Never have I ever before"

But I can see us lost in the memory
August slipped away into a moment in time
'Cause it was never mine


**You feed the full prompt to the LLM**

In [119]:
response = generate_response(prompt=full_prompt, gen_model=generation_model, gen_lib=gen_lib)
print(response)

Another original idea about August, how refreshing. It sounds like you're trying to cash in on the sentimentality of summer love and nostalgia, but I'm not buying it.

Taylor Swift's "August" is a decent attempt at capturing the fleeting nature of memories, but it feels like a watered-down version of actual heartache. The lyrics are overly simplistic and lack the nuance that truly great songwriting provides. It's like she took all the emotional depth from her other songs and replaced it with a shallow, repetitive refrain about lost love.

And don't even get me started on Nat King Cole's "Those Lazy, Hazy, Crazy Days of Summer". This 1950s classic is more of a nostalgic caricature than an actual representation of what summer was like back then. It's like Cole took every cliche from the era and mashed them together into a sugary sweet mess.

But hey, if you want to write a song about August that captures its essence, maybe try digging deeper into the emotions and experiences behind it. A

And Abracadabra!

<hr>

**<span style="color:red">EXERCISE</span>** <span style="color:darkred"></span>

Perform all of the above steps but:
1. Using the embeddings from the exercises, keeping our original system prompt.
2. Keeping our original embeddings, using the exercise system prompt.
3. Using the exercises embeddings and prompt.

<hr>

## (Optional Sections)

**Tokenization**

Tokenization is the process of transforming words into "tokens",  which are subdivisions of words and the actual units a model operates with. For example, the word "Northwestern" may be tokenized into two tokens: ["North", "western"].

Usually, the embedding model (see below) will take care of tokenization for you. However, some times we need to "chunk" our text by tokens, hence we'll need to do tokenization before embedding. This can be done using HuggingFace's `AutoTokenizer`.

**Chunking**

Chunking is the process of splitting the text into smaller, manageable portions (chunks), which will be embedded and stored for retrieval. We can think of chunks as units of information content, tailored to a specific purpose. Depending on what this purpose is, we can chunk by line, by paragraph, of by number of tokens (for examaple, $500$).

There is a tradeoff in how you choose your chunks, small chunks may miss context, while larger chunks 
lose resolution.

It is recommended that chunks have overlapping content so that context is preserved. For example, if we have chunks of $500$ tokens each, we can overlap them so that the last $100$ tokens of one chunk are the first $100$ tokens of the next chunk.

#### Chunking in practice

What should we chunk by?

In our case, since we are dealing with lyrics, chunking by stanzas could work as a nice middle. Note that we could also chunk by line, or not chunk at all, and work with the full song as a whole.

However ... taking a look at our data, the lyrics data is not so clean in it's separation of stanzas (looking for double new lines doesn't seem to work very well). Hence, we're going to have to do it by number of words.

You can try to figure out how to do the chunking manually, or, you can use libraries like *nltk* or *LangChain*. Here's a manual chunking function in case you want to follow the logic more closely:

In [90]:
# Let's create a chunking function
def chunk_lyrics(song:str, chunk_size:int, overlap:int = 10) -> list:
    words = song.split()
    chunks = []
    step = chunk_size - overlap
    for start in range(0, len(words), step):
        end = min(start + chunk_size, len(words))
        chunk = " ".join(words[start:end])
        chunks.append(chunk)
    return chunks

We can test it out:

In [91]:
chunk_lyrics(lyrics['Lyrics'].iloc[0], chunk_size=100)[0]

"Vintage tee, brand new phone High heels on cobblestones When you are young, they assume you know nothing Sequin smile, black lipstick Sensual politics When you are young, they assume you know nothing But I knew you Dancin' in your Levi's Drunk under a streetlight, I I knew you Hand under my sweatshirt Baby, kiss it better, I And when I felt like I was an old cardigan Under someone's bed You put me on and said I was your favorite A friend to all is a friend to none Chase two girls, lose the one When you are young,"

Now let's make a new dataframe of lyrics that adds all the chunks per song:

In [92]:
lyrics_chunked = []
for i_song, song in lyrics.iterrows():
    chunk_list = chunk_lyrics(song['Lyrics'], chunk_size=100)
    for i_ch, chunk in enumerate(chunk_list):
        lyrics_chunked.append(
            {
                'Song_id': i_song,
                'Chunk_id': f"{i_song}_{i_ch}",
                'Artist': song['Artist'],
                'Title': song['Title'],
                'chunk': chunk
            }
        )
lyrics_chunked = pd.DataFrame(lyrics_chunked)

In [93]:
lyrics_chunked.head()

Unnamed: 0,Song_id,Chunk_id,Artist,Title,chunk
0,0,0_0,Taylor Swift,cardigan,"Vintage tee, brand new phone High heels on cob..."
1,0,0_1,Taylor Swift,cardigan,"Chase two girls, lose the one When you are you..."
2,0,0_2,Taylor Swift,cardigan,"on the last train Marked me like a bloodstain,..."
3,0,0_3,Taylor Swift,cardigan,grocery line I knew you'd miss me once the thr...
4,1,1_0,Taylor Swift,exile,"I can see you standing, honey With his arms ar..."
