# Retrieval Augmented Generation

**What is Retrieval augmented generation?**

See slides!

**What libraries will we use?**
- Embedding step: `sentence_transformers`, other good options available.
- Indexing step: `faiss`, it's my favorite so far.
- Generation: `???`

**Pre-requisites**
- Basic python, including numpy
- The intro slides

## Imports

TO-DO:
* Install faiss-cpu or faiss-gpu?

In [2]:
# in_google_colab = 1
# if in_google_colab:
#   !pip install faiss-cpu

In [None]:
# # Dictionary to handle cases where package name and module name are different
# package_to_module = {
#     "faiss-cpu": "faiss",
#     "scikit-learn": "sklearn",
#     "umap-learn": "umap"
# }

# Packages to install
# packages = ["faiss-cpu", "numpy", "scikit-learn", "umap-learn"]

# def check_and_install_packages(package_list):
#     """Check if each package in the list is installed, and install it if not."""
#     for package in package_list:
#         # Get the correct module name (or fallback to package name if they are the same)
#         module_name = package_to_module.get(package, package)
#         try:
#             __import__(module_name)  # Try to import the correct module name
#             print(f"{package} ({module_name}) is already installed")
#         except ImportError:
#             print(f"{package} is not installed, installing now...")
#             !pip install {package}
#             # And add import here no?

# # Check and install packages
# check_and_install_packages(packages)

In [4]:
# Helps with threading issues
# import os
# os.environ["OMP_NUM_THREADS"] = "1"
# os.environ["MKL_NUM_THREADS"] = "1"

In [73]:
# LLM libraries
from sentence_transformers import SentenceTransformer
import faiss
# from transformers import pipeline
import ollama

# Machine learning libraries
from umap import UMAP

# Helper libraries
import pandas as pd
import numpy as np
from pathlib import Path

# Visualization libraries
import plotly.express as px
import plotly.graph_objects as go

## Section 1: Building the Retrieval System

### Load Data

We first need "anchoring" data that our system will retrieve as necessary to answer to relevant prompts. In our case, we are going to use a dataset of song lyrics.

Indicate the data directory:

In [6]:
data_path = Path("data/songs.csv")

Load the data into a pandas dataframe:

In [7]:
lyrics = pd.read_csv(data_path)

Let's take a look at the first few songs:

In [8]:
lyrics.head()

Unnamed: 0,Artist,Title,Lyrics
0,Taylor Swift,cardigan,"Vintage tee, brand new phone\nHigh heels on co..."
1,Taylor Swift,exile,"I can see you standing, honey\nWith his arms a..."
2,Taylor Swift,Lover,We could leave the Christmas lights up 'til Ja...
3,Taylor Swift,the 1,"I'm doing good, I'm on some new shit\nBeen say..."
4,Taylor Swift,Look What You Made Me Do,I don't like your little games\nDon't like you...


Check how many songs we have:

In [9]:
lyrics.shape

(745, 3)

And let's look at the artists we have songs from:

In [10]:
lyrics['Artist'].unique()

array(['Taylor Swift', 'Billie Eilish', 'The Beatles', 'David Bowie',
       'Billy Joel', 'Ed Sheeran', 'Eric Clapton', 'Bruce Springsteen',
       'Vance Joy', 'Lana Del Rey', 'Bryan Adams', 'Leonard Cohen',
       'Nat King Cole', 'twenty one pilots', 'Ray LaMontagne',
       'Bob Dylan', 'John Denver', 'Frank Sinatra', 'Queen', 'Elton John',
       'George Michael'], dtype=object)

### Preprocessing

There are many elements of preprocessing. The main ones you will encounter are:
* Removing boilerplate text, cleaning white spaces, etc.
* Tokenization
* Chunking

I have done the first one for you. In this workshop we may not have time to go over tokenization or chunking, but I've written sections about them at the end of the notebook, in the *Optional* section.

### Create Embeddings

This will be our first crucial step. To create embeddings, you need to have a model that is designed for the same type of data as your data! There are models for text, for images, multimodal, etc.

**Step 1** Choose your model

In [11]:
# Let's pick a populat text model
model_name = 'all-mpnet-base-v2' 

# We create the SentenceTransformer based on our model. This is the function that takes texts and produces embeddings.
emb_model = SentenceTransformer(model_name)

**Step 2** Create embeddings

In [12]:
# Just one line!!
embeddings = emb_model.encode(lyrics['Lyrics'])

Now let's take a second to study our embeddings.

In [13]:
# SentenceTransformer returns numpy arrays. Other libraries may return different data types.
type(embeddings)

numpy.ndarray

In [14]:
# The numpy array is basically an nxd matrix, where n is number of songs and d is the embedding dimension
embeddings.shape

(745, 768)

**Normalization**

Note that our vectors are normalized (see below). This is not always the case, but it is important you know if your vectors are normalized or not. We'll get back to it when we create our index.

In [15]:
np.inner(embeddings[0], embeddings[0])

np.float32(1.0)

In [16]:
# Normalization is not always ensured by default, but you can set it to be so with an argument:
embeddings = emb_model.encode(lyrics['Lyrics'], normalize_embeddings=True)

SentenceTransformers provides a handy `similarity` function, which computes the pairwise similarity of two sets of songs.

In [17]:
# Comparing two songs
emb_model.similarity(embeddings[0], embeddings[1])

tensor([[0.4468]])

The result is a *tensor*, which you can index as you would with numpy arrays:

In [18]:
d01 = emb_model.similarity(embeddings[0], embeddings[1])
print(f"The cosine similarity between song 0 and song 1 is {d01[0,0]}")

The cosine similarity between song 0 and song 1 is 0.4468323588371277


In [19]:
# A maximum of 1 is achieved if the vectors are the same
emb_model.similarity(embeddings[0], embeddings[0])

tensor([[1.0000]])

You can compare one song to multiples songs:

In [20]:
emb_model.similarity(embeddings[0], embeddings[0:5])

tensor([[1.0000, 0.4468, 0.6080, 0.5934, 0.5363]])

Or multiple songs to multiple songs, in which case you get a matrix of similarities:

In [21]:
emb_model.similarity(embeddings[0:5], embeddings[0:5])

tensor([[1.0000, 0.4468, 0.6080, 0.5934, 0.5363],
        [0.4468, 1.0000, 0.5352, 0.5242, 0.6023],
        [0.6080, 0.5352, 1.0000, 0.5728, 0.4673],
        [0.5934, 0.5242, 0.5728, 1.0000, 0.5427],
        [0.5363, 0.6023, 0.4673, 0.5427, 1.0000]])

By default, sentence_transformers uses the `cosine` similarity (see slides). But you can use other distances like the euclidean or manhattan distances (see their documentation).

### Visualization for Intuition

**Similarity Heatmap**

Before continuing, let's try to develop an intuition about these embeddings.

First, recall emebeddings are vectors in a d-dimensional space, where d is quite large. We can't visualize them directly, but we can see how they interact with each other. Let's look at a heatmap of the similarities between different artists songs.

All the code in this section will be skipped! It is not important for our workshop.

In [22]:
n_heatmap = 5
a_artist = 'Taylor Swift'
b_artist = 'Bob Dylan'

In [23]:
lyrics[lyrics['Artist']==a_artist].head(n_heatmap)

Unnamed: 0,Artist,Title,Lyrics
0,Taylor Swift,cardigan,"Vintage tee, brand new phone\nHigh heels on co..."
1,Taylor Swift,exile,"I can see you standing, honey\nWith his arms a..."
2,Taylor Swift,Lover,We could leave the Christmas lights up 'til Ja...
3,Taylor Swift,the 1,"I'm doing good, I'm on some new shit\nBeen say..."
4,Taylor Swift,Look What You Made Me Do,I don't like your little games\nDon't like you...


In [24]:
lyrics[lyrics['Artist']==b_artist].head(n_heatmap)

Unnamed: 0,Artist,Title,Lyrics
575,Bob Dylan,Murder Most Foul,"'Twas a dark day in Dallas, November '63\nA da..."
576,Bob Dylan,Blowin’ in the Wind,How many roads must a man walk down\nBefore yo...
577,Bob Dylan,The Times They Are A-Changin’,"Come gather 'round people, wherever you roam\n..."
578,Bob Dylan,All Along the Watchtower,"""There must be some way out of here""\nSaid the..."
579,Bob Dylan,Like a Rolling Stone,Once upon a time you dressed so fine\nThrew th...


In [25]:
# Let's save the indices for easy access
a_idxs = lyrics[lyrics['Artist']==a_artist].index.to_list()[:n_heatmap]
b_idxs = lyrics[lyrics['Artist']==b_artist].index.to_list()[:n_heatmap]

# subset embeddings of first and last n songs
a_embs = embeddings[a_idxs]
b_embs = embeddings[b_idxs]
both_embs = np.concatenate((a_embs, b_embs), axis=0)

# we'll use this in our visualization:
a_titles = lyrics['Title'].iloc[:n_heatmap].to_list()
b_titles = lyrics['Title'].iloc[-n_heatmap:].to_list()
a_titles = [title[:20] for title in a_titles] # truncating text
b_titles = [title[:20] for title in b_titles]
both_titles = a_titles + b_titles

# compute their similarity, we want to visualize this with a heatmap
fl_sim_matrix = emb_model.similarity(both_embs, both_embs)

In [26]:
fig = px.imshow(
    fl_sim_matrix,
    x=both_titles,
    y=both_titles,
    color_continuous_scale="Viridis",
    text_auto=".2f"
)

fig.update_layout(
    title=f"Cosine Similarity Among {a_artist} and {b_artist} Lyrics",
    width=750,
    height=750,
    xaxis=dict(tickangle=45)
)
fig.show()

**Dimensionality Reduction**

We can also visualize embeddings by looking at them in a lower dimension. We will use a machine learning technique called *dimensionality reduction*. You don't need to know how it's done, and don't worry about the code, we'll use it only for visualization purposes.

(If you attend the topic modeling workshop, you may learn about it more in depth).

In [27]:
dimred_model = UMAP(
    n_neighbors=3,  # umap hyper-parameter
    n_components=2, # dimension we are reducing to
    metric='cosine'
)

two_d_rep = dimred_model.fit_transform(embeddings)

In [28]:
fig_clustering = go.Figure()

fig_clustering.add_trace(go.Scatter(
    x=two_d_rep[:, 0],
    y=two_d_rep[:,1],
    mode='markers',
    marker=dict(size=6),
    text=lyrics['Title'],
    hoverinfo='text'
))

fig_clustering.update_layout(
    height=750, width=750,
    title='Low dimensional view of embedded lyrics',
)

fig_clustering.show()

In [29]:
# A reminder of the artists, which one would you like to see?
lyrics['Artist'].unique()

array(['Taylor Swift', 'Billie Eilish', 'The Beatles', 'David Bowie',
       'Billy Joel', 'Ed Sheeran', 'Eric Clapton', 'Bruce Springsteen',
       'Vance Joy', 'Lana Del Rey', 'Bryan Adams', 'Leonard Cohen',
       'Nat King Cole', 'twenty one pilots', 'Ray LaMontagne',
       'Bob Dylan', 'John Denver', 'Frank Sinatra', 'Queen', 'Elton John',
       'George Michael'], dtype=object)

In [30]:
# Let's highlight an artist's songs just for fun:
artist_highlight = 'John Denver'
artist_idxs = lyrics[lyrics['Artist']==artist_highlight].index.to_list()

fig_clustering = go.Figure()

fig_clustering.add_trace(go.Scatter(
    x=two_d_rep[:, 0],
    y=two_d_rep[:,1],
    mode='markers',
    marker=dict(size=6),
    text=lyrics['Title'],
    hoverinfo='text',
    name = 'All artists'
))

fig_clustering.add_trace(go.Scatter(
    x=two_d_rep[artist_idxs, 0],
    y=two_d_rep[artist_idxs,1],
    mode='markers',
    marker=dict(size=6, color='crimson'),
    text = lyrics['Title'].iloc[artist_idxs],
    hoverinfo = 'text',
    name = artist_highlight
))

fig_clustering.update_layout(
    height=750, width=750,
    title='Low dimensional view of embedded lyrics',
)

fig_clustering.show()

### Create Index

OK, back to our main task. Now we have this collection of high-dimensional embeddings. It's time to create our index! As a reminder, an index is a data structure which efficiently allows us to find the most similar vectors in our collection to one reference point (usually, a user's query). See slides if you need a refresher.

Creating a fass index takes one line!

In [31]:
# We need the dimension of our embeddings
d_emb = len(embeddings[0])

# Create a faiss index
faiss_index = faiss.IndexFlatIP(d_emb) # <-- creating the d-dimensional index (empty for now)
print(faiss_index.is_trained)
print(faiss_index.ntotal)

True
0


We will talk about the meaning of `FaltIP` later. For now, just know that **only because our embeddings are normalized** this index works with the cosine similarity.

### Add embeddings to index

In [32]:
faiss_index.add(embeddings)

In [33]:
print(faiss_index.is_trained)
print(faiss_index.ntotal)

True
745


### Search

In [34]:
# Make a query and embed it, don't forget to normalize!
query_1 = "Life is good and I will survive. I am happy that things turned out this way"
qemb_1 = emb_model.encode(query_1, normalize_embeddings=True)

# Let's remind ourselves how the embedding is returned:
print(f"Query embedding info:\n\nType: {type(qemb_1)}\nShape: {qemb_1.shape}")

Query embedding info:

Type: <class 'numpy.ndarray'>
Shape: (768,)


In theory, we can ask `faiss` to find the closest songs to it with just one line of code:

```python
faiss_index.search(qemb_1, 1)
```

However, the above will produce an error because `faiss` expects a (n,d) numpy array, but when we encode only one query we get a (d,) numpy array back. Therefore, we need to reshape!

In [35]:
qemb_1 = qemb_1.reshape((1, qemb_1.shape[0])) # reshape into (1,d)

print(f"Query embedding info:\n\nType: {type(qemb_1)}\nShape: {qemb_1.shape}")

Query embedding info:

Type: <class 'numpy.ndarray'>
Shape: (1, 768)


Now we can search!

In [36]:
faiss_index.search(qemb_1, 1) # the 1 indicates how many close neighbors to find.

(array([[0.42986214]], dtype=float32), array([[294]]))

The first element [[.43]] is the cosine similarity between the first song and its closest neighbor. The second element [[294]] is the index (song number) of such neighbor.

In [37]:
neighbor_sim, neighbor_idx = faiss_index.search(qemb_1, 1)

print(f"The cosine similarity to the closest neighbor is: {neighbor_sim[0,0]:.2f}\n")
print(f"And the neighbor is:\n{lyrics.iloc[294]}")

The cosine similarity to the closest neighbor is: 0.43

And the neighbor is:
Artist                                         Eric Clapton
Title                                       I’ll Be Alright
Lyrics    I'll be alright\nI'll be alright\nI'll be alri...
Name: 294, dtype: object


If we wanted to find the k closest neighors:

In [38]:
k = 3
faiss_index.search(qemb_1, k)

(array([[0.42986214, 0.41438007, 0.3375955 ]], dtype=float32),
 array([[294, 608, 566]]))

### Searching multiple queries

You can search for multiple queries at the same time, however, some `faiss` versions are buggy when that happens. To avoid any complications, we'll just have to search individually for each query.

In [39]:
# Let's make a helper function that automatically normalized and reshapes embeddings for us:
def embed_reshape(query):
    qemb = emb_model.encode(query, normalize_embeddings=True)
    qemb = qemb.reshape((1, qemb.shape[0]))
    return qemb

In [40]:
query_1 = "Life is good and I will survive. I am happy that things turned out this way"
query_2 = "Why did you leave me? I am so sad. The world is so cruel."
queries = [query_1, query_2]

In [41]:
# Embed the queries, remember to normalize
qembs = [embed_reshape(q) for q in queries]

Let's find the 4 closest songs to each query.

In [46]:
k = 4
D_matched = []
I_matched = []
for qe in qembs:
    dists_q_matched, idxs_q_matched = faiss_index.search(qe, k)
    D_matched.append(dists_q_matched)
    I_matched.append(idxs_q_matched)
# distances_q_matched, indices_q_matched = faiss_index.search(qembs, k)

In [48]:
D_matched

[array([[0.42986214, 0.41438007, 0.3375955 , 0.33043662]], dtype=float32),
 array([[0.44099832, 0.43135837, 0.41771904, 0.41135132]], dtype=float32)]

In [57]:
I_matched[0][0]

array([294, 608, 566, 621])

In [62]:
# Let's look at matched songs for the first query:
print(f"Matched song to your query \'{query_1}\':\n")
for i in I_matched[0][0]:
    artist = lyrics['Artist'].iloc[i]
    title = lyrics['Title'].iloc[i]
    song_lyrics = lyrics['Lyrics'].iloc[i]
    print(f"Artist: {artist}\nTitle: {title}\nLyrics:{song_lyrics[:100]}\n")

Matched song to your query 'Life is good and I will survive. I am happy that things turned out this way':

Artist: Eric Clapton
Title: I’ll Be Alright
Lyrics:I'll be alright
I'll be alright
I'll be alright someday
If in my heart
I do not give
Then I'll be al

Artist: John Denver
Title: Poems, Prayers, & Promises
Lyrics:I've been lately thinking
About my life's time
All the things I've done
And how it's been
And I can'

Artist: Ray LaMontagne
Title: Part of the Light
Lyrics:Why so many people always runnin' 'round
Looking for a happiness that can't be found?
I don't know
I

Artist: John Denver
Title: Matthew
Lyrics:I had an Uncle name of Matthew
He was his father's only boy
Born just south of Colby, Kansas
He was 



In [66]:
# Let's look at matched songs for the first query:
print(f"Matched song to your query \'{query_2}\':\n")
for i in I_matched[1][0]:
    artist = lyrics['Artist'].iloc[i]
    title = lyrics['Title'].iloc[i]
    song_lyrics = lyrics['Lyrics'].iloc[i]
    print(f"Artist: {artist}\nTitle: {title}\nLyrics:{song_lyrics[:100]}\n")

Matched song to your query 'Why did you leave me? I am so sad. The world is so cruel.':

Artist: Billie Eilish
Title: ​bitches broken hearts
Lyrics:You can pretend you don't miss me (Me)
You can pretend you don't care
All you wanna do is kiss me (M

Artist: Queen
Title: Too Much Love Will Kill You
Lyrics:I'm just the pieces
Of the man I used to be
Too many bitter tears
Are raining down on me
I'm far awa

Artist: Billie Eilish
Title: ​goodbye
Lyrics:Please, please
Don't leave﻿ me
Be

It's not true
Take me to the rooftop
Told you not to worry
What d

Artist: Queen
Title: Love of My Life
Lyrics:Love of my life, you've hurt me
You've broken my heart
And now you leave me

Love of my life, can't 



### RECAP

OK, let's do a quick recap, now that we have developed an intuition, we can actually perform the above steps super quickly:

In [68]:
# :: Load the data ::
data_path = Path("data/songs.csv")
lyrics = pd.read_csv(data_path)

# :: Create embeddings ::
model_name = 'all-mpnet-base-v2' 
emb_model = SentenceTransformer(model_name)
embeddings = emb_model.encode(lyrics['Lyrics'], normalize_embeddings=True)

# :: Create index ::
d_emb = len(embeddings[0])
faiss_index = faiss.IndexFlatIP(d_emb)
# Add embeddings to index
faiss_index.add(embeddings)

# :: Search ::
# Make query and embed it
query_1 = "Life is good and I will survive. I am happy that things turned out this way"
qemb_1 = emb_model.encode(query_1, normalize_embeddings=True)
# Reshape if necessary
qemb_1 = qemb_1.reshape((1, qemb_1.shape[0]))
# Search
k = 2
faiss_index.search(qemb_1, k)

(array([[0.42986214, 0.41438007]], dtype=float32), array([[294, 608]]))

And there you have it!! A full retrieval system in less than 20 lines :-)

### (Optional) A note about indexes

**IndexFlatIP vs IndexFlatL2**

IP stands for *inner product*, while L2 stands for *L2 norm* (euclidean distance). In general, these may produce different results, so you need to choose carefully. A simple heuristic is:
- Text data: `cosine similarity`.
- Image data: `euclidean distance`.

But, you should still do some research on the model you are using, the data type, and what you care about (direction, magnitude, etc.). More details on the slides.

**Cosine Similarity, Inner Product, Normalization**

If and only if your vectors are *normalized*, the cosine similarity is the same as the inner product. In our case, since our embeddings are normalized, we can use `IndexFlatIP` and it will be equivalent to using the cosine similarity, which is what we want.

**Flat vs Other Indexes**

Our flat index is not the most computationally efficient. It doesn't quantize vectors (see slides) and all searches are brute force (that is, it will compare a query to all vectors in the index). For our toy dataset this is fine, but for larger datasets you should use other indices. Different libraries have implementations of different indices, for example `faiss` has the hierarchical navigable small world index `IndexHNSWFlat`, which is more search efficient, but returns approximate results.

We can't go over all index types in this workshop, but here's a handy table for the ones `faiss` offers:
https://github.com/facebookresearch/faiss/wiki/Faiss-indexes.

### (Optional) Save index on disc

In general you will create a large index, and store it on disk for further use. This could be purely for convenience (you don't need to create a new index every time you use your RAG system) but will be vital if your index is quite large and memory becomes a limitation.

In [None]:
# Saving to a .index file
index_directory = 'rag_workshop.index'
faiss.write_index(faiss_index, index_directory)

If you need to read it later on, you can use `faiss.read_index(index_directory, faiss.IO_FLAG_MMAP)`, the MMAP flag tells it not to load the full index into memory.

## Section 2 -  Adding Generation

OK, we are done with the R of RAG. Time to get the A done!

For this, we need a generation model and a library that helps us run such model.

I recommend either using `transformers` or `ollama`

In [None]:
# The transformers way, you'll need to provide an access token though
# generation_model = "meta-llama/Llama-3.2-1B" # you can also try "google/gemma-2-2b" for example
# gen_ppln = pipeline(task="text-generation", model=generation_model)

In [76]:
generation_model = "llama3.2"

# The ollama way:
test_reponse = ollama.generate(
    model=generation_model,
    prompt="Why is the sky blue and not pink?"
)

print(test_reponse.response)

The sky appears blue to us during the daytime because of a phenomenon called Rayleigh scattering. Here's what happens:

1. **Sunlight enters Earth's atmosphere**: When sunlight enters our atmosphere, it consists of a spectrum of colors, including all the colors of the visible light spectrum.
2. **Light interacts with tiny molecules**: The shorter (blue) wavelengths of light are scattered more than the longer (red) wavelengths by the tiny molecules of gases in the atmosphere, such as nitrogen and oxygen.
3. **Scattered light is dispersed in all directions**: As a result of this scattering, the blue light is dispersed in all directions and reaches our eyes from all parts of the sky.
4. **Our eyes perceive the blue color**: Because our eyes are most sensitive to blue light, we perceive the sky as blue.

Now, why doesn't the sky appear pink? There are a few reasons:

1. **Pink light has a longer wavelength**: Pink light has a longer wavelength than blue light, which means it is less scatte

### Creating a system prompt

### Connecting retrieval and generation

### Evaluation

### Structured Output (Optional)

## Section 3 - Full Piepeline with transformers (or something else)

## (Optional Sections)

**Tokenization**

Tokenization is the process of transforming words into "tokens",  which are subdivisions of words and the actual units a model operates with. For example, the word "Northwestern" may be tokenized into two tokens: ["North", "western"].

Usually, the embedding model (see below) will take care of tokenization for you. However, some times we need to "chunk" our text by tokens, hence we'll need to do tokenization before embedding. This can be done using HuggingFace's `AutoTokenizer`.

**Chunking**

Chunking is the process of splitting the text into smaller, manageable portions (chunks), which will be embedded and stored for retrieval. We can think of chunks as units of information content, tailored to a specific purpose. Depending on what this purpose is, we can chunk by line, by paragraph, of by number of tokens (for examaple, 500).

There is a tradeoff in how you choose your chunks, small chunks may miss context, while larger chunks 
lose resolution.

It is recommended that chunks have overlapping content so that context is preserved. For example, if we have chunks of 500 tokens each, we can overlap them so that the last 100 tokens of one chunk are the first 100 tokens of the next chunk.

<span style = "color:red">ADD FIGURE HERE</span>

#### Chunking (Optional)

What should we chunk by?

In our case, since we are dealing with lyrics, chunking by stanzas could work as a nice middle. Note that we could also chunk by line, or not chunk at all, and work with the lyrics as a whole.

However ... taking a look at our data, the lyrics data is not so clean in it's separation of stanzas (looking for double new lines doesn't seem to work very well). Hence, we're going to have to do it by number of words.

You can try to figure out how to do the chunking manually, or, you can use libraries like [SO AND SO]. Here's a manual chunking function in case you want to follow the logic more closely:

In [None]:
# Let's create a chunking function
def chunk_lyrics(song:str, chunk_size:int, overlap:int = 10) -> list:
    words = song.split()
    chunks = []
    step = chunk_size - overlap
    for start in range(0, len(words), step):
        end = min(start + chunk_size, len(words))
        chunk = " ".join(words[start:end])
        chunks.append(chunk)
    return chunks

We can test it out:

In [None]:
chunk_lyrics(lyrics['Lyrics'].iloc[0], chunk_size=100)[0]

"Vintage tee, brand new phone High heels on cobblestones When you are young, they assume you know nothing Sequin smile, black lipstick Sensual politics When you are young, they assume you know nothing But I knew you Dancin' in your Levi's Drunk under a streetlight, I I knew you Hand under my sweatshirt Baby, kiss it better, I And when I felt like I was an old cardigan Under someone's bed You put me on and said I was your favorite A friend to all is a friend to none Chase two girls, lose the one When you are young,"

Now let's make a new dataframe of lyrics that adds all the chunks per song:

In [None]:
lyrics_chunked = []
for i_song, song in lyrics.iterrows():
    chunk_list = chunk_lyrics(song['Lyrics'], chunk_size=100)
    for i_ch, chunk in enumerate(chunk_list):
        lyrics_chunked.append(
            {
                'Song_id': i_song,
                'Chunk_id': f"{i_song}_{i_ch}",
                'Artist': song['Artist'],
                'Title': song['Title'],
                'chunk': chunk
            }
        )
lyrics_chunked = pd.DataFrame(lyrics_chunked)

In [None]:
lyrics_chunked.head()

Unnamed: 0,Song_id,Chunk_id,Artist,Title,chunk
0,0,0_0,Taylor Swift,cardigan,"Vintage tee, brand new phone High heels on cob..."
1,0,0_1,Taylor Swift,cardigan,"Chase two girls, lose the one When you are you..."
2,0,0_2,Taylor Swift,cardigan,"on the last train Marked me like a bloodstain,..."
3,0,0_3,Taylor Swift,cardigan,grocery line I knew you'd miss me once the thr...
4,1,1_0,Taylor Swift,exile,"I can see you standing, honey With his arms ar..."
