This notebook begins the steps for implementing the data into a RAG system. We give the LLM the user-inputed lyrics, and we then find the embedding, use FAISS to compare and grab k similar lyrics from our data, and the LLM predicts popularity and explain why given the popularity and audio features of the similar songs. 

There is sitll work to be done:

- we use the LLM for prediciton and explinations. using a prediction model instead can lead to better reproducibility (consistant scores)
- we combine song title and arist into embedding for faiss
- better prompting techniques
- deep eval to evaluate the perforamnce of RAG

### Downloading the model first

In [12]:
import os, certifi
os.environ["SSL_CERT_FILE"] = certifi.where()
os.environ["REQUESTS_CA_BUNDLE"] = certifi.where()

In [13]:
from sentence_transformers import SentenceTransformer
import tqdm as tqdm

model = SentenceTransformer("sentence-transformers/distiluse-base-multilingual-cased-v2")
print("Model loaded")

  from .autonotebook import tqdm as notebook_tqdm


Model loaded


### Reading in data and making df

In [1]:
import numpy as np
import pandas as pd
from typing import List, Dict, Any

df1 = pd.read_parquet("data/lyric_embeddings/librosa_shard_0.parquet")
df2 = pd.read_parquet("data/lyric_embeddings/librosa_shard_1.parquet")
df3 = pd.read_parquet("data/lyric_embeddings/librosa_shard_2.parquet")
df4 = pd.read_parquet("data/lyric_embeddings/librosa_shard_3.parquet")
df5 = pd.read_parquet("data/lyric_embeddings/librosa_shard_4.parquet")

df = pd.concat([df1, df2, df3, df4, df5])

### REAL embedding functions

In [3]:
import faiss

emb_list = [np.asarray(x, dtype="float32") for x in df["lyrics_embedding"].values]
emb_matrix = np.stack(emb_list, axis=0)
dimension = emb_matrix.shape[1]
index = faiss.IndexFlatL2(dimension)

index.add(emb_matrix.astype("float32"))

print("FAISS index built with", index.ntotal, "vectors.")

def retrieve_similar_songs(query_embedding: np.ndarray, k: int = 5) -> List[Dict[str, Any]]:
    query_vector = np.array([query_embedding]).astype('float32')
    D, I = index.search(query_vector, k)

    neighbors = []
    for idx, dist in zip(I[0], D[0]):
        if idx != -1:
            neighbors.append({
                "index": int(idx),
                "similarity": float(dist)
            })

    return neighbors


FAISS index built with 20740 vectors.


In [19]:
import re
import unicodedata


def clean_lyrics_for_query(text: str) -> str:
    """
    Simple lyric cleaner for user queries.

    • Lowercases text
    • Removes [chorus], [verse 1], etc.
    • Flattens newlines / \n into spaces
    • Strips things like (prod. xxx), (remix)
    • Drops repeat markers like x2, x3
    • Keeps letters (any language), digits, spaces, apostrophes
    """

    if not isinstance(text, str):
        return ""

    text = text.lower()

    text = re.sub(r"\[.*?\]", " ", text)

    text = text.replace("\\n", " ").replace("\n", " ")

    # remove (prod. ...), (remix ...)
    text = re.sub(r"\(.*?prod.*?\)", " ", text)
    text = re.sub(r"\(.*?remix.*?\)", " ", text)

    # remove x2, x3, etc.
    text = re.sub(r"\bx\d+\b", " ", text)

    chars = []
    for ch in text:
        cat = unicodedata.category(ch)
        if cat.startswith("L") or cat.startswith("N") or ch in [" ", "'", "’"]:
            chars.append(ch)

    text = "".join(chars)

    text = re.sub(r"\s+", " ", text).strip()
    return text




def embed_lyrics(text: str) -> np.ndarray:
    cleaned = clean_lyrics_for_query(text)

    emb = model.encode(
        [cleaned],
        convert_to_numpy=True,
        normalize_embeddings=True
    )

    vec = emb[0]

    vec = np.asarray(vec, dtype="float32").reshape(-1)

    return vec



In [5]:
audio_feature_cols = list(df.columns[df.columns.get_loc("duration")+1:])
df.head()

Unnamed: 0,song_id,title,artist,query_title,query_artist,track_genre,popularity,lyrics,preview_url,track_id,...,spectral_contrast_6,spectral_contrast_7,tonnetz_1,tonnetz_2,tonnetz_3,tonnetz_4,tonnetz_5,tonnetz_6,lyrics_clean,lyrics_embedding
0,4845,State of Mind,Scooter,state of mind,scooter,happy,24.0,The world seems not the same...\n\nIntroducing...,https://audio-ssl.itunes.apple.com/itunes-asse...,1692327616,...,18.328021,39.053367,0.197966,-0.116721,0.142559,-0.069539,-0.044986,-0.047523,the world seems not the same introducing twist...,"[0.07519828, -0.0233649, -0.06524662, -0.07315..."
1,462,Reptilia,The Strokes,reptilia,the strokes,alt-rock,75.0,[Verse 1]\nHe seemed impressed by the way you ...,https://audio-ssl.itunes.apple.com/itunes-asse...,302987569,...,17.382681,39.012014,0.078138,-0.077754,0.063345,0.036541,-0.011976,-0.014041,he seemed impressed by the way you came in tel...,"[-0.08670999, -0.025700577, -0.08122497, -0.02..."
2,16017,None Of My Business,Cher Lloyd,none of my business,cher lloyd,electro,64.0,"[Chorus]\nDamn, I heard that you and her been ...",https://audio-ssl.itunes.apple.com/itunes-asse...,1438630505,...,18.248683,39.966514,0.013912,0.1729,-0.092766,-0.056323,-0.004173,-0.014388,damn i heard that you and her been having prob...,"[0.017929412, 0.0015679213, 0.00086991367, -0...."
3,9478,Trouble Sleeping,The Perishers,trouble sleeping,the perishers,acoustic,48.0,I'm having trouble sleeping\nYou're jumping in...,https://audio-ssl.itunes.apple.com/itunes-asse...,89335271,...,16.969837,28.947224,-0.118755,0.195544,0.025169,-0.130705,0.024176,0.005865,i'm having trouble sleeping you're jumping in ...,"[0.012034113, -0.0008498362, -0.040335782, 0.0..."
4,2822,Shot in the Dark,Ozzy Osbourne,shot in the dark,ozzy osbourne,hard-rock,65.0,[Verse 1]\nOut on the streets I'm stalking the...,https://audio-ssl.itunes.apple.com/itunes-asse...,158711416,...,17.184653,35.540522,-0.113671,0.023209,-0.029743,-0.051142,0.003486,-0.011837,out on the streets i'm stalking the night i ca...,"[-0.054401744, 0.021241566, -0.05175488, -0.01..."


In [6]:
def get_top_k_neighbors(df, query_embedding, k=5):
    raw_neighbors = retrieve_similar_songs(query_embedding, k=k)
    neighbors = []

    for n in raw_neighbors:
        idx = n["index"]
        row = df.iloc[idx]

        audio_features = {}

        for col in audio_feature_cols:
            val = row[col]

            # keep if scalar
            if np.isscalar(val):
                audio_features[col] = float(val)
            
            # flatten if array
            elif isinstance(val, np.ndarray):
                val = val.flatten()
                for j, v in enumerate(val):
                    audio_features[f"{col}_{j}"] = float(v)
            
            # flatten if list
            elif isinstance(val, list):
                for j, v in enumerate(val):
                    audio_features[f"{col}_{j}"] = float(v)

            else:
                try:
                    audio_features[col] = float(val)
                except Exception:
                    audio_features[col] = None

        neighbor_data = {
            "song_id": row["song_id"],
            "title": row["title"],
            "artist": row["artist"],
            "similarity": n.get("similarity", None),
            "popularity": float(row["popularity"]),
            "lyrics_snippet": row["lyrics"][:400].replace("\n", " ") + "...",
            "audio_features": audio_features
        }
        
        neighbors.append(neighbor_data)

    return neighbors


In [7]:
# continue working on promping
def build_rag_prompt_for_lyric_popularity(user_lyric: str,neighbors: List[Dict[str, Any]]) -> str:

    lines = []
    lines.append("You are an expert in music analytics, audio features, and lyric interpretation.")
    lines.append("You are given a NEW lyric and several similar songs from a dataset.")
    lines.append("")
    lines.append("Each similar song includes:")
    lines.append(" - song_id, title, artist")
    lines.append(" - lyric snippet")
    lines.append(" - popularity score (0-100)")
    lines.append(" - detailed audio features extracted from 30-second clips")
    lines.append("")
    lines.append("Your tasks are:")
    lines.append("  1. Predict a popularity score (0-100) for the NEW lyric.")
    lines.append("  2. Explain your reasoning using comparisons to the similar songs.")
    lines.append("")
    lines.append("Keep in mind the audio features of the similar songs, and explain what they mean in context to everyday people.")
    lines.append("")
    lines.append("Return your answer as VALID JSON with this exact format:")
    lines.append("{")
    lines.append('  "predicted_popularity": <number>,')
    lines.append('  "explanation": "<multi-paragraph explanation grounded in the provided songs>"')
    lines.append("}")
    lines.append("")
    lines.append("IMPORTANT:")
    lines.append("Return ONLY raw JSON.")
    lines.append("Do NOT include any code fences such as ``` json")
    lines.append("Do NOT include any explanation text outside the JSON.")
    lines.append("Do NOT add commentary before or after the JSON.")
    lines.append("Return JSON ONLY.")
    lines.append("")
    lines.append("------------------------------------------------------------")
    lines.append("NEW LYRIC:")
    lines.append(user_lyric.strip())
    lines.append("------------------------------------------------------------")
    lines.append("")
    lines.append("SIMILAR SONGS FROM THE DATASET (use these as evidence):")

    for i, nb in enumerate(neighbors, start=1):
        lines.append(f"\nNeighbor #{i}:")
        lines.append(f"  song_id: {nb['song_id']}")
        lines.append(f"  title: {nb['title']}")
        lines.append(f"  artist: {nb['artist']}")
        if nb["similarity"] is not None:
            lines.append(f"  similarity: {nb['similarity']:.4f}")
        lines.append(f"  popularity: {nb['popularity']:.2f}")
        lines.append(f"  lyrics_snippet: {nb['lyrics_snippet']}")
        lines.append("  audio_features:")

        for feat_name, feat_val in nb["audio_features"].items():
            lines.append(f"    {feat_name}: {feat_val:.4f}")

    lines.append("")
    lines.append(
        "Using ONLY the information above, estimate the popularity of the new lyric "
        "and explain your reasoning in terms of lyric similarity, artist/genre patterns, "
        "and audio features (energy, brightness, tempo, chroma, MFCCs, contrasts, tonnetz, etc.)."
        "Make sure to contextulize what the audio features mean for the average person."
    )

    return "\n".join(lines)


In [8]:
import os
import re
import json
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()
client = OpenAI()

def call_llm_for_popularity_and_explanation(prompt: str) -> dict:

    response = client.responses.create(
        model="gpt-4o",
        input=prompt,
        temperature=0.2,
        max_output_tokens=900
    )

    raw_text = response.output[0].content[0].text.strip()

    # Remove any ```json ...``` or ```
    raw_text = raw_text.replace("```json", "")
    raw_text = raw_text.replace("```", "")
    raw_text = raw_text.strip()

    # first try direct json parse
    try:
        return json.loads(raw_text)
    except:
        pass

    # else, find outside json block using regex
    json_matches = re.findall(r"\{(?:[^{}]|(?:\{[^{}]*\}))*\}", raw_text, flags=re.DOTALL)

    if json_matches:
        for match in json_matches:
            try:
                return json.loads(match)
            except:
                continue

    # try to repair json with trailing commas
    repaired = re.sub(r",\s*([}\]])", r"\1", raw_text)

    try:
        return json.loads(repaired)
    except:
        pass

    # else, say failed
    print("Could not parse JSON from LLM output. Returning raw text.")
    return {
        "predicted_popularity": None,
        "explanation": raw_text
    }





In [9]:
def rag_lyric_popularity_system(df: pd.DataFrame, user_lyric: str, k_neighbors: int = 5) -> Dict[str, Any]:
    # 1) embed
    query_embedding = embed_lyrics(user_lyric)

    # 2) find songs with similar sounding lyrics
    neighbors = get_top_k_neighbors(df, query_embedding, k=k_neighbors)

    # 3) build prompt
    prompt = build_rag_prompt_for_lyric_popularity(user_lyric, neighbors)

    # 4) call llm
    llm_output = call_llm_for_popularity_and_explanation(prompt)

    pred_pop = llm_output.get("predicted_popularity", None)
    explanation = llm_output.get("explanation", "")

    return {
        "predicted_popularity": pred_pop,
        "explanation": explanation,
        "neighbors_used": neighbors,
        "prompt_sent": prompt,
    }


# Now we can test the system

In [10]:
import numpy as np
import pandas as pd

# 1. Grab the existing list (which currently has the bad columns)
# Note: Ensure this variable exists. If not, re-run the cell where it was defined originally.
# If you get a NameError here, look for the cell with 'audio_feature_cols = ...' and run it first.

# 2. Filter it to keep ONLY numeric columns
# We explicitly exclude 'lyrics', 'title', 'artist', etc. by checking the data type.
audio_feature_cols = [
    col for col in audio_feature_cols 
    if pd.api.types.is_numeric_dtype(df[col])
]

print(f"Sanitized feature list. Removed non-numeric columns.")
print(f"Count of valid audio features: {len(audio_feature_cols)}")

Sanitized feature list. Removed non-numeric columns.
Count of valid audio features: 80


In [22]:
test_lyric = "random test lyrics."
result = rag_lyric_popularity_system(df, test_lyric, k_neighbors=3)

print("Predicted popularity:", result["predicted_popularity"])
print("\nExplanation:\n", result["explanation"])


Predicted popularity: 30

Explanation:
 The new lyric 'random test lyrics' is quite generic and lacks specific emotional or narrative content, which can affect its potential popularity. When comparing it to the similar songs provided, we can draw some insights based on their popularity scores and audio features.

Neighbor #1, 'Marks On My Neck' by Charlie Puth, has a popularity score of 66. This song features emotionally charged lyrics and a strong narrative, which likely contributes to its higher popularity. The audio features show a moderate level of energy and brightness, with a spectral centroid of 2939.8676 and spectral bandwidth of 3459.3342. These features suggest a sound that is engaging and dynamic, appealing to a broad audience. The MFCCs and spectral contrast indicate a balanced and polished production, which is typical for a pop song.

Neighbor #2, 'Greatestlove' by Musiq Soulchild, has a popularity score of 0, despite having a positive and harmonious lyrical theme. The aud