# Advanced Query Strategies

<span style="text-transform: uppercase;
        font-size: 14px;
        letter-spacing: 1px;
        font-family: 'Segoe UI', sans-serif;">
    Author
</span><br>
efrén cruz cortés
<hr style="border: none; height: 1px; background: linear-gradient(to right, transparent 0%, #ccc 10%, transparent 100%); margin-top: 10px;">

Querying is not just about embedding a sentence / document and searching for the closest match. We can get very creative when building querying vectors!

## Imports

In [1]:
# LLM libraries
from sentence_transformers import SentenceTransformer
import faiss

# Helper libraries
import pandas as pd
import numpy as np

## Preparation

In [2]:
# Embedding model
model_name = 'all-mpnet-base-v2' 
emb_model = SentenceTransformer(model_name)

In [3]:
# lyrics data and embeddings
data_path = "https://raw.githubusercontent.com/nuitrcs/AI_Week_RAG/refs/heads/main/data/songs.csv"
lyrics = pd.read_csv(data_path)
embeddings = emb_model.encode(lyrics['Lyrics'], normalize_embeddings=True)

In [4]:
# faiss index
d_emb = len(embeddings[0])
faiss_index = faiss.IndexFlatIP(d_emb)
faiss_index.add(embeddings)

## Averaging 2 queries

You can average two concepts and find the document that lies somewhere in between!

In [5]:
# Let's compare Swift and Dylan
lyrics_swilan = lyrics[lyrics['Artist'].isin(['Taylor Swift', 'Bob Dylan'])].copy()
lyrics_swilan = lyrics_swilan.reset_index(drop=True)

# new embeddings
embeddings_swilan = emb_model.encode(lyrics_swilan['Lyrics'].to_list(), normalize_embeddings=True)

# new faiss
d_emb = len(embeddings_swilan[0])
faiss_swilan = faiss.IndexFlatIP(d_emb)
faiss_swilan.add(embeddings_swilan)

In [6]:
# queries
q1 = "a heartbreaking love story"
q2 = "a long journey on the open road"

# Encode and average
v_q1 = emb_model.encode(q1, normalize_embeddings=True)
v_q2 = emb_model.encode(q2, normalize_embeddings=True)
v_avg = (v_q1 + v_q2) / 2
v_avg = v_avg / np.linalg.norm(v_avg)  # re-normalize

In [7]:
# Search extreme songs - love
_, I_love = faiss_swilan.search(np.expand_dims(v_q1, axis=0), k=1)
print(lyrics_swilan.loc[I_love[0,0], ['Artist', 'Title']])

Artist    Taylor Swift
Title       Love Story
Name: 24, dtype: object


In [8]:
# extreme song - journey
_, I_journey = faiss_swilan.search(np.expand_dims(v_q2, axis=0), k=1)
print(lyrics_swilan.loc[I_journey[0,0], ['Artist', 'Title']])

Artist                        Bob Dylan
Title     The Times They Are A-Changin’
Name: 52, dtype: object


In [9]:
# search
D, I = faiss_swilan.search(np.expand_dims(v_avg, axis=0), k=5)

# Show results
for idx, score in zip(I[0], D[0]):
    print(lyrics_swilan.iloc[idx]["Artist"], "-", lyrics_swilan.iloc[idx]["Title"], "| score:", round(score, 3))

Taylor Swift - Love Story | score: 0.454
Bob Dylan - It’s Alright, Ma (I’m Only Bleeding) | score: 0.434
Bob Dylan - Tangled Up in Blue | score: 0.433
Bob Dylan - Make You Feel My Love | score: 0.428
Bob Dylan - Tempest | score: 0.421


## Averaging multiple queries

You can also average more than two queries, hence creating a semantic "centroid". This can work as a central theme, an abstraction of different examples, etc.

Let's try two different prototypes.

In [10]:
love_queries = [
    "falling in love",
    "heartbreak after a breakup",
    "missing someone you loved",
    "longing for someone far away",
    "the crushing pain of unrequited love",
    "healing after heartbreak",
    "a secret crush",
    "an emotional love story with ups and downs",
    "falling in love again!",
    "getting over you"
]

# compute the centroid
love_embs = emb_model.encode(love_queries, normalize_embeddings=True)
love_prototype = np.mean(love_embs, axis=0)
love_prototype = love_prototype / np.linalg.norm(love_prototype)  # re-normalize

# search
D, I = faiss_swilan.search(np.expand_dims(love_prototype, axis=0), k=5)

# and let's check which songs are more prototypical
for idx, score in zip(I[0], D[0]):
    print(lyrics_swilan.iloc[idx]["Artist"], "-", lyrics_swilan.iloc[idx]["Title"], "| score:", round(score, 3))

Taylor Swift - Love Story | score: 0.418
Bob Dylan - Tangled Up in Blue | score: 0.402
Taylor Swift - peace | score: 0.398
Bob Dylan - Girl from the North Country | score: 0.382
Taylor Swift - happiness | score: 0.375


In [11]:
rolling_queries = [
    "wandering down the road",
    "restless and moving on",
    "searching for meaning in life",
    "socially conscious",
    "feeling like an outsider",
    "stories about change and transformation",
    "a restless spirit looking for answers",
    "a voice of social awareness",
    "poetic reflections on the world",
    "the open highway calling"
]

# compute the centroid
rolling_embs = emb_model.encode(rolling_queries, normalize_embeddings=True)
rolling_prototype = np.mean(rolling_embs, axis=0)
rolling_prototype = rolling_prototype / np.linalg.norm(rolling_prototype)  # re-normalize

# search
D, I = faiss_swilan.search(np.expand_dims(rolling_prototype, axis=0), k=5)

# and let's check which songs are more prototypical
for idx, score in zip(I[0], D[0]):
    print(lyrics_swilan.iloc[idx]["Artist"], "-", lyrics_swilan.iloc[idx]["Title"], "| score:", round(score, 3))

Bob Dylan - The Times They Are A-Changin’ | score: 0.478
Bob Dylan - It’s Alright, Ma (I’m Only Bleeding) | score: 0.451
Bob Dylan - Desolation Row | score: 0.403
Bob Dylan - Ballad of a Thin Man | score: 0.4
Bob Dylan - Like a Rolling Stone | score: 0.382


## Semantic walks

OK, now let's take a promenade in conceptual space. Let's say we have two queries (or two pre-established songs), and you want to explore what lies in between as we move from one to another. We definitely can!

In [13]:
# start and end queries
q_start = "a heartbreaking love story, missing someone you loved"
q_end   = "wandering through changing times, a contemplative journey on life and society"

v_start = emb_model.encode(q_start, normalize_embeddings=True)
v_end   = emb_model.encode(q_end, normalize_embeddings=True)

# Number of interludes along the promenade
n_steps = 10

# compute the interludes (this is just linear interpolation, in case you're interested)
interlude_vectors = []
for t in np.linspace(0, 1, n_steps):
    v_step = (1-t)*v_start + t*v_end
    v_step = v_step / np.linalg.norm(v_step)
    interlude_vectors.append(v_step)

# find closest songs to each intermediate spot
for i, v in enumerate(interlude_vectors):
    D, I = faiss_swilan.search(np.expand_dims(v, axis=0), k=3)
    print(f"\nStep {i+1}")
    for idx, score in zip(I[0], D[0]):
        song = lyrics_swilan.iloc[idx]
        print(f"{song['Artist']} - {song['Title']} | score: {round(score,3):.2f}")


Step 1
Taylor Swift - Love Story | score: 0.49
Bob Dylan - Tempest | score: 0.45
Bob Dylan - Tangled Up in Blue | score: 0.44

Step 2
Taylor Swift - Love Story | score: 0.49
Bob Dylan - Tempest | score: 0.47
Bob Dylan - Tangled Up in Blue | score: 0.45

Step 3
Bob Dylan - Tempest | score: 0.49
Taylor Swift - Love Story | score: 0.48
Bob Dylan - Tangled Up in Blue | score: 0.45

Step 4
Bob Dylan - Tempest | score: 0.49
Taylor Swift - Love Story | score: 0.47
Bob Dylan - It’s Alright, Ma (I’m Only Bleeding) | score: 0.45

Step 5
Bob Dylan - Tempest | score: 0.49
Bob Dylan - It’s Alright, Ma (I’m Only Bleeding) | score: 0.47
Bob Dylan - Desolation Row | score: 0.44

Step 6
Bob Dylan - Tempest | score: 0.48
Bob Dylan - The Times They Are A-Changin’ | score: 0.47
Bob Dylan - It’s Alright, Ma (I’m Only Bleeding) | score: 0.47

Step 7
Bob Dylan - The Times They Are A-Changin’ | score: 0.50
Bob Dylan - It’s Alright, Ma (I’m Only Bleeding) | score: 0.47
Bob Dylan - Tempest | score: 0.45

Step 

Let's now try with our full dataset:

In [14]:
# start and end queries
q_start = "a joyful holiday celebration with friends and family, festive winter songs with warmth and cheer"
q_end   = "a heartfelt love story with emotional ups and downs, reflecting on personal growth and change"

v_start = emb_model.encode(q_start, normalize_embeddings=True)
v_end   = emb_model.encode(q_end, normalize_embeddings=True)

# Number of interludes along the promenade
n_steps = 10

# compute the interludes (this is just linear interpolation, in case you're interested)
interlude_vectors = []
for t in np.linspace(0, 1, n_steps):
    v_step = (1-t)*v_start + t*v_end
    v_step = v_step / np.linalg.norm(v_step)
    interlude_vectors.append(v_step)

# find closest songs to each intermediate spot
for i, v in enumerate(interlude_vectors):
    D, I = faiss_index.search(np.expand_dims(v, axis=0), k=3)
    print(f"\nStep {i+1}")
    for idx, score in zip(I[0], D[0]):
        song = lyrics.iloc[idx]
        print(f"{song['Artist']} - {song['Title']} | score: {round(score,3):.2f}")


Step 1
Nat King Cole - The Christmas Song | score: 0.56
Frank Sinatra - Have Yourself a Merry Little Christmas | score: 0.55
Nat King Cole - Chestnuts Roasting on an Open Fire | score: 0.53

Step 2
Frank Sinatra - Have Yourself a Merry Little Christmas | score: 0.56
Nat King Cole - The Christmas Song | score: 0.55
Bryan Adams - Christmas Time | score: 0.52

Step 3
Frank Sinatra - Have Yourself a Merry Little Christmas | score: 0.56
Nat King Cole - The Christmas Song | score: 0.52
Bryan Adams - Christmas Time | score: 0.52

Step 4
Frank Sinatra - Have Yourself a Merry Little Christmas | score: 0.55
Bryan Adams - Christmas Time | score: 0.51
John Denver - Rhymes & Reasons | score: 0.51

Step 5
Frank Sinatra - Have Yourself a Merry Little Christmas | score: 0.52
John Denver - Rhymes & Reasons | score: 0.51
Bryan Adams - Christmas Time | score: 0.49

Step 6
Nat King Cole - Stardust | score: 0.50
John Denver - Rhymes & Reasons | score: 0.50
Frank Sinatra - Have Yourself a Merry Little Chri

You can also explore what lies between two songs already in your dataset, you don't even need queries to take your semantic promenade :-)