<a href="https://colab.research.google.com/github/TezBytes/scriptor/blob/feat%2Fcompare-knn-algorithms/knn_algo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd

# Load the uploaded CSV file
df = pd.read_csv('sample_dataset_100.csv')  # Replace with actual name if different

# View first few rows to check structure
df.head()


Unnamed: 0,title,description,flow,llm_response
0,Reset Password #1,User wants to reset the password using email,Email → OTP → New Password,"Sure, enter your email to reset password."
1,Prompt Engineering #2,User wants to understand how to write effectiv...,Prompt Basics → Examples → Refinement Techniques,"Effective prompts are clear, specific, and inc..."
2,Start a Tech YouTube Channel #3,User is looking for tips to start a successful...,User → Tips → Content Strategy,"Start by picking a niche like reviews, tutoria..."
3,Start a Tech YouTube Channel #4,User is looking for tips to start a successful...,User → Tips → Content Strategy,"Start by picking a niche like reviews, tutoria..."
4,Research Paper on LLMs #5,User needs help structuring a research paper o...,Intro → Literature Review → Methods → Results ...,"Start with an introduction to LLMs, followed b..."


In [None]:
# Combine title, description, and flow into one text per row
df['combined_text'] = df['title'].astype(str) + " " + df['description'].astype(str) + " " + df['flow'].astype(str)


In [None]:
from sentence_transformers import SentenceTransformer

# Load the pre-trained sentence embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings for each row
embeddings = model.encode(df['combined_text'].tolist(), show_progress_bar=True)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Batches:   0%|          | 0/4 [00:00<?, ?it/s]

In [None]:
import numpy as np

# Save embeddings to file (optional)
np.save('text_embeddings.npy', embeddings)


In [None]:
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
import numpy as np

def get_knn(query_text, model, corpus_vecs, df, k=3, metric='cosine'):
    """
    Finds top-k most similar rows to the query using cosine or euclidean metric.

    Parameters:
    - query_text: New user input as string
    - model: SentenceTransformer model
    - corpus_vecs: Numpy array of embedded CSV rows
    - df: Original DataFrame (for returning top-k results)
    - k: How many similar entries to return
    - metric: 'cosine' or 'euclidean'
    """
    # Step 1: Embed the query text
    query_vec = model.encode([query_text])[0]  # get vector for user query

    # Step 2: Compute similarity or distance
    if metric == 'cosine':
        sims = cosine_similarity([query_vec], corpus_vecs)[0]
        idx = np.argsort(sims)[-k:][::-1]  # highest similarity = closest
    elif metric == 'euclidean':
        dists = euclidean_distances([query_vec], corpus_vecs)[0]
        idx = np.argsort(dists)[:k]  # lowest distance = closest
    else:
        raise ValueError("Invalid metric. Use 'cosine' or 'euclidean'.")

    # Step 3: Return top-k rows from original dataframe
    return df.iloc[idx]


In [None]:
# Example user input (you can change it to anything)
query = "How to change password using phone number?"

# Top-3 using cosine similarity
top_k_cosine = get_knn(query, model, embeddings, df, k=3, metric='cosine')

# Top-3 using euclidean distance
top_k_euclidean = get_knn(query, model, embeddings, df, k=3, metric='euclidean')

# Print both results side by side
print("🔹 Top-K (Cosine Similarity):\n", top_k_cosine[['title', 'description', 'flow']])
print("\n🔸 Top-K (Euclidean Distance):\n", top_k_euclidean[['title', 'description', 'flow']])


🔹 Top-K (Cosine Similarity):
                  title                                   description  \
76  Reset Password #77  User wants to reset the password using email   
95  Reset Password #96  User wants to reset the password using email   
41  Reset Password #42  User wants to reset the password using email   

                          flow  
76  Email → OTP → New Password  
95  Email → OTP → New Password  
41  Email → OTP → New Password  

🔸 Top-K (Euclidean Distance):
                  title                                   description  \
76  Reset Password #77  User wants to reset the password using email   
95  Reset Password #96  User wants to reset the password using email   
41  Reset Password #42  User wants to reset the password using email   

                          flow  
76  Email → OTP → New Password  
95  Email → OTP → New Password  
41  Email → OTP → New Password  


In [None]:
# Example user input (you can change it to anything)
query = "artificial intelegence"

# Top-3 using cosine similarity
top_k_cosine = get_knn(query, model, embeddings, df, k=3, metric='cosine')

# Top-3 using euclidean distance
top_k_euclidean = get_knn(query, model, embeddings, df, k=3, metric='euclidean')

# Print both results side by side
print("🔹 Top-K (Cosine Similarity):\n", top_k_cosine[['title', 'description', 'flow']])
print("\n🔸 Top-K (Euclidean Distance):\n", top_k_euclidean[['title', 'description', 'flow']])


🔹 Top-K (Cosine Similarity):
                          title  \
65  Learn Machine Learning #66   
22  Learn Machine Learning #23   
80  Learn Machine Learning #81   

                                         description  \
65  User asks how to start learning machine learning   
22  User asks how to start learning machine learning   
80  User asks how to start learning machine learning   

                                              flow  
65  Math Basics → Python → Scikit-learn → Projects  
22  Math Basics → Python → Scikit-learn → Projects  
80  Math Basics → Python → Scikit-learn → Projects  

🔸 Top-K (Euclidean Distance):
                          title  \
65  Learn Machine Learning #66   
22  Learn Machine Learning #23   
80  Learn Machine Learning #81   

                                         description  \
65  User asks how to start learning machine learning   
22  User asks how to start learning machine learning   
80  User asks how to start learning machine learning   

  

In [None]:
query = {
    "title": "Email Update",
    "description": "User needs help updating their email",
    "flow": "Navigate to settings and select email"
}

# Combine text fields into one string (because your function expects raw text)
query_text = query["title"] + " " + query["description"] + " " + query["flow"]

# Now call your get_knn function exactly with all args it wants
top_k_cosine = get_knn(query_text, model, embeddings, df, k=3, metric='cosine')
top_k_euclidean = get_knn(query_text, model, embeddings, df, k=3, metric='euclidean')

# Print results using dataframe index returned by get_knn
print("🔹 Cosine Top Matches:")
print(top_k_cosine[['title', 'description', 'flow']])

print("\n🔸 Euclidean Top Matches:")
print(top_k_euclidean[['title', 'description', 'flow']])



🔹 Cosine Top Matches:
               title                                  description  \
24  Change Email #25  User wants to change their registered email   
57  Change Email #58  User wants to change their registered email   
64  Change Email #65  User wants to change their registered email   

                                 flow  
24  Settings → Account → Change Email  
57  Settings → Account → Change Email  
64  Settings → Account → Change Email  

🔸 Euclidean Top Matches:
               title                                  description  \
24  Change Email #25  User wants to change their registered email   
57  Change Email #58  User wants to change their registered email   
64  Change Email #65  User wants to change their registered email   

                                 flow  
24  Settings → Account → Change Email  
57  Settings → Account → Change Email  
64  Settings → Account → Change Email  


In [None]:
print(top_k_cosine)
print(type(top_k_cosine))



               title                                  description  \
24  Change Email #25  User wants to change their registered email   
57  Change Email #58  User wants to change their registered email   
64  Change Email #65  User wants to change their registered email   

                                 flow  \
24  Settings → Account → Change Email   
57  Settings → Account → Change Email   
64  Settings → Account → Change Email   

                                         llm_response  \
24  You can change your email from the settings page.   
57  You can change your email from the settings page.   
64  You can change your email from the settings page.   

                                        combined_text  
24  Change Email #25 User wants to change their re...  
57  Change Email #58 User wants to change their re...  
64  Change Email #65 User wants to change their re...  
<class 'pandas.core.frame.DataFrame'>


In [None]:
def build_context_from_df(df_subset):
    context = ""
    for _, row in df_subset.iterrows():
        context += f"Title: {row['title']}\nDescription: {row['description']}\nFlow: {row['flow']}\n---\n"
    return context

query = {
    "title": "Email Update",
    "description": "User needs help updating their email",
    "flow": "Navigate to settings and select email"
}

context_cosine = build_context_from_df(top_k_cosine)
prompt = f"# Context\n{context_cosine}\nUser Query: {query['description']}"
print(prompt)


# Context
Title: Change Email #25
Description: User wants to change their registered email
Flow: Settings → Account → Change Email
---
Title: Change Email #58
Description: User wants to change their registered email
Flow: Settings → Account → Change Email
---
Title: Change Email #65
Description: User wants to change their registered email
Flow: Settings → Account → Change Email
---

User Query: User needs help updating their email


In [None]:
pip install bert-score




In [None]:
# Make sure you install bert-score first (run once)
# !pip install bert-score

from bert_score import score

# Example candidate response generated by your LLM (replace this with actual output)
candidate = "You can update your email by going to Settings and selecting Email."

# Reference (ground truth) response from your dataframe (replace with your actual df variable)
# Assuming top_k_cosine is a dataframe returned by your KNN retrieval step
reference = top_k_cosine.iloc[0]['llm_response']

# Calculate BERTScore
P, R, F1 = score([candidate], [reference], lang="en")

print(f"BERTScore Precision: {P[0]:.4f}")
print(f"BERTScore Recall: {R[0]:.4f}")
print(f"BERTScore F1: {F1[0]:.4f}")


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:  26%|##5       | 493M/1.91G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERTScore Precision: 0.9336
BERTScore Recall: 0.9565
BERTScore F1: 0.9449


In [None]:
# Make sure bert-score is installed
# !pip install bert-score

from bert_score import score

# Example candidate response generated by your LLM (replace this with actual output)
candidate = "You can update your email by going to Settings and selecting Email."

# Reference (ground truth) responses from your retrieval results
reference_euclidean = top_k_euclidean.iloc[0]['llm_response']

# Calculate BERTScore for Euclidean retrieval
P_euc, R_euc, F1_euc = score([candidate], [reference_euclidean], lang="en")

print("\nBERTScore for Euclidean Retrieval:")
print(f"Precision: {P_euc[0]:.4f}")
print(f"Recall:    {R_euc[0]:.4f}")
print(f"F1:        {F1_euc[0]:.4f}")


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



BERTScore for Euclidean Retrieval:
Precision: 0.9336
Recall:    0.9565
F1:        0.9449
