In [None]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from dotenv import load_dotenv
import os
import google.generativeai as genai

In [28]:
df = pd.read_csv('data-ML/cleaned.csv')
df

Unnamed: 0,Title,Article Body,vector
0,Linear Regression,Linear Regression ¶ Introduction Simple regres...,[ 5.06295590e-03 -2.39642903e-01 -2.46081278e-...
1,Gradient Descent,Gradient Descent ¶ Gradient descent is an opti...,[-1.86680302e-01 -1.34919152e-01 -2.72177279e-...
2,Logistic Regression,Logistic Regression ¶ Introduction Comparison ...,[-8.80371779e-03 -2.63402790e-01 -2.24201143e-...
3,Glossary,Glossary ¶ Definitions of common machine learn...,[ 4.11835276e-02 -4.21892256e-01 -1.72787145e-...
4,Calculus,Calculus ¶ Introduction Derivatives Geometric ...,[-2.59588063e-01 -2.30306998e-01 -3.17948580e-...
5,Linear Algebra,Linear Algebra ¶ Vectors Notation Vectors in g...,[ 1.79772496e-01 -4.04821187e-01 -2.41739497e-...
6,Probability (TODO),Probability ¶ Links Screenshots License Basic ...,[-1.55495733e-01 -1.71098158e-01 -3.58953655e-...
7,Statistics (TODO),Statistics ¶ Basic concepts in statistics for ...,[ 6.41153902e-02 -2.15294093e-01 -3.03122133e-...
8,Notation,Notation ¶ Commonly used math symbols in machi...,[-1.06072292e-01 -3.12585503e-01 -2.95823365e-...
9,Concepts,Concepts ¶ Neural Network Neuron Synapse Weigh...,[ 2.20161170e-01 -2.86158323e-01 -2.03859001e-...


In [29]:
model = SentenceTransformer('multi-qa-mpnet-base-dot-v1')

In [30]:
def convert_to_array(x):
    if isinstance(x, str):
        cleaned = x.strip('[]').split()
        return np.array([float(num) for num in cleaned])
    return x

df['vector'] = df['vector'].apply(convert_to_array)
vectors = np.array([np.array(v) for v in df['vector']])

Calculate ```cosine similarity``` and search similar sequentially.

In [31]:
def search_similar(query, vectors, df, k=3):
    query_vec = model.encode(query)
    sims = []
    for vec in vectors:
        sim = cosine_similarity([query_vec], [vec])[0][0]
        sims.append(sim)
    
    sims = np.array(sims)
    top_k_idx = sims.argsort()[::-1][:k]
    results = df.iloc[top_k_idx].copy()
    results['similarity'] = sims[top_k_idx]
    return results

Enter query

In [32]:
query = input('Enter related text:')

**A bit complex input example:** An introduction paragraph from **Adaptively-weighted Nearest Neighbors for Matrix Completion** paper on [arXiv](https://arxiv.org/abs/2505.09612).

Despite being a millennium-old principle, nearest neighbors remains a formidable method owing to its computational scalability and flexibility in terms of minimal model assumptions for getting guarantees on its performance. Nearest neighbors is a popular choice for non-parametric regression, with applications in pattern recognition [FH52, CH67, CBS20]. After seeing huge empirical success, nearest neighbors has paved its way into modern fields like matrix completion. Matrix completion, the art of estimating the true underlying matrix from a data matrix of noisy and missing entries [Rec11, Cha15, CCF+20], serves as a cornerstone of machine learning for addressing missing data challenges, ubiquitous in causal inference. It has been a canonical tool for recommender systems [KBV09, ADSS21].

Top 3 results that have the highest similarity value with the query

In [33]:
top_results = search_similar(query, vectors, df, k=3)

In [34]:
top_results

Unnamed: 0,Title,Article Body,vector,similarity
20,Regression,Regression Algorithms ¶ Ordinary Least Squares...,"[0.0233300906, -0.352147609, -0.168644696, -0....",0.490373
21,Reinforcement Learning,"Reinforcement Learning ¶ In machine learning, ...","[0.0133673623, -0.395623982, -0.174896106, -0....",0.464245
3,Glossary,Glossary ¶ Definitions of common machine learn...,"[0.0411835276, -0.421892256, -0.172787145, -0....",0.462598


Use LLM model (```gemini 2.0 flash```) to derive basic questions relating to the input's knowledge. Questions should be simple, suggestive.

The questions should be simple and suggestive — designed to guide the user toward understanding key Machine Learning concepts, rather than testing deep knowledge.

The goal is to help users identify what fundamental knowledge of Machine Learning they may need to learn or review.

In [None]:
load_dotenv()
api_key = os.getenv("GEMINI_API_KEY")

genai.configure(api_key=api_key)

model = genai.GenerativeModel("gemini-2.0-flash")
response = model.generate_content(f"From this input query: {query}, and related knowledge in {top_results['Article Body'].iloc[0]}, derive some simple questions relating to it, questions have to be simple and short, not too complex. Return only list of questions.")
print(response.text)

*   What is a key advantage of nearest neighbors despite its age?
*   What are some applications of nearest neighbors in non-parametric regression?
*   What is matrix completion used for?
*   Name one application of matrix completion.
*   What is minimized in Ordinary Least Squares regression?
*   How does Polynomial regression modify linear regression?
*   What regularization term does Lasso regression add?
*   How does Ridge regression differ from Lasso regression?
*   What type of function does Stepwise regression fit to data?
*   What is the Heaviside step function used for in Stepwise regression?

