## Similarity Exercise

In [1]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import normalize

from sentence_transformers import SentenceTransformer
import faiss

In this exercise, you've been provided the title and abstract of 500 recent machine learning research papers posted on arXiv.org.

In [2]:
articles = pd.read_csv('../data/arxiv_papers.csv')
articles.head()

Unnamed: 0,title,abstract,url
0,GoT-R1: Unleashing Reasoning Capability of MLL...,Visual generation models have made remarkable ...,http://arxiv.org/abs/2505.17022v1
1,Delving into RL for Image Generation with CoT:...,Recent advancements underscore the significant...,http://arxiv.org/abs/2505.17017v1
2,Interactive Post-Training for Vision-Language-...,"We introduce RIPT-VLA, a simple and scalable r...",http://arxiv.org/abs/2505.17016v1
3,When Are Concepts Erased From Diffusion Models?,"Concept erasure, the ability to selectively pr...",http://arxiv.org/abs/2505.17013v1
4,Understanding Prompt Tuning and In-Context Lea...,Prompting is one of the main ways to adapt a p...,http://arxiv.org/abs/2505.17010v1


In [3]:
i = 0
print(f'Title: {articles.loc[i,"title"]}\n')
print(f'Text: {articles.loc[i,"abstract"]}')

Title: GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

Text: Visual generation models have made remarkable progress in creating realistic
images from text prompts, yet struggle with complex prompts that specify
multiple objects with precise spatial relationships and attributes. Effective
handling of such prompts requires explicit reasoning about the semantic content
and spatial layout. We present GoT-R1, a framework that applies reinforcement
learning to enhance semantic-spatial reasoning in visual generation. Building
upon the Generation Chain-of-Thought approach, GoT-R1 enables models to
autonomously discover effective reasoning strategies beyond predefined
templates through carefully designed reinforcement learning. To achieve this,
we propose a dual-stage multi-dimensional reward framework that leverages MLLMs
to evaluate both the reasoning process and final output, enabling effective
supervision across the entire generation pipeli

In [4]:
print(articles.head(2))

                                               title  \
0  GoT-R1: Unleashing Reasoning Capability of MLL...   
1  Delving into RL for Image Generation with CoT:...   

                                            abstract  \
0  Visual generation models have made remarkable ...   
1  Recent advancements underscore the significant...   

                                 url  
0  http://arxiv.org/abs/2505.17022v1  
1  http://arxiv.org/abs/2505.17017v1  


Let's try out a variety of ways of vectorizing and searching for semantically-similar papers.

### Method 1: Bag of Words

Fit a CountVectorizer to the abstracts of the articles with all of the defaults.  Then vectorize the dataset using the fit vectorizer. 

In [5]:
vectorizer = CountVectorizer()
abstract_vectors = vectorizer.fit_transform(articles['abstract'])

**Question:** How many dimensions do the embeddings have?

In [6]:
print(f"Number of dimensions: {abstract_vectors.shape[1]}")

Number of dimensions: 7978


Now, let's use the embeddings to look for similar articles to a search query.

Apply the vectorizer you fit earlier to this query string to get an embedding. 

**Hint:** You can't pass a string to a vectorizer, but you can pass a list containing a string.

In [7]:
query = "vector databases for retrieval augmented generation"

query_vector = vectorizer.transform([query])

Now, we need to find the similarity between our query embedding and each vectorized article.

For this, you can use the [cosine similarity function from scikit-learn.](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html)

Calculate the similarity between the query embedding and each article embedding and save the result to a variable named `similarity_scores`.

In [8]:
similarity_scores = cosine_similarity(query_vector, abstract_vectors).flatten()

Now, we need to find the most similar results. To help with this, we can use the [argsort function from numpy](https://numpy.org/doc/stable/reference/generated/numpy.argsort.html), which will give the indices sorted by value. 

Use the argsort function to find the indices of the 5 most similar articles. Inspect their titles and abstracts. **Warning:** argsort sorts from smallest to largest.

In [9]:
top_indices = np.argsort(similarity_scores)[::-1][:5]

titles = []
for idx in top_indices:
    print(f"\nTitle: {articles.loc[idx, 'title']}")
    print(f"\nSimilarity Score: {similarity_scores[idx]:.4f}")
    print(f"\nAbstract: {articles.loc[idx, 'abstract']}...\n")
    titles.append(articles.loc[idx, 'title'])


Title: MIRB: Mathematical Information Retrieval Benchmark

Similarity Score: 0.2698

Abstract: Mathematical Information Retrieval (MIR) is the task of retrieving
information from mathematical documents and plays a key role in various
applications, including theorem search in mathematical libraries, answer
retrieval on math forums, and premise selection in automated theorem proving.
However, a unified benchmark for evaluating these diverse retrieval tasks has
been lacking. In this paper, we introduce MIRB (Mathematical Information
Retrieval Benchmark) to assess the MIR capabilities of retrieval models. MIRB
includes four tasks: semantic statement retrieval, question-answer retrieval,
premise retrieval, and formula retrieval, spanning a total of 12 datasets. We
evaluate 13 retrieval models on this benchmark and analyze the challenges
inherent to MIR. We hope that MIRB provides a comprehensive framework for
evaluating MIR systems and helps advance the development of more effective
retrie

Try using a tfidf vectorizer. How do the results compare?

In [10]:
display(titles)

['MIRB: Mathematical Information Retrieval Benchmark',
 'SCENIR: Visual Semantic Clarity through Unsupervised Scene Graph Retrieval',
 'WikiDBGraph: Large-Scale Database Graph of Wikidata for Collaborative Learning',
 'Explaining Neural Networks with Reasons',
 'Scan, Materialize, Simulate: A Generalizable Framework for Physically Grounded Robot Planning']

In [11]:
tfidf = TfidfVectorizer()
abstract_vectors_tfidf = tfidf.fit_transform(articles['abstract'])

query_vector_tfidf = tfidf.transform([query])

similarity_scores_tfidf = cosine_similarity(query_vector_tfidf, abstract_vectors_tfidf).flatten()

top_indices_tfidf = np.argsort(similarity_scores_tfidf)[::-1][:5]
titles = []
for idx in top_indices_tfidf:
    print(f"\nTitle: {articles.loc[idx, 'title']}")
    print(f"\nSimilarity Score (TF-IDF): {similarity_scores_tfidf[idx]:.4f}")
    print(f"\nAbstract: {articles.loc[idx, 'abstract']}...\n")
    titles.append(articles.loc[idx, 'title'])


Title: MIRB: Mathematical Information Retrieval Benchmark

Similarity Score (TF-IDF): 0.2996

Abstract: Mathematical Information Retrieval (MIR) is the task of retrieving
information from mathematical documents and plays a key role in various
applications, including theorem search in mathematical libraries, answer
retrieval on math forums, and premise selection in automated theorem proving.
However, a unified benchmark for evaluating these diverse retrieval tasks has
been lacking. In this paper, we introduce MIRB (Mathematical Information
Retrieval Benchmark) to assess the MIR capabilities of retrieval models. MIRB
includes four tasks: semantic statement retrieval, question-answer retrieval,
premise retrieval, and formula retrieval, spanning a total of 12 datasets. We
evaluate 13 retrieval models on this benchmark and analyze the challenges
inherent to MIR. We hope that MIRB provides a comprehensive framework for
evaluating MIR systems and helps advance the development of more effecti

In [12]:
display(titles)

['MIRB: Mathematical Information Retrieval Benchmark',
 'WikiDBGraph: Large-Scale Database Graph of Wikidata for Collaborative Learning',
 'SCENIR: Visual Semantic Clarity through Unsupervised Scene Graph Retrieval',
 'HDLxGraph: Bridging Large Language Models and HDL Repositories via HDL Graph Databases',
 'Explaining Neural Networks with Reasons']

### Method 2: Using a Pretrained Embedding Model

Now, let's compare how we do using the [all-MiniLM-L6-v2 embedding model](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).

This will create a 384-dimensional dense embedding of each sentence.

In [13]:
embedder = SentenceTransformer('all-MiniLM-L6-v2')

In [14]:
sentences = ["This is an example sentence", "Each sentence is converted"]

embeddings = embedder.encode(sentences)
print(embeddings)

[[ 6.76568300e-02  6.34958893e-02  4.87130992e-02  7.93049634e-02
   3.74480523e-02  2.65282486e-03  3.93749252e-02 -7.09849223e-03
   5.93614690e-02  3.15369889e-02  6.00981005e-02 -5.29052056e-02
   4.06067446e-02 -2.59308629e-02  2.98427884e-02  1.12694502e-03
   7.35149235e-02 -5.03819697e-02 -1.22386590e-01  2.37028170e-02
   2.97265705e-02  4.24769074e-02  2.56337989e-02  1.99519354e-03
  -5.69190606e-02 -2.71598026e-02 -3.29035893e-02  6.60248324e-02
   1.19007140e-01 -4.58791293e-02 -7.26215094e-02 -3.25839706e-02
   5.23413755e-02  4.50553112e-02  8.25296156e-03  3.67023535e-02
  -1.39415273e-02  6.53919056e-02 -2.64272653e-02  2.06366734e-04
  -1.36643462e-02 -3.62809934e-02 -1.95043907e-02 -2.89738290e-02
   3.94270420e-02 -8.84090662e-02  2.62422953e-03  1.36714093e-02
   4.83063050e-02 -3.11565585e-02 -1.17329165e-01 -5.11690415e-02
  -8.85287598e-02 -2.18961909e-02  1.42986253e-02  4.44168225e-02
  -1.34814717e-02  7.43392110e-02  2.66382620e-02 -1.98762268e-02
   1.79191

Use this new embedder to vectorize the abstracts and then find the most similar to the query. How do the results compare to the other methods?

**Warning:** Creating embeddings for all of the articles may take a while.

In [15]:
abstract_embeddings = embedder.encode(articles['abstract'].tolist())

query_embedding = embedder.encode([query])

similarity_scores_st = cosine_similarity(query_embedding, abstract_embeddings).flatten()

top_indices_st = np.argsort(similarity_scores_st)[::-1][:5]
titles = []
print("\n---TOP 5 RESULTS USING SENTENCE TRANSFORMER---\n")
print("--------------------------------------------------")
for idx in top_indices_st:
    print(f"\nTitle: {articles.loc[idx, 'title']}")
    print(f"\nSimilarity Score (Sentence Transformer): {similarity_scores_st[idx]:.4f}")
    print(f"\nAbstract: {articles.loc[idx, 'abstract']}...\n")
    titles.append(articles.loc[idx, 'title'])


---TOP 5 RESULTS USING SENTENCE TRANSFORMER---

--------------------------------------------------

Title: MIRB: Mathematical Information Retrieval Benchmark

Similarity Score (Sentence Transformer): 0.4782

Abstract: Mathematical Information Retrieval (MIR) is the task of retrieving
information from mathematical documents and plays a key role in various
applications, including theorem search in mathematical libraries, answer
retrieval on math forums, and premise selection in automated theorem proving.
However, a unified benchmark for evaluating these diverse retrieval tasks has
been lacking. In this paper, we introduce MIRB (Mathematical Information
Retrieval Benchmark) to assess the MIR capabilities of retrieval models. MIRB
includes four tasks: semantic statement retrieval, question-answer retrieval,
premise retrieval, and formula retrieval, spanning a total of 12 datasets. We
evaluate 13 retrieval models on this benchmark and analyze the challenges
inherent to MIR. We hope that MI

In [16]:
display(titles)

['MIRB: Mathematical Information Retrieval Benchmark',
 'Attributing Response to Context: A Jensen-Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation',
 'HDLxGraph: Bridging Large Language Models and HDL Repositories via HDL Graph Databases',
 'The Atlas of In-Context Learning: How Attention Heads Shape In-Context Retrieval Augmentation',
 'Distance Adaptive Beam Search for Provably Accurate Graph-Based Nearest Neighbor Search']

### FAISS

The [Faiss library](https://faiss.ai/index.html) is a library for efficient similarity search and clustering of dense vectors. It can be used to automate the process of finding the most similar abstracts.

If we want to use cosine similarity, we need to use the Inner Product. We also need to normalize our vectors so that they all have length 1.

Use the [normalize function](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html) to normalize both the abstract vectors and the query vector.

In [17]:
normalized_embeddings = normalize(abstract_embeddings)
normalized_query = normalize(query_embedding)

dimension = normalized_embeddings.shape[1]

Now, create an [IndexFlatIP object](https://github.com/facebookresearch/faiss/wiki/Faiss-indexes#summary-of-methods) that has dimensions equal to the dimensionality of your vectors. Then add your normalized abstract vectors.

Hint: You can mimic the example [here](https://github.com/facebookresearch/faiss/wiki/Getting-started#building-an-index-and-adding-the-vectors-to-it), but substitute in the IndexFlatIP class.

In [18]:
index = faiss.IndexFlatIP(dimension) # dot product index

index.add(normalized_embeddings.astype('float'))


Finally, use the [search function](https://github.com/facebookresearch/faiss/wiki/Getting-started#searching) on your index object to find the 5 most similar articles.

In [19]:
k = 5
distances, indices = index.search(normalized_query.astype('float'), k)

titles = []
print("\n---TOP 5 RESULTS USING SENTENCE FAISS---\n")
print("--------------------------------------------------")
for i, idx in enumerate(indices[0]):
    print(f"\nTitle: {articles.loc[idx, 'title']}")
    print(f"\nSimilarity Score (FAISS): {distances[0][i]:.4f}")
    print(f"\nAbstract: {articles.loc[idx, 'abstract']}...\n")
    titles.append(articles.loc[idx, 'title'])


---TOP 5 RESULTS USING SENTENCE FAISS---

--------------------------------------------------

Title: MIRB: Mathematical Information Retrieval Benchmark

Similarity Score (FAISS): 0.4782

Abstract: Mathematical Information Retrieval (MIR) is the task of retrieving
information from mathematical documents and plays a key role in various
applications, including theorem search in mathematical libraries, answer
retrieval on math forums, and premise selection in automated theorem proving.
However, a unified benchmark for evaluating these diverse retrieval tasks has
been lacking. In this paper, we introduce MIRB (Mathematical Information
Retrieval Benchmark) to assess the MIR capabilities of retrieval models. MIRB
includes four tasks: semantic statement retrieval, question-answer retrieval,
premise retrieval, and formula retrieval, spanning a total of 12 datasets. We
evaluate 13 retrieval models on this benchmark and analyze the challenges
inherent to MIR. We hope that MIRB provides a compreh

In [20]:
display(titles)

['MIRB: Mathematical Information Retrieval Benchmark',
 'Attributing Response to Context: A Jensen-Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation',
 'HDLxGraph: Bridging Large Language Models and HDL Repositories via HDL Graph Databases',
 'The Atlas of In-Context Learning: How Attention Heads Shape In-Context Retrieval Augmentation',
 'Distance Adaptive Beam Search for Provably Accurate Graph-Based Nearest Neighbor Search']