## Similarity Exercise

In [4]:
pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0-cp312-cp312-win_amd64.whl.metadata (5.0 kB)
Downloading faiss_cpu-1.11.0-cp312-cp312-win_amd64.whl (15.0 MB)
   ---------------------------------------- 0.0/15.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/15.0 MB 660.6 kB/s eta 0:00:23
   - -------------------------------------- 0.4/15.0 MB 4.1 MB/s eta 0:00:04
   ----- ---------------------------------- 2.1/15.0 MB 14.8 MB/s eta 0:00:01
   ---------- ----------------------------- 3.9/15.0 MB 20.9 MB/s eta 0:00:01
   ---------------- ----------------------- 6.2/15.0 MB 26.6 MB/s eta 0:00:01
   ---------------------- ----------------- 8.6/15.0 MB 30.5 MB/s eta 0:00:01
   ----------------------------- ---------- 11.0/15.0 MB 46.7 MB/s eta 0:00:01
   ----------------------------------- ---- 13.3/15.0 MB 50.4 MB/s eta 0:00:01
   ---------------------------------------  15.0/15.0 MB 50.4 MB/s eta 0:00:01
   ---------------------------------------- 15.0/15.0 MB 40.9 MB

In [6]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import normalize

from sentence_transformers import SentenceTransformer
import faiss

In this exercise, you've been provided the title and abstract of 500 recent machine learning research papers posted on arXiv.org.

In [8]:
articles = pd.read_csv('../data/arxiv_papers.csv')
articles.head()

Unnamed: 0,title,abstract,url
0,GoT-R1: Unleashing Reasoning Capability of MLL...,Visual generation models have made remarkable ...,http://arxiv.org/abs/2505.17022v1
1,Delving into RL for Image Generation with CoT:...,Recent advancements underscore the significant...,http://arxiv.org/abs/2505.17017v1
2,Interactive Post-Training for Vision-Language-...,"We introduce RIPT-VLA, a simple and scalable r...",http://arxiv.org/abs/2505.17016v1
3,When Are Concepts Erased From Diffusion Models?,"Concept erasure, the ability to selectively pr...",http://arxiv.org/abs/2505.17013v1
4,Understanding Prompt Tuning and In-Context Lea...,Prompting is one of the main ways to adapt a p...,http://arxiv.org/abs/2505.17010v1


In [12]:
i = 1
print(f'Title: {articles.loc[i,"title"]}\n')
print(f'Text: {articles.loc[i,"abstract"]}')

Title: Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO

Text: Recent advancements underscore the significant role of Reinforcement Learning
(RL) in enhancing the Chain-of-Thought (CoT) reasoning capabilities of large
language models (LLMs). Two prominent RL algorithms, Direct Preference
Optimization (DPO) and Group Relative Policy Optimization (GRPO), are central
to these developments, showcasing different pros and cons. Autoregressive image
generation, also interpretable as a sequential CoT reasoning process, presents
unique challenges distinct from LLM-based CoT reasoning. These encompass
ensuring text-image consistency, improving image aesthetic quality, and
designing sophisticated reward models, rather than relying on simpler
rule-based rewards. While recent efforts have extended RL to this domain, these
explorations typically lack an in-depth analysis of the domain-specific
challenges and the characteristics of different RL strategies. To bridge this
gap, we

Let's try out a variety of ways of vectorizing and searching for semantically-similar papers.

### Method 1: Bag of Words

Fit a CountVectorizer to the abstracts of the articles with all of the defaults.  Then vectorize the dataset using the fit vectorizer. 

In [24]:
abstracts = articles['abstract']
vectorizer = CountVectorizer()
vectorizer.fit(abstracts)
X = vectorizer.transform(abstracts)

In [28]:
print(vectorizer.vocabulary_)



In [34]:


# get vocab
vocab_dict = vectorizer.vocabulary_

# Convert to DataFrame: words as one column, indices as another
vocab_df = pd.DataFrame(list(vocab_dict.items()), columns=['token', 'index'])

# Optional: sort by index to see tokens in the order of their vector columns
vocab_df = vocab_df.sort_values(by='index').reset_index(drop=True)

index_filter = 7779
vocab_df_filtered = vocab_df[vocab_df['index'] == index_filter]
vocab_df_filtered

Unnamed: 0,token,index
7779,visual,7779


**Question:** How many dimensions do the embeddings have?

In [26]:
X.shape

(500, 7978)

Now, let's use the embeddings to look for similar articles to a search query.

Apply the vectorizer you fit earlier to this query string to get an embedding. 

**Hint:** You can't pass a string to a vectorizer, but you can pass a list containing a string.

In [None]:
query = "vector databases for retrieval augmented generation"

# Your code to transform the search query

Now, we need to find the similarity between our query embedding and each vectorized article.

For this, you can use the [cosine similarity function from scikit-learn.](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html)

Calculate the similarity between the query embedding and each article embedding and save the result to a variable named `similarity_scores`.

In [None]:
# Your Code Here

Now, we need to find the most similar results. To help with this, we can use the [argsort function from numpy](https://numpy.org/doc/stable/reference/generated/numpy.argsort.html), which will give the indices sorted by value. 

Use the argsort function to find the indices of the 5 most similar articles. Inspect their titles and abstracts. **Warning:** argsort sorts from smallest to largest.

In [None]:
# Your Code Here

Try using a tfidf vectorizer. How do the results compare?

In [None]:
# Your Code Here

### Method 2: Using a Pretrained Embedding Model

Now, let's compare how we do using the [all-MiniLM-L6-v2 embedding model](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).

This will create a 384-dimensional dense embedding of each sentence.

In [None]:
embedder = SentenceTransformer('all-MiniLM-L6-v2')

In [None]:
sentences = ["This is an example sentence", "Each sentence is converted"]

embeddings = embedder.encode(sentences)
print(embeddings)

Use this new embedder to vectorize the abstracts and then find the most similar to the query. How do the results compare to the other methods?

**Warning:** Creating embeddings for all of the articles may take a while.

In [None]:
# Your Code Here

### FAISS

The [Faiss library](https://faiss.ai/index.html) is a library for efficient similarity search and clustering of dense vectors. It can be used to automate the process of finding the most similar abstracts.

If we want to use cosine similarity, we need to use the Inner Product. We also need to normalize our vectors so that they all have length 1.

Use the [normalize function](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html) to normalize both the abstract vectors and the query vector.

In [None]:
# Your Code Here

Now, create an [IndexFlatIP object](https://github.com/facebookresearch/faiss/wiki/Faiss-indexes#summary-of-methods) that has dimensions equal to the dimensionality of your vectors. Then add your normalized abstract vectors.

Hint: You can mimic the example [here](https://github.com/facebookresearch/faiss/wiki/Getting-started#building-an-index-and-adding-the-vectors-to-it), but substitute in the IndexFlatIP class.

In [None]:
# Your Code Here

Finally, use the [search function](https://github.com/facebookresearch/faiss/wiki/Getting-started#searching) on your index object to find the 5 most similar articles.

In [None]:
# Your Code Here