# üîç Research Paper Semantic Search Engine

![Architecture](https://i.imgur.com/95idYbN.jpeg)

A production-grade semantic search engine for exploring arXiv research papers using advanced information retrieval techniques. This project implements a **hybrid retrieval architecture** combining dense vector search, lexical matching, and neural re-ranking to deliver highly relevant research paper recommendations.

The system indexes 8,000+ NLP papers from arXiv and provides an interactive web interface powered by Streamlit, enabling researchers to discover papers using natural language queries with real-time evaluation metrics.

**Key Innovation:** Multi-stage retrieval pipeline with intelligent fusion of semantic and lexical signals, followed by cross-encoder re-ranking for optimal result quality.

### [OPTIONAL] - Installing Packages on <img src="https://colab.google/static/images/icons/colab.png" width=100>

If you are viewing this notebook on Google Colab (or any other cloud vendor), you need to **uncomment and run** the following codeblock to install the dependencies for this chapter:

---

üí° **NOTE**: It is not necessary to use GPU as CPU will suffice. Yet if you wish to use a GPU to run the examples in this Google Colab notebook, go to **Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**.

---


In [None]:
# %%capture
# !pip install datasets langchain langchain_community ddgs faiss-cpu rank_bm25 sentence-transformers chromadb

## Getting data archive

In [2]:
from datasets import load_dataset

# load data from huggingface
dataset = load_dataset("MaartenGr/arxiv_nlp")["train"]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/617 [00:00<?, ?B/s]

data.csv:   0%|          | 0.00/53.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/44949 [00:00<?, ? examples/s]

In [3]:
dataset

Dataset({
    features: ['Titles', 'Abstracts', 'Years', 'Categories'],
    num_rows: 44949
})

In [4]:
# Extract metadata
titles = list(dataset["Titles"])
abstracts = list(dataset["Abstracts"])

In [5]:
# cleaning up the abstracts
abstracts = [a.strip(' \n').replace("\n", "") for a in abstracts]
titles = [t.strip(' \n').replace("\n", "") for t in titles]

## Embedding the Abstract Chunks

We will be using Sentence Transformer as embedding model. You can feel free to use either open source model like I used or closed source model. In the [Hands-On Large Language Models](https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961) book written by [Jay Alammar](https://www.linkedin.com/in/jalammar) & [Maarten Grootendorst](https://www.linkedin.com/in/mgrootendorst/), they used [Cohear](https://docs.cohere.com/) and used its API to implement closed form of embedding model.

Know other Sentence Transformer models here: https://www.sbert.net/docs/sentence_transformer/pretrained_models.html <br>

Model Used in the notebook: **all-MiniLM-L6-v2** <br>
![Model Details](https://i.imgur.com/Y4gH8yC.png)


In [6]:
from sentence_transformers import SentenceTransformer

# loading embedding model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [7]:
# embedding the abstracts
embeddings = embedding_model.encode(abstracts)

# checking the dimensions
embeddings.shape

(44949, 384)

## Dense Retrieval with FAISS

FAISS + cosine similarity expects normalized vectors. So we must normalize embddings before indexing. Without this, cosine similarity is meaningless. This is sort of unsung bug in many github repos

In [8]:
import numpy as np

embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)

### Step 1: Build FAISS Index (Cosine Similarity)

In [9]:
import faiss

dimension = embeddings.shape[1]

index = faiss.IndexFlatIP(dimension)  # Inner Product = cosine (after normalization)
index.add(embeddings)

print("Total vectors indexed:", index.ntotal)

Total vectors indexed: 44949


### Step 2: Define Search Function (Retriever)

In [10]:
def dense_search(query, top_k=5):
    query_embedding = embedding_model.encode([query])
    query_embedding = query_embedding / np.linalg.norm(query_embedding, axis=1, keepdims=True)

    scores, indices = index.search(query_embedding, top_k)

    results = []
    for score, idx in zip(scores[0], indices[0]):
        # idx = int(idx) # remove this if using list

        results.append({
            "title": titles[idx],
            "abstract": abstracts[idx],
            "score": float(score)
        })
    return results

### Step 3: Test Retrieval

In [11]:
# inspection
query = "parameter efficient fine tuning of large language models"
results = dense_search(query, top_k=5)

for r in results:
    print(r["score"], r["title"])

0.8596878051757812 Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning
0.7623370885848999 Scaling Sparse Fine-Tuning to Large Language Models
0.7545766830444336 Empirical Analysis of the Strengths and Weaknesses of PEFT Techniques  for LLMs
0.7505697011947632 Compacter: Efficient Low-Rank Hypercomplex Adapter Layers
0.7484837770462036 Towards Better Parameter-Efficient Fine-Tuning for Large Language  Models: A Position Paper


## Sparse Retrieval (BM25)

### Step 1: Tokenize Abstracts

In [12]:
from rank_bm25 import BM25Okapi

# Tokenization
tokenized_abstracts = [abs.lower().split() for abs in abstracts]
bm25 = BM25Okapi(tokenized_abstracts)

### Step 2: BM25 Search

In [13]:
def bm25_search(query, top_k=5):
    tokenized_query = query.lower().split()
    scores = bm25.get_scores(tokenized_query)

    top_indices = np.argsort(scores)[-top_k:][::-1]

    results = []
    for idx in top_indices:
      idx = int(idx) # remove this if using list

      results.append({
          "title": titles[idx],
          "abstract": abstracts[idx],
          "score": scores[idx]
      })
    return results

### Step 3: Compare Dense vs Sparse

In [14]:
query = "corpus evaluation of French determiner grammar shows high accuracy"

print("Dense Search:")
for r in dense_search(query):
    print(f"{r["title"]} -> {r["score"]}")

print("\nBM25 Search:")
for r in bm25_search(query):
    print(f"{r["title"]} -> {r["score"]}")

Dense Search:
Evaluation of a Grammar of French Determiners -> 0.73871248960495
Verbal chunk extraction in French using limited resources -> 0.6437157392501831
Int\'egration des donn\'ees d'un lexique syntaxique dans un analyseur  syntaxique probabiliste -> 0.6364638209342957
\'Evaluation de lexiques syntaxiques par leur int\'egartion dans  l'analyseur syntaxiques FRMG -> 0.6359631419181824
Constructions d\'efinitoires des tables du Lexique-Grammaire -> 0.6027039289474487

BM25 Search:
Treating clitics with minimalist grammars -> 23.559089982317808
A Lexicalized Tree Adjoining Grammar for English -> 22.7408266461012
Evaluation Of Word Embeddings From Large-Scale French Web Content -> 21.240541110452288
\'Evaluation de lexiques syntaxiques par leur int\'egartion dans  l'analyseur syntaxiques FRMG -> 20.51329370856521
Un syst\`eme modulaire d'acquisition automatique de traductions \`a  partir du Web -> 20.136017588698476


## Re-Ranking

### Step 1: Candidate Pool (Hybrid)

In [15]:
def hybrid_search(query, dense_k=10, bm25_k=10):
    dense_results = dense_search(query, dense_k)
    bm25_results = bm25_search(query, bm25_k)

    combined = {r["abstract"]: r for r in dense_results + bm25_results}
    return list(combined.values())

### Step 2: Cross-Encoder Re-Ranker

In [16]:
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

### Step 3: Re-ranking Logic

In [17]:
def rerank(query, candidates, top_k=5):
    pairs = [(query, c["abstract"]) for c in candidates]
    scores = reranker.predict(pairs)

    for c, s in zip(candidates, scores):
        c["rerank_score"] = float(s)

    return sorted(candidates, key=lambda x: x["rerank_score"], reverse=True)[:top_k]

### Step 4: Full Test

In [18]:
query = "efficient adaptation of large language models for low data regimes"

candidates = hybrid_search(query)
final_results = rerank(query, candidates)

for r in final_results:
    print(r["rerank_score"], r["title"])


6.318161487579346 Efficiently Adapting Pretrained Language Models To New Languages
6.294178485870361 Empirical Analysis of the Strengths and Weaknesses of PEFT Techniques  for LLMs
5.941547870635986 SuryaKiran at MEDIQA-Sum 2023: Leveraging LoRA for Clinical Dialogue  Summarization
5.693998336791992 Zero-Shot Adaptive Transfer for Conversational Language Understanding
5.4996113777160645 Efficient MDI Adaptation for n-gram Language Models


FAISS is vector index which is good for searching and looks good to test quickly without building vector database. FAISS give fast nearest neighbour search and in-memory indexing but it does not give document storage, metadata filtering, API server and persistence.

## Migrating from FAISS to Chroma

### 1. Import Chroma

In [19]:
import chromadb
from chromadb.config import Settings

### 2. Create a Persistent Chroma Client

In [20]:
chroma_client = chromadb.Client(
    Settings(
        persist_directory="./chroma_arxiv",
        anonymized_telemetry=False
    )
)

### 3. Create Collection (Vector Store)

In [21]:
collection = chroma_client.get_or_create_collection(
    name="arxiv_abstracts",
    metadata={"hnsw:space": "cosine"}
)

### 4. Prepare Data for Chroma (Very Important)
Chroma expects:
- `documents`: text
- `metadatas`: extra info
- `ids`: strings

In [22]:
documents = abstracts
metadatas = [{"title": titles[i]} for i in range(len(titles))]
ids = [str(i) for i in range(len(abstracts))]

### 5. Add Embeddings to Chroma

Note: Do the process in batches else you will get InternalError: ValueError: Batch size of 44949 is greater than max batch size of 5461

In [23]:
BATCH_SIZE = 1000  # safe & fast

for i in range(0, len(documents), BATCH_SIZE):
    collection.add(
        documents=documents[i:i+BATCH_SIZE],
        metadatas=metadatas[i:i+BATCH_SIZE],
        embeddings=embeddings[i:i+BATCH_SIZE].tolist(),
        ids=ids[i:i+BATCH_SIZE]
    )

In [24]:
# Use the below line if you are using older version of Chroma DB. Newer version persists automatically
# chroma_client.persist()

### 6. Dense Search Function (Chroma Version)

In [25]:
import numpy as np

def chroma_dense_search(query, top_k=5):
    query_embedding = embedding_model.encode([query])
    query_embedding = query_embedding / np.linalg.norm(
        query_embedding, axis=1, keepdims=True
    )

    results = collection.query(
        query_embeddings=query_embedding.tolist(),
        n_results=top_k
    )

    output = []
    for i in range(len(results["documents"][0])):
        output.append({
            "title": results["metadatas"][0][i]["title"],
            "abstract": results["documents"][0][i],
            "score": results["distances"][0][i]
        })
    return output

### 7. Sanity Check

In [26]:
query = "multichannel end-to-end attention speech recognizer with HLSTM improves distant speech recognition"

results = chroma_dense_search(query, top_k=5)

for r in results:
    print(r["score"], r["title"])


0.2532505393028259 End-to-end attention-based distant speech recognition with Highway LSTM
0.3964526653289795 Multi-encoder multi-resolution framework for end-to-end speech  recognition
0.4003249406814575 An Online Attention-based Model for Speech Recognition
0.41325873136520386 On the limit of English conversational speech recognition
0.4296720623970032 On using 2D sequence-to-sequence models for speech recognition


Works like a charm!
![LSTM Paper](https://i.imgur.com/2tF2lI9.png)

## Use LangChain To Build Agent (Pseudo Agent)

![LangChain Banner](https://i.imgur.com/WC4fZSg.png)

We will use DuckDuckGo Web Search Tool provided by LangChain. Check out the docs [here](https://docs.langchain.com/oss/python/integrations/tools)

This is not completely agentic but still gives agentic-like feel by adding web searching feature. Feel free to work on this to make it more agentic project.

In [28]:
from langchain_community.tools import DuckDuckGoSearchResults
from langchain_community.utilities import DuckDuckGoSearchAPIWrapper

wrapper = DuckDuckGoSearchAPIWrapper(max_results=10)

search = DuckDuckGoSearchResults(api_wrapper=wrapper, output_format="list")

results = search.invoke(
    "Bio-linguistic transition and Baldwin effect in an evolutionary" # The top re-raked paper's title goes here
)

arxiv_only = [
    r for r in results
    if "arxiv.org" in r.get("link", "")
]

arxiv_only

[{'snippet': 'View a PDF of the paper titled Bio - linguistic transition and Baldwin effect in an evolutionary naming-game model, by Adam Lipowski and Dorota Lipowska.',
  'title': '[0710.0009] Bio - linguistic transition and Baldwin effect in an ...',
  'link': 'https://arxiv.org/abs/0710.0009'}]

In [29]:
arxiv_only[0]['link']

'https://arxiv.org/abs/0710.0009'

**Now using this link and updating the string from "abs" to "pdf" will make it directly open the PDF format of the paper.**

**If you want to build full fledged product level project, this is the time to migrate from Google Colab to VS Code (or any IDE). Colab is only good for research.**