---

### Problem Statement for Ex9

**Use Case Title:**
**"Research Paper Selector using Retrieve-and-Rerank RAG (R\&R-RAG)"**

**Problem Statement:**
In academic research, retrieving the most relevant scholarly papers for a specific topic (e.g., "few-shot learning techniques") can be challenging due to the large volume of documents and noisy keyword-based results. A basic semantic search using embeddings is often insufficient in terms of ranking the best-matching results based on fine-grained semantic nuances.

To improve the **accuracy and relevance** of retrieved results, this project implements a **hybrid RAG pipeline** using:

* A **bi-encoder (SentenceTransformer)** for fast semantic retrieval using FAISS.
* A **cross-encoder (MS MARCO TinyBERT)** for reranking the top retrieved results using deeper interaction modeling between query and document.

This approach enhances **information retrieval quality** for NLP/NLU-based academic literature search.

---

### What‚Äôs New in Ex9 Compared to Ex8?

| Feature           | Ex8 (Hybrid Clause Finder) | Ex9 (Research Paper Reranker)    | What's New in Ex9                            |
| ----------------- | -------------------------- | -------------------------------- | -------------------------------------------- |
| Input Type        | Legal contract PDFs        | Academic papers in CSV           | CSV-based structured document handling       |
| Search Type       | BM25 + FAISS hybrid        | FAISS + CrossEncoder hybrid      | CrossEncoder reranking added                 |
| Index             | Semantic + BM25            | Semantic only for initial recall | Uses CrossEncoder for deep reranking         |
| LLM               | FLAN-T5 via pipeline       | Not used for generation          | Focus is on ranking, not answering           |
| Evaluation Output | Answer + source chunks     | Ranked documents with scores     | Reranked paper results with relevance scores |

---

### Practical Significance

This pattern (R\&R-RAG) is used in:

* Academic search engines like Semantic Scholar, Arxiv-Sanity.
* Legal document analysis for ranking contract clauses by importance.
* Patent retrieval and systematic literature reviews in NLP pipelines.

---



In [2]:
# Install dependencies
!pip install sentence-transformers faiss-cpu pandas tqdm -q

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pycaret 3.3.2 requires scikit-learn>1.4.0, but you have scikit-learn 1.3.0 which is incompatible.
umap-learn 0.5.9.post2 requires scikit-learn>=1.6, but you have scikit-learn 1.3.0 which is incompatible.
thinc 8.3.6 requires numpy<3.0.0,>=2.0.0, but you have numpy 1.26.4 which is incompatible.
imbalanced-learn 0.13.0 requires scikit-learn<2,>=1.3.2, but you have scikit-learn 1.3.0 which is incompatible.[0m[31m
[0m

In [None]:
# Imports
import pandas as pd
import faiss
import numpy as np
from tqdm import tqdm
from sentence_transformers import SentenceTransformer, CrossEncoder

# SentenceTransformer
# This is used to load a bi-encoder model.
# A bi-encoder converts a single piece of text (like a paper title + abstract or a user query) into a vector (embedding).
# These vectors are then stored in FAISS for fast semantic retrieval.
# You use SentenceTransformer("all-MiniLM-L6-v2") to:
# Convert all academic papers (title + abstract) into vector form.
# Also to convert user query into a vector to find the top-k similar papers.

# CrossEncoder
# This is used to load a cross-encoder model.
# A cross-encoder takes a pair of texts (e.g., query + one paper) and processes them together.
# It returns a relevance score between 0 and 1 based on how well the paper matches the query.
# You use CrossEncoder("cross-encoder/ms-marco-TinyBERT-L-6") to:
# -Score each (query, document) pair among the top-k retrieved papers.
# -Rerank them based on these scores for better final results.

In [4]:
# Step 1: Load academic papers dataset
# Example format: papers.csv with 'title' and 'abstract'
df = pd.read_csv("papers.csv")  # <- Replace with your corpus
df

Unnamed: 0,title,abstract
0,Few-Shot Learning via Prompt Tuning with LLMs,We propose a prompt-tuning strategy using larg...
1,Meta-Learning for Efficient Few-Shot Classific...,Meta-learning frameworks have shown promising ...
2,A Survey on Transformers in Vision,This paper surveys the use of Transformer arch...
3,Contrastive Learning for Representation Learning,We explore contrastive learning approaches tha...
4,Neural Scaling Laws in Large Language Models,This work investigates how performance scales ...
5,Efficient Retrieval Techniques for Long Documents,We introduce retrieval mechanisms to efficient...
6,Reinforcement Learning with Human Feedback,We discuss reinforcement learning with human f...
7,An Empirical Study of LLM Prompt Sensitivity,This empirical study examines how LLMs respond...
8,Self-Supervised Pretraining in NLP,This paper analyzes self-supervised learning o...
9,Benchmarks for Multimodal Understanding,We present a collection of benchmarks and eval...


In [5]:
# Combine title and abstract into a single content field
# This is done to create a single text representation of each paper.
df['content'] = df['title'] + ". " + df['abstract']
df

Unnamed: 0,title,abstract,content
0,Few-Shot Learning via Prompt Tuning with LLMs,We propose a prompt-tuning strategy using larg...,Few-Shot Learning via Prompt Tuning with LLMs....
1,Meta-Learning for Efficient Few-Shot Classific...,Meta-learning frameworks have shown promising ...,Meta-Learning for Efficient Few-Shot Classific...
2,A Survey on Transformers in Vision,This paper surveys the use of Transformer arch...,A Survey on Transformers in Vision. This paper...
3,Contrastive Learning for Representation Learning,We explore contrastive learning approaches tha...,Contrastive Learning for Representation Learni...
4,Neural Scaling Laws in Large Language Models,This work investigates how performance scales ...,Neural Scaling Laws in Large Language Models. ...
5,Efficient Retrieval Techniques for Long Documents,We introduce retrieval mechanisms to efficient...,Efficient Retrieval Techniques for Long Docume...
6,Reinforcement Learning with Human Feedback,We discuss reinforcement learning with human f...,Reinforcement Learning with Human Feedback. We...
7,An Empirical Study of LLM Prompt Sensitivity,This empirical study examines how LLMs respond...,An Empirical Study of LLM Prompt Sensitivity. ...
8,Self-Supervised Pretraining in NLP,This paper analyzes self-supervised learning o...,Self-Supervised Pretraining in NLP. This paper...
9,Benchmarks for Multimodal Understanding,We present a collection of benchmarks and eval...,Benchmarks for Multimodal Understanding. We pr...


In [6]:
# Convert DataFrame content to a list of documents
# This list will be used for embedding and retrieval.
# Each document is a string combining the title and abstract of a paper.
documents = df['content'].tolist()
print(f"Loaded {len(documents)} documents.")

Loaded 10 documents.


In [7]:
# Step 2: Embed documents using SentenceTransformer (bi-encoder)
bi_encoder = SentenceTransformer("all-MiniLM-L6-v2")
doc_embeddings = bi_encoder.encode(documents, show_progress_bar=True)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  return forward_call(*args, **kwargs)


In [8]:
# Step 3: Create FAISS index
dimension = doc_embeddings[0].shape[0]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(doc_embeddings))
print("FAISS index built.")

FAISS index built.


In [None]:
# Step 4: Cross-Encoder for reranking (query-doc pairs)
cross_encoder = CrossEncoder("cross-encoder/ms-marco-TinyBERT-L-6", max_length=512)

# We're now using a Cross-Encoder model.
# The model will take a pair of inputs:
# a. A user query (like "few-shot learning methods")
# b. A document (like a research paper‚Äôs content)
# And it will score how relevant the document is to the query.
# This helps us rerank the top documents retrieved earlier from FAISS.

# cross-encoder/ms-marco-TinyBERT-L-6
# A tiny, fast transformer-based cross-encoder model fine-tuned for ranking text pairs (like question + passage). It's made available via the sentence-transformers library.

# max_length=512
# Only read up to 512 tokens from the combined input (query + document).
# This avoids memory issues and truncates long documents smartly.
# 512 is a typical limit for many transformer models like BERT and TinyBERT.

In [10]:
# Step 5: Retrieval + Reranking pipeline
# This function will:
# First retrieve many candidate papers (top_k √ó 5),
# Then rerank them using a cross-encoder,
# And finally return the top_k most relevant papers.

def retrieve_and_rerank(query, top_k=10):
    # Step 1: Vector search (fast recall)
    query_embedding = bi_encoder.encode([query]) # It converts the query into a vector (embedding) using the bi-encoder model.
    distances, indices = index.search(np.array(query_embedding), top_k * 5) # It performs semantic search in the FAISS index using the query vector.
    initial_results = [(documents[i], df.iloc[i]['title'], df.iloc[i]['abstract']) for i in indices[0]]
    # For each of the matched document indices, it extracts:
    # >The full content (title + abstract),
    # >The title alone,
    # >The abstract alone.
    # These are stored as tuples in a list called initial_results.

    # Step 2: Cross-encoder reranking
    rerank_pairs = [[query, doc] for doc, _, _ in initial_results]
    scores = cross_encoder.predict(rerank_pairs)
    reranked = sorted(zip(scores, initial_results), key=lambda x: x[0], reverse=True)

    return reranked[:top_k]

In [11]:
# Test it
query = "What are the latest techniques in few-shot learning?"
results = retrieve_and_rerank(query)

print(f"\nüîç Top results for: {query}\n")
for score, (doc, title, abstract) in results:
    print(f"üìù Title: {title}")
    print(f"üìä Score: {score:.4f}")
    print(f"üìÑ Abstract: {abstract[:300]}...")
    print("-" * 80)

  return forward_call(*args, **kwargs)



üîç Top results for: What are the latest techniques in few-shot learning?

üìù Title: Few-Shot Learning via Prompt Tuning with LLMs
üìä Score: 0.9026
üìÑ Abstract: We propose a prompt-tuning strategy using large language models for adapting to few-shot settings in NLP tasks. Our method reduces the need for fine-tuning by leveraging prompt engineering....
--------------------------------------------------------------------------------
üìù Title: Meta-Learning for Efficient Few-Shot Classification
üìä Score: 0.7122
üìÑ Abstract: Meta-learning frameworks have shown promising results in few-shot classification by optimizing the initialization of neural networks across tasks....
--------------------------------------------------------------------------------
üìù Title: Contrastive Learning for Representation Learning
üìä Score: 0.0003
üìÑ Abstract: We explore contrastive learning approaches that learn useful representations by pulling semantically similar instances together and p

# Happy Learning