# RAG Pipeline Demo - arXiv Papers
## Week 4: Retrieval-Augmented Generation

This notebook demonstrates the complete RAG pipeline:
1. Load processed chunks and FAISS index
2. Perform semantic search queries
3. Analyze retrieval results
4. Generate retrieval report

In [1]:
# Import libraries
import json
import numpy as np
from embedding_indexer import EmbeddingIndexer
import pandas as pd
from IPython.display import display, HTML, Markdown

## 1. Load the Index

In [2]:
# Initialize and load the indexer
indexer = EmbeddingIndexer(model_name='all-MiniLM-L6-v2')
indexer.load_index('faiss_index.bin', 'index_metadata.pkl')

# Display stats
stats = indexer.get_stats()
print("Index Statistics:")
print("=" * 50)
for key, value in stats.items():
    print(f"{key}: {value}")

Loading embedding model: all-MiniLM-L6-v2
Model loaded. Embedding dimension: 384
FAISS index loaded from: faiss_index.bin
Index contains 998 vectors
Metadata loaded from: index_metadata.pkl
Index Statistics:
total_chunks: 998
dimension: 384
total_papers: 50
avg_chunk_tokens: 500.1743486973948


## 2. Define Test Queries

We'll test with 5 different queries covering various NLP topics

In [3]:
test_queries = [
    "What are attention mechanisms in transformer models?",
    "How does fine-tuning work for large language models?",
    "What is the role of tokenization in NLP?",
    "Explain zero-shot and few-shot learning approaches",
    "What are the challenges in machine translation?"
]

print(f"Test Queries ({len(test_queries)}):")
for i, query in enumerate(test_queries, 1):
    print(f"{i}. {query}")

Test Queries (5):
1. What are attention mechanisms in transformer models?
2. How does fine-tuning work for large language models?
3. What is the role of tokenization in NLP?
4. Explain zero-shot and few-shot learning approaches
5. What are the challenges in machine translation?


## 3. Perform Searches and Collect Results

In [4]:
# Search parameters
k = 3  # Top-k results

# Store all results
all_results = []

for query in test_queries:
    results = indexer.search(query, k=k)
    all_results.append({
        'query': query,
        'results': results
    })

print(f"âœ… Completed {len(all_results)} searches")

âœ… Completed 5 searches


## 4. Display Results for Each Query

In [5]:
for idx, item in enumerate(all_results, 1):
    query = item['query']
    results = item['results']
    
    print("\n" + "=" * 80)
    print(f"Query {idx}: {query}")
    print("=" * 80)
    
    for result in results:
        print(f"\nRank {result['rank']} (Distance: {result['distance']:.4f})")
        print(f"Paper ID: {result['paper_id']}")
        print(f"Chunk ID: {result['chunk_id']}")
        print(f"\nText Preview (first 300 chars):")
        print(result['text'][:300] + "...")
        print("-" * 80)


Query 1: What are attention mechanisms in transformer models?

Rank 1 (Distance: 0.8744)
Paper ID: 2511.12832
Chunk ID: 2511.12832_chunk_19

Text Preview (first 300 chars):
to the â€™realismâ€™ diagnostic task. Figure 5: Layer-wise attention head contributions to the â€™counter offerâ€™ diagnostic task. Causal Influence of Attention Heads on Responses Countering an Offer Attribution Map: Layer-wise Head Contributions Attention Head Index (0-31) Transformer Layer (0-31) Transfo...
--------------------------------------------------------------------------------

Rank 2 (Distance: 0.9693)
Paper ID: 2511.12874
Chunk ID: 2511.12874_chunk_17

Text Preview (first 300 chars):
processing pre-trained on a large corpus of text. GPT-2 Generative Pre-trained Transformer 2. An autoregressive language model that uses unidirec- tional attention (each token can only attend to previous tokens). It contains 124 million parameters in its base version and was pre-trained on a larger ...
------------------

## 5. Analyze Retrieval Quality

In [6]:
# Calculate average distances
distances_by_query = []

for item in all_results:
    distances = [r['distance'] for r in item['results']]
    avg_distance = np.mean(distances)
    distances_by_query.append({
        'query': item['query'][:50] + '...' if len(item['query']) > 50 else item['query'],
        'avg_distance': avg_distance,
        'min_distance': min(distances),
        'max_distance': max(distances)
    })

# Create DataFrame
df_distances = pd.DataFrame(distances_by_query)
print("\nRetrieval Quality Metrics:")
print("=" * 80)
display(df_distances)

print(f"\nOverall Average Distance: {df_distances['avg_distance'].mean():.4f}")


Retrieval Quality Metrics:


Unnamed: 0,query,avg_distance,min_distance,max_distance
0,What are attention mechanisms in transformer m...,0.962566,0.87436,1.044035
1,How does fine-tuning work for large language m...,0.725145,0.637259,0.7788
2,What is the role of tokenization in NLP?,1.125757,1.118935,1.135897
3,Explain zero-shot and few-shot learning approa...,1.109308,1.097714,1.117514
4,What are the challenges in machine translation?,0.857553,0.841021,0.866049



Overall Average Distance: 0.9561


## 6. Paper Coverage Analysis

In [7]:
# Analyze which papers appear in results
paper_frequency = {}

for item in all_results:
    for result in item['results']:
        paper_id = result['paper_id']
        paper_frequency[paper_id] = paper_frequency.get(paper_id, 0) + 1

# Sort by frequency
sorted_papers = sorted(paper_frequency.items(), key=lambda x: x[1], reverse=True)

print("Most Frequently Retrieved Papers:")
print("=" * 50)
for paper_id, count in sorted_papers[:10]:
    print(f"{paper_id}: {count} times")

print(f"\nTotal unique papers in results: {len(paper_frequency)}")

Most Frequently Retrieved Papers:
2511.13180: 3 times
2511.12832: 2 times
2511.13467: 2 times
2511.12874: 1 times
2511.12991: 1 times
2511.13368: 1 times
2511.13182: 1 times
2511.12573: 1 times
2511.12630: 1 times
2511.13152: 1 times

Total unique papers in results: 11


## 7. Generate Retrieval Report

In [8]:
# Create a comprehensive report
report = {
    'metadata': {
        'total_queries': len(all_results),
        'results_per_query': k,
        'index_stats': stats
    },
    'queries_and_results': []
}

for item in all_results:
    query_report = {
        'query': item['query'],
        'results': [
            {
                'rank': r['rank'],
                'distance': r['distance'],
                'paper_id': r['paper_id'],
                'chunk_id': r['chunk_id'],
                'text': r['text']
            }
            for r in item['results']
        ]
    }
    report['queries_and_results'].append(query_report)

# Save report
with open('retrieval_report.json', 'w', encoding='utf-8') as f:
    json.dump(report, f, indent=2, ensure_ascii=False)

print("âœ… Retrieval report saved to: retrieval_report.json")

âœ… Retrieval report saved to: retrieval_report.json


## 8. Interactive Query Test

Try your own custom queries!

In [10]:
# Interactive search function
def interactive_search(query, k=3):
    """Search and display results nicely"""
    results = indexer.search(query, k=k)
    
    print("\n" + "=" * 80)
    print(f"Query: {query}")
    print("=" * 80)
    
    for result in results:
        print(f"\nðŸ“„ Rank {result['rank']} | Distance: {result['distance']:.4f}")
        print(f"Paper: {result['paper_id']}")
        print(f"\n{result['text'][:400]}...")
        print("-" * 80)

interactive_search("What does the paper Attention is all you need talk about?", k=3)
# Example: Try your own query
# interactive_search("What is BERT and how does it work?", k=3)


Query: What does the paper Attention is all you need talk about?

ðŸ“„ Rank 1 | Distance: 1.1351
Paper: 2511.13505

and a diversity of perspectives in such cases should be actively sought rather than normalized through strict majority-capping. Figure 8: Performance of each model using the CoT + Prompt Chaining prompt averaged across 3 runs. F Prompting Experiment Prompts F.1 CoT F.1.1 CoT System Prompt Your task is to annotate a public narrative speech according to a specific codebook developed by Dr. Marshall ...
--------------------------------------------------------------------------------

ðŸ“„ Rank 2 | Distance: 1.1397
Paper: 2511.13505

Applying Large Language Models to Characterize Public Narratives Elinor Poole-Dayan* MIT elinorpd@mit.edu Daniel T. Kesslerâˆ— MIT kessler1@mit.edu Hannah Chiou Wellesley College Margaret Hughes MIT Emily S. Lin Harvard University Marshall Ganz Harvard University Deb Roy MIT Abstract Public Narratives (PNs) are key tools for lead- ership develop

## Summary

This notebook demonstrated:
- Loading and using the FAISS index
- Performing semantic search on arXiv papers
- Analyzing retrieval quality
- Generating a comprehensive report

**Next Steps:**
1. Test the FastAPI service (`python main.py`)
2. Explore different chunking strategies
3. Experiment with different embedding models
4. Add hybrid search (BM25 + dense embeddings)