# RAG Pipeline Demo: arXiv cs.CL Papers

This notebook demonstrates the complete RAG (Retrieval-Augmented Generation) pipeline for searching through arXiv computer science computational linguistics papers.

## Pipeline Steps:
1. **PDF Text Extraction** - Extract text from 50 PDF papers
2. **Text Chunking** - Split papers into meaningful segments (250-512 tokens)
3. **Embedding Generation** - Create dense vector embeddings using sentence-transformers
4. **FAISS Indexing** - Build searchable index of embeddings
5. **Retrieval** - Query the index to find relevant passages

In [None]:
import sys
import os
sys.path.append('./src')

import json
import numpy as np
import pandas as pd
from pdf_processor import PDFProcessor
from chunker import TextChunker
from embedder import EmbeddingIndexer
from retriever import RAGRetriever

print("All modules imported successfully!")

## Step 1: PDF Text Extraction

In [None]:
# Process PDFs if not already done
if not os.path.exists('./data/processed_documents.json'):
    print("Processing PDF files...")
    processor = PDFProcessor(
        pdf_directory="./PDFs",
        output_directory="./data"
    )
    documents = processor.process_all_pdfs()
else:
    print("Loading existing processed documents...")
    with open('./data/processed_documents.json', 'r', encoding='utf-8') as f:
        documents = json.load(f)

print(f"Loaded {len(documents)} documents")
print(f"Sample document keys: {list(documents.keys())[:5]}")

## Step 2: Text Chunking

In [None]:
# Create chunks if not already done
if not os.path.exists('./data/chunks.json'):
    print("Creating text chunks...")
    chunker = TextChunker(chunk_size=512, overlap_size=50)
    chunks = chunker.process_documents(documents, "./data")
else:
    print("Loading existing chunks...")
    with open('./data/chunks.json', 'r', encoding='utf-8') as f:
        chunks = json.load(f)

print(f"Total chunks: {len(chunks)}")
print(f"Sample chunk: {chunks[0]['chunk_id']}")
print(f"Sample text (first 200 chars): {chunks[0]['text'][:200]}...")

## Step 3: Embedding Generation and FAISS Indexing

In [None]:
# Create embeddings and index if not already done
if not os.path.exists('./data/faiss_index.bin'):
    print("Creating embeddings and FAISS index...")
    indexer = EmbeddingIndexer(output_directory="./data")
    index, chunks = indexer.build_index_from_chunks("./data/chunks.json")
    print(f"Index created with {index.ntotal} vectors")
else:
    print("FAISS index already exists!")
    indexer = EmbeddingIndexer(output_directory="./data")
    index, chunk_metadata = indexer.load_index_and_metadata()
    print(f"Loaded index with {index.ntotal} vectors")

## Step 4: RAG Retrieval Demo

In [None]:
# Initialize retriever
retriever = RAGRetriever(data_directory="./data")
print("RAG Retriever initialized successfully!")

## Interactive Search Demo

In [None]:
def search_and_display(query, k=3):
    print(f"\n🔍 Searching for: '{query}'")
    print("=" * 70)
    
    results = retriever.search(query, k=k)
    
    for i, result in enumerate(results, 1):
        print(f"\n📄 Result {i}")
        print(f"   Document: {result['document_id']}")
        print(f"   Chunk: {result['chunk_id']}")
        print(f"   Similarity Score: {result['similarity_score']:.4f}")
        print(f"   Token Count: {result['token_count']}")
        print("\n   Text:")
        text = result['text']
        if len(text) > 300:
            print(f"   {text[:300]}...")
        else:
            print(f"   {text}")
        print("-" * 50)

In [None]:
# Example queries for cs.CL (Computational Linguistics) papers
example_queries = [
    "transformer architecture",
    "attention mechanism",
    "natural language processing",
    "machine translation",
    "language models"
]

# Run searches for all example queries
for query in example_queries:
    search_and_display(query, k=3)

## Custom Query Interface

In [None]:
# Interactive search - you can modify this query
custom_query = "BERT embeddings"
search_and_display(custom_query, k=3)

## Dataset Statistics

In [None]:
# Load chunks for analysis
with open('./data/chunks.json', 'r', encoding='utf-8') as f:
    chunks_data = json.load(f)

# Create DataFrame for analysis
df = pd.DataFrame(chunks_data)

print("📊 Dataset Statistics:")
print(f"Total Documents: {df['document_id'].nunique()}")
print(f"Total Chunks: {len(df)}")
print(f"Average Chunks per Document: {len(df) / df['document_id'].nunique():.1f}")
print(f"Average Token Count: {df['token_count'].mean():.1f}")
print(f"Min Token Count: {df['token_count'].min()}")
print(f"Max Token Count: {df['token_count'].max()}")

# Distribution of chunks per document
chunks_per_doc = df.groupby('document_id').size()
print(f"\nChunks per document distribution:")
print(chunks_per_doc.describe())

## Conclusion

This notebook demonstrated a complete RAG pipeline for searching through arXiv cs.CL papers:

✅ **PDF Processing**: Extracted text from 50 research papers  
✅ **Text Chunking**: Created overlapping chunks of 250-512 tokens  
✅ **Embedding Generation**: Used sentence-transformers to create dense vectors  
✅ **FAISS Indexing**: Built efficient search index  
✅ **Retrieval**: Implemented semantic search with similarity scoring  

The system can now answer questions about computational linguistics research by finding the most relevant passages from the paper collection.