# End-to-End RAG Demo: Academic Paper Q&A

This notebook demonstrates a complete RAG (Retrieval-Augmented Generation) pipeline for answering questions about research papers.

## Pipeline Overview:
1. **PDF Extraction** → Extract text from PDF
2. **Text Cleaning** → Remove artifacts and boilerplate
3. **Adaptive Chunking** → Split into semantic chunks
4. **Hybrid Retrieval** → Dense (embeddings) + Sparse (BM25)
5. **Answer Generation** → LLM synthesizes answer from retrieved context

**Expected Runtime:** 5-10 minutes (first run downloads models)

## 1. Installation

In [1]:
!pip install -q pymupdf sentence-transformers faiss-cpu transformers accelerate nltk

In [2]:
import nltk
nltk.download('punkt_tab', quiet=True)
print("Dependencies installed")

Dependencies installed


## 2. Import Modules

In [3]:
import sys
import os

# Add src to path
sys.path.append('../src')  # Adjust if running from different directory

from academic_rag_system.preprocessing.pdf_cleaner import clean_academic_text
from academic_rag_system.preprocessing.adaptive_chunker import AcademicPaperChunker
from academic_rag_system.retrieval.hybrid_retriever import HybridRetriever
from academic_rag_system.generation.answer_generator import RAGAnswerGenerator

print("Modules imported")

Modules imported


## 3. Load and Process PDF

In [4]:
import fitz  # PyMuPDF

# Path to your PDF
# pdf_path = r"path/to/your/paper.pdf"  # ← Change this to your PDF path
pdf_path = r"C:\Users\Abu.Sikder\OneDrive - FDA\data_move_june_2025\FDA\EduRAG\dumps\s13031-018-0151-3.pdf"  # ← Change this to your PDF path

# Extract text
print(f"Loading PDF: {pdf_path}")
doc = fitz.open(pdf_path)
text = "\n".join([page.get_text() for page in doc])


print(f"Extracted {len(text)} characters from {doc.page_count} pages")
doc.close()

Loading PDF: C:\Users\Abu.Sikder\OneDrive - FDA\data_move_june_2025\FDA\EduRAG\dumps\s13031-018-0151-3.pdf
Extracted 53402 characters from 13 pages


In [5]:
# Clean text (remove PDF artifacts, boilerplate)
print("Cleaning text...")
text = clean_academic_text(text)

print(f"Cleaned text: {len(text)} characters")
print(f"\nFirst 300 chars:\n{text[:300]}...")

Cleaning text...
Cleaned text: 40240 characters

First 300 chars:
RESEARCH
Open Access
Water, sanitation, and hygiene access in southern Syria: analysis of survey data and recommendations for response
Mustafa Sikder1*, Umar Daraz2, Daniele Lantagne1 and Roberto Saltori2
Abstract
Background: Water, sanitation, and hygiene (WASH) are immediate priorities for human s...


## 4. Adaptive Chunking

In [6]:
# Initialize chunker
chunker = AcademicPaperChunker(
    target_chunk_size=600,
    min_chunk_size=200,
    max_chunk_size=1000,
    overlap_sentences=1,
    fallback_to_paragraphs=True
)

# Create chunks
print("Creating chunks...")
chunk_objects = chunker.chunk_paper(text)
chunks = chunker.chunks_to_list(chunk_objects)

# Print statistics
chunker.print_statistics(chunk_objects)

Creating chunks...

CHUNKING STATISTICS
Total chunks: 50
Average chunk size: 1090 chars
Min chunk size: 324 chars
Max chunk size: 3999 chars
Average sentences per chunk: 4.6

Chunks by section:
  Abstract: 3 chunks
  Introduction: 8 chunks
  Methods: 5 chunks
  Results: 20 chunks
  Discussion: 14 chunks



In [7]:
# Inspect first few chunks
print("\nSample chunks:\n")
for i, chunk in enumerate(chunks[:3]):
    section = chunk_objects[i].section
    print(f"Chunk {i+1} [Section: {section}]:")
    print(f"{chunk[:150]}...\n")
    print("-" * 80)


Sample chunks:

Chunk 1 [Section: abstract]:
Background: Water, sanitation, and hygiene (WASH) are immediate priorities for human survival and dignity in emergencies. In 2010, > 90% of Syrians ha...

--------------------------------------------------------------------------------
Chunk 2 [Section: abstract]:
Results: In 2016 and 2017, 1281 and 1360 surveys were conducted. Piped water as the main water source declined from
22.0% to 15.3% over this time. Hou...

--------------------------------------------------------------------------------
Chunk 3 [Section: abstract]:
Conclusions: The private sector has effectively replaced decaying infrastructure in Syria, although at high cost and uncertain quality. Allowing marke...

--------------------------------------------------------------------------------


## 5. Index Documents (Hybrid Retrieval)

In [8]:
# Initialize hybrid retriever (dense + sparse)
retriever = HybridRetriever(
    embedding_model_name='sentence-transformers/multi-qa-mpnet-base-cos-v1',
    use_cosine=True,
    bm25_k1=1.5,
    bm25_b=0.75
)

# Index documents
print("Indexing documents...")
retriever.index_documents(chunks, show_progress=True)

Loading embedding model: sentence-transformers/multi-qa-mpnet-base-cos-v1
Initializing BM25...
Indexing documents...

Indexing 50 documents...
Creating embeddings...


Batches:   0%|          | 0/2 [00:00<?, ?it/s]

Normalizing embeddings for cosine similarity...
Dense index created with 50 vectors
Fitting BM25 on corpus...
BM25 fitted successfully

✅ Indexing complete!


## 6. Initialize Answer Generator

In [9]:
# Initialize LLM for answer generation
generator = RAGAnswerGenerator(
    # model_name="google/long-t5-tglobal-base",  # for long context
    model_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0",  # for long context
    device=-1  # CPU (use 0 for GPU)
)

Loading model: TinyLlama/TinyLlama-1.1B-Chat-v1.0


Device set to use cpu


✅ Model loaded (text-generation). Max input tokens: 2048


## 7. Ask Questions!

In [None]:
def ask_question(query, k=3, method='hybrid', show_chunks=True):
    """
    Ask a question about the paper.
    
    Args:
        query: Your question
        k: Number of chunks to retrieve
        method: 'hybrid', 'dense', or 'sparse'
        show_chunks: Whether to display retrieved chunks
    """
    print(f"\n{'='*80}")
    print(f"QUESTION: {query}")
    print(f"{'='*80}\n")
    
    # 1. Retrieve relevant chunks
    retrieved_chunks, scores, indices = retriever.search(
        query,
        k=k,
        method=method,
        dense_weight=0.6,
        sparse_weight=0.4,
        fusion_method='weighted'
    )
    
    # 2. Display retrieved chunks (optional)
    if show_chunks:
        print("RETRIEVED CHUNKS:\n")
        for i, (chunk, score, idx) in enumerate(zip(retrieved_chunks, scores, indices)):
            section = chunk_objects[idx].section
            preview = chunk[:100] + "..." if len(chunk) > 100 else chunk
            print(f"  {i+1}. [Section: {section}, Score: {score:.4f}, Index: {idx}]")
            print(f"     {preview}\n")
    
    # 3. Generate answer
    print("GENERATING ANSWER...\n")
    answer = generator.generate_answer(
        query,
        retrieved_chunks,
        max_new_tokens=200,
        do_sample=False,
        ensure_complete=True, 
        display_truncation_message=True
    )
    
    # 4. Display answer
    # print(f"ANSWER:\n")
    # print(f"{answer}\n")
    # print(f"{'='*80}\n")
    
    return answer, retrieved_chunks, scores, indices

### Example Questions

In [21]:
# Question 1: Main research question
answers = ask_question("What is the main research question or objective of this study?")



QUESTION: What is the main research question or objective of this study?

RETRIEVED CHUNKS:

  1. [Section: discussion, Score: 0.7609, Index: 44]
     The limitations of this analysis include: 1) southern Syria is not representative of all Syria, espe...

  2. [Section: results, Score: 0.4000, Index: 21]
     Overall, 4.1% of households reported leaving their garbage in the open, and 35.4% were in communitie...

  3. [Section: discussion, Score: 0.3963, Index: 43]
     Communitylevel WSP programming also reaches a scale that allows cost-effective use of limited resour...

GENERATING ANSWER...

Input: 851 tokens | Generating up to 200 tokens
⚠️  Truncated at '

Question:' to prevent over-generation
ANSWER:

The main research question or objective of this study is to assess the effectiveness of community-level water safety plan (WSP) programming in improving water supply and hygiene practices in southern Syria.




In [22]:
# Question 2: Methods
answers= ask_question("What methods were used to collect data in this study?")


QUESTION: What methods were used to collect data in this study?

RETRIEVED CHUNKS:

  1. [Section: methods, Score: 1.0000, Index: 14]
     Please note the survey tool is available upon request from the corresponding author. Water quality t...

  2. [Section: methods, Score: 0.4861, Index: 13]
     Enumerators were trained on ethical survey administration; local community councils were informed, a...

  3. [Section: methods, Score: 0.2723, Index: 12]
     Sample size was calculated using the Krejcie and
Morgan model [18]; set for 95% confidence with 10%
...

GENERATING ANSWER...

Input: 690 tokens | Generating up to 200 tokens
⚠️  Truncated at '

Question:' to prevent over-generation
ANSWER:

The survey tool is available upon request from the corresponding author.




In [23]:
# Question 3: Results
answers= ask_question("What were the most significant results or findings?")


QUESTION: What were the most significant results or findings?

RETRIEVED CHUNKS:

  1. [Section: results, Score: 0.7553, Index: 31]
     IncomeSpent was significant, with OR near one (1.03, 1.02–1.04). These four variables remained signi...

  2. [Section: methods, Score: 0.6000, Index: 14]
     Please note the survey tool is available upon request from the corresponding author. Water quality t...

  3. [Section: results, Score: 0.3475, Index: 30]
     In 2016, 10 of the 15 variables were significantly associated with diarrhea in children < 5 in univa...

GENERATING ANSWER...

Input: 1143 tokens | Generating up to 200 tokens
⚠️  Truncated at '

Question:' to prevent over-generation
ANSWER:

The most significant results or findings were that the protective factor was FunctionalToilet (mixed effect OR (mOR): 0.62 (95% CI 0.46–0.82)), and risk factors included AdequateWater (2.14, 1.62–2.84) and SeparateWater (2.03, 1.52–2.72).




In [24]:
# Question 4: Limitations
answers= ask_question("What limitations or weaknesses does the study acknowledge?")


QUESTION: What limitations or weaknesses does the study acknowledge?

RETRIEVED CHUNKS:

  1. [Section: discussion, Score: 0.8842, Index: 44]
     The limitations of this analysis include: 1) southern Syria is not representative of all Syria, espe...

  2. [Section: discussion, Score: 0.6589, Index: 43]
     Communitylevel WSP programming also reaches a scale that allows cost-effective use of limited resour...

  3. [Section: discussion, Score: 0.4226, Index: 48]
     The lesson from the Syria WASH response is that allowing market forces to manage services and quanti...

GENERATING ANSWER...

Input: 696 tokens | Generating up to 200 tokens
ℹ️  Truncated to last complete sentence
ANSWER:

The study acknowledges the following limitations: 1) Southern Syria is not representative of all Syria, especially for power supply, and therefore water network, availability. 2) No microbiological water quality data was collected, although FCR presence is an indicator of no/low bacterial contaminatio

In [None]:
# Try your own question!
my_question = "Your question here"  # ← Change this
answers= ask_question(my_question)

## 8. Compare Retrieval Methods

In [25]:
# Compare dense, sparse, and hybrid retrieval
test_query = "What methods were used?"

print(f"\n{'='*80}")
print(f"COMPARING RETRIEVAL METHODS")
print(f"Query: {test_query}")
print(f"{'='*80}\n")

retriever.compare_methods(test_query, k=3)


COMPARING RETRIEVAL METHODS
Query: What methods were used?


QUERY: What methods were used?

DENSE RETRIEVAL:
--------------------------------------------------------------------------------

Rank 1 | Score: 0.3212 | Chunk ID: 14
Text: Please note the survey tool is available upon request from the corresponding author. Water quality testing
During the survey, drinking water samples f...

Rank 2 | Score: 0.3057 | Chunk ID: 13
Text: Enumerators were trained on ethical survey administration; local community councils were informed, and household consent was obtained before conductin...

Rank 3 | Score: 0.2849 | Chunk ID: 10
Text: Depending on the situation, WSPs vary in complexity. In case of southern Syria, WSP implementation involved conducing a risk assessment at three level...


SPARSE RETRIEVAL:
--------------------------------------------------------------------------------

Rank 1 | Score: 3.9904 | Chunk ID: 40
Text: 3) How can community-level WASH interventions, such as WSPs, be s

## Summary

1. Extracted and cleaned text from a PDF
2. Created adaptive semantic chunks
3. Built a hybrid retrieval index (dense + sparse)
4. Generated answers using LongT5
5. Asked questions about the paper

**Next Steps:**
- Try different papers
- Experiment with retrieval weights (`dense_weight`, `sparse_weight`)
- Test different LLM models (`google/flan-t5-xl`, `microsoft/phi-3-mini-4k-instruct`)
- Evaluate on the 20 generic questions

**For production use:**
- See `demo/gradio_app.py` for web interface