[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/oviya-raja/ist-402/blob/main/learning-path/W08/W8_pdf_Q_A.ipynb)

---

# PDF Q&A RAG System 

## Overview
This notebook implements a **Retrieval-Augmented Generation (RAG)** system for answering questions from PDF documents.

## Architecture
1. **Document Processing**: Extract text from uploaded PDFs
2. **Text Chunking**: Split documents into manageable chunks (1000 chars, 200 overlap)
3. **Embedding**: Convert chunks to vectors using MiniLM-L6-v2
4. **Vector Store**: Build FAISS index for fast similarity search
5. **Question Answering**: Retrieve relevant chunks and generate answers using FLAN-T5

## Key Fixes in This Version
- Improved prompt template for better FLAN-T5 comprehension
- Better context preprocessing to remove noise
- Fixed generation parameters (removed problematic min_length)
- Increased chunk size for better context
- Added context cleaning to remove figure/table noise

## Usage
1. Run the cell below to install dependencies and launch the app
2. Upload PDF files in the Streamlit interface
3. Click "Build / Rebuild Index" to process documents
4. Ask questions and get answers grounded in your documents

In [2]:
# =========================
# üìö PDF Q&A RAG ‚Äî Launcher (FIXED VERSION)
# =========================
# This cell installs dependencies and launches the Streamlit app
# type: ignore

# 1) Install dependencies
# Note: requests==2.32.4 required for Google Colab compatibility
%pip install -q streamlit langchain-community faiss-cpu sentence-transformers \
                transformers accelerate safetensors pypdf pyngrok python-dotenv requests==2.32.4

# 2) Create the Streamlit application
app_code: str = '''
import os
import io
import re
import torch
import streamlit as st

# ---- LangChain & friends ----
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter

# ---- PDF parsing ----
from pypdf import PdfReader

# ---- Local LLM (FLAN-T5) ----
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

st.set_page_config(page_title="PDF Q&A (Local RAG)", page_icon="üìö")
st.title("üìö PDF Q&A Chatbot ‚Äî Local RAG (Fixed)")

st.markdown(
    "Upload one or more PDFs. We\'ll chunk + embed them (MiniLM), build a FAISS index, "
    "then answer questions using retrieved chunks and a local FLAN-T5-Large model (better quality than base)."
)

# ---------------------------
# CACHED MODELS
# ---------------------------

@st.cache_resource
def load_embeddings():
    model_name = "sentence-transformers/all-MiniLM-L6-v2"
    return HuggingFaceEmbeddings(model_name=model_name)

@st.cache_resource
def load_flan():
    # Using flan-t5-large for better answer quality (780M params vs 250M in base)
    # Falls back to base if large fails to load (memory constraints)
    name = "google/flan-t5-large"
    try:
        tok = AutoTokenizer.from_pretrained(name)
        dtype = torch.float16 if torch.cuda.is_available() else torch.float32
        model = AutoModelForSeq2SeqLM.from_pretrained(name, torch_dtype=dtype)
        device = "cuda" if torch.cuda.is_available() else "cpu"
        model.to(device)
        print(f"‚úÖ Loaded {name} successfully")
    except Exception as e:
        print(f"‚ö†Ô∏è Failed to load {name}, falling back to flan-t5-base: {e}")
        name = "google/flan-t5-base"
        tok = AutoTokenizer.from_pretrained(name)
        dtype = torch.float16 if torch.cuda.is_available() else torch.float32
        model = AutoModelForSeq2SeqLM.from_pretrained(name, torch_dtype=dtype)
        device = "cuda" if torch.cuda.is_available() else "cpu"
        model.to(device)
    return tok, model, device

embeddings = load_embeddings()
tokenizer, flan, device = load_flan()

# ---------------------------
# HELPER FUNCTIONS
# ---------------------------

def clean_text(text):
    """
    Clean extracted text to remove noise that confuses the model.
    """
    # Remove sequences of numbers (like figure axis labels: "0 200 400 600...")
    # Using [0-9] instead of \\d to avoid escape sequence warnings
    text = re.sub(r"([0-9]+\\s+){4,}", "", text)
    
    # Remove figure/table references that are just numbers
    text = re.sub(r"Figure\\s*[0-9]+[.:]", "Figure: ", text)
    text = re.sub(r"Table\\s*[0-9]+[.:]", "Table: ", text)
    
    # Remove excessive whitespace
    text = re.sub(r"\\s+", " ", text)
    
    # Remove lines that are mostly numbers/symbols
    # Use splitlines() to avoid escape sequence issues
    lines = text.splitlines()
    cleaned_lines = []
    for line in lines:
        # Keep line if it has enough alphabetic content
        alpha_ratio = sum(c.isalpha() for c in line) / max(len(line), 1)
        if alpha_ratio > 0.3 or len(line) < 10:
            cleaned_lines.append(line)
    
    # Join with newline character
    # Join with newline - properly escaped
    return chr(10).join(cleaned_lines).strip()

def read_pdfs(files):
    texts = []
    for f in files:
        data = f.read()
        reader = PdfReader(io.BytesIO(data))
        content = []
        for page in reader.pages:
            try:
                page_text = page.extract_text() or ""
                # Clean each page
                page_text = clean_text(page_text)
                content.append(page_text)
            except Exception:
                content.append("")
        full_text = chr(10).join(content).strip()
        if full_text:
            texts.append(full_text)
    return texts

def build_vectorstore(raw_texts):
    # Increased chunk size for better context
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,     # Larger chunks = more context
        chunk_overlap=200,   # More overlap to preserve continuity
        length_function=len,
    )
    docs = []
    for t in raw_texts:
        docs.extend(splitter.create_documents([t]))
    vs = FAISS.from_documents(docs, embeddings)
    return vs

def make_prompt(question, contexts):
    """
    Create a clear, structured prompt for FLAN-T5.
    Key improvements:
    - Cleaner instruction format
    - Context limited to avoid truncation issues
    - Explicit instruction to answer from context
    """
    # Limit total context length to avoid truncation
    max_context_chars = 1500
    combined_context = ""
    for ctx in contexts:
        if len(combined_context) + len(ctx) < max_context_chars:
            combined_context += ctx + chr(10) + chr(10)
        else:
            # Add partial context if space allows
            remaining = max_context_chars - len(combined_context)
            if remaining > 100:
                combined_context += ctx[:remaining] + "..."
            break
    
    combined_context = combined_context.strip()
    
    # FLAN-T5 works better with explicit instruction-following format
    prompt = f"""Based on the following context, answer the question. If the answer is not in the context, say "I don't know".

Context:
{combined_context}

Question: {question}

Answer:"""
    
    return prompt

def generate_answer(prompt, max_new_tokens=256):
    """
    Generate answer using FLAN-T5 with fixed parameters.
    Key fixes:
    - Removed min_length (was causing garbage output)
    - Better temperature settings
    - Proper handling of edge cases
    """
    # Tokenize with proper truncation
    inputs = tokenizer(
        prompt, 
        return_tensors="pt", 
        truncation=True,
        max_length=512,
        padding=False
    ).to(device)
    
    with torch.no_grad():
        output = flan.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            min_length=5,            # Ensure minimum answer length (5 tokens)
            temperature=0.5,          # Lower temperature for more focused answers
            do_sample=True,           # Enable sampling for better quality
            top_p=0.95,               # Nucleus sampling (slightly higher)
            top_k=50,                 # Limit vocabulary
            repetition_penalty=1.1,   # Reduce repetition (slightly lower)
            no_repeat_ngram_size=2,   # Prevent 2-gram repetition
            early_stopping=False,     # Don't stop early - let it generate fully
            num_beams=3,              # Use beam search for better quality
        )
    
    answer = tokenizer.decode(output[0], skip_special_tokens=True)
    answer = answer.strip()
    
    # Remove common prefixes that FLAN-T5 might add
    prefixes_to_remove = ["answer:", "answer is:", "the answer is:", "answer:", "a:"]
    for prefix in prefixes_to_remove:
        if answer.lower().startswith(prefix):
            answer = answer[len(prefix):].strip()
            break
    
    # Post-process: only reject clearly invalid answers
    if not answer:
        return "I couldn\'t generate an answer from the provided context."
    
    # Only reject if answer is suspiciously short AND mostly non-alphabetic
    if len(answer) < 5:
        # Very short answers might still be valid (like "Yes", "No", "3")
        # Only reject if it's completely empty or just whitespace
        if not answer.strip():
            return "I couldn\'t generate an answer from the provided context."
    
    # Check for suspiciously numeric-only answers (but allow short numeric answers)
    if len(answer) > 15:  # Only check longer answers
        alpha_ratio = sum(c.isalpha() for c in answer) / max(len(answer), 1)
        if alpha_ratio < 0.2:  # More lenient threshold
            # Answer is mostly numbers/symbols - likely garbage
            return "I couldn\'t find a clear answer in the provided context. Please try rephrasing your question or check if the document contains relevant information."
    
    return answer

# ---------------------------
# UI
# ---------------------------
st.subheader("üì§ Upload PDFs")
uploaded = st.file_uploader("Upload one or more PDFs", type=["pdf"], accept_multiple_files=True)

if "vectorstore" not in st.session_state:
    st.session_state.vectorstore = None

col_a, col_b = st.columns([1,1])
with col_a:
    build_btn = st.button("üîß Build / Rebuild Index")
with col_b:
    clear_btn = st.button("üóëÔ∏è Clear Index")

if clear_btn:
    st.session_state.vectorstore = None
    st.success("Cleared vector index.")

if build_btn:
    if not uploaded:
        st.warning("Please upload at least one PDF.")
    else:
        with st.spinner("Reading PDFs and building FAISS index..."):
            texts = read_pdfs(uploaded)
            if not any(texts):
                st.error("No extractable text found in the PDFs.")
            else:
                st.session_state.vectorstore = build_vectorstore(texts)
                st.success(f"Index ready! Processed {len(texts)} document(s). Ask questions below.")

st.divider()
st.subheader("‚ùì Ask a Question")

q = st.text_input("Your question", placeholder="Enter your question here...")

k = st.slider("Top-k chunks", 2, 8, 4)
max_tokens = st.slider("Max new tokens (answer length)", 64, 512, 256, step=32)

# Always show the button, but disable it if no index or no question
button_disabled = st.session_state.vectorstore is None or not q.strip()

if st.session_state.vectorstore is None:
    st.warning("‚ö†Ô∏è Please upload PDFs and click **Build / Rebuild Index** first!")

# Show the button always
if st.button("üîç Retrieve & Answer", disabled=button_disabled, type="primary"):
    if st.session_state.vectorstore is None:
        st.error("‚ùå No index found! Please upload PDFs and build the index first.")
    elif not q.strip():
        st.error("‚ùå Please enter a question first.")
    else:
        with st.spinner("Retrieving relevant chunks..."):
            docs = st.session_state.vectorstore.similarity_search(q, k=k)
            contexts = [d.page_content for d in docs]
        
        st.write("**Retrieved Chunks:**")
        for i, c in enumerate(contexts, 1):
            with st.expander(f"Chunk {i}"):
                st.write(c)

        with st.spinner("Generating answer with FLAN-T5..."):
            prompt = make_prompt(q, contexts)
            ans = generate_answer(prompt, max_new_tokens=max_tokens)
        
        st.success("**Answer:**")
        st.write(ans)
        
        # Debug info (collapsible)
        with st.expander("üîß Debug: View prompt sent to model"):
            st.code(prompt, language="text")
'''

# Write app.py with error handling
try:
    with open("app.py", "w", encoding="utf-8") as f:
        f.write(app_code)
    print("‚úÖ app.py generated successfully")
except Exception as e:
    print(f"‚ùå Failed to write app.py: {e}")
    raise

# 3) Setup ngrok for public URL
from pyngrok import ngrok
import os
import time

# ==============================================
# AGGRESSIVE NGROK CLEANUP (MUST RUN FIRST)
# ==============================================
print("üßπ Killing ALL existing ngrok processes...")
try:
    # Kill ngrok at OS level first (most aggressive)
    os.system('pkill -9 ngrok 2>/dev/null || true')
    os.system('killall ngrok 2>/dev/null || true')
    time.sleep(1)  # Wait for processes to die
    
    # Then use pyngrok's kill
    ngrok.kill()
    time.sleep(1)  # Wait again
    
    print("‚úÖ ngrok processes killed")
except Exception as e:
    print(f"   Note: {e}")

# Load ngrok token from environment variables
# Supports both Google Colab (userdata) and local (.env file)
NGROK_TOKEN = None

# Try Google Colab first
try:
    from google.colab import userdata
    NGROK_TOKEN = userdata.get('NGROK_AUTHTOKEN')
    if NGROK_TOKEN:
        print("‚úÖ Loaded ngrok token from Google Colab userdata")
except ImportError:
    # Not in Colab, try local .env file
    try:
        from dotenv import load_dotenv
        load_dotenv()  # Load .env file if it exists
        NGROK_TOKEN = os.getenv('NGROK_AUTHTOKEN')
        if NGROK_TOKEN:
            print("‚úÖ Loaded ngrok token from .env file")
    except ImportError:
        # dotenv not installed, try environment variable directly
        NGROK_TOKEN = os.getenv('NGROK_AUTHTOKEN')
        if NGROK_TOKEN:
            print("‚úÖ Loaded ngrok token from environment variable")
    except Exception as e:
        print(f"‚ö†Ô∏è Could not load .env file: {e}")

# Fallback to environment variable if still not found
if not NGROK_TOKEN:
    NGROK_TOKEN = os.getenv('NGROK_AUTHTOKEN')

if not NGROK_TOKEN:
    print("\n‚ùå ERROR: NGROK_AUTHTOKEN not found!")
    print("\nüìù How to set it:")
    print("   For Google Colab:")
    print("   1. Go to: Runtime ‚Üí Manage secrets")
    print("   2. Add secret: NGROK_AUTHTOKEN = your_token_here")
    print("   3. Get token from: https://dashboard.ngrok.com/get-started/your-authtoken")
    print("\n   For Local (Jupyter/Local Python):")
    print("   1. Create a .env file in this directory")
    print("   2. Add: NGROK_AUTHTOKEN=your_token_here")
    print("   3. Get token from: https://dashboard.ngrok.com/get-started/your-authtoken")
    print("\n   Or set environment variable:")
    print("   export NGROK_AUTHTOKEN=your_token_here")
    raise SystemExit("NGROK_AUTHTOKEN not configured")

try:
    ngrok.set_auth_token(NGROK_TOKEN)
    print("‚úÖ ngrok token configured successfully")
except Exception as e:
    print(f"‚ö†Ô∏è Warning: Could not set ngrok token: {e}")
    print("   Continuing without ngrok (local access only)")

# Disconnect any remaining tunnels via API
print("üîå Disconnecting any remaining tunnels...")
try:
    tunnels = ngrok.get_tunnels()
    for tunnel in tunnels:
        ngrok.disconnect(tunnel.public_url)
        print(f"   Disconnected: {tunnel.public_url}")
    if tunnels:
        time.sleep(2)  # Wait for disconnections to complete
    print("‚úÖ All tunnels disconnected")
except Exception as e:
    print(f"   Note: {e}")

# 4) Start Streamlit locally
import subprocess
import sys

# Kill any existing streamlit on port 8501
try:
    if os.name == 'nt':  # Windows
        os.system('netstat -ano | findstr :8501')
    else:  # macOS/Linux
        os.system('lsof -ti:8501 | xargs kill -9 2>/dev/null || true')
except:
    pass

# Start Streamlit
print("\nüöÄ Starting Streamlit...")
try:
    if sys.platform.startswith('win'):
        subprocess.Popen(
            [sys.executable, "-m", "streamlit", "run", "app.py", "--server.port", "8501", "--server.headless", "true"],
            creationflags=subprocess.CREATE_NEW_CONSOLE
        )
    else:
        subprocess.Popen(
            ["streamlit", "run", "app.py", "--server.port", "8501", "--server.headless", "true"],
            stdout=subprocess.DEVNULL,
            stderr=subprocess.DEVNULL,
            start_new_session=True
        )
    
    time.sleep(5)  # Give Streamlit time to start
    print("‚úÖ Streamlit started!")
    
except Exception as e:
    print(f"‚ö†Ô∏è Error starting Streamlit: {e}")
    print("   You can start it manually with: streamlit run app.py")

# Create ngrok tunnel
print("\nüåê Creating public URL with ngrok...")
try:
    public_url = ngrok.connect(8501)
    print("\n" + "="*60)
    print("‚úÖ SUCCESS! Your app is running!")
    print("="*60)
    print(f"\nüåê Public URL (share this):")
    print(f"   {public_url}")
    print(f"\nüè† Local URL:")
    print(f"   http://localhost:8501")
    print(f"\nüìå Tips:")
    print(f"   ‚Ä¢ Keep this notebook running")
    print(f"   ‚Ä¢ Upload PDFs and build the index")
    print(f"   ‚Ä¢ Ask questions to get answers from your documents")
    print("\n" + "="*60)
    
except Exception as e:
    error_msg = str(e)
    print(f"\n‚ö†Ô∏è Could not create ngrok tunnel: {e}")
    
    # Check for session limit error (ERR_NGROK_108)
    if "ERR_NGROK_108" in error_msg or "3 simultaneous" in error_msg or "agent sessions" in error_msg:
        print("\nüí° Issue: You've reached ngrok's free account limit (3 simultaneous sessions)")
        print("   These sessions are running on OTHER machines (not this one)")
        print("\nüîß How to fix:")
        print("   1. Go to: https://dashboard.ngrok.com/agents")
        print("   2. MANUALLY disconnect all active agent sessions")
        print("   3. Then re-run this cell")
        print("\nüìå App is running locally at: http://localhost:8501")
        print("   (You can still use the app locally without ngrok)")
    elif "ERR_NGROK_334" in error_msg or "already online" in error_msg:
        print("\nüí° Issue: An ngrok endpoint is already registered to your account")
        print("   This happens when a previous session didn't close properly")
        print("\nüîß How to fix:")
        print("   1. Go to: https://dashboard.ngrok.com/agents")
        print("   2. MANUALLY disconnect all active agent sessions")
        print("   3. Wait 30 seconds")
        print("   4. Go to: Runtime ‚Üí Restart runtime (in Colab menu)")
        print("   5. Re-run this cell")
        print("\nüìå App is running locally at: http://localhost:8501")
        print("   (You can still use the app locally without ngrok)")
    else:
        print("\nüìå App is running locally at: http://localhost:8501")
        print("   (ngrok tunnel failed, but local access works)")
        print("\nüîß Troubleshooting:")
        print("   1. Check your ngrok token is correct")
        print("   2. Try restarting the runtime and running again")

Note: you may need to restart the kernel to use updated packages.
‚úÖ app.py generated successfully
üßπ Killing ALL existing ngrok processes...
‚úÖ ngrok processes killed
‚úÖ Loaded ngrok token from .env file
‚úÖ ngrok token configured successfully
üîå Disconnecting any remaining tunnels...
‚úÖ All tunnels disconnected

üöÄ Starting Streamlit...
‚úÖ Streamlit started!

üåê Creating public URL with ngrok...

‚úÖ SUCCESS! Your app is running!

üåê Public URL (share this):
   NgrokTunnel: "https://unrivalable-lenna-soothfastly.ngrok-free.dev" -> "http://localhost:8501"

üè† Local URL:
   http://localhost:8501

üìå Tips:
   ‚Ä¢ Keep this notebook running
   ‚Ä¢ Upload PDFs and build the index
   ‚Ä¢ Ask questions to get answers from your documents



t=2025-12-14T22:26:40-0500 lvl=warn msg="Stopping forwarder" name=http-8501-ee0e24ec-3c3a-4b1c-a537-acb163ecb549 acceptErr="failed to accept connection: Listener closed"
t=2025-12-14T22:26:40-0500 lvl=warn msg="Error restarting forwarder" name=http-8501-ee0e24ec-3c3a-4b1c-a537-acb163ecb549 err="failed to start tunnel: session closed"


## Example Usage

### Step 1: Upload PDFs
Upload one or more PDF documents using the file uploader in the Streamlit interface.

### Step 2: Build Index
Click "Build / Rebuild Index" to:
- Extract text from PDFs
- Split into chunks
- Generate embeddings
- Build FAISS vector index

### Step 3: Ask Questions
Enter your question and adjust parameters:
- **Top-k chunks**: Number of relevant chunks to retrieve (2-8, default: 4)
- **Max tokens**: Maximum length of answer (64-512, default: 256)

### Example Questions:
- "What is the main topic of this document?"
- "Summarize the key findings"
- "Explain the methodology used"
- "What are the limitations mentioned?"

### Understanding the Output:
- **Retrieved Chunks**: Shows the document sections used to answer
- **Answer**: Generated response grounded in the retrieved context