# Retrieval-Augmented Generation (RAG) Pipeline Demo

This Jupyter Notebook implements a minimal Retrieval-Augmented Generation (RAG) pipeline for a take-home project interview. The system answers user queries by leveraging content from two PDF datasets: `QuantumCore_Solutions_RAG_Demo_Dataset_v1.pdf` and `NeoCompute_Technologies_RAG_Demo_Dataset_v2.pdf`. It demonstrates versatility in handling varied queries (e.g., leadership roles, product lists, technical specifications) using a lightweight, CPU-friendly setup suitable for Google Colab.

## Objective
- **Purpose**: Combine retrieval and generative AI to provide accurate, context-grounded answers from PDF content.
- **Resource Efficiency**: Use small models (`lightonai/GTE-ModernColBERT-v1` for embeddings, `google/flan-t5-base` for generation) to ensure compatibility with CPU environments.
- **Post-Processing**: Apply minimal regex-based post-processing for role-based queries (e.g., extracting CEO names) and product queries (e.g., listing product names), with deduplication to ensure clean outputs.
- **Interactivity**: Support an interactive query interface for demo purposes, with example queries to showcase functionality.

## Architecture
The pipeline follows a modular RAG design:
- **Knowledge Base**: PDFs are loaded using `PyPDFLoader` and split into chunks (300 characters, 50-character overlap) with `RecursiveCharacterTextSplitter`. Chunks are stored in a dictionary mapping document IDs to text, with source tracking for company-specific filtering.
- **Semantic Layer**: Text chunks and queries are embedded into dense vectors using `lightonai/GTE-ModernColBERT-v1` for semantic similarity comparison.
- **Retrieval System**: `retrieve.ColBERT` fetches the top 15 relevant chunks based on query embeddings, which are reranked to the top 3 using `rank.rerank` for improved relevance.
- **Augmentation**: The top 3 chunks (up to 600 characters) are combined with the query via a `PromptTemplate` to create a contextualized input for the generative model.
- **Generation**: `google/flan-t5-base` produces concise answers, with post-processing to extract names for role queries (e.g., CEO), list products for product queries, or deduplicate comma-separated lists.
- **Fixes Implemented**:
  - **Role Extraction**: Improved regex to handle formatting variations and case sensitivity for reliable name extraction (e.g., 'Dr. Elena Ruiz' for CEO).
  - **Product Extraction**: Refined regex to target quantum-related products and filter out non-product terms (e.g., 'Compliance').
  - **Company Filtering**: Added source tracking to filter chunks by company (QuantumCore or NeoCompute) based on query keywords.

## Setup
- **Dependencies**: Requires `pylate`, `langchain`, `transformers`, `google-colab`, `pypdf`, `hf_xet` for PDF processing, embedding, retrieval, and generation.
- **Environment**: Designed for Google Colab with CPU, ensuring accessibility without GPU requirements.
- **Datasets**: Processes `QuantumCore_Solutions_RAG_Demo_Dataset_v1.pdf` (quantum computing company details) and `NeoCompute_Technologies_RAG_Demo_Dataset_v2.pdf` (assumed similar content).

## Instructions
1. **Cell 1**: Install required Python libraries to set up the environment.
2. **Cell 2**: Import libraries and suppress warnings for cleaner output.
3. **Cell 3**: Define the RAG pipeline functions (`run_rag_pipeline` and `query_rag`) with improved logic.
4. **Cell 4**: Load and process the PDFs, initializing the pipeline with models and indexes.
5. **Cell 5**: Run an interactive query interface to test the pipeline with example or custom queries.

The pipeline combines chunks from both PDFs into a single knowledge base but filters by company when specified in queries, ensuring relevant responses.

## Cell 1: Install Dependencies

This cell installs the necessary Python libraries for the RAG pipeline. It ensures compatibility in a clean Google Colab environment by installing `pylate` (for ColBERT embeddings and retrieval), `langchain` (for document loading and splitting), `transformers` (for the FLAN-T5 model), `google-colab` (for Colab utilities), and additional dependencies (`langchain-community`, `pypdf`, `hf_xet`) for PDF processing and Hugging Face integration.

In [None]:
# Install core libraries for RAG pipeline (pylate for ColBERT, langchain for document processing, transformers for generation)
!pip install pylate langchain transformers google-colab
# Install additional dependencies for PDF loading and Hugging Face integration
!pip install -U langchain-community pypdf hf_xet

## Cell 2: Import Libraries

This cell imports the required Python libraries for the pipeline and suppresses warnings to ensure cleaner output in Colab. Key libraries include:
- `pylate` for ColBERT-based embedding and retrieval (`models`, `indexes`, `retrieve`, `rank`).
- `langchain` for PDF loading (`PyPDFLoader`), text splitting (`RecursiveCharacterTextSplitter`), and prompt creation (`PromptTemplate`).
- `transformers` for the FLAN-T5 model (`pipeline`).
- `google.colab.files` for handling file uploads in Colab.
- `os`, `re` for file path handling and regex post-processing.
- Warnings from `pypdf` are suppressed to avoid cluttering the output.

In [None]:
import warnings
from pylate import models, indexes, retrieve, rank  # For ColBERT embedding, indexing, and retrieval
from langchain.document_loaders import PyPDFLoader  # For loading PDF documents
from langchain.text_splitter import RecursiveCharacterTextSplitter  # For splitting text into chunks
from langchain.prompts import PromptTemplate  # For creating prompt templates
from google.colab import files  # For file uploads in Colab
import os  # For file path handling
from transformers import pipeline  # For FLAN-T5 text generation
import re  # For regex-based post-processing

# Suppress warnings from pypdf for cleaner output
warnings.filterWarnings('ignore', category=UserWarning, module='pypdf._reader')
warnings.filterWarnings('ignore', category=DeprecationWarning, module='pypdf._reader')

## Cell 3: Define RAG Pipeline

This cell defines the core functions of the RAG pipeline:
- **`run_rag_pipeline`**: Processes PDFs by loading, chunking, embedding, and indexing them, then initializes the retriever and generator. It now tracks the source PDF for each chunk to enable company-specific filtering.
- **`query_rag`**: Handles user queries by encoding them, retrieving and reranking relevant chunks, augmenting the query with context, generating an answer, and applying post-processing. Fixes include:
  - **Company Filtering**: Filters chunks by company (QuantumCore or NeoCompute) based on query keywords.
  - **Role Extraction**: Uses an improved regex to extract names for roles (e.g., CEO) reliably.
  - **Product Extraction**: Uses a refined regex to target quantum-related products and filters out non-product terms.
  - **Robust Post-Processing**: Ensures accurate deduplication and fallback to raw generated answers when needed.

The pipeline is designed to be robust, handling errors gracefully and providing clear feedback if processing fails. All strings (e.g., prompt template, regex patterns) are properly escaped to ensure valid JSON.

In [None]:
def run_rag_pipeline(pdf_paths):
    """Initialize the RAG pipeline by processing PDFs, creating embeddings, and setting up models."""
    try:
        # Initialize storage for document texts, IDs, and sources
        all_document_texts = []  # Store text chunks from all PDFs
        all_document_ids = []  # Store unique IDs for each chunk
        document_map = {}  # Map IDs to text chunks
        current_doc_id = 0  # Track ID increments across PDFs
        document_sources = {}  # Track source PDF (QuantumCore or NeoCompute) for each chunk

        # Initialize text splitter (300 chars, 50-char overlap) for chunking PDFs
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=50)
        for pdf_path in pdf_paths:
            print(f'Processing PDF: {pdf_path}')
            # Load PDF using PyPDFLoader
            loader = PyPDFLoader(pdf_path)
            documents = loader.load()
            # Split documents into chunks
            chunks = text_splitter.split_documents(documents)
            document_texts = [chunk.page_content for chunk in chunks]
            # Assign unique IDs to chunks
            document_ids = [str(i + current_doc_id) for i in range(len(document_texts))]
            # Update document map with ID-to-text mapping
            document_map.update(dict(zip(document_ids, document_texts)))
            # Track source PDF (e.g., 'QuantumCore' or 'NeoCompute') for filtering
            source_name = os.path.basename(pdf_path).split('_')[0]
            document_sources.update({doc_id: source_name for doc_id in document_ids})
            # Extend lists with current PDF's chunks and IDs
            all_document_texts.extend(document_texts)
            all_document_ids.extend(document_ids)
            current_doc_id += len(document_texts)
            print(f'Created {len(document_texts)} chunks from {pdf_path}')

        print(f'Total chunks created: {len(all_document_texts)}')

        # Load ColBERT model for embedding
        model_name = 'lightonai/GTE-ModernColBERT-v1'
        model = models.ColBERT(model_name_or_path=model_name)

        # Initialize Voyager index for storing embeddings
        index_folder = 'pylate-index'
        index_name = 'pdf_index'
        index = indexes.Voyager(index_folder=index_folder, index_name=index_name, override=True)

        # Embed document chunks using ColBERT
        documents_embeddings = model.encode(
            all_document_texts,
            batch_size=32,
            is_query=False,
            show_progress_bar=True
        )
        # Add embeddings to index with corresponding IDs
        index.add_documents(all_document_ids, documents_embeddings=documents_embeddings)

        # Initialize ColBERT retriever
        retriever = retrieve.ColBERT(index=index)

        # Initialize FLAN-T5 model for text generation
        generator = pipeline('text2text-generation', model='google/flan-t5-base', max_length=300)

        # Define prompt template for answer generation (properly escaped for JSON)
        prompt_template = "Using only the provided text, answer the user's question with a concise and accurate response. For questions about specific roles (e.g., CEO, CTO, CFO), return only the full name of the individual in that role. For questions about lists (e.g., products), return all items as a comma-separated list of names only. Exclude any details not directly relevant to the question, such as technical specifications, unless explicitly requested. If the answer is not in the text, respond with 'The answer could not be found in the text.'\n\nText: {context}\n\nQuestion: {question}\n\nAnswer:"
        PROMPT = PromptTemplate(template=prompt_template, input_variables=['context', 'question'])

        # Return initialized components, including document_sources for company filtering
        return model, index, retriever, generator, PROMPT, document_map, document_sources

    except Exception as e:
        print(f'Error processing PDFs: {e}')
        return None

def query_rag(model, index, retriever, generator, PROMPT, document_map, document_sources, query):
    """Process a user query by retrieving relevant chunks, generating an answer, and applying post-processing."""
    try:
        queries = [query]

        # Encode query using ColBERT
        query_embedding = model.encode(
            queries,
            batch_size=32,
            is_query=True,
            show_progress_bar=True
        )

        # Retrieve top 15 documents based on query embedding
        top_k_initial = 15
        initial_results = retriever.retrieve(queries_embeddings=query_embedding, k=top_k_initial)
        retrieved_doc_ids = [result['id'] for result in initial_results[0]]

        # Filter documents by company if specified in query (e.g., 'QuantumCore' or 'NeoCompute')
        company = None
        if 'quantumcore' in query.lower():
            company = 'QuantumCore'
        elif 'neocompute' in query.lower():
            company = 'NeoCompute'
        if company:
            retrieved_doc_ids = [doc_id for doc_id in retrieved_doc_ids if document_sources.get(doc_id) == company]
        retrieved_documents = [document_map[doc_id] for doc_id in retrieved_doc_ids]

        # Rerank documents to select top 3 most relevant
        reranked_results = rank.rerank(
            documents_ids=[retrieved_doc_ids],
            queries_embeddings=query_embedding,
            documents_embeddings=[model.encode(retrieved_documents, is_query=False)]
        )

        # Extract reranked document IDs
        reranked_doc_ids = []
        if reranked_results and isinstance(reranked_results[0], list):
            for result in reranked_results[0]:
                if isinstance(result, dict) and 'id' in result:
                    reranked_doc_ids.append(result['id'])
                elif isinstance(result, str):
                    reranked_doc_ids.append(result)
        else:
            reranked_doc_ids = retrieved_doc_ids

        reranked_documents = [document_map[doc_id] for doc_id in reranked_doc_ids]

        # Create context from top 3 reranked documents (max 600 characters)
        max_context_length = 600
        context = '\n'.join(reranked_documents[:3])[:max_context_length]
        prompt_text = PROMPT.format(context=context, question=query)

        # Generate answer using FLAN-T5
        response = generator(prompt_text)[0]['generated_text']
        answer = response.strip()

        # Post-processing for role, product, or list-based queries
        non_product_terms = {'Compliance', 'Cooling', 'Features', 'Storage', 'Networking', 'Frameworks', 'Uptime', 'Encryption', 'Certifications', 'Software'}
        if any(role in query.lower() for role in ['ceo', 'cto', 'cfo']):
            # Extract role (e.g., CEO) and use regex to find name
            role = next((r for r in ['CEO', 'CTO', 'CFO'] if r.lower() in query.lower()), None)
            if role:
                # Improved regex: non-greedy, handles spaces, case-insensitive
                match = re.search(r'- ([^,]+?),\s*'+role+r'\s*:', context, re.IGNORECASE)
                if match:
                    answer = match.group(1).strip()
                else:
                    answer = 'The answer could not be found in the text.'
        elif 'product' in query.lower():
            # Extract product names with regex targeting quantum-related terms
            product_names = re.findall(r'- (\w+): (?:Quantum|Cloud-based|processing unit|platform|cryptographic security)', context, re.IGNORECASE)
            # Filter out non-product terms
            product_names = [name for name in product_names if name not in non_product_terms]
            if product_names:
                # Sort and deduplicate product names
                answer = ', '.join(sorted(set(product_names)))
            else:
                answer = 'The answer could not be found in the text.'
        elif ', ' in answer:
            # Handle comma-separated lists (e.g., compliance standards)
            items = set(answer.split(', '))
            # Filter out non-product terms
            items = [item for item in items if item not in non_product_terms]
            answer = ', '.join(sorted(items)) if items else 'The answer could not be found in the text.'

        # Debug logging (commented out for production)
        # print(f'Debug: Query={query}, Context={context[:100]}..., Raw Answer={response}, Final Answer={answer}')

        return context, answer

    except Exception as e:
        print(f'Error processing query: {e}')
        return None, 'Error processing query.'

## Cell 4: Process PDFs

This cell specifies the paths to the PDF datasets (`QuantumCore_Solutions_RAG_Demo_Dataset_v1.pdf` and `NeoCompute_Technologies_RAG_Demo_Dataset_v2.pdf`) and initializes the RAG pipeline by calling `run_rag_pipeline`. If running locally, ensure the PDFs are in the `/data` directory. In Google Colab, the cell checks for missing files and prompts the user to upload them. The pipeline is initialized with the ColBERT model, Voyager index, retriever, FLAN-T5 generator, prompt template, document map, and source tracking for company-specific filtering.

In [None]:
# Define PDF paths (modify as needed for local environment)
pdf_paths = [
    '/data/QuantumCore_Solutions_RAG_Demo_Dataset_v1.pdf',
    '/data/NeoCompute_Technologies_RAG_Demo_Dataset_v2.pdf'
]

# Check if PDFs exist; if not, prompt for upload in Colab
if not all(os.path.exists(pdf_path) for pdf_path in pdf_paths):
    print('Please upload your PDF files (QuantumCore_v1.pdf and/or NeoCompute_v2.pdf):')
    uploaded = files.upload()
    pdf_paths = list(uploaded.keys())

# Initialize the RAG pipeline
result = run_rag_pipeline(pdf_paths)
if result:
    # Unpack pipeline components, including document_sources
    model, index, retriever, generator, PROMPT, document_map, document_sources = result
    print('RAG pipeline initialized successfully.')
else:
    print('Failed to initialize RAG pipeline.')

## Cell 5: Interactive Querying

This cell provides an interactive interface to query the RAG system. Users can enter custom queries or use provided examples. The system retrieves relevant chunks, generates an answer, and displays both the context and response. Example queries test various aspects of the pipeline:
- Role queries (e.g., CEO name) use regex to extract full names.
- Product queries list product names, with fixes to exclude non-products.
- Specification queries (e.g., qubit count) extract specific details.
- Compliance queries return lists of standards, deduplicated and filtered.

**Example Queries**:
- Who is the CEO of QuantumCore Solutions? → Expected: 'Dr. Elena Ruiz'
- What are the products offered by NeoCompute Technologies? → Expected: 'QubitCore, QuantumNet' (based on prior context)
- What is the qubit count of QubitCore? → Expected: '50-qubit superconducting architecture'
- What compliance standards does NeoCompute follow? → Expected: 'FIPS 140-3, GDPR' (based on prior context)

In [None]:
def interactive_query():
    """Run an interactive query loop to test the RAG pipeline."""
    print('Enter your query (or type "exit" to quit):')
    while True:
        query = input('Query: ')
        if query.lower() == 'exit':
            print('Exiting query interface.')
            break
        if not result:
            print('RAG pipeline not initialized. Please run Cell 4 first.')
            break
        # Process query and display context and answer
        context, answer = query_rag(model, index, retriever, generator, PROMPT, document_map, document_sources, query)
        print('\n**Context Retrieved**:\n', context)
        print('\n**Answer**:\n', answer, '\n')

# Start the interactive query interface
interactive_query()