# Retrieval-Augmented Generation (RAG) Pipeline Demo (TAKE_HOME_PROJECT)

This Jupyter Notebook implements a minimal Retrieval-Augmented Generation (RAG) pipeline for a take-home project interview. The system answers user queries by leveraging content from two PDF datasets: `QuantumCore_Solutions_RAG_Demo_Dataset_v1.pdf` and `NeoCompute_Technologies_RAG_Demo_Dataset_v2.pdf`. It demonstrates versatility in handling varied queries (e.g., leadership roles, product lists, technical specifications) using a lightweight, CPU-friendly setup suitable for Google Colab.

## Objective
- **Purpose**: Combine retrieval and generative AI to provide accurate, context-grounded answers from PDF content.
- **Resource Efficiency**: Use small models (`lightonai/GTE-ModernColBERT-v1` for embeddings, `google/flan-t5-base` for generation) to ensure compatibility with CPU environments.
- **Post-Processing(moved to prompt)**: Apply minimal regex-based post-processing for role-based queries (e.g., extracting CEO names) and product queries (e.g., listing product names), with deduplication to ensure clean outputs.
- **Interactivity**: Support an interactive query interface for demo purposes, with example queries to showcase functionality.

## Architecture
The pipeline follows a modular RAG design:
- **Knowledge Base**: PDFs are loaded using `PyPDFLoader` and split into chunks (200 characters, 25-character overlap) with `RecursiveCharacterTextSplitter`. Chunks are stored in a dictionary mapping document IDs to text, with source tracking for company-specific filtering.
- **Semantic Layer**: Text chunks and queries are embedded into dense vectors using `lightonai/GTE-ModernColBERT-v1` for semantic similarity comparison.
- **Retrieval System**: `retrieve.ColBERT` fetches the top 15 relevant chunks based on query embeddings, which are reranked to the top 3 using `rank.rerank` for improved relevance.
- **Augmentation**: The top 3 chunks (up to 500 characters) are combined with combined with the query via a Few-Shot PromptTemplate tocreate a contextualized input for the generative model
- **Generation**: `google/flan-t5-base` produces concise answers, with post-processing to extract names for role queries (e.g., CEO), list products for product queries, or deduplicate comma-separated lists.
- **Fixes Implemented**:

  -  Added source tracking to filter chunks by company based on query keywords.
  -  Implemented Few-Shot Prompting with fictional examples.


## Setup
- **Dependencies**: Requires `pylate`, `langchain`, `transformers`, `google-colab`, `pypdf`, `hf_xet` for PDF processing, embedding, retrieval, and generation.
- **Environment**: Designed for Google Colab with CPU, ensuring accessibility without GPU requirements.
- **Datasets**: Processes `QuantumCore_Solutions_RAG_Demo_Dataset_v1.pdf` (quantum computing company details) and `NeoCompute_Technologies_RAG_Demo_Dataset_v3.pdf` (assumed similar content).

## Instructions
1. **Cell 1**: Install required Python libraries to set up the environment.
2. **Cell 2**: Import libraries and suppress warnings for cleaner output.
3. **Cell 3**: Define the RAG pipeline functions (`run_rag_pipeline` and `query_rag`) with improved logic.
4. **Cell 4**: Load and process the PDFs, initializing the pipeline with models and indexes.
5. **Cell 5**: Run an interactive query interface to test the pipeline with example or custom queries.

The pipeline combines chunks from both PDFs into a single knowledge base but filters by company when specified in queries, ensuring relevant responses.

## Cell 1: Install Dependencies

Installs Python libraries required for the RAG pipeline, ensuring compatibility in Google Colab.

## Cell 1: Install Dependencies

This cell installs the necessary Python libraries for the RAG pipeline. It ensures compatibility in a clean Google Colab environment by installing `pylate` (for ColBERT embeddings and retrieval), `langchain` (for document loading and splitting), `transformers` (for the FLAN-T5 model), `google-colab` (for Colab utilities), and additional dependencies (`langchain-community`, `pypdf`, `hf_xet`) for PDF processing and Hugging Face integration.

In [6]:
# Install core libraries for RAG pipeline (pylate for ColBERT, langchain for document processing, transformers for generation)
!pip install pylate langchain transformers google-colab
# Install additional dependencies for PDF loading and Hugging Face integration
!pip install -U langchain-community pypdf hf_xet

[31mERROR: Could not find a version that satisfies the requirement os (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for os[0m[31m
[0m

## Cell 2: Import Libraries

This cell imports the required Python libraries for the pipeline and suppresses warnings to ensure cleaner output in Colab. Key libraries include:
- `pylate` for ColBERT-based embedding and retrieval (`models`, `indexes`, `retrieve`, `rank`).
- `langchain` for PDF loading (`PyPDFLoader`), text splitting (`RecursiveCharacterTextSplitter`), and prompt creation (`PromptTemplate`).
- `transformers` for the FLAN-T5 model (`pipeline`).
- `google.colab.files` for handling file uploads in Colab.
- `os`, `re` for file path handling and regex post-processing.
- Warnings from `pypdf` are suppressed to avoid cluttering the output.

In [7]:
# Import required libraries for the RAG pipeline
import warnings
from pylate import models, indexes, retrieve, rank  # Pylate modules for embedding, indexing, and retrieval
from langchain.document_loaders import PyPDFLoader  # Load PDFs
from langchain.text_splitter import RecursiveCharacterTextSplitter  # Split text into chunks
from langchain.prompts import PromptTemplate  # Create prompt templates
from google.colab import files  # Handle file uploads in Colab
import os  # File system operations
from transformers import pipeline  # Hugging Face pipeline for text generation
import contextlib  # Redirect stderr for clean output
import io  # StringIO for stderr redirection
import os


# Suppress PDF reader warnings to avoid cluttering output
warnings.filterwarnings('ignore', category=UserWarning, module='pypdf._reader')
warnings.filterwarnings('ignore', category=DeprecationWarning, module='pypdf._reader')

No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'


## Cell 3: Define RAG Pipeline

This cell defines the core functions of the RAG pipeline:
- **`run_rag_pipeline`**: Processes PDFs by loading, chunking, embedding, and indexing them, then initializes the retriever and generator. It  tracks the source PDF for each chunk to enable company-specific filtering. Create a prompt template with fictional examples to guide answer extraction
- **`query_rag`**: Handles user queries by encoding them, retrieving and reranking relevant chunks, augmenting the query with context, generating an answer, and applying post-processing. Fixes include:
  - **Company Filtering**: Filters chunks by company (QuantumCore or NeoCompute) based on query keywords.
  - **Role Extraction**: Uses an improved regex to extract names for roles (e.g., CEO) reliably.
  
  - **Robust Post-Processing**: Ensures accurate (removed) and fallback to raw generated answers when needed.

The pipeline is designed to be robust, handling errors gracefully and providing clear feedback if processing fails. All strings (e.g., prompt template, regex patterns(removed)) are properly escaped to ensure valid JSON.

In [8]:
def run_rag_pipeline(pdf_paths):
    """Initialize the RAG pipeline by processing PDFs, creating embeddings, and setting up models."""
    try:
        # --- Initialize Data Structures ---
        # Lists to store document texts and IDs
        all_document_texts = []
        all_document_ids = []
        # Dictionary to map document IDs to their text content
        document_map = {}
        # Counter for generating unique document IDs
        current_doc_id = 0
        # Dictionary to track the source (QuantumCore/NeoCompute) of each document
        document_sources = {}

        # --- Process PDFs ---
        # Initialize text splitter with 300-char chunks and 50-char overlap
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=25)
        for pdf_path in pdf_paths:
            print(f'Processing PDF: {pdf_path}')
            # Check if PDF file exists
            if not os.path.exists(pdf_path):
                print(f'PDF not found: {pdf_path}')
                continue
            # Load PDF content
            loader = PyPDFLoader(pdf_path)
            documents = loader.load()
            # Verify that content was loaded
            if not documents:
                print(f'No content loaded from {pdf_path}')
                continue
            # Split documents into chunks
            chunks = text_splitter.split_documents(documents)
            document_texts = [chunk.page_content for chunk in chunks]
            # Generate unique IDs for chunks
            document_ids = [str(i + current_doc_id) for i in range(len(document_texts))]
            # Map IDs to texts
            document_map.update(dict(zip(document_ids, document_texts)))
            # Extract source name (exp:QuantumCore or NeoCompute) from filename
            source_name = os.path.basename(pdf_path).split('_')[0]
            # Associate IDs with source
            document_sources.update({doc_id: source_name for doc_id in document_ids})
            # Add texts and IDs to main lists
            all_document_texts.extend(document_texts)
            all_document_ids.extend(document_ids)
            # Update ID counter
            current_doc_id += len(document_texts)
            print(f'Created {len(document_texts)} chunks from {pdf_path}')

        # --- Validate Document Processing ---
        if not all_document_texts:
            print('No documents processed.')
            return None

        print(f'Total chunks created: {len(all_document_texts)}')

        # --- Set Up Embedding Model ---
        # Initialize ColBERT model for embeddings
        model_name = 'lightonai/GTE-ModernColBERT-v1'
        model = models.ColBERT(model_name_or_path=model_name)

        # --- Create Index for Retrieval ---
        # Set up Voyager index to store embeddings
        index_folder = 'pylate-index'
        index_name = 'pdf_index'
        index = indexes.Voyager(index_folder=index_folder, index_name=index_name, override=True)

        # --- Generate and Store Embeddings ---
        # Encode document texts into embeddings, suppressing stderr for clean output
        with contextlib.redirect_stderr(io.StringIO()):
            documents_embeddings = model.encode(
                all_document_texts,
                batch_size=32,
                is_query=False,
                show_progress_bar=False
            )
        # Add embeddings to the index
        index.add_documents(all_document_ids, documents_embeddings=documents_embeddings)

        # --- Initialize Retriever ---
        # Set up ColBERT retriever for fetching relevant chunks
        retriever = retrieve.ColBERT(index=index)

        # --- Initialize Generator ---
        # Set up FLAN-T5 model for text generation
        generator = pipeline('text2text-generation', model='google/flan-t5-base', max_length=300)

        # --- Define Few-Shot Prompt ---
        # Create a prompt template with fictional examples to guide answer extraction
        prompt_template = r"""
You are an expert assistant answering questions based solely on the provided text. Follow these rules:
1. For roles (e.g., CEO), return the full name.
2. For lists (e.g., products), return a comma-separated list, sorted alphabetically.
3. For details (e.g., specifications), return the exact detail.
4. For other questions, provide a brief answer.
5. If no answer is found, return: "The answer could not be found in the text."

**Examples**:
- Text: "Jane Smith, CEO." Question: Who is the CEO? Answer: Jane Smith
- Text: "CloudPeak, SecureVault." Question: What products? Answer: CloudPeak, SecureVault
- Text: "AlphaCore: 100 qubits." Question: Qubit count? Answer: 100 qubits

**Text**: {context}

**Question**: {question}

**Answer**:
"""
        # Create PromptTemplate object with input variables
        PROMPT = PromptTemplate(template=prompt_template, input_variables=["context", "question"])

        # --- Return Pipeline Components ---
        # Return all components needed for querying
        return model, index, retriever, generator, PROMPT, document_map, document_sources

    except Exception as e:
        # Handle any errors during pipeline setup
        print(f'Error processing PDFs: {e}')
        return None

def query_rag(model, index, retriever, generator, PROMPT, document_map, document_sources, query):
    """Process a user query with retrieval and generation, relying on Few-Shot prompt."""
    try:
        # --- Prepare Query ---
        # Convert query to a list for batch processing
        queries = [query]

        # --- Encode Query ---
        print('Encoding query...')
        # Encode the query into an embedding, suppressing stderr
        with contextlib.redirect_stderr(io.StringIO()):
            query_embedding = model.encode(
                queries,
                batch_size=32,
                is_query=True,
                show_progress_bar=False
            )

        # --- Retrieve Documents ---
        print('Retrieving documents...')
        # Retrieve top 15 relevant document chunks
        with contextlib.redirect_stderr(io.StringIO()):
            top_k_initial = 15
            initial_results = retriever.retrieve(queries_embeddings=query_embedding, k=top_k_initial)

        # Check if any documents were retrieved
        if not initial_results or not initial_results[0]:
            print('No documents retrieved.')
            return None, 'No relevant documents found.'

        # Extract document IDs from results
        retrieved_doc_ids = [result['id'] for result in initial_results[0] if 'id' in result]
        if not retrieved_doc_ids:
            print('No document IDs after retrieval.')
            return None, 'No relevant documents found.'

        # --- Filter by Company ---
        # Determine company based on query keywords
        company = None
        filtered_doc_ids = retrieved_doc_ids
        if 'quantumcore' in query.lower():
            company = 'QuantumCore'
        elif 'neocompute' in query.lower():
            company = 'NeoCompute'
        if company:
            # Filter documents by company source
            filtered_doc_ids = [doc_id for doc_id in retrieved_doc_ids if document_sources.get(doc_id) == company]
            if not filtered_doc_ids:
                # Fallback to all documents if no company-specific chunks found
                print(f'No documents found for company: {company}. Falling back to all documents.')
                filtered_doc_ids = retrieved_doc_ids

        # --- Retrieve Document Texts ---
        # Get text content for filtered document IDs
        retrieved_documents = [document_map.get(doc_id, '') for doc_id in filtered_doc_ids]
        retrieved_documents = [doc for doc in retrieved_documents if doc]

        # Check if any valid documents remain
        if not retrieved_documents:
            print('No valid documents after filtering.')
            return None, 'No relevant documents found.'

        # --- Rerank Documents ---
        print('Reranking documents...')
        # Rerank documents to select top 3 most relevant
        with contextlib.redirect_stderr(io.StringIO()):
            reranked_results = rank.rerank(
                documents_ids=[filtered_doc_ids],
                queries_embeddings=query_embedding,
                documents_embeddings=[model.encode(retrieved_documents, is_query=False, show_progress_bar=False)]
            )

        # Extract reranked document IDs
        reranked_doc_ids = []
        if reranked_results and isinstance(reranked_results[0], list):
            for result in reranked_results[0]:
                if isinstance(result, dict) and 'id' in result:
                    reranked_doc_ids.append(result['id'])
                elif isinstance(result, str):
                    reranked_doc_ids.append(result)
        else:
            # Fallback to top 3 filtered IDs if reranking fails
            reranked_doc_ids = filtered_doc_ids[:3]

        # Check if reranked IDs are valid
        if not reranked_doc_ids:
            print('No document IDs after reranking.')
            return None, 'No relevant documents found.'

        # Get texts for reranked documents
        reranked_documents = [document_map.get(doc_id, '') for doc_id in reranked_doc_ids]
        reranked_documents = [doc for doc in reranked_documents if doc]

        # --- Build Context ---
        # Combine top 3 documents into context, limiting to 600 characters
        max_context_length = 500
        context = '\n'.join(reranked_documents[:3])[:max_context_length]
        if not context:
            print('No context generated.')
            return None, 'No relevant context found.'

        # --- Generate Prompt ---
        # Format the prompt with context and question
        prompt_text = PROMPT.format(context=context, question=query)

        # --- Generate Answer ---
        print('Generating answer...')
        # Use FLAN-T5 to generate the answer
        response = generator(prompt_text)[0]['generated_text']
        answer = response.strip()

        # Handle empty or invalid responses
        if not answer or answer.lower() == 'none':
            answer = 'The answer could not be found in the text.'

        # Return context and answer
        return context, answer

    except Exception as e:
        # Handle any errors during query processing
        print(f'Error processing query: {e}')
        return None, 'Error processing query.'

## Cell 4: Process PDFs

This cell specifies the paths to the PDF datasets (`QuantumCore_Solutions_RAG_Demo_Dataset_v1.pdf` and `NeoCompute_Technologies_RAG_Demo_Dataset_v2.pdf`) and initializes the RAG pipeline by calling `run_rag_pipeline`. If running locally, ensure the PDFs are in the `/data` directory. In Google Colab, the cell checks for missing files and prompts the user to upload them. The pipeline is initialized with the ColBERT model, Voyager index, retriever, FLAN-T5 generator, prompt template, document map, and source tracking for company-specific filtering.

In [9]:
# Define paths to PDF datasets
pdf_paths = [
    '/data/QuantumCore_Solutions_RAG_Demo_Dataset_v1.pdf',
    '/data/NeoCompute_Technologies_RAG_Demo_Dataset_v2.pdf'
]

# --- Handle Missing PDFs ---
# Check if PDFs exist; prompt for upload if not found
if not all(os.path.exists(pdf_path) for pdf_path in pdf_paths):
    print('Please upload your PDF files (QuantumCore_v1.pdf and/or NeoCompute_v2.pdf):')
    # Allow user to upload PDFs in Colab
    uploaded = files.upload()
    # Update paths to uploaded files
    pdf_paths = [f'/content/{name}' for name in uploaded.keys()]

# --- Initialize Pipeline ---
# Run the RAG pipeline with the PDF paths
result = run_rag_pipeline(pdf_paths)
if result:
    # Unpack pipeline components if successful
    model, index, retriever, generator, PROMPT, document_map, document_sources = result
    print('RAG pipeline initialized successfully.')
else:
    # Report failure if pipeline initialization fails
    print('Failed to initialize RAG pipeline.')

Please upload your PDF files (QuantumCore_v1.pdf and/or NeoCompute_v2.pdf):


Saving QuantumCore_Solutions_RAG_Demo_Dataset_v1.pdf to QuantumCore_Solutions_RAG_Demo_Dataset_v1.pdf
Saving NeoCompute_Technologies_RAG_Demo_Dataset_v2.pdf to NeoCompute_Technologies_RAG_Demo_Dataset_v2.pdf
Processing PDF: /content/QuantumCore_Solutions_RAG_Demo_Dataset_v1.pdf
Created 22 chunks from /content/QuantumCore_Solutions_RAG_Demo_Dataset_v1.pdf
Processing PDF: /content/NeoCompute_Technologies_RAG_Demo_Dataset_v2.pdf
Created 16 chunks from /content/NeoCompute_Technologies_RAG_Demo_Dataset_v2.pdf
Total chunks created: 38


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/216 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/663 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/384k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.37k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/596M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/21.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.58M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/581 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/393k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/115 [00:00<?, ?B/s]

Adding documents to the index (bs=2000): 100%|██████████| 1/1 [00:00<00:00,  7.57it/s]


config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Device set to use cpu


RAG pipeline initialized successfully.


## Cell 5: Interactive Querying

Provides an interactive query interface with formatted output and detailed comments.



## Cell 5: Interactive Querying

This cell provides an interactive interface to query the RAG system. Users can enter custom queries or use provided examples. The system retrieves relevant chunks, generates an answer, and displays both the context and response. Example queries test various aspects of the pipeline:
- Role queries (e.g., CEO name) use regex to extract full names.
- Product queries list product names, with fixes to exclude non-products.
- Specification queries (e.g., qubit count) extract specific details.
- Compliance queries return lists of standards, deduplicated and filtered.

**Example Queries**:
- Who is the CEO of QuantumCore Solutions?
- What are the products offered by NeoCompute Technologies?
- What is the qubit count of QubitCore? → Expected: '
- What compliance standards does NeoCompute follow?

In [12]:
def interactive_query():
    """Run an interactive query loop with formatted output for user queries."""
    # --- Display Interface Header ---
    # Print a clean header for the query interface
    print('=====================================')
    print('Interactive RAG Query Interface')
    print('=====================================')
    print('Enter your query (or type "exit" to quit):')

    # --- Query Loop ---
    while True:
        # Prompt user for a query
        query = input('Query: ').strip()
        # Check if user wants to exit
        if query.lower() == 'exit':
            print('\nExiting query interface.')
            break
        # Verify that pipeline is initialized
        if not result:
            print('\nError: RAG pipeline not initialized. Please run Cell 4 first.')
            break

        # --- Process Query ---
        # Run the query through the RAG pipeline
        context, answer = query_rag(model, index, retriever, generator, PROMPT, document_map, document_sources, query)

        # --- Display Results ---
        # Print formatted query results
        print('\n====================================')
        print(f'Query: {query}')
        print('====================================')
        print('\n**Context Retrieved**:\n')
        if context is None:
            # Handle case where no context was retrieved
            print('    Error: No context retrieved.')
        else:
            # Format context with indentation for readability
            indented_context = context.replace('\n', '\n    ')
            print(f'    {indented_context}')
        print('\n---')
        print('\n**Answer**:\n')
        # Display the generated answer
        print(f'    {answer}')
        print('\n====================================\n')
        print('Enter your next query (or type "exit" to quit):')

# --- Run Interactive Query Interface ---
#interactive_query()

## Cell 6: Interactive Querying



**Example Queries**:
- Who is the CEO of QuantumCore Solutions? → 'Dr. Elena Ruiz'
- What are the products offered by NeoCompute Technologies? → 'NeoCloud, NeoSecure'
- Who is the CIO of NeoCompute Technologies? → 'The answer could not be found in the text.'
- Who is the CEO of NeoCompute Technologies? → 'The answer could not be found in the text.'
- What is the qubit count of QubitCore? → '50-qubit superconducting architecture'
- What compliance standards does NeoCompute follow? → 'ISO/IEC 27001, SOC 2 Type II'

In [13]:
# --- Run Interactive Query Interface ---
interactive_query()

Interactive RAG Query Interface
Enter your query (or type "exit" to quit):
Query: Who is the CIO of NeoCompute Technologies?
Encoding query...
Retrieving documents...
Reranking documents...
Generating answer...

Query: Who is the CIO of NeoCompute Technologies?

**Context Retrieved**:

    NeoCompute Technologies - Product & Company Overview
    1. Company Overview
    NeoCompute Technologies is a next-generation AI hardware and software company focused on high-performance edge
    NeoCompute products are used in finance, healthcare, autonomous vehicles, and smart cities, supporting both private
    and public sector clients.
    2. Contact Information
    - SLA: 99.99% Uptime, 24/7 Global Support
    4. Compliance and Security
    NeoCompute follows strict compliance frameworks to meet enterprise and government

---

**Answer**:

    The answer could not be found in the text.


Enter your next query (or type "exit" to quit):


KeyboardInterrupt: Interrupted by user