# Retrieval-Augmented Generation (RAG) Pipeline Demo

This Jupyter Notebook implements a minimal Retrieval-Augmented Generation (RAG) pipeline for a take-home project interview. The system answers user queries based on two PDF datasets: `QuantumCore_Solutions_RAG_Demo_Dataset_v1.pdf` and `NeoCompute_Technologies_RAG_Demo_Dataset_v2.pdf`, demonstrating versatility for varied queries.

## Objective
- Combine retrieval and generative AI to answer queries grounded in PDF content.
- Use small, CPU-friendly models (`lightonai/GTE-ModernColBERT-v1`, `google/flan-t5-base`) to minimize resource usage.
- Minimize post-processing with regex for roles/products and deduplication.
- Support interactive querying for demo purposes, with example queries provided.

## Architecture
- **Knowledge Base**: PDFs are loaded using `PyPDFLoader`, split into chunks with `RecursiveCharacterTextSplitter` (300 characters, 50 overlap), and stored in memory as a dictionary mapping document IDs to text chunks.
- **Semantic Layer**: Chunks and queries are embedded using `lightonai/GTE-ModernColBERT-v1` for semantic comparison, producing dense vector representations.
- **Retrieval System**: `retrieve.ColBERT` fetches the top 15 chunks based on query embeddings, which are then reranked to the top 3 using `rank.rerank` for relevance.
- **Augmentation**: The top 3 chunks (up to 600 characters total) are combined with the query using a `PromptTemplate` to create a contextualized input for generation.
- **Generation**: `google/flan-t5-base` generates concise answers, with regex-based post-processing for role-based queries (e.g., CEO, CTO) to extract names and product queries to list names, ensuring deduplication via sets.

## Setup
- **Dependencies**: `pylate`, `langchain`, `transformers`, `google-colab`, `pypdf`, `hf_xet`.
- **Environment**: Designed for Google Colab with CPU, ensuring accessibility.
- **Datasets**: Processes `QuantumCore_Solutions_RAG_Demo_Dataset_v1.pdf` and `NeoCompute_Technologies_RAG_Demo_Dataset_v2.pdf`.

## Instructions
1. Run Cell 1 to install dependencies.
2. Run Cell 2 to import libraries.
3. Run Cell 3 to define the RAG pipeline functions.
4. Run Cell 4 to process the PDFs and initialize the pipeline.
5. Run Cell 5 to interactively query the system with example queries or custom inputs.

The pipeline processes both PDFs sequentially, combining their chunks into a single knowledge base for querying.

## Cell 1: Install Dependencies

Install required libraries for the RAG pipeline. This ensures compatibility in a clean Google Colab environment.

In [None]:
!pip install pylate langchain transformers google-colab
!pip install -U langchain-community pypdf hf_xet

## Cell 2: Import Libraries

Import necessary libraries and suppress warnings for cleaner output.

In [None]:
import warnings
from pylate import models, indexes, retrieve, rank
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.prompts import PromptTemplate
from google.colab import files
import os
from transformers import pipeline
import re

# Suppress warnings for cleaner output
warnings.filterWarnings('ignore', category=UserWarning, module='pypdf._reader')
warnings.filterWarnings('ignore', category=DeprecationWarning, module='pypdf._reader')

## Cell 3: Define RAG Pipeline

Define the RAG pipeline functions to process PDFs and handle queries:
- `run_rag_pipeline`: Loads and chunks PDFs, embeds chunks using ColBERT, indexes embeddings in Voyager, and initializes the retriever and generator.
- `query_rag`: Encodes queries, retrieves and reranks relevant chunks, augments the query with context, and generates answers with post-processing for role and product queries.

In [None]:
def run_rag_pipeline(pdf_paths):
    try:
        # Initialize combined document storage
        all_document_texts = []
        all_document_ids = []
        document_map = {}
        current_doc_id = 0

        # Load and Chunk PDFs
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=50)
        for pdf_path in pdf_paths:
            print(f'Processing PDF: {pdf_path}')
            loader = PyPDFLoader(pdf_path)
            documents = loader.load()
            chunks = text_splitter.split_documents(documents)
            document_texts = [chunk.page_content for chunk in chunks]
            document_ids = [str(i + current_doc_id) for i in range(len(document_texts))]
            document_map.update(dict(zip(document_ids, document_texts)))
            all_document_texts.extend(document_texts)
            all_document_ids.extend(document_ids)
            current_doc_id += len(document_texts)
            print(f'Created {len(document_texts)} chunks from {pdf_path}')

        print(f'Total chunks created: {len(all_document_texts)}')

        # Load ColBERT Model
        model_name = 'lightonai/GTE-ModernColBERT-v1'
        model = models.ColBERT(model_name_or_path=model_name)

        # Initialize Voyager Index
        index_folder = 'pylate-index'
        index_name = 'pdf_index'
        index = indexes.Voyager(index_folder=index_folder, index_name=index_name, override=True)

        # Create and Index Embeddings
        documents_embeddings = model.encode(
            all_document_texts,
            batch_size=32,
            is_query=False,
            show_progress_bar=True
        )
        index.add_documents(all_document_ids, documents_embeddings=documents_embeddings)

        # Initialize Retriever
        retriever = retrieve.ColBERT(index=index)

        # Initialize FLAN-T5 Generator
        generator = pipeline('text2text-generation', model='google/flan-t5-base', max_length=300)

        # Define Prompt Template
        prompt_template = '''Using only the provided text, answer the user's question with a concise and accurate response. For questions about specific roles (e.g., CEO, CTO, CFO), return only the full name of the individual in that role. For questions about lists (e.g., products), return all items as a comma-separated list of names only. Exclude any details not directly relevant to the question, such as technical specifications, unless explicitly requested. If the answer is not in the text, respond with 'The answer could not be found in the text.'

Text: {context}

Question: {question}

Answer:''' 
        PROMPT = PromptTemplate(template=prompt_template, input_variables=['context', 'question'])

        return model, index, retriever, generator, PROMPT, document_map

    except Exception as e:
        print(f'Error processing PDFs: {e}')
        return None

def query_rag(model, index, retriever, generator, PROMPT, document_map, query):
    try:
        queries = [query]

        # Encode Query
        query_embedding = model.encode(
            queries,
            batch_size=32,
            is_query=True,
            show_progress_bar=True
        )

        # Retrieve Top Documents
        top_k_initial = 15
        initial_results = retriever.retrieve(queries_embeddings=query_embedding, k=top_k_initial)
        retrieved_doc_ids = [result['id'] for result in initial_results[0]]
        retrieved_documents = [document_map[doc_id] for doc_id in retrieved_doc_ids]

        # Rerank Documents
        reranked_results = rank.rerank(
            documents_ids=[retrieved_doc_ids],
            queries_embeddings=query_embedding,
            documents_embeddings=[model.encode(retrieved_documents, is_query=False)]
        )

        # Get Reranked Documents
        reranked_doc_ids = []
        if reranked_results and isinstance(reranked_results[0], list):
            for result in reranked_results[0]:
                if isinstance(result, dict) and 'id' in result:
                    reranked_doc_ids.append(result['id'])
                elif isinstance(result, str):
                    reranked_doc_ids.append(result)
        else:
            reranked_doc_ids = retrieved_doc_ids

        reranked_documents = [document_map[doc_id] for doc_id in reranked_doc_ids]

        # Create Context
        max_context_length = 600
        context = '\n'.join(reranked_documents[:3])[:max_context_length]
        prompt_text = PROMPT.format(context=context, question=query)

        # Generate Answer
        response = generator(prompt_text)[0]['generated_text']
        answer = response.strip()

        # Post-processing
        if ', ' in answer:
            items = set(answer.split(', '))
            answer = ', '.join(sorted(items)) if items else 'The answer could not be found in the text.'
        if any(role in query.lower() for role in ['ceo', 'cto', 'cfo']):
            role = query.lower().split('who is')[1].strip().upper()
            match = re.search(rf'- ([^,]+), {role}:', context)
            if match:
                answer = match.group(1).strip()
            else:
                answer = 'The answer could not be found in the text.'
        if 'product' in query.lower():
            product_names = re.findall(r'- (\w+):', context)
            if product_names:
                answer = ', '.join(sorted(set(product_names)))
            else:
                answer = 'The answer could not be found in the text.'

        return context, answer

    except Exception as e:
        print(f'Error processing query: {e}')
        return None, 'Error processing query.'

## Cell 4: Process PDFs

Specify the paths to the PDFs (`QuantumCore_Solutions_RAG_Demo_Dataset_v1.pdf`, `NeoCompute_Technologies_RAG_Demo_Dataset_v2.pdf`) and initialize the RAG pipeline. If running locally, ensure the PDFs are in the specified directory (e.g., `/data`). In Colab, upload the PDFs manually when prompted.

In [None]:
# Specify PDF paths (modify as needed for your environment)
pdf_paths = [
    '/data/QuantumCore_Solutions_RAG_Demo_Dataset_v1.pdf',
    '/data/NeoCompute_Technologies_RAG_Demo_Dataset_v2.pdf'
]

# Check if running in Colab and prompt for upload if files are missing
if not all(os.path.exists(pdf_path) for pdf_path in pdf_paths):
    print('Please upload your PDF files (QuantumCore_v1.pdf and/or NeoCompute_v2.pdf):')
    uploaded = files.upload()
    pdf_paths = list(uploaded.keys())

# Run the RAG pipeline
result = run_rag_pipeline(pdf_paths)
if result:
    model, index, retriever, generator, PROMPT, document_map = result
    print('RAG pipeline initialized successfully.')
else:
    print('Failed to initialize RAG pipeline.')

## Cell 5: Interactive Querying

Run this cell to interactively query the RAG system. Example queries are provided based on the PDFs. Enter a custom query or use one of the examples below. The system will retrieve relevant chunks, generate an answer, and display both the context and response.

**Example Queries**:
- Who is the CEO of QuantumCore Solutions?
- What are the products offered by NeoCompute Technologies?
- What is the qubit count of QubitCore?
- What compliance standards does NeoCompute follow?

In [None]:
def interactive_query():
    print('Enter your query (or type "exit" to quit):')
    while True:
        query = input('Query: ')
        if query.lower() == 'exit':
            print('Exiting query interface.')
            break
        if not result:
            print('RAG pipeline not initialized. Please run Cell 4 first.')
            break
        context, answer = query_rag(model, index, retriever, generator, PROMPT, document_map, query)
        print('\n**Context Retrieved**:\n', context)
        print('\n**Answer**:\n', answer, '\n')

# Run interactive query interface
interactive_query()