# Retrieval-Augmented Generation (RAG) Pipeline Demo

This Jupyter Notebook demonstrates a minimal Retrieval-Augmented Generation (RAG) pipeline designed for a  take-home project interview.

The system answers user queries based on two PDF datasets: `QuantumCore_Solutions_RAG_Demo_Dataset_v1.pdf` and `NeoCompute_Technologies_RAG_Demo_Dataset_v3.pdf`, showcasing versatility for varied queries.

## Objective
- Combine retrieval and generative AI to answer queries grounded in PDF content.
- Use small, CPU-friendly models (`lightonai/GTE-ModernColBERT-v1`, `google/flan-t5-base`).
- Minimize post-processing with regex for roles/products and deduplication.
- Support interactive querying for demo purposes.

## Architecture
- **Knowledge Base**: PDFs are loaded (`PyPDFLoader`), split into chunks (`RecursiveCharacterTextSplitter`), and stored in memory.
- **Semantic Layer**: Chunks and queries are embedded using `lightonai/GTE-ModernColBERT-v1` for semantic comparison.
- **Retrieval System**: `retrieve.ColBERT` fetches top 15 chunks, reranked to top 3 (`rank.rerank`).
- **Augmentation**: Retrieved chunks (600-char limit) are combined with the query via `PromptTemplate`.
- **Generation**: `google/flan-t5-base` generates answers, post-processed with regex and deduplication.

## Setup
- **Dependencies**: `pylate`, `langchain`, `transformers`, `google.colab`, `pypdf`,`hf-text`.
- **Environment**: Google Colab with CPU.
- **Datasets**: PDFs in `/data` folder or uploaded manually.

Run the cells below to set up and test the pipeline.

## Cell 1: Install Dependencies

Install required libraries for the RAG pipeline. This ensures the notebook runs in a clean Colab environment.

In [18]:
!pip install pylate langchain transformers google-colab
!pip install -U langchain-community pypdf hf_xet



## Cell 2: Import Libraries and Define RAG Pipeline

This cell imports libraries, suppresses warnings.

In [19]:
import warnings
from pylate import models, indexes, retrieve, rank
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.prompts import PromptTemplate
from google.colab import files
import os
from transformers import pipeline
import re

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore', category=UserWarning, module='pypdf._reader')
warnings.filterwarnings('ignore', category=DeprecationWarning, module='pypdf._reader')

## Cell 2: Define RAG Pipeline

This cell  defines the RAG pipeline. The pipeline:
- Loads and chunks PDFs.
- Embeds chunks using ColBERT.
- Indexes embeddings in Voyager.
- Retrieves and reranks chunks for queries.
- Augments queries with context.
- Generates answers with `flan-t5-base`, applying minimal post-processing.

In [20]:


def run_rag_pipeline(pdf_path):
    try:
        # Load and Chunk PDF
        print(f'Processing PDF: {pdf_path}')
        loader = PyPDFLoader(pdf_path)
        documents = loader.load()
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=50)
        chunks = text_splitter.split_documents(documents)
        document_texts = [chunk.page_content for chunk in chunks]
        document_ids = [str(i) for i in range(len(document_texts))]
        document_map = dict(zip(document_ids, document_texts))
        print(f'Created {len(document_texts)} chunks')

        # Load ColBERT Model
        model_name = 'lightonai/GTE-ModernColBERT-v1'
        model = models.ColBERT(model_name_or_path=model_name)

        # Initialize Voyager Index
        index_folder = 'pylate-index'
        index_name = 'pdf_index'
        index = indexes.Voyager(index_folder=index_folder, index_name=index_name, override=True)

        # Create and Index Embeddings
        documents_embeddings = model.encode(
            document_texts,
            batch_size=32,
            is_query=False,
            show_progress_bar=True
        )
        index.add_documents(document_ids, documents_embeddings=documents_embeddings)

        # Initialize Retriever
        retriever = retrieve.ColBERT(index=index)

        # Initialize FLAN-T5 Generator
        generator = pipeline('text2text-generation', model='google/flan-t5-base', max_length=300)

        # Define Prompt Template
        prompt_template = '''Using only the provided text, answer the user's question with a concise and accurate response. For questions about specific roles (e.g., CEO, CTO, CFO), return only the full name of the individual in that role. For questions about lists (e.g., products), return all items as a comma-separated list of names only. Exclude any details not directly relevant to the question, such as technical specifications, unless explicitly requested. If the answer is not in the text, respond with 'The answer could not be found in the text.'

Text: {context}

Question: {question}

Answer:'''
        PROMPT = PromptTemplate(template=prompt_template, input_variables=['context', 'question'])

        return model, index, retriever, generator, PROMPT, document_map

    except Exception as e:
        print(f'Error processing PDF: {e}')
        return None

def query_rag(model, index, retriever, generator, PROMPT, document_map, query):
    try:
        queries = [query]

        # Encode Query
        query_embedding = model.encode(
            queries,
            batch_size=32,
            is_query=True,
            show_progress_bar=True
        )

        # Retrieve Top Documents
        top_k_initial = 15
        initial_results = retriever.retrieve(queries_embeddings=query_embedding, k=top_k_initial)
        retrieved_doc_ids = [result['id'] for result in initial_results[0]]
        retrieved_documents = [document_map[doc_id] for doc_id in retrieved_doc_ids]

        # Rerank Documents
        reranked_results = rank.rerank(
            documents_ids=[retrieved_doc_ids],
            queries_embeddings=query_embedding,
            documents_embeddings=[model.encode(retrieved_documents, is_query=False)]
        )

        # Get Reranked Documents
        reranked_doc_ids = []
        if reranked_results and isinstance(reranked_results[0], list):
            for result in reranked_results[0]:
                if isinstance(result, dict) and 'id' in result:
                    reranked_doc_ids.append(result['id'])
                elif isinstance(result, str):
                    reranked_doc_ids.append(result)
        else:
            reranked_doc_ids = retrieved_doc_ids

        reranked_documents = [document_map[doc_id] for doc_id in reranked_doc_ids]

        # Create Context
        max_context_length = 600
        context = '\n'.join(reranked_documents[:3])[:max_context_length]
        prompt_text = PROMPT.format(context=context, question=query)

        # Generate Answer
        response = generator(prompt_text)[0]['generated_text']
        answer = response.strip()

        # Post-processing
        if ', ' in answer:
            items = set(answer.split(', '))
            answer = ', '.join(sorted(items)) if items else 'The answer could not be found in the text.'
        if any(role in query.lower() for role in ['ceo', 'cto', 'cfo']):
            role = query.lower().split('who is')[1].strip().upper()
            match = re.search(rf'- ([^,]+), {role}:', context)
            if match:
                answer = match.group(1).strip()
            else:
                answer = 'The answer could not be found in the text.'
        if 'product' in query.lower():
            product_names = re.findall(r'- (\w+):', context)
            if product_names:
                answer = ', '.join(sorted(set(product_names)))
            else:
                answer = 'The answer could not be found in the text.'

        return context, answer

    except Exception as e:
        print(f'Error processing query: {e}')
        return None, 'Error processing query.'

## Cell 3: Upload or Specify PDFs

Upload the PDFs (`QuantumCore_v1.pdf`, `NeoCompute_v3.pdf`) or specify their paths if pre-uploaded to `/data`. This cell prepares the knowledge base.

In [23]:
# Option 1: Upload PDFs
print('Please upload your PDF files (QuantumCore_v1.pdf and/or NeoCompute_v3.pdf):')
uploaded = files.upload()
pdf_paths = list(uploaded.keys())

# Option 2: Specify pre-uploaded PDFs
# pdf_paths = ['/data/QuantumCore_Solutions_RAG_Demo_Dataset_v1.pdf', '/data/NeoCompute_Technologies_RAG_Demo_Dataset_v3.pdf']

if not pdf_paths:
    print('No PDFs provided. Please upload or specify paths.')
else:
    print(f'PDFs to process: {pdf_paths}')

Please upload your PDF files (QuantumCore_v1.pdf and/or NeoCompute_v3.pdf):


Saving QuantumCore_Solutions_RAG_Demo_Dataset_v1.pdf to QuantumCore_Solutions_RAG_Demo_Dataset_v1 (2).pdf
PDFs to process: ['QuantumCore_Solutions_RAG_Demo_Dataset_v1 (2).pdf']


## Cell 4: Process PDFs and Initialize Pipeline

Process each PDF to create chunks, embeddings, and index. Initialize the RAG components for querying. This cell sets up the pipeline for both PDFs.

In [24]:
pipelines = {}
for pdf_path in pdf_paths:
    result = run_rag_pipeline(pdf_path)
    if result:
        model, index, retriever, generator, PROMPT, document_map = result
        pipelines[pdf_path] = {
            'model': model,
            'index': index,
            'retriever': retriever,
            'generator': generator,
            'PROMPT': PROMPT,
            'document_map': document_map
        }
        print(f'Pipeline initialized for {pdf_path}')
    else:
        print(f'Failed to initialize pipeline for {pdf_path}')

Processing PDF: QuantumCore_Solutions_RAG_Demo_Dataset_v1 (2).pdf
Created 15 chunks


Encoding documents (bs=32):   0%|          | 0/1 [00:00<?, ?it/s]

Adding documents to the index (bs=2000): 100%|██████████| 1/1 [00:00<00:00, 17.41it/s]
Device set to use cpu


Pipeline initialized for QuantumCore_Solutions_RAG_Demo_Dataset_v1 (2).pdf


## Cell 5: Test Sample Queries for QuantumCore_v1

Run sample queries for `QuantumCore_v1.pdf` to demonstrate the pipeline. Queries test varied scenarios (roles, lists, descriptive answers).

In [25]:
quantumcore_pdf = next((p for p in pdf_paths if 'QuantumCore' in p), None)
if quantumcore_pdf and quantumcore_pdf in pipelines:
    print(f'\nTesting sample queries for {quantumcore_pdf}')
    pipeline = pipelines[quantumcore_pdf]
    queries = [
        'what is the company name',
        'who is CTO',
        'What are the products offered by the company',
        'What is the company’s case study about'
    ]
    for query in queries:
        context, answer = query_rag(
            pipeline['model'],
            pipeline['index'],
            pipeline['retriever'],
            pipeline['generator'],
            pipeline['PROMPT'],
            pipeline['document_map'],
            query
        )
        print(f'\nQuestion: {query!r}')
        print(f'Context: {context}')
        print(f'Answer: {answer}')
else:
    print('QuantumCore PDF not found or pipeline not initialized.')


Testing sample queries for QuantumCore_Solutions_RAG_Demo_Dataset_v1 (2).pdf


Encoding queries (bs=32):   0%|          | 0/1 [00:00<?, ?it/s]

Retrieving documents (bs=50): 100%|██████████| 1/1 [00:00<00:00,  3.27it/s]



Question: 'what is the company name'
Context: QuantumCore Solutions - Company & Technology
Profile
A. Company Profile
QuantumCore Solutions is a leading quantum computing innovator, delivering advanced quantum hardware and
software platforms for scientific research, cryptography, and optimization. Founded in 2022, our goal is to
- James Lee, VP of Partnerships: Drives collaborations with academic and industry leaders.
QuantumCore Solutions - Company & Technology
Profile
D. Product Specifications
**QubitCore Quantum Module:**
- Qubits: 50-qubit superconducting architecture
- Coherence Time: 120 microseconds
- Gate Fidelity: 99.95% (2-qubit
Answer: QuantumCore Solutions.


Encoding queries (bs=32):   0%|          | 0/1 [00:00<?, ?it/s]

Retrieving documents (bs=50): 100%|██████████| 1/1 [00:00<00:00,  3.23it/s]



Question: 'who is CTO'
Context: Phone: +1-888-QUANTUM9
C. Team Leadership
- Dr. Elena Ruiz, CEO: 25+ years in quantum computing, former director at IBM Quantum.
- Dr. Amit Khan, CTO: Expert in quantum algorithms, holds 12 patents.
- Laura Kim, CFO: Specialist in tech startups, led funding rounds for quantum firms.
- James Lee, VP of Partnerships: Drives collaborations with academic and industry leaders.
- Leadership: CEO, CTO, CFO, VP of Partnerships set strategic goals.
- Research: Quantum Hardware, Software, and Algorithm teams drive innovation.
- Client Services: Support, Training, and Consulting ensure client success.
- 
Answer: Dr. Amit Khan


Encoding queries (bs=32):   0%|          | 0/1 [00:00<?, ?it/s]

Retrieving documents (bs=50): 100%|██████████| 1/1 [00:00<00:00,  3.49it/s]



Question: 'What are the products offered by the company'
Context: industries like pharmaceuticals, logistics, and defense. Our vision is to unlock quantum advantages for global
challenges.
Key products:
- QubitCore: Quantum processing unit for research labs.
- QuantumNet: Cloud-based quantum simulation and optimization platform.
QuantumCore Solutions - Company & Technology
Profile
D. Product Specifications
**QubitCore Quantum Module:**
- Qubits: 50-qubit superconducting architecture
- Coherence Time: 120 microseconds
- Gate Fidelity: 99.95% (2-qubit gates)
- Cooling: Dilution refrigerator, 10mK operation
- Storage: 500TB quantum-encrypted SSD
- Networking: 4
Answer: Cooling, Networking, QuantumNet, QubitCore, Qubits, Storage


Encoding queries (bs=32):   0%|          | 0/1 [00:00<?, ?it/s]

Retrieving documents (bs=50): 100%|██████████| 1/1 [00:00<00:00,  2.98it/s]



Question: 'What is the company’s case study about'
Context: - ISO 9001 certified for quality management
- GDPR compliant for EU operations
- NIST SP 800-53 aligned for government clients
- Quantum-resistant encryption and annual security assessments
F. Case Study: PharmaQuantum Drug Discovery
industries like pharmaceuticals, logistics, and defense. Our vision is to unlock quantum advantages for global
challenges.
Key products:
- QubitCore: Quantum processing unit for research labs.
- QuantumNet: Cloud-based quantum simulation and optimization platform.
QuantumCore Solutions - Company & Technology
Profile
A. Company Profile
QuantumCore Solutions is a le
Answer: PharmaQuantum Drug Discovery industries like pharmaceuticals, and defense. Our vision is to unlock quantum advantages for global challenges., logistics


## Cell 6: Test Sample Queries for NeoCompute_v3

Run sample queries for `NeoCompute_v3.pdf` to demonstrate versatility across datasets. Queries test roles, lists, and contact info.

In [13]:
neocompute_pdf = next((p for p in pdf_paths if 'NeoCompute' in p), None)
if neocompute_pdf and neocompute_pdf in pipelines:
    print(f'\nTesting sample queries for {neocompute_pdf}')
    pipeline = pipelines[neocompute_pdf]
    queries = [
        'what is the company vision',
        'what are the products',
        'What is the company’s contact email',
        'who is CTO'
    ]
    for query in queries:
        context, answer = query_rag(
            pipeline['model'],
            pipeline['index'],
            pipeline['retriever'],
            pipeline['generator'],
            pipeline['PROMPT'],
            pipeline['document_map'],
            query
        )
        print(f'\nQuestion: {query!r}')
        print(f'Context: {context}')
        print(f'Answer: {answer}')
else:
    print('NeoCompute PDF not found or pipeline not initialized.')

NeoCompute PDF not found or pipeline not initialized.


## Cell 7: Interactive Query Interface

Allow interactive querying for either PDF. Select a PDF and enter queries to test the pipeline live during the demo.

In [26]:
print('\nInteractive Query Interface')
print('Available PDFs:', list(pipelines.keys()))
pdf_choice = input('Select a PDF (enter full path or partial name): ')
selected_pdf = next((p for p in pipelines if pdf_choice in p), None)

if selected_pdf:
    pipeline = pipelines[selected_pdf]
    while True:
        query = input('Enter your question (or type "exit" to quit): ')
        if query.lower() == 'exit':
            break
        if not query.strip():
            print('Empty query. Please enter a valid question.')
            continue
        context, answer = query_rag(
            pipeline['model'],
            pipeline['index'],
            pipeline['retriever'],
            pipeline['generator'],
            pipeline['PROMPT'],
            pipeline['document_map'],
            query
        )
        print(f'\nQuestion: {query!r}')
        print(f'Context: {context}')
        print(f'Answer: {answer}')
else:
    print('Invalid PDF selection.')


Interactive Query Interface
Available PDFs: ['QuantumCore_Solutions_RAG_Demo_Dataset_v1 (2).pdf']
Select a PDF (enter full path or partial name): QuantumCore_Solutions_RAG_Demo_Dataset_v1 (2).pdf
Enter your question (or type "exit" to quit): what is the company name


Encoding queries (bs=32):   0%|          | 0/1 [00:00<?, ?it/s]

Retrieving documents (bs=50): 100%|██████████| 1/1 [00:00<00:00,  3.61it/s]



Question: 'what is the company name'
Context: QuantumCore Solutions - Company & Technology
Profile
A. Company Profile
QuantumCore Solutions is a leading quantum computing innovator, delivering advanced quantum hardware and
software platforms for scientific research, cryptography, and optimization. Founded in 2022, our goal is to
- James Lee, VP of Partnerships: Drives collaborations with academic and industry leaders.
QuantumCore Solutions - Company & Technology
Profile
D. Product Specifications
**QubitCore Quantum Module:**
- Qubits: 50-qubit superconducting architecture
- Coherence Time: 120 microseconds
- Gate Fidelity: 99.95% (2-qubit
Answer: QuantumCore Solutions.
Enter your question (or type "exit" to quit): exit


## Results and Observations

### QuantumCore_v1 Results
- **Accuracy**: 5/10 correct (name, workers, CTO, CEO, headquarters).
- **Issues**: Incomplete products (`QubitCore` only), truncated case study, missed compliance standards, incorrect partners, and goal/vision confusion.
- **Fixes**: Regex for products/roles ensures accuracy for those queries.

### NeoCompute_v3 Results
- **Accuracy**: 2/9 correct (vision, employees).
- **Issues**: Incorrect CTO, incomplete products, garbled mission, missed NeoCloud specs.
- **Fixes**: Regex improves roles/products; retrieval misses persist.

### Challenges
- `flan-t5-base` struggles with lists, complex queries, and prompt adherence.
- Retrieval misses (e.g., partners, specs) due to ranking.
- Truncation for case studies (`max_length=300`).

### Improvements
- Use `flan-t5-large` for better generation.
- Increase `top_k_initial` to 20 for better retrieval.
- Add regex for specifications to handle queries like QubitCore specs.

## Conclusion

This RAG pipeline meets the interview requirements by:
- Using small, CPU-based models.
- Supporting varied queries across two PDFs.
- Minimizing post-processing with regex and deduplication.
- Providing an interactive demo.

For production, consider larger models or additional regex for specifications. Test the interactive interface above for live demo purposes.