# Building Conversational RAG with OpenAI and Chroma DB: No LangChain or LlamaIndex Required!

## Table of Contents
1. Document Processing and Indexing
2. Setting Up ChromaDB
3. Inserting Data into ChromaDB
4. Semantic Search on ChromaDB
5. Combining ChromaDB and OpenAI for RAG
6. Creating Conversational RAG with Memory


###What is Retrieval Augmented Generation (RAG)?

RAG is a technique that enhances language models by combining them with a retrieval system. It allows the model to access and utilize external knowledge when generating responses.

## Installing Necessary Libraries

In [None]:
#!pip install -qU chromadb pypdf2 python-docx sentence-transformers

In [1]:
#!pip install google-genai
from google import genai

## Document Processing and Indexing

###Functions to read file contents

In [2]:
#import docx
import PyPDF2
import os
def read_text_file(file_path: str):
    """Read content from a txt file"""
    with open(file_path, 'r', encoding='utf-8') as file:
        return file.read()

def read_pdf_file(file_path: str):
    """Read content from a PDF file"""
    text = ""
    with open(file_path, 'rb') as file:
        pdf_reader = PyPDF2.PdfReader(file)
        for page in pdf_reader.pages:
            text += page.extract_text() + "\n"
    return text

# def read_docx_file(file_path: str):
#     """Read content from a docx file"""
#     doc = docx.Document(file_path)
#     return "\n".join([paragraph.text for paragraph in doc.paragraphs])

In [3]:
def read_document(file_path: str):
    """Read document content based on file extension"""
    _, file_extension = os.path.splitext(file_path)
    file_extension = file_extension.lower()

    if file_extension == '.txt':
        return read_text_file(file_path)
    elif file_extension == '.pdf':
        return read_pdf_file(file_path)
    elif file_extension == '.docx':
        return read_docx_file(file_path)
    else:
        raise ValueError(f"Unsupported file format: {file_extension}")

text = read_document(r"H:\01_Training\ContentSlides_AgenticAI\Code\Deployment_RAG\docs\ERP-2008-chapter4.pdf")
#text = read_document(r"/content/ERP-2008-chapter4.pdf") usinig this in the colab
print(text)

97CHAPTER 4
The Importance of Health and 
Health Care
The American health care system is an engine for innovation that develops 
and broadly disseminates advanced, life-enhancing treatments and offers 
a wide set of choices for consumers of health care. The current health care system provides enormous benefits, but there are substantial opportunities for reforms that would reduce costs, increase access, enhance quality, and improve the health of Americans.
An individual’s health can be maintained or improved in many ways, 
including through changes in personal behavior and through the appropriate 
consumption of health care services. While there is substantial health care spending in the United States, the importance of health does provide a strong rationale for this level of spending. But because health care financing and delivery are often inefficient, there are opportunities to advance health and access to health care services without further growth in spending. To improve the effic

###Chunking

In [4]:
def split_text(text: str, chunk_size: int = 500):
    """Split text into chunks"""
    sentences = text.replace('\n', ' ').split('. ')
    chunks = []
    current_chunk = []
    current_size = 0
    print(f"no of sentences:  {len(sentences)}")
    for sentence in sentences:
        sentence = sentence.strip()
        if not sentence:
            continue

        if not sentence.endswith('.'):
            sentence += '.'

        sentence_size = len(sentence)

        if current_size + sentence_size > chunk_size and current_chunk:
            chunks.append(' '.join(current_chunk))
            current_chunk = [sentence]
            current_size = sentence_size
        else:
            current_chunk.append(sentence)
            current_size += sentence_size

    if current_chunk:
        chunks.append(' '.join(current_chunk))

    return chunks

chunks = split_text(text)
print(chunks[2])

no of sentences:  271
But because health care financing and delivery are often inefficient, there are opportunities to advance health and access to health care services without further growth in spending. To improve the efficiency of health care financing and delivery, the Administration has pursued policies that would increase incentives for individuals to purchase consumer-directed health insurance plans.


In [6]:
text = "Hi hello how are"

len(text)

16

In [7]:
len(chunks)


115

## Setting Up ChromaDB

In [5]:
import chromadb
from chromadb.utils import embedding_functions
# import textwrap

In [6]:
client = chromadb.PersistentClient(path="./chroma_db")

# Use sentence-transformer embeddings for embedding our data
sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

collection = client.get_or_create_collection(name="documents_collection", embedding_function=sentence_transformer_ef)

  from .autonotebook import tqdm as notebook_tqdm


In [11]:
# from sentence_transformers import SentenceTransformer
# sentences = ["This is an example sentence", "Each sentence is converted"]

# model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
# embeddings = model.encode(sentences)
# print(embeddings)

## Inserting Data into ChromaDB

In [7]:
def process_document(file_path: str):
    """Process a single document and prepare it for ChromaDB"""
    try:
        # Read the document
        content = read_document(file_path)

        # Split into chunks
        chunks = split_text(content)

        # Prepare metadata
        file_name = os.path.basename(file_path)
        metadatas = [{"source": file_name, "chunk": i} for i in range(len(chunks))]
        ids = [f"{file_name}_chunk_{i}" for i in range(len(chunks))]

        return ids, chunks, metadatas
    except Exception as e:
        print(f"Error processing {file_path}: {str(e)}")
        return [], [], []

def add_to_collection(collection, ids, texts, metadatas):
    """Add documents to collection in batches"""
    if not texts:
        return

    batch_size = 100
    for i in range(0, len(texts), batch_size):
        end_idx = min(i + batch_size, len(texts))
        collection.add(
            documents=texts[i:end_idx],
            metadatas=metadatas[i:end_idx],
            ids=ids[i:end_idx]
        )

def process_and_add_documents(collection, folder_path: str):
      files = [os.path.join(folder_path, file) for file in os.listdir(folder_path) if os.path.isfile(os.path.join(folder_path, file))]

      for file_path in files:
        print(f"Processing {os.path.basename(file_path)}...")
        ids, texts, metadatas = process_document(file_path)
        add_to_collection(collection, ids, texts, metadatas)
        print(f"Added {len(texts)} chunks to collection")

In [8]:
process_and_add_documents(collection, "./docs")

Processing ERP-2008-chapter4.pdf...
no of sentences:  271
Added 115 chunks to collection


## Semantic Search on ChromaDB

In [9]:
def semantic_search(collection, query: str, n_results: int = 2):
    """Perform semantic search on the collection"""
    return collection.query(
        query_texts=[query],
        n_results=n_results,
        include=["embeddings", "documents", "metadatas", "distances"]
    )

In [10]:
query = "What is American Health System?"
results = semantic_search(collection, query)
results

{'ids': [['ERP-2008-chapter4.pdf_chunk_0', 'ERP-2008-chapter4.pdf_chunk_4']],
 'embeddings': [array([[ 1.02494424e-02,  5.28266728e-02, -4.75636497e-02,
          -6.64891452e-02, -8.49574339e-03,  8.73280838e-02,
           8.17720592e-03,  1.68928429e-02, -3.63396928e-02,
          -6.38996717e-03, -5.63424639e-02,  7.73676932e-02,
          -4.25983630e-02, -1.09591126e-01, -2.81010717e-02,
          -9.08016786e-02,  1.25139374e-02, -5.23579568e-02,
          -4.40422595e-02,  5.34197092e-02, -4.98924814e-02,
           1.07772149e-01,  4.95683104e-02,  2.36799791e-02,
          -5.44510707e-02,  4.41576242e-02, -5.35479188e-02,
          -5.71012162e-02, -3.05823386e-02, -1.78930424e-02,
           1.18035667e-01, -9.30198468e-03,  1.04846932e-01,
          -1.59087740e-02, -1.08672947e-01,  3.43072432e-04,
           7.24217519e-02, -3.85458358e-02, -6.22753277e-02,
           2.62599662e-02, -4.09665518e-02, -4.74672578e-02,
          -2.35220604e-02,  1.20599084e-01,  2.7475038

In [11]:
def print_search_results(results):
    """Print formatted search results"""
    print("\nSearch Results:\n" + "-" * 50)

    for i in range(len(results['documents'][0])):
        doc = results['documents'][0][i]
        meta = results['metadatas'][0][i]
        print(f"\nResult {i + 1}: Source: {meta['source']}, Chunk {meta['chunk']}")
        print(f"Content: {doc}\n")

print_search_results(results)


Search Results:
--------------------------------------------------

Result 1: Source: ERP-2008-chapter4.pdf, Chunk 0
Content: 97CHAPTER 4 The Importance of Health and  Health Care The American health care system is an engine for innovation that develops  and broadly disseminates advanced, life-enhancing treatments and offers  a wide set of choices for consumers of health care. The current health care system provides enormous benefits, but there are substantial opportunities for reforms that would reduce costs, increase access, enhance quality, and improve the health of Americans.


Result 2: Source: ERP-2008-chapter4.pdf, Chunk 4
Content: The key points in this chapter are: • Health can be improved not only through the consumption of health  care services, but also through individual behavior and lifestyle choices  such as quitting smoking, eating more nutritious foods, and getting more exercise. • Health care has enhanced the health of our population; greater efficiency  in the healt

In [12]:
def get_context_with_sources(results):
    """Get a combined context and formatted sources from search results."""
    # Combine the document chunks into a single context
    context = "\n\n".join(results['documents'][0])

    # Format the sources with metadata information
    sources = [f"{meta['source']} (chunk {meta['chunk']})" for meta in results['metadatas'][0]]

    return context, sources

context, sources = get_context_with_sources(results)
print(context)

97CHAPTER 4 The Importance of Health and  Health Care The American health care system is an engine for innovation that develops  and broadly disseminates advanced, life-enhancing treatments and offers  a wide set of choices for consumers of health care. The current health care system provides enormous benefits, but there are substantial opportunities for reforms that would reduce costs, increase access, enhance quality, and improve the health of Americans.

The key points in this chapter are: • Health can be improved not only through the consumption of health  care services, but also through individual behavior and lifestyle choices  such as quitting smoking, eating more nutritious foods, and getting more exercise. • Health care has enhanced the health of our population; greater efficiency  in the health care system, however, could yield even greater health for Americans without increasing health care spending.


## Combining ChromaDB and Gemini for RAG

In [14]:
def get_prompt(query: str, context: str):
    """Prompt for Response Generation"""
    prompt = f"""Based on the following context, please answer the question.
    If the answer cannot be derived from the context, say "I cannot answer this based on the provided context."

    Context:
    {context}

    Question: {query}

    Answer:"""

    return prompt

In [15]:
from openai import OpenAI

client = OpenAI(
    api_key="AIzaSyCvWGVISu3A0HK8P3m_BZwn-xJ0wO9mQ9c",
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)

In [16]:
def generate_response_openai_style(query: str, context: str):
    """Generate a response using OpenAI"""

    prompt = get_prompt(query, context)
    print(prompt)
    
    response = client.chat.completions.create(
        model="gemini-2.5-flash",
        messages=[
            {"role": "system", "content": "You are a helpful assistant that answers questions based on the provided context."},
            {"role": "user", "content": prompt}
        ],
        temperature=0,
        max_tokens=500
    )

    return response.choices[0].message.content

In [21]:
def generate_response_gemini_new(query: str, context: str):
    """Generate a response using OpenAI"""

    prompt = get_prompt(query, context)
    #print(prompt)
    
    client = genai.Client(api_key= "AIzaSyCvWGVISu3A0HK8P3m_BZwn-xJ0wO9mQ9c")
    
    response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=prompt
    )

    return response.text

In [18]:
def rag_query(collection, query: str, n_chunks: int = 2):
    """Perform RAG query: retrieve relevant chunks and generate answer"""
    # Get relevant chunks
    results = semantic_search(collection, query, n_chunks)
    context, sources = get_context_with_sources(results)

    # Generate response
    response = generate_response_openai_style(query, context)
    response = generate_response_gemini_new(query, context)

    return response, sources

Functions to get prompt and OpenAI Response

##Perform RAG query

In [None]:
query = "What is the demand for health"
response, sources = rag_query(collection, query)

# Print results
#print("\nQuery:", query)
print("\nAnswer:", response)
#print("\nSources used:")
# for source in sources:
#     print(f"- {source}")

Based on the following context, please answer the question.
    If the answer cannot be derived from the context, say "I cannot answer this based on the provided context."

    Context:
    97CHAPTER 4 The Importance of Health and  Health Care The American health care system is an engine for innovation that develops  and broadly disseminates advanced, life-enhancing treatments and offers  a wide set of choices for consumers of health care. The current health care system provides enormous benefits, but there are substantial opportunities for reforms that would reduce costs, increase access, enhance quality, and improve the health of Americans.

The key points in this chapter are: • Health can be improved not only through the consumption of health  care services, but also through individual behavior and lifestyle choices  such as quitting smoking, eating more nutritious foods, and getting more exercise. • Health care has enhanced the health of our population; greater efficiency  in the h