##### **Retrieval-Augmented Medical Chatbot - High-Fidelity Prototype**

---

**<span style="color: #A1AEB1;">Retrieval-Augmented Medical Chatbot Pipeline Architecture</span>**

| Module                                        | Description                                                                                           | Inputs               | Outputs                           |
| --------------------------------------------- | ----------------------------------------------------------------------------------------------------- | -------------------- | --------------------------------- |
| **Document Ingestion**                        | Loads documents, extracts text, filenames, timestamps, and page-level metadata.                       | PDF / DOCX / HTML    | Clean Text + Metadata             |
| **Preprocessing & Chunking**                  | Normalizes text, removes boilerplate, creates overlapping chunks with rich metadata.                  | Page-level Text      | Text Chunks + Metadata            |
| **Embedding & Vectorization**                 | Converts chunks to embeddings using OpenAI Embedding model, then stores in vector database.           | Text Chunks          | Vector Store Entries              |
| **Retriever & Re-Ranking**                    | Performs semantic search, applies metadata filters, reranks results using cross-encoder.              | Query + Vector Store | Top-K Passages + Relevance Scores |
| **RAG Prompting**                             | Constructs RAG prompt with system rules, retrieved passages, citations, and the user query.           | Top-k Passages       | Prompt for LLM                    |
| **LLM Generation**                            | Produces draft medical response with citations and confidence.                                        | Structured Prompt    | Generated Answer                  |
| **Safety & Clinical Guardrails**              | Checks for hallucination, unsafe content, missing citations, or clinical risks; may trigger fallback. | Generated Answer     | Safe Final Answer                 |
| **Audit Logging & Monitoring**                | Logs queries, retrievals, sources, and model responses for compliance.                                | System Events        | Audit Logs                        |
| **Evaluation & Feedback Loop**                | Measures accuracy, safe-fail rates, and retrieval quality to improve pipeline.                        | Logs + Responses     | Metrics & Model Improvements      |

<br />

<div style="text-align: justify;">

<span style="color: #D0312D;">**Note: The Retrieval-Augmented Generation (RAG) pipeline is currently being implemented as an early prototype. It does not yet contain every element specified in the final architecture, even though it illustrates the essential workflow from document ingestion to response generation. Later stages of development will include features like PHI de-identification, safety and compliance layers, improved retrieval optimisation, and thorough assessment systems.**</span>

</div>

**<span style="color: #A1AEB1;">Retrieval-Augmented Generation Framework Research Summary</span>**

<div style="text-align: justify;">

Throughout the research into open-source Retrieval-Augmented Generation (RAG) frameworks, I evaluated several leading tools including LangChain, LlamaIndex, Haystack, FlexRAG, and UltraRAG based on factors such as ecosystem maturity, retrieval performance, orchestration flexibility, integration support, scalability, and suitability for medical domain requirements.

**LlamaIndex** stood out for its data-centric architecture, offering efficient document ingestion, structured indexing, and advanced query engines such as Tree and Graph indexes. It is excellent for rapid prototyping and scenarios where flexible data connectors are needed. However, it provides less orchestration control for multi-step conversational flows or agent-like reasoning.

**Haystack** demonstrated strengths in production-grade pipelines, robust hybrid retrieval (sparse + dense), and a modular architecture that fits enterprise settings well. Its pipeline API is powerful but can be heavier to configure and less ideal for agent-based interactions or dynamic tool-calling workflows.

**FlexRAG** and **UltraRAG** offer innovative retrieval strategies and automation capabilities. FlexRAG focuses on dynamic retrieval orchestration, while UltraRAG emphasizes pipeline optimization and evaluation tooling. Despite their promise, both frameworks are still relatively new, with smaller ecosystems, limited documentation, and fewer real-world examples.

<span style="color: #74B72E;">**LangChain** turned out to be the most sensible and adaptable alternative for this project when all the choices had been evaluated. It provides flexible workflow orchestration, strong support for tools, agents, and memory components, and broad integration with vector stores and LLM providers. LangChain also enables large-scale ingestion, distributed processing, detailed control over embedding/vectorstore layers, and excellent OpenAI integration, which is important for medical RAG systems that require reliability and strict compliance handling. While LangChain may not include the most optimized retriever internally, its compatibility with external retrieval engines makes it the best fit for building a medically safe, extensible, and future-proof RAG-based medical chatbot.</span>

</div>

**<span style="color: #A1AEB1;">Retrieval-Augmented Generation Large Language Model (LLM) Research Summary</span>**

##### **Retrieval-Augmented Medical Chatbot - Implemented Code**

---

In [1]:
# Constant Variables
VECTOR_DATABASE_PATH = "./vector_database/db_faiss"

**<span style="color: #A1AEB1;">Document Ingestion (Document Loader)</span>**

<div style="text-align: justify;">

Through the research phase, I noticed that medical dataset often come in multiple formats such as PDF guidelines, DOCX clinical reports, TXT notes, HTML webpages, and more, which implies that each file type requires different preprocessing steps, and handling them individually can introduce errors, inconsistencies, and maintenance challenges. Therefore, a unified approach was necessary to streamline document loading while ensuring that all formats are processed correctly and consistently, and **Document Loader Factory** was implemented to address this need. In brief, it abstracts away file-specific parsing logic while ensuring that every document is transformed into a standardised structure suitable for chunking, embedding, and retrieval. This makes the overall RAG pipeline more robust, extensible, and easier to maintain.

Overall, the key reasons for implementing it is to ensure various file formats and future formats are process realiably, extensibility, consistent downstream processing, cleaner architecture, and reusability. Meanwhile, I believe that it is a foundational component that prepares unstructured medical data for high-quality retrieval in the RAG medical chatbot.

</div>

In [2]:
# Framework Libraries
from langchain.document_loaders import PyPDFLoader, Docx2txtLoader, TextLoader
import glob

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# Document Loader
class DocumentLoader:
    def __init__(self):
        self.dataset_path = "./data/"
        self.docs_loaders = {"*pdf": PyPDFLoader, "*docx": Docx2txtLoader, "*txt": TextLoader} 
    
    def load_documents(self):
        docs = []
        
        for file_type, loader_cls in self.docs_loaders.items():
            file_paths = glob.glob(f"{self.dataset_path}/{file_type}")
            
            for file_path in file_paths:
                docs_loader = loader_cls(file_path)
                docs.extend(docs_loader.load())
                
        return docs   

In [4]:
# Dataset Preparation
docs_loaders = DocumentLoader()
medical_docs = docs_loaders.load_documents()

# Dataset Checking
print(f"Number of PDF pages: {len(medical_docs)}")

Number of PDF pages: 759


**<span style="color: #A1AEB1;">Preprocessing & Chunking</span>**

<div style="text-align: justify;">

In comparison to more naive or fixed-window chunking methods, the **RecursiveCharacterTextSplitter** produces segments that better preserve semantic continuity, reduce unnatural text breakpoints, and align effectively with the processing requirements of downstream embedding models. Its recursive, rule-based splitting mechanism enables the generation of structurally coherent chunks while maintaining compatibility with LangChain’s document processing pipeline.

<span style="color: #74B72E;">Based on the evaluation across multiple chunking strategies applied to medical text corpora, a configuration of **500-character chunk size** with a **50-character overlap** emerged as the most suitable for this RAG implementation. A chunk size of approximately 500 characters provides a strong balance between semantic density and processing efficiency, typically encapsulating one to two clinically meaningful paragraphs. This reduces the likelihood of fragmenting medically significant concepts, thereby improving embedding representativeness and enhancing retrieval precision.</span>

The 50-character overlap further addresses boundary effects by ensuring that clinical statements or sentence fragments located near chunk edges are preserved across adjacent segments. This approach minimizes semantic loss without introducing excessive redundancy in the vector store.

In short, this chunking configuration offers an effective trade-off—maintaining semantic coherence, optimizing embedding quality, and supporting efficient storage and retrieval—making it well-aligned with best practices for retrieval-augmented generation (RAG) in the medical domain.

</div>

In [5]:
# Framework Libraries
from langchain_text_splitters import RecursiveCharacterTextSplitter                                 # Split the whole document which containing all text into chunks

In [6]:
# Create Text Chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 50)

dataset = text_splitter.split_documents(medical_docs)
dataset[:10]

[Document(metadata={'producer': 'GPL Ghostscript 9.10', 'creator': '', 'creationdate': '2017-05-01T10:37:35-07:00', 'moddate': '2017-05-01T10:37:35-07:00', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'source': './data/The-Glae-Encyclopedia-of-Medicine.pdf', 'total_pages': 759, 'page': 0, 'page_label': '1'}, page_content='The GALE\nENCYCLOPEDIA\nof MEDICINE\nSECOND EDITION'),
 Document(metadata={'producer': 'GPL Ghostscript 9.10', 'creator': '', 'creationdate': '2017-05-01T10:37:35-07:00', 'moddate': '2017-05-01T10:37:35-07:00', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'source': './data/The-Glae-Encyclopedia-of-Medicine.pdf', 'total_pages': 759, 'page': 1, 'page_label': '2'}, page_content='The G ALE\nENCYCLOPEDIA\nof MEDICINE\nSECOND EDITION\nJACQUELINE L. LONGE, EDITOR\nDEIRDRE S. BLANCHFIELD, ASSOCIATE EDITOR\nVOLUME\nC-F\n2'),
 Document(metadata={'producer': 'GPL Ghostscript 9.10', 'creator': '', 'creationdate': '2017-05-01T10:37:35-07:00', 'moddate': '

**<span style="color: #A1AEB1;">Embedding & Vectorization</span>**

<div style="text-align: justify;">

As aforementioned, OpenAI’s services were selected for both embedding generation and Large Language Model (LLM) due to their demonstrated performance, operational stability, and seamless compatibility with the LangChain framework. Thus, the comparative analysis concentrates on three embedding models within the OpenAI ecosystem including text-embedding-3-small, text-embedding-3-large, and text-embedding-ada-002, and I will evaluate their relative effectiveness and suitability for downstream retrieval-augmented generation tasks.

The newest ligtweight embedding model from OpenAI, **text-embedding-3-small** model is intended to provide cutting-edge semantic performance with a smaller computing footprint. It provides significantly higher accuracy than older models despite operating at a lower dimensionality, enabling more efficient FAISS indexing and faster similarity search. Empirical testing indicates that this model achieves strong semantic alignment for medical terminology, clinical statements, and multi-sentence medical narratives, making it suitable for large-scale ingestion pipelines.

Additionally, **text-embedding-3-large** model is engineered for high-precision retrieval in scenarios where semantic fidelity and recall are prioritized over computational cost. It consistently yields stronger performance in tasks involving long-form medical documents, subtle clinical distinctions, and multi-hop medical reasoning. This makes it particularly valuable for safety-sensitive or high-recall applications such as diagnostic guidance, literature retrieval, and guideline summarization. However, the larger vector dimensionality increases memory usage and index size.

Compared with both text-embedding-3 series, the predecessor **text-embedding-ada-002** model previously served as OpenAI’s standard embedding solution. While ada-002 remains functional, it generally exhibits lower semantic resolution, weaker contextual understanding, and higher variance in retrieval quality, especially for domain-specific text such as medical literature and clinical guidelines. Its performance gap is evident across semantic similarity benchmarks and multi-paragraph retrieval tasks. Consequently, ada-002 is no longer preferred for systems requiring high accuracy or domain sensitivity.

<span style="color: #74B72E;">The empirical evaluation revealed that **text-embedding-3-small** was chosen for this study because it strikes the best possible balance between cost-effectiveness, computational efficiency, and semantic quality. It minimises index size and offers a high enough retrieval accuracy for medical-domain RAG tasks, making it appropriate for scale deployment and iterative prototyping. Text-embedding-3-large is still a feasible upgrade path for situations needing maximum recall or managing intricate clinical narratives. Despite being compatible with any LLM, OpenAI embeddings perform best when combined with OpenAI LLMs because of cross-component optimisation, shared semantic assumptions, and architectural alignment.</span>

</div>

In [7]:
# Framework Libraries
from langchain_openai import OpenAIEmbeddings                                            
from langchain_community.vectorstores import FAISS                                                  # Use to store, index, and search through large collection of vector embeddings efficiently

from dotenv import load_dotenv

In [10]:
# Load API Key from .env file
load_dotenv()

# Create Vector Embeddings
embedding_model = OpenAIEmbeddings(model = "text-embedding-3-small")

<div style="text-align: justify;">

Throughout the development of the RAG pipeline, several vector databases and ANN (Approximate Nearest Neighbor) indexing frameworks were evaluated, including Pinecone, Weaviate, Milvus, ChromaDB, and HNSWlib, in addition to FAISS. Each solution offers a different balance of performance, scalability, operational complexity, and integration support.

**Pinecone** provides a highly scalable, fully managed vector database that delivers strong performance, hybrid search capabilities, and robust metadata filtering. Its primary limitations lie in its proprietary ecosystem and relatively high operational cost, making it less suitable for early-stage experimentation or local development environments.

**Weaviate** offers an open-source, schema-driven vector store with built-in hybrid search, modular storage backends, and cloud deployment options. While flexible, it introduces additional configuration overhead and resource consumption, which can be excessive for smaller projects or prototype settings.

**Milvus**, designed for distributed vector storage at scale, demonstrates excellent performance for large datasets and production-grade workloads. However, its operational complexity which often requiring container orchestration, specialized storage, and dedicated cluster management makes it heavyweight for lightweight RAG prototypes or academic experimentation.

**ChromaDB** serves as a lightweight, developer-friendly local vector store with a simple API and effective metadata filtering. Although suitable for small to medium applications, it lacks the raw search performance and index configurability required for more advanced or computationally demanding retrieval tasks.

**HNSWlib**, though extremely fast and easy to use, is limited by its in-memory design and lack of broader database features such as persistence, filtering, or integrated metadata search, making it less suitable as a standalone vector database in structured RAG pipelines.

<span style="color: #74B72E;">Compared with alternatives, **FAISS** emerged as the most appropriate choice for the current stage of the project. FAISS offers high-performance vector indexing, GPU acceleration, and fine-grained control over ANN algorithms, while remaining simple to deploy in a local development environment. It is widely used in research and prototyping due to its reliability, mature codebase, and strong ecosystem support. Notably, FAISS integrates seamlessly with the LangChain framework, which allows embeddings, chunk metadata, and retrieval logic to be managed efficiently using LangChain’s vector store abstractions. This compatibility reduces implementation effort and ensures that the RAG pipeline remains modular, extensible, and aligned with industry best practices.</span>

In summary, FAISS provides the best balance between performance, simplicity, configurability, and ecosystem support for the current development goals. While future iterations of the medical RAG chatbot may require migration to a distributed or managed vector database, FAISS is well-suited for prototyping, testing retrieval strategies, and validating the overall architecture.

</div>

In [12]:
# Store Embeddings in FAISS
vector_database = FAISS.from_documents(dataset, embedding_model)
vector_database.save_local(VECTOR_DATABASE_PATH)

**<span style="color: #A1AEB1;">Retrieval-Augmented Generation (RAG) Prompting</span>**

<div style="text-align: justify;">



</div>