## Medical Assistant: Problem Statement (NLP with GenAI)

### Business Context

The healthcare industry is rapidly evolving, and professionals face increasing challenges in managing vast volumes of medical data while delivering accurate and timely diagnoses. Quick access to comprehensive, reliable, and up-to-date medical knowledge is critical for improving patient outcomes and ensuring informed decision-making in a fast-paced environment.

Healthcare professionals often encounter information overload, struggling to sift through extensive research and data to create accurate diagnoses and treatment plans. This challenge is amplified by the need for efficiency, particularly in emergencies, where time-sensitive decisions are vital. Furthermore, access to trusted, current medical information from renowned manuals and research papers is essential for maintaining high standards of care.

To address these challenges, healthcare centers can focus on integrating systems that streamline access to medical knowledge, provide tools to support quick decision-making and enhance efficiency. Leveraging centralized knowledge platforms and ensuring healthcare providers have continuous access to reliable resources can significantly improve patient care and operational effectiveness.

### Objective

As an AI specialist, your task is to develop a RAG-based AI solution using renowned medical manuals to address healthcare challenges. The objective is to understand information overload, apply AI techniques to streamline decision-making, analyze  its impact on diagnostics and patient outcomes, evaluate its potential to standardize care practices, and create a functional prototype demonstrating its feasibility and effectiveness.

### Questions to Answer

1. What is the protocol for managing sepsis in a critical care unit?
2. What are the common symptoms of appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?
3. What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?
4. What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?
5. What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?

### Data Dictionary

The Merck Manuals are medical references published by the American pharmaceutical company Merck & Co., that cover a wide range of medical topics, including disorders, tests, diagnoses, and drugs. The manuals have been published since 1899 when Merck & Co. was still a subsidiary of the German company Merck.
The manual is a PDF with over 4,000 pages divided into 23 sections.

---

📂 **Folder Structure of current task**

> **Note:** This folder structure is included just for informational purpose 🔭 
> - it outlines how the current task is organized across modules and data files + generated files 🧐.

```bash
.
├── evaluation_results.json
├── llm_only_responses.json
├── prompt_engineering_results.json
├── rag_optimization_results.json
├── rag_responses.json
├── raw_text_backup.pkl
├── merck_manual_faiss_index
│   ├── index.faiss
│   └── index.pkl
├── merck_manual.pdf
├── .env
├── requirements.txt
└── notebook.ipynb
```

---

📊 **LLM + RAG Healthcare Task — Flow Diagram (Pseudo Flow)** 📌

> **Note:** This document outlines the overall flow of the RAG-based medical assistant solution. It complements the detailed step-by-step implementation plan.

##### 🔄 High-Level Task Flow

```mermaid
journeyFlow

    A[Start: Environment Setup] --> B[Data Preparation (Merck Manual)]
    B --> C[Text Splitting + Metadata Creation]
    C --> D[Generate Embeddings (Titan)]
    D --> E[Store in Vector DB (FAISS)]

    E --> F[Basic LLM Q&A Setup (Titan Lite)]
    F --> G[Prompt Engineering (5+ variations)]

    G --> H[RAG Pipeline Setup]
    H --> I[Test RAG on Healthcare Questions]

    I --> J[Optimize RAG Configs (chunks, k, prompts, etc.)]
    J --> K[Evaluation Framework: Groundedness + Relevance]

    K --> L[Compare LLM-only vs. RAG]
    L --> M[Insights + Recommendations]
    M --> N[Documentation + Presentation]
    N --> O[End]
```

---

### Environment Setup

- Install necessary libraries

- refer requirements.txt (in root folder)
  > pip install -r requirements.txt

### Imports

In [35]:
import boto3
import json
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm
from pypdf import PdfReader
from typing import List, Dict, Any

# LangChain imports
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.schema import Document
from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_aws import BedrockEmbeddings
# ref: issue for langchain_aws: https://github.com/langchain-ai/langchain/issues/2828 so used langchain.llms.Bedrock
# prefer using langchain_aws.BedrockLLM instead of langchain.llms.Bedrock
from langchain.llms import Bedrock

from dotenv import load_dotenv

In [97]:
import pickle
import time
import datetime
from pprint import pprint

In [40]:
# Load environment variables from .env file
load_dotenv(override=True)

True

In [104]:
# Access environment variables
aws_access_key_id = os.getenv('AWS_ACCESS_KEY_ID')
aws_secret_access_key = os.getenv('AWS_SECRET_ACCESS_KEY')
aws_session_token = os.getenv('AWS_SESSION_TOKEN')
aws_region = os.getenv('AWS_REGION')

In [42]:
# AWS Bedrock setup
def get_bedrock_client():
    """Initialize and return the Bedrock client."""
    bedrock_client = boto3.client(
        service_name="bedrock-runtime",
        region_name=aws_region,
        aws_access_key_id=aws_access_key_id,
        aws_secret_access_key=aws_secret_access_key,
        aws_session_token=aws_session_token
    )
    return bedrock_client

bedrock_client = get_bedrock_client()

### Data Preparation for RAG

#### 🔧 RAG Data Preparation & Vector Store Creation

This section prepares data for the Retrieval-Augmented Generation (RAG) pipeline.  

It involves two main steps:

1. **Create embeddings** for each text chunk using an embedding model.  
2. **Store the embeddings** along with their corresponding text chunks and metadata in a **FAISS vector store** for efficient retrieval.

> ⚠️ **Note:** 
>  
> This step is computationally intensive and time-consuming. (so attempted initially)
>
> 
> The resulting FAISS index is saved locally to avoid repeating this step in future runs.


In [43]:
# Helper Functions

def extract_text_from_pdf(pdf_path: str) -> str:
    """Extract text from a PDF file."""
    try:
        reader = PdfReader(pdf_path)
        text = ""
        for page in tqdm(reader.pages, desc="Extracting PDF text"):
            text += page.extract_text() + "\n"
        return text
    except Exception as e:
        print(f"Error extracting text from PDF: {e}")
        return ""

def split_text(text: str) -> List[Document]:
    """Split text into chunks with metadata."""
    try:
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=200,
            separators=["\n\n", "\n", ". ", " ", ""],
            length_function=len
        )

        chunks = text_splitter.create_documents([text])

        # Add metadata to track chunk positions
        for i, chunk in enumerate(chunks):
            chunk.metadata = {
                "chunk_id": i,
                "source": "Merck Manual",
            }

            return chunks
    except Exception as e:
        print(f"Error splitting text: {e}")
        return []


> ⚠️ **Warning:**  
> The function below extracts text from a PDF file. This is a time-intensive operation.  
> To avoid repeated processing, the extracted results are saved in a `.pkl` (pickle) file.  
> **Do not run this cell multiple times unnecessarily.** Reuse the saved `.pkl` file instead.


In [16]:
# TODO: Add a check to see if the file exists, if it does, load it, if it doesn't, run the function and likewise assign the `raw_text` variable

# Step 0: Load and process the Merck Manual
# NOTE: HEAVY TASK
pdf_path = "merck_manual.pdf"  # Update with your actual file path
raw_text = extract_text_from_pdf(pdf_path)
print(f"Extracted {len(raw_text)} characters from PDF")

Extracting PDF text: 100%|██████████| 4114/4114 [03:08<00:00, 21.83it/s]

Extracted 13786695 characters from PDF





In [27]:
# Below are helper functions to save and load the raw text (As Backup)

RAW_TEXT_FILE_PATH = "raw_text_backup.pkl"
def save_raw_text(raw_text, file_path=RAW_TEXT_FILE_PATH):
    """
    Save the extracted raw text to a pickle file.

    Args:
        raw_text (str): The raw text extracted from the PDF
        file_path (str): Path where to save the text
    """
    with open(file_path, 'wb') as f:
        pickle.dump(raw_text, f)
    print(f"Raw text saved to {file_path} ({len(raw_text)} characters)")

def load_raw_text(file_path=RAW_TEXT_FILE_PATH):
    """
    Load the raw text from a pickle file.

    Args:
        file_path (str): Path from where to load the text

    Returns:
        str: The raw text, or None if file doesn't exist
    """
    if not os.path.exists(file_path):
        print(f"Raw text backup not found at {file_path}")
        return None

    with open(file_path, 'rb') as f:
        raw_text = pickle.load(f)
    print(f"Loaded raw text from {file_path} ({len(raw_text)} characters)")
    return raw_text

In [18]:
save_raw_text(raw_text)

Raw text saved to raw_text_backup.pkl (13786695 characters)


In [44]:
raw_text = load_raw_text()

Loaded raw text from raw_text_backup.pkl (13786695 characters)


In [45]:
# Step 1: Split text into chunks
chunks = split_text(raw_text)
print(f"Created {len(chunks)} text chunks")

Created 18212 text chunks


#### Embedding Generation

In [46]:
# constants
BATCH_SIZE = 10
CHECKPOINT_FILE = "embedding_progress.pkl"
STORE_PATH = "merck_manual_faiss_index"
MODEL_ID = "amazon.titan-embed-text-v2:0"
EMBEDDING_DIMENSIONS = 512

In [47]:
def get_embeddings_model():
    '''embeddings model that is used to create embeddings'''
    return BedrockEmbeddings(
        client=bedrock_client,
        model_id=MODEL_ID,
        model_kwargs={"dimensions": EMBEDDING_DIMENSIONS}
    )

In [48]:
# Function to create embeddings with checkpointing

def create_embeddings_with_checkpoints(chunks, batch_size=BATCH_SIZE, checkpoint_file=CHECKPOINT_FILE):
    """
    Generate embeddings for text chunks with batching and checkpointing support.

    This function processes document chunks in batches to generate embeddings efficiently.
    It includes checkpoint functionality to save progress and resume from the last successful
    batch in case of interruption or failure.

    Args:
        chunks (List[Document]): List of Document objects containing text to embed
        batch_size (int): Number of chunks to process in each batch (default: 20)
        checkpoint_file (str): File path to save/load checkpoint data (default: "embedding_progress.pkl")

    Returns:
        tuple: (texts, embeddings, metadatas) where:
            - texts (List[str]): List of text content from chunks
            - embeddings (List[List[float]]): List of embedding vectors
            - metadatas (List[dict]): List of metadata dictionaries

    Example:
        >>> texts, embeddings, metadatas = create_embeddings_with_checkpoints(chunks)
        >>> vector_store = FAISS.from_embeddings(
        >>>     text_embeddings=list(zip(texts, embeddings)),
        >>>     embedding=embeddings_model,
        >>>     metadatas=metadatas
        >>> )
    """

    texts = [chunk.page_content for chunk in chunks]
    metadatas = [chunk.metadata for chunk in chunks]

    # Initialize embedding model
    embeddings_model = get_embeddings_model()

    # Process embeddings with checkpointing
    embeddings = []
    start_idx = 0

    # Load checkpoint if exists
    if os.path.exists(checkpoint_file):
        try:
            with open(checkpoint_file, 'rb') as f:
                checkpoint_data = pickle.load(f)
                embeddings = checkpoint_data['embeddings']
                start_idx = checkpoint_data['next_idx']
                print(f"Resuming from checkpoint at index {start_idx}/{len(texts)}")
        except Exception as e:
            print(f"Error loading checkpoint: {e}. Starting from beginning.")

    try:
        # Process in batches with progress bar
        for i in tqdm(range(start_idx, len(texts), batch_size),
                      desc="Generating embeddings",
                      total=(len(texts)-start_idx)//batch_size + 1):
            # Get current batch
            end_idx = min(i + batch_size, len(texts))
            batch = texts[i:end_idx]

            # Generate embeddings for batch
            batch_embeddings = embeddings_model.embed_documents(batch)
            embeddings.extend(batch_embeddings)

            # Save checkpoint after each batch
            temp_file = checkpoint_file + '.tmp'
            with open(temp_file, 'wb') as f:
                pickle.dump({
                    'embeddings': embeddings,
                    'next_idx': end_idx
                }, f)

            # Atomically replace old checkpoint (ensures  always have a valid checkpoint file)
            os.replace(temp_file, checkpoint_file)

            # Optional: Add a small delay to avoid rate limiting
            time.sleep(0.1)

    except Exception as e:
        print(f"Error occurred: {e}")
        print(f"Progress saved at index {len(embeddings)}")
        # No need to re-raise, we've saved progress

    print(f"Completed embedding generation: {len(embeddings)}/{len(texts)} chunks")
    return texts, embeddings, metadatas

In [49]:
def process_and_store_embeddings(chunks, vector_store_path=STORE_PATH, batch_size=BATCH_SIZE):
    """
    Process document chunks and create a vector store, but only if it doesn't already exist.
    Verifies complete embedding generation before creating the vector store.

    Args:
        chunks (List[Document]): List of Document objects to process
        vector_store_path (str): Path where vector store will be saved
        batch_size (int): Batch size for embedding generation

    Returns:
        FAISS or None: The vector store if successful, None if embeddings are incomplete
    """
    # Check if vector store already exists
    if os.path.exists(vector_store_path):
        print(f"Vector store already exists at {vector_store_path}. Loading from disk...")

        # Initialize embedding model (needed for loading)
        embeddings_model = get_embeddings_model()

        # Load existing vector store
        vector_store = FAISS.load_local(vector_store_path, embeddings_model)
        print(f"Loaded vector store with {vector_store.index.ntotal} vectors")
        return vector_store

    else:
        print("Vector store not found. Creating new embeddings and vector store...")

        # Generate embeddings with checkpointing
        texts, embeddings, metadatas = create_embeddings_with_checkpoints(
            chunks,
            batch_size=batch_size,
            checkpoint_file=CHECKPOINT_FILE
        )

        # Verify embedding generation is complete
        if len(embeddings) < len(chunks):
            print(f"WARNING: Embedding generation is incomplete! Generated {len(embeddings)}/{len(chunks)} embeddings.")
            print("Please run the embedding generation process again to complete. (ie process_and_store_embeddings())")
            print("The progress has been saved and will resume from where it left off.")
            return None

        print(f"Embedding generation complete: {len(embeddings)}/{len(chunks)} embeddings created.")

        # Initialize embedding model
        embeddings_model = get_embeddings_model()

        # Create vector store from embeddings
        vector_store = FAISS.from_embeddings(
            text_embeddings=list(zip(texts, embeddings)),
            embedding=embeddings_model,
            metadatas=metadatas
        )

        # Save the vector store
        vector_store.save_local(vector_store_path)
        print(f"Created and saved vector store with {len(embeddings)} vectors")

        # Clean up the checkpoint file
        if os.path.exists("embedding_progress.pkl"):
            os.remove("embedding_progress.pkl")
            print("Removed checkpoint file")

        return vector_store

In [50]:
# Step2 :- Create vector store (from Chunks)
try:
    vector_store = process_and_store_embeddings(chunks)
    if vector_store is None:
        print("Vector store creation failed. Please run the process again to complete embedding generation.")
    else:
        print("Vector store ready for use in RAG pipeline.")
except Exception as e:
    print(f"Error occurred during vector store creation: {e}")


Vector store not found. Creating new embeddings and vector store...


Generating embeddings: 100%|██████████| 1822/1822 [1:48:53<00:00,  3.59s/it]  


Completed embedding generation: 18212/18212 chunks
Embedding generation complete: 18212/18212 embeddings created.
Created and saved vector store with 18212 vectors
Removed checkpoint file
Vector store ready for use in RAG pipeline.


In [53]:
def save_embeddings_backup(texts, embeddings, metadatas, file_path="chunks_embeddings.pkl"):
    """
    Save the generated embeddings data to a backup file.

    Args:
        texts (List[str]): List of text chunks
        embeddings (List[List[float]]): List of embedding vectors
        metadatas (List[dict]): List of metadata dictionaries
        file_path (str): Path where to save the backup
    """
    backup_data = {
        "texts": texts,
        "embeddings": embeddings,
        "metadatas": metadatas,
        "count": len(embeddings),
        "timestamp": datetime.datetime.now().isoformat()
    }

    with open(file_path, 'wb') as f:
        pickle.dump(backup_data, f)

    file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
    print(f"Embeddings backup saved to {file_path}")
    print(f"Saved {len(embeddings)} embeddings ({file_size_mb:.2f} MB)")

def load_embeddings_backup(file_path="chunks_embeddings.pkl"):
    """
    Load embeddings data from a backup file.

    Args:
        file_path (str): Path from where to load the backup

    Returns:
        tuple: (texts, embeddings, metadatas) or (None, None, None) if file doesn't exist
    """
    if not os.path.exists(file_path):
        print(f"Embeddings backup not found at {file_path}")
        return None, None, None

    with open(file_path, 'rb') as f:
        backup_data = pickle.load(f)

    texts = backup_data["texts"]
    embeddings = backup_data["embeddings"]
    metadatas = backup_data["metadatas"]

    print(f"Loaded embeddings backup from {file_path}")
    print(f"Loaded {len(embeddings)} embeddings (created on {backup_data['timestamp']})")

    return texts, embeddings, metadatas

# Usage example - after creating embeddings
# save_embeddings_backup(texts, embeddings, metadatas)

# Usage example - to load embeddings instead of generating them
# texts, embeddings, metadatas = load_embeddings_backup()
# if embeddings is not None:
#     # Create vector store from loaded embeddings
#     vector_store = FAISS.from_embeddings(
#         text_embeddings=list(zip(texts, embeddings)),
#         embedding=embeddings_model,
#         metadatas=metadatas
#     )

In [None]:
def extract_embeddings_from_faiss(vector_store_path="merck_manual_faiss_index"):
    """
    Extract embeddings, texts, and metadata from a saved FAISS vector store.

    Args:
        vector_store_path (str): Path to the saved FAISS vector store

    Returns:
        tuple: (texts, embeddings, metadatas) or (None, None, None) if extraction fails
    """
    import pickle
    import numpy as np
    from langchain_aws import BedrockEmbeddings
    from langchain_community.vectorstores import FAISS

    try:
        # Initialize embedding model (needed for loading)
        embeddings_model = BedrockEmbeddings(
            client=bedrock_client,
            model_id="amazon.titan-embed-text-v2:0",
            model_kwargs={"dimensions": 516}
        )

        # Load the vector store
        vector_store = FAISS.load_local(vector_store_path, embeddings_model)

        # Load the docstore data from the index.pkl file
        docstore_path = os.path.join(vector_store_path, "index.pkl")
        with open(docstore_path, "rb") as f:
            docstore_data = pickle.load(f)

        # Extract texts and metadata
        texts = []
        metadatas = []

        # The docstore contains the mapping between IDs and documents
        for doc_id, doc in docstore_data["docstore"]._dict.items():
            texts.append(doc.page_content)
            metadatas.append(doc.metadata)

        # Extract embeddings from the FAISS index
        # This gets the raw numpy array of all embeddings
        embeddings = vector_store.index.reconstruct_n(0, vector_store.index.ntotal)

        print(f"Extracted {len(texts)} texts, {len(embeddings)} embeddings, and {len(metadatas)} metadata entries")

        # Convert numpy arrays to lists for easier serialization
        embeddings_list = [emb.tolist() for emb in embeddings]

        return texts, embeddings_list, metadatas

    except Exception as e:
        print(f"Error extracting data from FAISS index: {e}")
        return None, None, None

# Usage
# texts, embeddings, metadatas = extract_embeddings_from_faiss()
# if embeddings is not None:
#     # Save as backup
#     save_embeddings_backup(texts, embeddings, metadatas)

### Basic LLM Implementation


In [66]:
def print_results(llm_response_map):
    """Print the results of the LLM responses."""
    for i, (_, data) in enumerate(llm_response_map.items(), start=1):
        print("=" * 60)
        print(f"Q{i}: {data['question']}\n")
        print(f"A{i}: {data['response']}\n")

In [54]:
# constants
MODEL_ID = "amazon.titan-text-lite-v1"

> /var/folders/50/fm7gl9594dsfr5pgv64fl4fh0000gn/T/ipykernel_46379/4153466583.py:20: LangChainDeprecationWarning: The class `Bedrock` was deprecated in LangChain 0.0.34 and will be removed in 1.0. An updated version of the class exists in the :class:`~langchain-aws package and should be used instead. To use it run `pip install -U :class:`~langchain-aws` and import as `from :class:`~langchain_aws import BedrockLLM``.
  self.llm = Bedrock(

because of such warning using langchain_aws > BedrockLLM

In [74]:
from langchain_aws import BedrockLLM

class LLMHelper:
    """Helper class to initialize and interact with a Bedrock LLM."""

    def __init__(self, model_id=MODEL_ID, **model_kwargs):
        """Initialize the LLM with specified model ID and parameters."""
        default_params = {
            # want factual, stable, and reliable responses.
            # ? Low temperature encourages the model to stay deterministic and avoid hallucinations.
            # suggested 0.2 !!
            "temperature": 0.3,
            # ! If concerned about response length, you can always instruct the model to be concise in your prompt, but having the 1024 token capacity ensures the model won't be artificially constrained when explaining complex medical concepts.
            # ? Medical explanations often require comprehensive context. 1024 tokens (approximately 750-800 words) provides sufficient space to explain medical concepts, protocols, and treatments completely.
            "maxTokenCount": 1024,
            # Allows some diversity in wording, but not too much.
            # ? Ensures responses stay on-topic and medically accurate while permitting natural language variations in explanations.
            "topP": 0.88,
        }
        params = {**default_params, **model_kwargs}

        self.llm = BedrockLLM(
            client=bedrock_client,
            model_id=model_id,
            model_kwargs=params,
        )

    def generate_response(self, question: str) -> str:
        """Generate a response using the default prompt."""

        prompt = f"""You are a helpful medical assistant with expertise in healthcare.
Please answer the following medical question accurately and comprehensively:

Question: {question}

Answer:"""
        return self.llm.invoke(prompt)

    def generate_response_for_prompt(self, prompt: str) -> str:
        """Generate a response to a custom prompt."""
        return self.llm.invoke(prompt)


In [61]:
# Initialize the LLM with default parameters
llm_manager = LLMHelper()


⚠️ Bedrock Deprecation Warning in LangChain

You may see this warning when using `langchain.llms.Bedrock`:

```
LangChainDeprecationWarning: The class `Bedrock` was deprecated in LangChain 0.0.34 and will be removed in 1.0. An updated version of the class exists in the :class:`~langchain-aws package and should be used instead. To use it run `pip install -U :class:`~langchain-aws` and import as `from :class:`~langchain_aws import BedrockLLM``.
````

👀
BedrockLLM is essentially the same as the deprecated Bedrock class, just with a renamed class name in the new package structure. The functionality is identical:


In [62]:
# Test questions from the problem statement
questions = [
    "What is the protocol for managing sepsis in a critical care unit?",
    "What are the common symptoms of appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?",
    "What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?",
    "What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?",
    "What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?"
]

In [63]:
# Generate responses for each question
llm_only_responses = {}
for i, question in enumerate(questions):
    print(f"Generating response for question {i+1}...")
    response = llm_manager.generate_response(question)
    llm_only_responses[f"Question {i+1}"] = {
        "question": question,
        "response": response
    }
    print(f"Response: {response[:100]}...\n")

Generating response for question 1...
Response: 
Septic shock is a life-threatening medical emergency that requires immediate treatment. The protoco...

Generating response for question 2...
Response: 
Appendicitis is a common condition that affects the appendix, a small, tube-like structure at the b...

Generating response for question 3...
Response: 
Here are some effective treatments or solutions for addressing sudden patchy hair loss, commonly se...

Generating response for question 4...
Response: 
Here are some treatments that may be recommended for a person who has sustained a physical injury t...

Generating response for question 5...
Response: 
Here are the necessary precautions and treatment steps for a person who has fractured their leg dur...



In [64]:
# ! Run only when you want to generate responses to be saved in a file for future references

# Save responses to file
with open("llm_only_responses.json", "w") as f:
    json.dump(llm_only_responses, f, indent=2)

👀 Observations 

It took aroud 3 mins to generate response to all5 questions !!

1. **Sepsis**: A severe infection that can spread throughout the body, requiring immediate medical care. Treatment involves antibiotics, fluids, and possibly surgery or intensive care.

2. **Appendicitis**: An inflamed appendix causing severe abdominal pain. The usual treatment is antibiotics and surgery to remove the appendix if necessary.

3. **Patchy Hair Loss**: Sudden hair loss might be caused by various factors like genetics or health issues. Treatments can include medication, hair transplants, or lifestyle changes, depending on the cause.

4. **Brain Injury**: A physical injury to the brain needs urgent care and may require therapies (physical, speech, etc.) and medications. In severe cases, surgery might be necessary, followed by rehab for recovery.

5. **Leg Fracture**: A broken leg requires immediate care like pain management and immobilization, with possible surgery. Rehabilitation and follow-up care are important for recovery.

These are all serious health conditions, but with proper treatment and quick medical attention, many can be managed effectively.

### Prompt Engineering for LLM

In [75]:
def test_prompt_variations(question: str):
    """Test different prompt variations and LLM parameters."""
    prompt_variations = [
        # Variation 1: Basic prompt
        {
            "template": "Answer the following medical question: {question}",
            "params": {"temperature": 0.7, "maxTokenCount": 1024}
        },
        # Variation 2: Detailed context
        {
            "template": """You are a medical expert with access to the Merck Manual.
            Provide a detailed, accurate, and comprehensive answer to the following question.
            Include relevant medical terminology and procedures where appropriate.

            Question: {question}

            Answer:""",
            "params": {"temperature": 0.5, "maxTokenCount": 1500}
        },
        # Variation 3: Step-by-step reasoning
        {
            "template": """As a healthcare professional, answer the following medical question.
            First, identify the key medical concepts in the question.
            Then, provide a step-by-step explanation with relevant medical information.
            Finally, summarize your answer concisely.

            Question: {question}

            Answer:""",
            "params": {"temperature": 0.3, "maxTokenCount": 1200}
        },
        # Variation 4: Evidence-based approach
        {
            "template": """Based on standard medical guidelines and evidence-based practice:

            Question: {question}

            Provide a comprehensive answer that includes:
            1. Definition and background
            2. Key diagnostic criteria or symptoms
            3. Standard treatment protocols
            4. Important considerations for healthcare providers

            Answer:""",
            "params": {"temperature": 0.4, "maxTokenCount": 1300}
        },
        # Variation 5: Patient-friendly explanation
        {
            "template": """Provide a clear, accurate medical answer that would be appropriate for both
            healthcare professionals and informed patients. Use plain language where possible
            while maintaining medical accuracy.

            Question: {question}

            Answer:""",
            "params": {"temperature": 0.6, "maxTokenCount": 1100}
        }
    ]

    print(f"Testing prompt variations for question: {question}")

    results = []
    for i, variation in enumerate(prompt_variations):
        print(f"Testing prompt variation {i+1}...")

        # Format the prompt
        formatted_prompt = variation["template"].format(question=question)

        # Initialize LLM with specific parameters
        test_llm = LLMHelper(**variation["params"])

        # Generate response (with custom prompt ie not using the default one)
        response = test_llm.generate_response_for_prompt(formatted_prompt)

        print(f"Response {i+1}: {response[:125]}...\n")

        results.append({
            "variation": i+1,
            "prompt": variation["template"],
            "parameters": variation["params"],
            "response": response
        })

    return results

In [76]:
# Test prompt variations for each question
prompt_engineering_results = {}

for i, question in enumerate(questions):
    print(f"\nTesting prompt variations for question {i+1}...")

    results = test_prompt_variations(question)
    prompt_engineering_results[f"Question {i+1}"] = {
        "question": question,
        "variations": results
    }


Testing prompt variations for question 1...
Testing prompt variations for question: What is the protocol for managing sepsis in a critical care unit?
Testing prompt variation 1...
Response 1: 
Sepsis is a serious medical condition that occurs when the body's response to an infection injures its own tissues and organ...

Testing prompt variation 2...
Response 2:  The protocol for managing sepsis in a critical care unit involves a combination of early detection, rapid diagnosis, and agg...

Testing prompt variation 3...
Response 3: 
            The protocol for managing sepsis in a critical care unit involves early identification, rapid assessment, and pr...

Testing prompt variation 4...
Response 4: 
Sepsis is a serious medical condition that requires immediate medical attention. It is a life-threatening condition that occ...

Testing prompt variation 5...
Response 5:  Sepsis is a life-threatening condition that requires immediate medical attention. In a critical care unit, sepsis mana

In [77]:
# Save results to file
with open("prompt_engineering_results.json", "w") as f:
    json.dump(prompt_engineering_results, f, indent=2)

👀 **Observations**

It took around 5 minutes to address all questions with different variations in prompt 

**Prompt Engineering Results - Quick Observations**

1. **Response Length**: Higher temperature settings (0.6-0.7) created longer answers; lower temperatures (0.3-0.4) gave more concise responses.

2. **Structure Matters**: Prompts that asked for step-by-step explanations produced more organized answers.

3. **Role-Playing Works**: When the AI was told to be a "medical expert," it used more technical language.

4. **Best Performers**:
   - For clear structure: Variation 3 (temp 0.3)
   - For technical detail: Variation 2 (temp 0.5)
   - For easy-to-understand language: Variation 5 (temp 0.6)

5. **Content Quality**: The core medical information stayed consistent across all variations, just presented differently.


### Rag Implementation

In [82]:
from langchain_aws import BedrockLLM

# TODO: create a custom helper method to get llm instance (maybe via langchain_aws or langchain.llms.Bedrock)

def initialize_llm_for_rag(model_id=MODEL_ID, **model_kwargs):
    """Initialize the Bedrock LLM with specified parameters."""
    default_params = {
        # ! even lower temperature because the model is now grounding its responses in retrieved text rather than generating from its parameters
        "temperature": 0.25,
        # *increase max tokens slightly because RAG responses often need to synthesize information from multiple retrieved chunks. The additional tokens allow for more comprehensive integration of the retrieved medical information.
        "maxTokenCount": 1500,
        # ? a slightly lower top_p with RAG because the retrieved context already provides the necessary information. A lower top_p helps the model stay more focused on the most probable tokens derived from the retrieved medical text.
        "topP": 0.8,
    }

    # Update default parameters with any provided ones
    params = {**default_params, **model_kwargs}

    #llm = Bedrock(
    llm = BedrockLLM(
        client=bedrock_client,
        model_id=model_id,
        model_kwargs=params
    )

    return llm

In [81]:
def setup_rag_pipeline(vector_store, llm):
    """Set up a RAG pipeline with the vector store and LLM."""

    # Create retriever
    retriever = vector_store.as_retriever(
        search_type="similarity",
        search_kwargs={"k": 5}
    )

    # Create RAG prompt template
    rag_prompt_template = """You are a medical assistant with expertise in healthcare.
    Use the following context from the Merck Manual to answer the question accurately.
    If the context doesn't contain the answer, say so and provide general medical knowledge.

    Context:
    {context}

    Question: {question}

    Answer:"""

    rag_prompt = PromptTemplate.from_template(rag_prompt_template)

    # Create RAG chain
    rag_chain = (
        {"context": retriever, "question": RunnablePassthrough()}
        | rag_prompt
        | llm
        | StrOutputParser()
    )

    return rag_chain

In [83]:
# Set up RAG pipeline
llm = initialize_llm_for_rag()
rag_chain = setup_rag_pipeline(vector_store, llm)

In [84]:
# Generate RAG responses for each question
rag_responses = {}
for i, question in enumerate(questions):
    print(f"Generating RAG response for question {i+1}...")
    response = rag_chain.invoke(question)
    rag_responses[f"Question {i+1}"] = {
        "question": question,
        "response": response
    }
    print(f"Response: {response[:120]}...\n")

Generating RAG response for question 1...
Response:  Fluid resuscitation with 0.9% saline should be given until CVP reaches 8 mm Hg (10 cm H2O) or PAOP reaches 12 to 15 mm ...

Generating RAG response for question 2...
Response:  Common symptoms of appendicitis include abdominal pain, nausea, vomiting, anorexia, and low-grade fever. While appendic...

Generating RAG response for question 3...
Response:  The effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on ...

Generating RAG response for question 4...
Response:  The cornerstone of management for all patients is maintenance of adequate ventilation, oxygenation, and brain perfusion...

Generating RAG response for question 5...
Response:  The necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what s...



In [85]:
# Save responses to file
with open("rag_responses.json", "w") as f:
    json.dump(rag_responses, f, indent=2)

👀 **TakeAways**

This RAG setup combines deep medical knowledge with AI's language ability to generate helpful, factual, and structured answers. It mirrors how a trained doctor might respond—often detailed and cautious, but sometimes abrupt or overly technical.

> **It took only 35 seconds to generate the answers for all 5 questions** (ie pretty fast 🤔)

**Summary of RAG Responses**

This file contains 5 medical questions with their corresponding AI-generated answers. The quality and completeness of these responses vary significantly:

1. **Sepsis management**: Very detailed response with specific treatments, medications, and dosages.

2. **Appendicitis**: Extremely brief response that contradicts itself (says it "can be cured via medicine" but then immediately states it "is a surgical condition").

3. **Hair loss**: Lists many possible causes but doesn't clearly explain treatments despite the question asking for both.

4. **Brain injury**: Very brief response focusing only on basic management principles.

5. **Leg fracture**: Brief response with basic care instructions but missing important emergency first aid information relevant to a hiking scenario.

**Observations**

- **Inconsistent detail level**: Some answers are comprehensive while others are minimal.
- **Accuracy issues**: The appendicitis answer contains contradictory information.
- **Missing information**: Several responses don't fully address the questions asked.
- **Format inconsistency**: Some responses use bullet points while others are paragraphs.
- **Context awareness**: The leg fracture response doesn't address the hiking emergency context.

These inconsistencies suggest the RAG system might need improvement in how it retrieves and synthesizes medical information to provide more consistent, accurate, and complete responses.

#### RAG Optimization

In [86]:
def create_vector_store(chunks: List[Document]):
    """Create a vector store from document chunks."""
    # Initialize the Titan Embeddings model
    embeddings = BedrockEmbeddings(
        client=bedrock_client,
        model_id=MODEL_ID
    )

    # Create vector store
    vector_store = FAISS.from_documents(
        documents=chunks,
        embedding=embeddings
    )

    return vector_store

In [88]:
def optimize_rag_pipeline(vector_store, question):
    """Test different RAG configurations for a question without re-chunking.

    Why ? Re-chunking (ie altering chunk size and overlap values) is a costly operation and should be avoided if possible as it consumes a lot of time, whilst creating embeddings.
    """

    # Configuration variations to test (only retriever_k and llm_params)
    configurations = [
        # Variation 1: Default configuration
        {
            "retriever_k": 5,
            "llm_params": {"temperature": 0.5, "maxTokenCount": 1500}
        },
        # Variation 2: More retrieval context
        {
            "retriever_k": 8,
            "llm_params": {"temperature": 0.4, "maxTokenCount": 1500}
        },
        # Variation 3: Less retrieval context
        {
            "retriever_k": 3,
            "llm_params": {"temperature": 0.5, "maxTokenCount": 1800}
        },
        # Variation 4: More creative generation
        {
            "retriever_k": 5,
            "llm_params": {"temperature": 0.7, "maxTokenCount": 1500}
        },
        # Variation 5: More precise generation
        {
            "retriever_k": 5,
            "llm_params": {"temperature": 0.2, "maxTokenCount": 1500}
        }
    ]

    print(f"Testing RAG configurations for question: {question} --->")

    results = []
    for i, config in enumerate(configurations):
        print(f"Testing RAG configuration {i+1}...")

        # Create retriever with specific k value
        test_retriever = vector_store.as_retriever(
            search_type="similarity",
            search_kwargs={"k": config["retriever_k"]}
        )

        # Initialize LLM with specific parameters
        test_llm = initialize_llm_for_rag(**config["llm_params"])

        # Create RAG prompt
        rag_prompt_template = """You are a medical assistant with expertise in healthcare.
        Use the following context from the Merck Manual to answer the question accurately.
        If the context doesn't contain the answer, say so and provide general medical knowledge.

        Context:
        {context}

        Question: {question}

        Answer:"""

        rag_prompt = PromptTemplate.from_template(rag_prompt_template)

        # Create RAG chain
        test_rag_chain = (
            {"context": test_retriever, "question": RunnablePassthrough()}
            | rag_prompt
            | test_llm
            | StrOutputParser()
        )

        # Generate response
        response = test_rag_chain.invoke(question)

        print(f"Response {i+1}: {response[:120]}...\n")

        results.append({
            "configuration": i+1,
            "settings": config,
            "response": response
        })

    return results

In [90]:
# Test RAG optimization for each question
rag_optimization_results = {}
for i, question in enumerate(questions):
    print(f"\nOptimizing RAG for question {i+1}...")
    results = optimize_rag_pipeline(vector_store, question)
    rag_optimization_results[f"Question {i+1}"] = {
        "question": question,
        "configurations": results
    }


Optimizing RAG for question 1...
Testing RAG configurations for question: What is the protocol for managing sepsis in a critical care unit? --->
Testing RAG configuration 1...
Response 1:  The first step is to keep the patient warm. Then, hemorrhage is controlled, airway and ventilation are checked, and res...

Testing RAG configuration 2...
Response 2:  Fluid resuscitation with 0.9% saline should be given until CVP reaches 8 mm Hg (10 cm H2O) or PAOP reaches 12 to 15 mm ...

Testing RAG configuration 3...
Response 3:  Fluid resuscitation, broad-spectrum antibiotics, drainage of abscesses, and normalization of blood glucose levels....

Testing RAG configuration 4...
Response 4:  Fluid resuscitation with 0.9% saline should be given until CVP reaches 8 mm Hg (10 cm H2O) or PAOP reaches 12 to 15 mm ...

Testing RAG configuration 5...
Response 5:  Fluid resuscitation with 0.9% saline should be given until CVP reaches 8 mm Hg (10 cm H2O) or PAOP reaches 12 to 15 mm ...


Optimizing RAG for

In [91]:
# Save results to file
with open("rag_optimization_results.json", "w") as f:
    json.dump(rag_optimization_results, f, indent=2)

🧐 **Summary** 

This experiment tested different configurations for a medical question-answering system, varying two key parameters:

- **Retriever_k**: The number of documents retrieved (3-8)
- **Temperature**: Controls randomness in responses (0.2-0.7)

**NOTE**:
- It was conducted in an around **3 mins**
- Task completed in lesser time than llms standalone responses (ie without tuned)
- This suggest providing context, somehow helps llm to deduce the responses faster !! 

**Key Findings**:

1. **More retrieval isn't always better**: Configuration 5 (k=5, temp=0.2) consistently produced the most detailed and accurate responses.

2. **Lower temperature (0.2-0.4)** generally produced more comprehensive and factual answers compared to higher temperatures.

3. **Response quality varied dramatically** across questions - some received detailed answers while others got minimal or incorrect information.

4. **System limitations**: For some medical questions (particularly appendicitis), most configurations failed completely.

5. **Best balance**: The optimal configuration appears to be moderate retrieval (k=5) with low temperature (0.2) for medical information retrieval.

This suggests that fine-tuning these parameters is crucial for reliable medical information retrieval, with quality being more important than quantity of retrieved documents.

### Evaluation Framework

In [99]:
def print_evaluation_results(result_map):
    pprint(result_map)

In [92]:
from langchain_aws import BedrockLLM

# TODO: create a custom helper method to get llm instance (maybe via langchain_aws or langchain.llms.Bedrock)

def initialize_llm_for_eval(model_id=MODEL_ID, **model_kwargs):
    """Initialize the Bedrock LLM with specified parameters."""
    default_params = {
        # ! even lower temperature because the model is now grounding its responses in retrieved text rather than generating from its parameters
        "temperature": 0.2,
        # *increase max tokens slightly because RAG responses often need to synthesize information from multiple retrieved chunks. The additional tokens allow for more comprehensive integration of the retrieved medical information.
        "maxTokenCount": 1000,
    }

    # Update default parameters with any provided ones
    params = {**default_params, **model_kwargs}

    #llm = Bedrock(
    llm = BedrockLLM(
        client=bedrock_client,
        model_id=model_id,
        model_kwargs=params
    )

    return llm

In [93]:
def evaluate_groundedness(question, response, context=None):
    """Evaluate the groundedness of a response."""

    if context:
        evaluation_prompt = f"""You are an objective evaluator assessing the groundedness of a medical response.

        Question: {question}

        Response to evaluate: {response}

        Context from source material: {context}

        Evaluate the groundedness of the response on a scale of 1-10, where:
        1 = Not grounded at all, contains information contradicting the source material
        5 = Partially grounded, some information is accurate but contains unsupported claims
        10 = Completely grounded, all information is supported by the source material

        Provide your rating and a brief explanation of your assessment.
        """
    else:
        evaluation_prompt = f"""You are an objective evaluator assessing the groundedness of a medical response.

        Question: {question}

        Response to evaluate: {response}

        Evaluate the groundedness of the response on a scale of 1-10, where:
        1 = Not grounded at all, contains information that contradicts medical knowledge
        5 = Partially grounded, some information is accurate but contains questionable claims
        10 = Completely grounded, all information aligns with established medical knowledge

        Provide your rating and a brief explanation of your assessment.
        """

    # Use a lower temperature for evaluation to get more consistent results
    eval_llm = initialize_llm_for_eval()
    evaluation = eval_llm.invoke(evaluation_prompt)

    return evaluation

In [94]:
def evaluate_relevance(question, response):
    """Evaluate the relevance of a response to the question."""

    evaluation_prompt = f"""You are an objective evaluator assessing the relevance of a medical response.

    Question: {question}

    Response to evaluate: {response}

    Evaluate the relevance of the response on a scale of 1-10, where:
    1 = Not relevant at all, does not address the question
    5 = Partially relevant, addresses some aspects of the question but misses key points
    10 = Completely relevant, directly and comprehensively addresses all aspects of the question

    Provide your rating and a brief explanation of your assessment.
    """

    # Use a lower temperature for evaluation to get more consistent results
    eval_llm = initialize_llm_for_eval()
    evaluation = eval_llm.invoke(evaluation_prompt)

    return evaluation

In [95]:
# Evaluate all responses
evaluation_results = {
    "llm_only": {},
    "rag": {}
}

In [96]:
# Evaluate LLM-only responses
for question_key, data in llm_only_responses.items():
    question = data["question"]
    response = data["response"]

    groundedness = evaluate_groundedness(question, response)
    relevance = evaluate_relevance(question, response)

    evaluation_results["llm_only"][question_key] = {
        "groundedness": groundedness,
        "relevance": relevance
    }

In [100]:
print("LLM-only Evaluation Results: \n")
print_evaluation_results(evaluation_results["llm_only"])

LLM-only Evaluation Results: 

{'Question 1': {'groundedness': '10 - Completely grounded, all information '
                                'aligns with established medical knowledge. '
                                'Sepsis is a life-threatening medical '
                                'emergency that requires immediate treatment. '
                                'The protocol for managing sepsis in a '
                                'critical care unit involves a combination of '
                                'interventions, including rapid assessment and '
                                'diagnosis, fluid resuscitation, antibiotics, '
                                'vasopressors, inotropes, mechanical '
                                'ventilation, nutrition, pain management, '
                                'fluid restriction, and blood transfusion. The '
                                'healthcare team should work closely with the '
                                "patient

In [101]:
# Evaluate RAG responses
print('Evaluating RAG responses...')
for question_key, data in rag_responses.items():
    question = data["question"]
    response = data["response"]

    # For RAG responses, we can retrieve the context used
    retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 5})
    context_docs = retriever.invoke(question)
    context = "\n\n".join([doc.page_content for doc in context_docs])

    groundedness = evaluate_groundedness(question, response, context)
    relevance = evaluate_relevance(question, response)

    evaluation_results["rag"][question_key] = {
        "groundedness": groundedness,
        "relevance": relevance
    }
print("Evaluation Completed !!")

Evaluating RAG responses...
Evaluation Completed !!


In [102]:
print_evaluation_results(evaluation_results["rag"])

{'Question 1': {'groundedness': '10 - Completely grounded, all information is '
                                'supported by the source material.',
                'relevance': '10 - Completely relevant, directly and '
                             'comprehensively addresses all aspects of the '
                             'question.'},
 'Question 2': {'groundedness': '10 - Completely grounded, all information is '
                                'supported by the source material. The '
                                'response accurately describes the common '
                                'symptoms of appendicitis, including abdominal '
                                'pain, nausea, vomiting, anorexia, and '
                                'low-grade fever, and the surgical procedure '
                                'required to treat it. The response also '
                                'explains that appendicitis is a surgical '
                                'condition that

In [103]:
with open("evaluation_results.json", "w") as f:
    json.dump(evaluation_results, f, indent=2)

### Actionable Insights and Recommendations

**Key Findings**

#### LLM-only vs. RAG Performance
- **The RAG-based approach demonstrates significantly faster inference times**, as it avoids generating entire responses from scratch and instead builds on retrieved, high-quality content.
- RAG outperformed LLM-only approaches in some cases
- RAG responses contained more specific medical details and protocols
- LLM-only responses sometimes contained hallucinations or generalized information

#### Prompt Engineering Impact
- Structured prompts with medical context significantly improved response quality
- Step-by-step reasoning prompts led to more comprehensive answers
- Evidence-based prompts resulted in more clinically relevant information

#### RAG Optimization Insights
- Chunk size of 1000 was chosen because of medical domain so as to not loose maybe granular context
- Retrieving 5 relevant passages provided the best balance of context and focus
- Lower temperature settings (0.2-0.3) produced more reliable medical information

#### Business Recommendations

1. **Implement RAG for Clinical Decision Support**
   - Deploy the optimized RAG system to provide quick access to medical knowledge
   - Integrate with existing healthcare systems for seamless workflow

2. **Customize for Specific Medical Departments**
   - Create specialized versions for different medical specialties
   - Fine-tune retrieval parameters based on department-specific needs

3. **Establish Continuous Evaluation Framework**
   - Implement regular evaluation of system responses by medical professionals
   - Create feedback loops to improve system performance over time

4. **Expand Knowledge Sources**
   - Incorporate additional trusted medical resources beyond the Merck Manual
   - Consider adding recent research papers and clinical guidelines

5. **Develop User-Friendly Interface**
   - Create intuitive interfaces for healthcare professionals to interact with the system
   - Provide transparency about information sources and confidence levels

#### Conclusion
The RAG-based system demonstrates significant potential for addressing information overload in healthcare settings. By providing quick access to reliable medical knowledge, it can support healthcare professionals in making informed decisions and ultimately improve patient outcomes.


#### ✅ Final Outcomes

* Efficient QA system for healthcare using Merck Manual
* Faster and more relevant inference via RAG approach
* Observed benefit of context injection using RAG over standalone LLM
* Insights into chunk sizing, prompt structure, and retrieval tuning