# Medical Assistant: Problem Statement (NLP with GenAI)

### Business Context



The healthcare industry is rapidly evolving, and professionals face increasing challenges in managing vast volumes of medical data while delivering accurate and timely diagnoses. Quick access to comprehensive, reliable, and up-to-date medical knowledge is critical for improving patient outcomes and ensuring informed decision-making in a fast-paced environment.

Healthcare professionals often encounter information overload, struggling to sift through extensive research and data to create accurate diagnoses and treatment plans. This challenge is amplified by the need for efficiency, particularly in emergencies, where time-sensitive decisions are vital. Furthermore, access to trusted, current medical information from renowned manuals and research papers is essential for maintaining high standards of care.

To address these challenges, healthcare centers can focus on integrating systems that streamline access to medical knowledge, provide tools to support quick decision-making and enhance efficiency. Leveraging centralized knowledge platforms and ensuring healthcare providers have continuous access to reliable resources can significantly improve patient care and operational effectiveness.



### Objective

As an AI specialist, your task is to develop a RAG-based AI solution using renowned medical manuals to address healthcare challenges. The objective is to understand information overload, apply AI techniques to streamline decision-making, analyze  its impact on diagnostics and patient outcomes, evaluate its potential to standardize care practices, and create a functional prototype demonstrating its feasibility and effectiveness.

### Questions to Answer

1. What is the protocol for managing sepsis in a critical care unit?
2. What are the common symptoms of appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?
3. What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?
4. What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?
5. What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?



### Data Dictionary



The Merck Manuals are medical references published by the American pharmaceutical company Merck & Co., that cover a wide range of medical topics, including disorders, tests, diagnoses, and drugs. The manuals have been published since 1899 when Merck & Co. was still a subsidiary of the German company Merck.
The manual is a PDF with over 4,000 pages divided into 23 sections.



> ⚠️ Important Note
>
> Please set the runtime to T4-GPU in Google Colab.

---



### 🚀 **Alternate Approach: Project | AWS Bedrock**

> ⚠️ **Request**
>
> Before referring to the learner guide, I independently implemented the project using **AWS Bedrock**, oblivious to provided direction, exploring a different stack and approach. This alternate solution highlights different choices and practical trade-offs.
>
> `Still, learned valuable know-hows along the way - inadvertently!`
>
>
> 📌 **Why include this here ?**
>
>
> It broadens the perspective by showcasing the same task via two paths - ***Hugging Face*** vs. ***AWS Bedrock*** — helping evaluate differences in performance, scalability, and integration.
>
> 💡 Since both versions were built independently and impromptu, I’d be glad if you could go through and appreciate your feedback on which aligns better with real-world industry practices.
>
> ⚡ Kindly consider reviewing this version too - it’s the same RAG/LLM task through a different lens.
>
> *Hope the link below catches your attention!* 👀
>
> 🔗 [Explore AWS Bedrock-based RAG Implementation](https://nvshah.github.io/pgpaiml/medical_assistant_soln.html)
>
>


---

In [1]:
# Connect to Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%cd "drive/MyDrive/Colab Notebooks/medical_assistant_hf"

/content/drive/MyDrive/Colab Notebooks/medical_assistant_hf


In [3]:
!pwd

/content/drive/MyDrive/Colab Notebooks/medical_assistant_hf


In [4]:
# This will show detailed GPU information if available
!nvidia-smi

Sun May 11 07:20:17 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   37C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## Setup

**NOTE: refer https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf for phi-3 mini setup once**

#### Essential Libraries for Task

🧐 **NOTE**:

The cu121 (CUDA 12.1) version works with your CUDA 12.4 because of backward compatibility within major CUDA versions. CUDA 12.1 libraries are generally compatible with CUDA 12.4, as NVIDIA maintains compatibility within major versions.

Regarding which approach is better:

Using `CMAKE_ARGS="-DLLAMA_CUBLAS=on"` might be slightly more beneficial because:

1. It compiles specifically against your exact CUDA version (12.4)
2. You can add additional optimization flags if needed
3. It ensures CUBLAS is explicitly enabled with your specific GPU drivers

The difference in performance is likely to be minor, but if you want the most optimized build for your specific environment, using the CMAKE_ARGS approach would be preferable.

If you're concerned about getting the best performance, you could cancel the current installation and use:

```language=python
!CMAKE_ARGS="-DLLAMA_CUBLAS=on -DLLAMA_CUDA_F16=on" pip install llama-cpp-python
```

This explicitly enables CUBLAS and FP16 support, which can improve performance on T4 GPUs.


In [5]:
# As our cuda version is 12.4 so using whl > cu124 (inorder not to get version mismatch things by using cu118)
# ref: https://pypi.org/project/llama-cpp-python/

!pip install llama-cpp-python --upgrade --force-reinstall --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124

# Alternatively use
#!CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python


# Use this F16 support one for faster inference
#!CMAKE_ARGS="-DLLAMA_CUBLAS=on -DLLAMA_CUDA_F16=on" pip install llama-cpp-python

#!CMAKE_ARGS="-DLLAMA_CUBLAS=ON" pip install llama-cpp-python --upgrade --force-reinstall

# !CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.85 --force-reinstall --no-cache-dir -q

Looking in indexes: https://pypi.org/simple, https://abetlen.github.io/llama-cpp-python/whl/cu124
Collecting llama-cpp-python
  Downloading llama_cpp_python-0.3.9.tar.gz (67.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.9/67.9 MB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting typing-extensions>=4.5.0 (from llama-cpp-python)
  Downloading typing_extensions-4.13.2-py3-none-any.whl.metadata (3.0 kB)
Collecting numpy>=1.20.0 (from llama-cpp-python)
  Downloading numpy-2.2.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.0/62.0 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting diskcache>=5.6.1 (from llama-cpp-python)
  Do

In [None]:
# *?? Ideally below should be used but (due to some indirect depenendencies incompatibility (ie transformer, numpy), not using latest versions !!)
#!pip install huggingface_hub langchain langchain_text_splitters pymupdf langchain-community tiktoken sentence-transformers

# Below are the values suggested by Academic in reference notebook (not working !!)
# For installing the libraries & downloading models from HF Hub
#!pip install huggingface_hub==0.23.2 pandas==1.5.3 pymupdf==1.25.1 langchain==0.1.1 langchain-community==0.0.13 sentence-transformers==2.4.0 numpy==1.26.0 -q

In [6]:
# ---
# ? After lots of trial and error deduce compatible versions between packages that worked out !!
# ---
!pip install huggingface_hub==0.23.0 accelerate==0.31.0 pymupdf==1.23.22 langchain==0.1.14 langchain-community==0.0.35 langchain_text_splitters==0.0.1 sentence-transformers==2.6.1 numpy==1.26.4 pandas==2.2.2 transformers==4.40.2

Collecting huggingface_hub==0.23.0
  Downloading huggingface_hub-0.23.0-py3-none-any.whl.metadata (12 kB)
Collecting accelerate==0.31.0
  Downloading accelerate-0.31.0-py3-none-any.whl.metadata (19 kB)
Collecting pymupdf==1.23.22
  Downloading PyMuPDF-1.23.22-cp311-none-manylinux2014_x86_64.whl.metadata (3.4 kB)
Collecting langchain==0.1.14
  Downloading langchain-0.1.14-py3-none-any.whl.metadata (13 kB)
Collecting langchain-community==0.0.35
  Downloading langchain_community-0.0.35-py3-none-any.whl.metadata (8.7 kB)
Collecting langchain_text_splitters==0.0.1
  Downloading langchain_text_splitters-0.0.1-py3-none-any.whl.metadata (2.0 kB)
Collecting sentence-transformers==2.6.1
  Downloading sentence_transformers-2.6.1-py3-none-any.whl.metadata (11 kB)
Collecting numpy==1.26.4
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m5.3 MB/s[0m eta [36m0:

In [1]:
# ?? --
# Earlier this was not needed (to install seprately)
# But as later on due to version mismatch, we put specific version to each library (ie not using latest versions)
# In which case (might be langchain community packages in older days wont be shipping `faiss-cpu` together with it so we may need to dodwnload manually)
# But whilst using latest version it may not required to download it manually !!
# -----
!pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.8 kB)
Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl (31.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.11.0


In [8]:
import numpy
print(numpy.__version__)

1.26.4


In [9]:
import transformers
print(transformers.__version__)

4.40.2


In [10]:
# THis block is for debug and identify trace of numpy version ambiguity (because installed was 1.26.4 but showing 2.0.2)
import sys

print("NumPy version:", numpy.__version__)
print("Imported from:", numpy.__file__)
print("Python executable:", sys.executable)
print("Environment paths:")
for path in sys.path:
    print(" -", path)

NumPy version: 1.26.4
Imported from: /usr/local/lib/python3.11/dist-packages/numpy/__init__.py
Python executable: /usr/bin/python3
Environment paths:
 - /content
 - /env/python
 - /usr/lib/python311.zip
 - /usr/lib/python3.11
 - /usr/lib/python3.11/lib-dynload
 - 
 - /usr/local/lib/python3.11/dist-packages
 - /usr/lib/python3/dist-packages
 - /usr/local/lib/python3.11/dist-packages/IPython/extensions
 - /root/.ipython


#### Imports

In [90]:
import os
import pickle
import time
import json

In [12]:
from huggingface_hub import hf_hub_download, list_repo_files
from llama_cpp import Llama

In [13]:
from langchain_community.document_loaders import PyMuPDFLoader
import re

In [14]:
# Import libraries for text chunking
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_text_splitters import MarkdownHeaderTextSplitter

In [15]:
# Import libraries for embeddings
from langchain_community.embeddings import HuggingFaceEmbeddings
import torch
import time

In [16]:
from typing import List

In [17]:
from langchain_community.vectorstores import FAISS

---

In [18]:
# Test questions from the problem statement
questions = [
    "What is the protocol for managing sepsis in a critical care unit?",
    "What are the common symptoms of appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?",
    "What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?",
    "What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?",
    "What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?"
]

In [None]:
# List available files to get the correct filename
files = list_repo_files("microsoft/Phi-3-mini-4k-instruct-GGUF")
gguf_files = [f for f in files if f.endswith('.gguf')]
print("Available GGUF files:")
for f in gguf_files:
    print(f)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Available GGUF files:
Phi-3-mini-4k-instruct-fp16.gguf
Phi-3-mini-4k-instruct-q4.gguf


⁉️ **Model Highlight**

We will use **Phi-3-mini-4k-instruct-q4**
> Phi-3-mini-4k-instruct-q4.gguf


✅ **`phi-3-mini-4k-instruct-q4.gguf` is much faster**, especially on:

* CPUs (with `llama.cpp`)
* GPUs with limited memory (e.g., T4, RTX 3050, etc.)
* Google Colab Free (RAM or VRAM limited)

---

### ✅ Recommendation

For **RAG and general inference**, use:

> **`phi-3-mini-4k-instruct-q4.gguf`**

It's **fast**, **efficient**, and retains **most of the accuracy** — ideal for local or Colab usage.


**Lets create directory for storing the model (above one)**

In [None]:
# Create directory for llm-models
# ? The -p flag ensures that no error is thrown if the directory already exists.
!mkdir -p /content/drive/MyDrive/llm_models

In [19]:
llm_models_dir_path = '/content/drive/MyDrive/llm_models'

In [20]:
phi3_mini_local_path = f"{llm_models_dir_path}/Phi-3-mini-4k-instruct-q4.gguf"

In [21]:
# Download pre-quantized model if it doesn't exist
if not os.path.exists(phi3_mini_local_path):
    print("Downloading pre-quantized model...")
    hf_hub_download(
        repo_id="microsoft/Phi-3-mini-4k-instruct-GGUF",
        filename="Phi-3-mini-4k-instruct-q4.gguf",
        local_dir=llm_models_dir_path,

        # Colab doesn't always support symlinks well esp. when
        # - Writing to mounted paths (e.g. Google Drive)
        # Using False ensures the model files are copied instead of symlinked
        local_dir_use_symlinks=False
    )

## Question Answering using LLM

💡**REMEMBER**:

the structure `response["choices"][0]["text"]` is consistent across models when using the **llama-cpp-python** library. This is because llama.cpp follows the OpenAI API response format.

When you call the model with a prompt, it returns a dictionary with this structure:

```
{
  "id": "...",
  "object": "text_completion",
  "created": timestamp,
  "model": "...",
  "choices": [
    {
      "text": "The generated text response",
      "index": 0,
      "logprobs": null,
      "finish_reason": "length" or "stop"
    }
  ],
  "usage": {
    "prompt_tokens": number,
    "completion_tokens": number,
    "total_tokens": number
  }
}
```

So `response["choices"][0]["text"]` will consistently give you the generated text output regardless of which model you're using with llama-cpp-python.


**Phi 3 Mini 4K Info**

Phi-3-mini-4k-instruct has a **total context window** of 4096 tokens. This means the combined length of:
- Your input prompt
- The model's generated response

cannot exceed 4096 tokens.

For example:
- If your prompt is 1000 tokens
- You could generate up to 3096 tokens in response

The `max_tokens` parameter (set to 1024 in our code) limits how many tokens the model will generate in its response, regardless of how much room is left in the context window.

So while the model *could* theoretically generate up to ~4000 tokens (if your prompt is very short), setting `max_tokens=1024` is a practical limit that:
1. Keeps response times reasonable
2. Provides sufficient detail for medical answers
3. Prevents excessively long outputs

You can adjust this value based on your needs, but 1024 is a good starting point.


#### Loading the model

In [22]:
# Function to load the model
def load_model(model_path):
    """
    Load the llm model (quantized) to be used eith llama.cpp
    """
    print(f'checking model at {model_path} ...')
    if not os.path.exists(model_path):
        print(f"Model not found at {model_path} Please check !.")
        return None

    # Load the model with appropriate parameters
    llm = Llama(
        model_path=model_path,
        n_ctx=4096,  # Context window size - use the full 4K that the model supports
        n_gpu_layers=-1,  # Use all GPU layers
        verbose=False
    )

    return llm

In [23]:
# max tokens - 1024 is good starting point // should be sufficient for comprehensive answers
# temperature - A temperature of 0.2-0.4 would be more appropriate for medical question answering
# - for medical applications, leaning toward more deterministic outputs is generally preferred

# Function to define model parameters
def get_model_parameters(temperature=0.3, top_p=0.9, max_tokens=1024):
    """
    Define parameters for model inference
    """
    return {
        "temperature": temperature,
        "top_p": top_p,
        "max_tokens": max_tokens,
        # todo: decide
        # "top_k": 40,
    }

In [24]:
# model path
print('model path: ', phi3_mini_local_path)

model path:  /content/drive/MyDrive/llm_models/Phi-3-mini-4k-instruct-q4.gguf


In [25]:
!ls -l "/content/drive/MyDrive/llm_models"

total 2337140
-rw------- 1 root root 2393231072 May 11 06:34 Phi-3-mini-4k-instruct-q4.gguf


In [26]:
print(os.path.exists(phi3_mini_local_path))

True


In [27]:
# Test the model loading
model = load_model(phi3_mini_local_path)

checking model at /content/drive/MyDrive/llm_models/Phi-3-mini-4k-instruct-q4.gguf ...


**Helper python function**

In [29]:
def response(query, llm, max_tokens=128,temperature=0,top_p=0.95,top_k=50):
    # * ! this method is provided by academic
    model_output = llm(
      prompt=query,
      max_tokens=max_tokens,
      temperature=temperature,
      top_p=top_p,
      top_k=top_k
    )

    return model_output['choices'][0]['text']

# TODO: make this method to return (answer, response)
def get_response(query, model, params=dict()):
    """
    Get answer to query
    """
    try:

      prompt = 'N/A'

      pt = params.get('prompt_template')
      if pt:
        prompt = pt(query)
      else:
        # Create the properly formatted prompt
        prompt = create_prompt(query)

      # Get response from model
      response = model(
          prompt,
          # because 512 or 1024 was consuming considerable amount of time so going with what academic takes as default (ie 128)
          max_tokens=params.get("max_tokens", 128),
          temperature=params.get("temperature", 0.2),
          top_p=params.get("top_p", 0.9),
          top_k=params.get("top_k", 50)
          #top_k=params.get("top_k")
      )
      # TODO: comment below print statement (if not needed to trace !...)
      # print('got response', response)
      return response['choices'][0]['text']
    except Exception as e:
      print('Error whilst getting the response', e)
      return 'Error'

def create_prompt(query):
    """
    Creates a properly formatted prompt for the Phi-3-mini-4k-instruct model

    Args:
        query: The user's question or query

    Returns:
        A formatted prompt string that follows the model's expected format
    """
    # Format following Phi-3 chat template
    formatted_prompt = f"""<|user|>
{query}
<|assistant|>"""

    # ! NOTE: not using `pipeline()` or `pipe()`
    # We're using llama.cpp via the llama-cpp-python binding rather than the Hugging Face Transformers library
    # The Transformers pipeline would be useful if we were using the full HF implementation
    # but for our quantized model with llama.cpp, the direct approach we're using is more appropriate

    return formatted_prompt

In [30]:
def display_response(question, answer, verbose=False, response=None):
    """
    Display the question and model response in a clean, formatted way

    Args:
        question: The question asked to the model
        answer: text response to question
        verbose: Whether to show additional details like token counts
        response: The full response object from the model
    """
    # Print with clear formatting
    print("\n" + "="*80)
    print("📋 QUESTION:")
    print("-"*80)
    print(question)
    print("\n" + "🩺 ANSWER:")
    print("-"*80)
    print(answer)
    print("="*80)

    # Optional verbose output with token information
    if verbose and response:
        usage = response.get("usage", {})
        prompt_tokens = usage.get("prompt_tokens", "N/A")
        completion_tokens = usage.get("completion_tokens", "N/A")
        total_tokens = usage.get("total_tokens", "N/A")

        print("\n📊 STATS:")
        print(f"  • Prompt tokens: {prompt_tokens}")
        print(f"  • Completion tokens: {completion_tokens}")
        print(f"  • Total tokens: {total_tokens}")

In [31]:
if model:
    print("Model loaded successfully!")

Model loaded successfully!


In [32]:
llm = model

### Query 1: What is the protocol for managing sepsis in a critical care unit?

In [None]:
response0 = get_response(questions[0], llm)
display_response(questions[0], response0)

got respoonse {'id': 'cmpl-6ea31136-c85e-4ee5-91e1-1634efd91b43', 'object': 'text_completion', 'created': 1746886655, 'model': '/content/drive/MyDrive/llm_models/Phi-3-mini-4k-instruct-q4.gguf', 'choices': [{'text': " The management of sepsis in a critical care unit follows the Surviving Sepsis Campaign (SSC) guidelines, which are periodically updated. The protocol generally includes the following steps:\n\n1. Early recognition and assessment: Identify patients with suspected sepsis, septic shock, or severe sepsis based on clinical signs, symptoms, and laboratory findings.\n\n2. Immediate resuscitation: Initiate aggressive fluid resuscitation with crystalloids, aiming for a 30 mL/kg bolus within the first 3 hours.\n\n3. Antibiotic therapy: Administer broad-spectrum antibiotics within one hour of recognition, and then de-escalate based on culture results and clinical response.\n\n4. Source control: Identify and treat the source of infection, such as draining abscesses, removing infected

### Query 2: What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?

In [None]:
i = 1
response1 = get_response(questions[i], llm)
display_response(questions[i], response1)

got respoonse {'id': 'cmpl-439199b1-d299-4347-9701-6533f3d6a890', 'object': 'text_completion', 'created': 1746887080, 'model': '/content/drive/MyDrive/llm_models/Phi-3-mini-4k-instruct-q4.gguf', 'choices': [{'text': ' Appendicitis is an inflammation of the appendix, a small pouch-like organ located in the lower right abdomen. The common symptoms of appendicitis include:\n\n1. Abdominal pain: The pain usually starts around the navel and then moves to the lower right abdomen. The pain tends to worsen over time and may become severe.\n2. Loss of appetite\n3. Nausea and vomiting\n4. Fever\n5. Abdominal bloating\n6. Constipation or diarrhea\n\nAppendicitis', 'index': 0, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 38, 'completion_tokens': 128, 'total_tokens': 166}}

📋 QUESTION:
--------------------------------------------------------------------------------
What are the common symptoms of appendicitis, and can it be cured via medicine? If not, what surgical proc

### Query 3: What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?

In [None]:
i = 2
response2 = get_response(questions[i], llm)
display_response(questions[i], response2)

got respoonse {'id': 'cmpl-6288221f-0323-458e-8d79-9a07e6aeca28', 'object': 'text_completion', 'created': 1746887208, 'model': '/content/drive/MyDrive/llm_models/Phi-3-mini-4k-instruct-q4.gguf', 'choices': [{'text': ' Sudden patchy hair loss, also known as alopecia areata, can be caused by various factors, including genetics, autoimmune disorders, and stress. Here are some effective treatments and solutions for addressing this condition:\n\n1. Medications:\n   a. Corticosteroids: Injectable or topical corticosteroids can help reduce inflammation and promote hair regrowth.\n   b. Minoxidil: This is a topical solution that can help stimulate hair growth.\n   c. Immunomodulatory agents:', 'index': 0, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 44, 'completion_tokens': 128, 'total_tokens': 172}}

📋 QUESTION:
--------------------------------------------------------------------------------
What are the effective treatments or solutions for addressing sudden patc

### Query 4:  What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?

In [None]:
i = 3
response3 = get_response(questions[i], llm)
display_response(questions[i], response3)

got respoonse {'id': 'cmpl-f7b91398-3fc6-45ec-9c1d-fbe3f66fbd1d', 'object': 'text_completion', 'created': 1746887284, 'model': '/content/drive/MyDrive/llm_models/Phi-3-mini-4k-instruct-q4.gguf', 'choices': [{'text': " I am not able to diagnose or provide specific treatment recommendations. It is crucial to consult with a qualified healthcare professional for an accurate diagnosis and appropriate treatment plan. However, I can provide you with some general information about potential treatments for brain injuries.\n\nTreatment for brain injuries depends on the severity and type of injury, as well as the individual's overall health. Some common approaches to treating brain injuries include:\n\n1. Medical management: This involves monitoring the patient's vital signs, managing pain, and addressing any immediate medical issues related to the injury", 'index': 0, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 36, 'completion_tokens': 128, 'total_tokens': 164}}

📋 

### Query 5: What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?

In [None]:
i = 4
response4 = get_response(questions[i], llm)
display_response(questions[i], response4)

got respoonse {'id': 'cmpl-284cbf05-074c-4dd5-bc93-5ee1e876e325', 'object': 'text_completion', 'created': 1746887421, 'model': '/content/drive/MyDrive/llm_models/Phi-3-mini-4k-instruct-q4.gguf', 'choices': [{'text': ' If a person has fractured their leg during a hiking trip, it is crucial to take the following precautions and treatment steps:\n\n1. Safety first:\n   - Ensure the injured person is in a safe location, away from any potential hazards.\n   - If possible, help the person to a stable, flat surface.\n\n2. Call for help:\n   - Contact emergency services or a local rescue team to provide professional medical assistance.\n   - If cell phone service is available, call for help immediately.\n\n3. Immobilize the leg', 'index': 0, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 42, 'completion_tokens': 128, 'total_tokens': 170}}

📋 QUESTION:
--------------------------------------------------------------------------------
What are the necessary precautions a

👀 **Quick Look (Phi-3-Mini)**

* Responses were generated in \~2 minutes per question.
* The model consistently returned well-structured, point-wise answers resembling medical prescriptions — without being explicitly instructed to do so
* Token limits occasionally constrained answer completeness
* Responses resembled clinical documentation - concise bullet points rather than narrative paragraphs
* Despite the compact setting, responses demonstrated strong domain alignment, resembling doctor-style advisories. All answers maintained consistent numbered format across different medical topics
* Temperature settings below 0.2 produced overly generic responses lacking specific medical details
* Response quality varied significantly between simple symptom questions and complex treatment protocols
* Surprisingly, the model inferred appropriate tone and format for medical Q\&A without heavy prompt engineering.

🤖 These observations highlight *Phi-3 Mini's* surprising capability to mimic clinical communication patterns, though with limitations in handling complex medical reasoning within token constraints ❗




## Question Answering using LLM with Prompt Engineering

In [None]:
combinations = [
    # Combination 1: Highly deterministic (factual focus)
    {
        "name": "Highly Deterministic",
        "temperature": 0.1,
        "top_p": 0.5,
        "max_tokens": 150,
        "prompt_template": lambda q: f"<|user|>\nAnswer this medical question with precise, factual information: {q}\n<|assistant|>"
    },

    # Combination 2: Balanced approach
    {
        "name": "Balanced Approach",
        "temperature": 0.4,
        "top_p": 0.8,
        "max_tokens": 160,
        "prompt_template": lambda q: f"<|user|>\nProvide a comprehensive medical answer to this question: {q}\n<|assistant|>"
    },

    # Combination 3: Step-by-step reasoning
    {
        "name": "Step-by-Step Reasoning",
        "temperature": 0.3,
        "top_p": 0.7,
        "max_tokens": 150,
        "prompt_template": lambda q: f"<|user|>\nAnswer this medical question step-by-step with clear reasoning: {q}\n<|assistant|>"
    },

    # Combination 4: Concise summary
    {
        "name": "Concise Summary",
        "temperature": 0.2,
        "top_p": 0.9,
        "max_tokens": 132,
        "prompt_template": lambda q: f"<|user|>\nProvide a brief, concise answer to this medical question: {q}\n<|assistant|>"
    },

    # Combination 5: Medical expert persona
    {
        "name": "Medical Expert Persona",
        "temperature": 0.3,
        "top_p": 0.85,
        "max_tokens": 135,
        "prompt_template": lambda q: f"<|user|>\nAs an experienced medical specialist, answer this question with your expert knowledge: {q}\n<|assistant|>"
    }
]

In [None]:
def test_combinations(question_index):
  query = questions[question_index]
  for e in combinations:
    ans = get_response(query, llm, params=e)
    display_response(query, ans)
    print('---\n')

### Query 1: What is the protocol for managing sepsis in a critical care unit?

In [None]:
test_combinations(0)

got respoonse {'id': 'cmpl-44cfbe41-f337-4ff0-be9f-603d6e1f902b', 'object': 'text_completion', 'created': 1746888719, 'model': '/content/drive/MyDrive/llm_models/Phi-3-mini-4k-instruct-q4.gguf', 'choices': [{'text': ' The protocol for managing sepsis in a critical care unit is based on the Surviving Sepsis Campaign (SSC) guidelines, which emphasize early recognition, prompt administration of antibiotics, and aggressive fluid resuscitation. The following steps are typically followed:\n\n1. Early recognition: Identify patients with suspected sepsis by assessing for signs and symptoms, such as fever, elevated heart rate, altered mental status, and hypotension.\n\n2. Immediate interventions:\n   a. Administer broad-spectrum antibiotics within one hour of recognition.\n   b. Initiate fluid resuscitation with', 'index': 0, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 31, 'completion_tokens': 150, 'total_tokens': 181}}

📋 QUESTION:
--------------------------------

### Query 2: What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?

In [None]:
test_combinations(1)

got respoonse {'id': 'cmpl-5029667d-19d0-42b1-9168-99689acf86d1', 'object': 'text_completion', 'created': 1746889097, 'model': '/content/drive/MyDrive/llm_models/Phi-3-mini-4k-instruct-q4.gguf', 'choices': [{'text': ' Appendicitis is an inflammation of the appendix, a small, finger-like pouch that projects from the large intestine. The common symptoms of appendicitis include:\n\n1. Abdominal pain: The pain usually starts around the navel and then moves to the lower right side of the abdomen. The pain typically worsens over time and becomes more severe.\n\n2. Loss of appetite\n\n3. Nausea and vomiting\n\n4. Low-grade fever\n\n5. Constipation or diarrhea\n\n6. Abdominal bloating\n\n7. Inability to pass gas\n\n8. Abdominal', 'index': 0, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 49, 'completion_tokens': 150, 'total_tokens': 199}}

📋 QUESTION:
--------------------------------------------------------------------------------
What are the common symptoms of appe

### Query 3: What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?

In [None]:
test_combinations(2)

got respoonse {'id': 'cmpl-ddf847b5-d0b0-458c-aea1-cc3e3692feeb', 'object': 'text_completion', 'created': 1746889448, 'model': '/content/drive/MyDrive/llm_models/Phi-3-mini-4k-instruct-q4.gguf', 'choices': [{'text': " Sudden patchy hair loss, also known as alopecia areata, is an autoimmune condition where the body's immune system mistakenly attacks hair follicles, leading to localized bald spots on the scalp. The exact cause of alopecia areata is unknown, but it is believed to involve a combination of genetic and environmental factors.\n\nPossible causes of alopecia areata include:\n\n1. Genetic predisposition: A family history of alopecia areata or other autoimmune diseases may increase the risk of developing the condition.\n2. Immune system dysfunction: An overactive immune system may attack hair follicles, causing", 'index': 0, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 55, 'completion_tokens': 150, 'total_tokens': 205}}

📋 QUESTION:
------------------

### Query 4:  What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?

In [None]:
test_combinations(3)

got respoonse {'id': 'cmpl-656cabf2-698f-4380-a037-2bcf4436b795', 'object': 'text_completion', 'created': 1746889776, 'model': '/content/drive/MyDrive/llm_models/Phi-3-mini-4k-instruct-q4.gguf', 'choices': [{'text': " Treatment for a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function, depends on the severity and location of the injury, as well as the specific symptoms and complications experienced by the individual. Here are some general treatment options:\n\n1. Immediate medical attention: In the case of a severe head injury, immediate medical attention is crucial. This may involve stabilizing the patient's vital signs, performing a thorough neurological examination, and obtaining imaging studies such as a CT scan or MRI to assess the extent of the injury.\n\n2. Medications: Various medications may be prescribed to manage symptoms and complications associated with", 'index': 0, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'pro

### Query 5: What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?

In [None]:
test_combinations(4)

got respoonse {'id': 'cmpl-89eecbf0-a8dc-49d0-861e-892e738e0ec0', 'object': 'text_completion', 'created': 1746890563, 'model': '/content/drive/MyDrive/llm_models/Phi-3-mini-4k-instruct-q4.gguf', 'choices': [{'text': ' Precautions and treatment steps for a person who has fractured their leg during a hiking trip include:\n\n1. Immobilization: Immobilize the injured leg using a splint or a makeshift support to prevent further injury.\n\n2. Pain management: Administer over-the-counter pain medications, such as acetaminophen or ibuprofen, to alleviate pain and reduce inflammation.\n\n3. Elevation: Elevate the injured leg above heart level to reduce swelling and improve blood circulation.\n\n4. Ice application: Apply ice packs wrapped in a cloth to the injured area for 15-', 'index': 0, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 53, 'completion_tokens': 150, 'total_tokens': 203}}

📋 QUESTION:
---------------------------------------------------------------------

## 👀 **Summary**



We utilized the Phi-3 Mini 4K model in a GPU-accelerated Colab setup to generate medical responses for a predefined set of five clinically relevant questions.

initial tests showed that a max_tokens setting of 512 to 1024 consumes too much time So we resort back to defualt provided by academia (ie 128)

**Fine tuned params**

- Varying LLM parameters(temperature, top_p_ had not drastic effect on core facts, reinforcing consistency in medical information

- The model occassionally altered between concise statements and step-wise instructions- mimicking the variability in real clinical communication.

1. **Response Time vs. Token Length**: Generating 512 tokens took significantly longer than shorter outputs. There appears to be a near-linear relationship between max_tokens and generation time.

2. **Temperature Impact**: Lower temperature settings (0.1-0.2) produced more consistent, factual responses appropriate for medical information, while slightly higher values (0.3-0.4) introduced minor variations in phrasing without compromising accuracy.

3. **Prompt Engineering Effects**: Directive prompts (e.g., "Answer step-by-step") noticeably influenced the structure of responses, with the model generally following the requested format.

4. **Default Value Optimization**: We settled on max_tokens=128 as the default for a balance between response quality and generation speed. This value provided sufficient detail for most medical questions while maintaining reasonable response times.

5. **Optimal Balance**: The "Balanced Approach" (temp=0.3, top_p=0.8) provided a good compromise between factual accuracy and natural language flow for medical questions.

6. **Conciseness vs. Completeness**: While the "Concise Summary" setting generated faster responses, some medical questions benefited from the additional context provided by longer outputs.

7. **Persona Framing**: The "Medical Expert Persona" prompt appeared to elicit slightly more technical terminology and structured explanations compared to neutral prompts.

8. **Token Efficiency**: Lower temperatures generally resulted in more information-dense responses, requiring fewer tokens to convey key medical information.

9. **Model Limitations**: For complex medical protocols (like sepsis management), even the longest responses sometimes felt truncated before completing the full explanation.

*Overall, output remained grounded and medically sound across parameter sweeps, suggesting robustness in factual domain.*

***The responses generated by Phi-3 Mini were surprisingly coherent and informative, even without retrieval augmentation — suggesting that smaller models can still be leveraged for meaningful domain-specific reasoning when guided with the right prompt and sampling parameters.***

## Data Preparation for RAG

**Flow** :

1. **PDF** → loaded via loader (e.g., `PyMuPDFLoader`)
2. **Documents** → output of loader (list of `Document` objects)
3. **Chunks** → split documents into smaller chunks
4. **Embeddings** → generate embeddings from chunks


**`PDF → Documents → Chunks → Embeddings`**


**Prep**

In [33]:
!pwd

/content/drive/MyDrive/Colab Notebooks/medical_assistant_hf


In [34]:
!ls -la

total 33794
-rw------- 1 root root 14173186 May 11 06:18 documents.pkl
drwx------ 2 root root     4096 May 11 06:18 faiss_index
-rw------- 1 root root 20150488 May 11 06:18 medical_diagnosis_manual.pdf
-rw------- 1 root root   237332 May 11 07:33 notebook_p2.ipynb
-rw------- 1 root root    37945 May 11 06:18 Project_Template_Notebook.ipynb


In [35]:
pdf_path = 'medical_diagnosis_manual.pdf'

In [36]:
# helper Functions

def load_pdf_with_langchain(pdf_path):
    """
    Load a PDF file using LangChain's PyMuPDFLoader

    Args:
        pdf_path: Path to the PDF file

    Returns:
        List of document objects with page content and metadata
    """
    if not os.path.exists(pdf_path):
        raise FileNotFoundError(f"PDF file not found at {pdf_path}")

    # Load the PDF using PyMuPDFLoader
    loader = PyMuPDFLoader(pdf_path)
    documents = loader.load()

    print(f"PDF loaded successfully: {os.path.basename(pdf_path)}")
    print(f"Total pages: {len(documents)}")

    return documents

def preview_documents(documents, num_pages=3):
    """
    Preview few pages of loaded documents
    """
    total_pages = len(documents)

    print(f"Previewing few pages:")
    print("="*80)

    # Skipping first 2 pages as those are not relevant (ie abstract, ...)
    for i in range(2, min(2+num_pages, total_pages)):
        doc = documents[i]
        text = doc.page_content

        print(f"\n--- Page {i+1} ---\n")
        print(text[:1000] + "..." if len(text) > 1000 else text)

    print("\n" + "="*80)

In [38]:
# This is a helper class to save and load the python objects locally !
class ObjectStore:
    def __init__(self, filepath):
        """
        :param filepath: Path to the file where the object will be saved/loaded from.
        """
        self.filepath = filepath

    def save(self, obj):
        """Save object to the file using pickle."""
        try:
          with open(self.filepath, 'wb') as f:
              pickle.dump(obj, f)
        except Exception as e:
          print(f'Error whilst saving the object {e}')

    def load(self, derive):
        """
        Load object from file, or call derive(), save it, and return if not found or corrupted.

        :param derive: Function to call to derive the default object.
        :return: Loaded or derived object.
        """
        try:
          if os.path.exists(self.filepath):
              try:
                  with open(self.filepath, 'rb') as f:
                      return pickle.load(f)
              except (pickle.UnpicklingError, EOFError, OSError):
                  print("Failed to load object. Deriving new one.")
          else:
              print("File not found. Deriving new object.")

          obj = derive()
          if obj:
            self.save(obj)
          return obj
        except:
          print("Failed to load or derive object.")
          return None

### Loading the Data

In [39]:
# Extract documents from pdf
# --
# 1. extract texts
# 2. and group or segregate texts into documents !!
# --

def derive_documents():
  try:
    documents = load_pdf_with_langchain(pdf_path)
    return documents
  except Exception as e:
    print(f"Error loading pdf: {e}")
    return None

objStore = ObjectStore('documents.pkl')
documents = objStore.load(derive_documents)

#### Checking the first 5 pages

In [40]:
# preview the PDF
preview_documents(documents, 5)

Previewing few pages:

--- Page 3 ---

Table of Contents
1
Front    ................................................................................................................................................................................................................
1
Cover    .......................................................................................................................................................................................................
2
Front Matter    ...........................................................................................................................................................................................
53
1 - Nutritional Disorders    ...............................................................................................................................................................
53
Chapter 1. Nutrition: General Considerations    ...............................................................

#### Checking the number of pages

In [None]:
print('Total number of pages: ', len(documents))

Total number of pages:  4114


### Data Overview

In [None]:
 # Combine all text for statistics
full_text = "\n\n".join([doc.page_content for doc in documents])

# Display some statistics about the content
lines = full_text.split('\n')
words = re.findall(r'\w+', full_text)

print(f"\nPDF loaded successfully with {len(documents)} pages and {len(full_text)} characters")
print(f"Approximate number of lines: {len(lines)}")
print(f"Approximate number of words: {len(words)}")


PDF loaded successfully with 4114 pages and 13710453 characters
Approximate number of lines: 211089
Approximate number of words: 2027803


### Data Chunking

🎯 **Notion | Remember**

Using `BAAI/bge-small-en-v1.5` (not an OpenAI model), the standard `RecursiveCharacterTextSplitter` would be more appropriate than the tiktoken version.

Here's why:

1. `RecursiveCharacterTextSplitter.from_tiktoken_encoder()` uses OpenAI's tokenization, which doesn't align with how BGE tokenizes text

2. The tokenization mismatch could lead to suboptimal chunk boundaries for your embedding model

3. The standard `RecursiveCharacterTextSplitter` with character-based chunking provides more consistent results across different embedding models

4. For BGE models, character-based chunking with appropriate chunk size and overlap settings works well in practice


This approach will be more aligned with your chosen embedding model and provide more consistent results.


In [41]:
def create_chunks_from_documents(documents, chunk_size=512, chunk_overlap=50):
    """
    Create chunks from documents using character-based splitting

    Args:
        documents: List of documents from PyMuPDFLoader
        chunk_size: Size of chunks in characters
        chunk_overlap: Number of characters to overlap between chunks

    Returns:
        List of chunked documents
    """
    # Create a character-based splitter (better for non-OpenAI embedding models like BGE)
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
    )

    # Split the documents into chunks
    chunks = text_splitter.split_documents(documents)

    print(f"Created {len(chunks)} chunks from {len(documents)} pages")
    print(f"Average chunk length: {sum(len(doc.page_content) for doc in chunks) / len(chunks):.2f} characters")

    return chunks

In [42]:
def preview_chunks(chunks, num_chunks=2):
    """
    Preview a sample of the created chunks

    Args:
        chunks: List of document chunks
        num_chunks: Number of chunks to preview
    """
    print(f"\nPreviewing {min(num_chunks, len(chunks))} chunks out of {len(chunks)}:")

    for i in range(min(num_chunks, len(chunks))):
        chunk = chunks[i]
        print(f"\n--- Chunk {i+1} ---")

        # # Print metadata if available
        # if hasattr(chunk, 'metadata') and chunk.metadata:
        #     print(f"Source: {chunk.metadata.get('source')}, Page: {chunk.metadata.get('page')}")

        # Print content preview
        content = chunk.page_content
        print(f"Length: {len(content)} characters")
        print(f"Content preview: {content[:150]}...")

In [43]:
chunks = create_chunks_from_documents(documents, chunk_size=512, chunk_overlap=50)

Created 31416 chunks from 4114 pages
Average chunk length: 441.85 characters


In [44]:
preview_chunks(chunks)


Previewing 2 chunks out of 31416:

--- Chunk 1 ---
Length: 117 characters
Content preview: nipunshah6776@gmail.com
0W3XG8QC4A
nt for personal use by nipunshah6776@
shing the contents in part or full is liable...

--- Chunk 2 ---
Length: 182 characters
Content preview: nipunshah6776@gmail.com
0W3XG8QC4A
This file is meant for personal use by nipunshah6776@gmail.com only.
Sharing or publishing the contents in part or ...


### Embedding

Flow Summary:

  1.  Create or Load Vector Store

  2.  Retrieve Relevant Documents

  3.  Generate RAG-based Answer

In [45]:
# ? For medical text retrieval specifically, BAAI/bge-small-en-v1.5 offers the best balance of speed and retrieval
EMBEDDING_MODEL_NAME = "BAAI/bge-small-en-v1.5"
STORE_PATH = "faiss_index"

In [46]:
def create_embedding_model(model_name, device="cuda"):
    """
    Create an embedding model using the specified model

    Args:
        model_name: Name of the HuggingFace model to use
        device: Device to run the model on ("cuda" or "cpu")

    Returns:
        Configured embedding model
    """
    # Check if CUDA is available when device is set to "cuda"
    if device == "cuda" and not torch.cuda.is_available():
        print("CUDA not available, falling back to CPU")
        device = "cpu"

    # Configure the embedding model
    model_kwargs = {'device': device}
    encode_kwargs = {'normalize_embeddings': True}  # Normalize for cosine similarity

    # Create the embedding model
    embedding_model = HuggingFaceEmbeddings(
        model_name=model_name,
        model_kwargs=model_kwargs,
        encode_kwargs=encode_kwargs
    )

    print(f"Embedding model created: {model_name} on {device}")
    return embedding_model

### Vector Database

In [57]:
def load_or_create_faiss_vectorstore(
    chunks,
    embedding_model,
    persist_dir=STORE_PATH,
):
    """
    Create a FAISS vector store from chunks or load from disk if it exists

    Args:
        chunks: List of document chunks
        embedding_model: The embedding model to use
        persist_directory: Directory to save/load the FAISS index

    Returns:
        FAISS vector store
    """
    index_path = os.path.join(persist_dir, "index.faiss")
    metadata_path = os.path.join(persist_dir, "index.pkl")

    # Load if already exists
    if os.path.exists(index_path) and os.path.exists(metadata_path):
        print(f"Loading FAISS index from {persist_dir}")

        # ---
        # ?? Why allow_dangerous_deserialization=True
        # -> ref: https://stackoverflow.com/questions/78120202/the-de-serialization-relies-loading-a-pickle-file
        # ---
        # Load the existing index
        vector_store = FAISS.load_local(persist_dir, embedding_model, allow_dangerous_deserialization=True)
        print(f"Loaded FAISS index with {vector_store.index.ntotal} vectors")

        return vector_store

    # Create a new vector store if it doesn't exist
    print(f"Creating new FAISS index from documents at {persist_dir} ...")
    start_time = time.time()

    # Create the vector store from documents
    vector_store = FAISS.from_documents(chunks, embedding_model)

    elapsed_time = time.time() - start_time
    print(f"FAISS index created in {elapsed_time:.2f} seconds")
    print(f"Index contains {vector_store.index.ntotal} vectors of dimension {vector_store.index.d}")

    # Create the directory if it doesn't exist
    os.makedirs(persist_dir, exist_ok=True)

    # Save the index to disk
    print(f"Saving FAISS index to {persist_dir}")
    vector_store.save_local(persist_dir)

    return vector_store

In [48]:
def print_faiss_store_info(store):
    print(f"Total documents: {len(store.docstore._dict)}")
    print(f"Vector dimension: {store.index.d}")
    print(f"Index type: {type(store.index).__name__}")

### Retriever

In [49]:
def retrieve_from_faiss(vector_store, query, top_k=3):
    """
    Retrieve the top_k most similar chunks for a given query

    Args:
        vector_store: FAISS vector store
        query: Query string
        k: Number of chunks to retrieve (default: 5)

    Returns:
        List of retrieved documents
    """
    # Retrieve the top_k most similar documents and their scores
    docs_and_scores = vector_store.similarity_search_with_score(query, k=top_k)

    # # Separate documents and scores
    # docs = [doc for doc, score in docs_and_scores]
    # scores = [score for doc, score in docs_and_scores]

    print(f"Retrieved {len(docs_and_scores)} documents for query: '{query}'")

    docs = []

    for i, (doc, score) in enumerate(docs_and_scores):
        docs.append(doc)

        print(f"Rank {i+1}:")
        print(f"Score: {score:.4f}")

        # # Print metadata if available
        # if hasattr(doc, 'metadata') and doc.metadata:
        #     source = doc.metadata.get('source', 'Unknown')
        #     page = doc.metadata.get('page', 'Unknown')
        #     print(f"Source: {source}, Page: {page}")

        # Print content preview (documents are objects with page_content attribute)
        content = doc.page_content
        preview_length = min(200, len(content))
        print(f"Content: {content[:preview_length]}...\n")

    return docs

### System and User Prompt Template

In [50]:
RAG_DEFAULT_SYSTEM_MSG = """You are a helpful medical assistant. Your task is to answer medical questions based solely on the provided context from the Merck Manual.

Guidelines:
- Answer ONLY based on the context provided.
- If the context doesn't contain the answer, say "I don't have enough information to answer this question".
- Be concise and accurate in your responses.
- For medical conditions, include key information about symptoms, diagnosis, and treatment if available.
- Do not fabricate information or use knowledge outside the provided context."""
# As for now we are not including metadata in user_message putting this below point aside !!
# - If relevant and possible, specify which section of the Merck Manual the information comes from."""


In [51]:
def create_user_message(query, retrieved_docs, include_metadata=False):
    """
    Create the user message portion of the RAG prompt in Phi-3-mini format

    Format
    ```
    Context:
    {context}

    Question:
    ```

    Args:
        query: User's question
        retrieved_docs: List of retrieved documents
        num_docs: Number of documents to include

    Returns:
        (Formatted user message string with context first, then question,
        context)

    Example
      ```
      Context:

      Document 1:
      (Source: Merck Manual, Page: 45)
      Content from document 1...

      Document 2:
      (Source: Merck Manual, Page: 50)
      Content from document 2...

      Question:
      What is the treatment for condition X?
      ```
    """
    # Start with the instruction
    user_message = "Use the following context to answer the question.\n\nContext:"

    context_value = ""

    # Add the most relevant documents to the context
    for i, doc in enumerate(retrieved_docs):
        # Add document content
        context_value += f"\n\nDocument {i+1}:"

        # Add metadata if available
        if include_metadata and hasattr(doc, 'metadata') and doc.metadata:
            source = doc.metadata.get('source', '')
            page = doc.metadata.get('page', '')
            if source or page:
                context_value += f" (Source: {source}, Page: {page})"

        # Add content
        context_value += f"\n{doc.page_content}"

    user_message += context_value

    # Add the question after the context
    user_message += f"\n\nQuestion:\n{query}"

    return user_message, context_value

In [52]:
def create_phi3_rag_prompt(system_message, user_message):
    """
    Create a RAG prompt specifically formatted for Phi-3-mini

    Format:-
    ```
    <|system|>
    You are a helpful and concise assistant. Answer based only on the provided context. If the context does not contain enough information, say "I don't know."

    <|user|>
    Context:
    {context}

    Question:
    {question}
    <|assistant|>
    ```

    Args:
        query: User's question
        retrieved_docs: List of retrieved documents
        num_docs: Number of documents to include in the prompt

    Returns:
        Formatted prompt string ready for Phi-3-mini
    """
    # Phi-3-mini uses a specific format with <|system|>, <|user|>, and <|assistant|> tags
    # Combine into Phi-3 format
    # phi3_prompt = f"<|system|>\n{system_message}\n<|user|>\n{user_message}\n<|assistant|>"
    # ..
    phi3_prompt = (
        f"<|system|>\n"
        f"{system_message}\n\n"
        f"<|user|>\n"
        f"{user_message}\n\n"
        f"<|assistant|>"
    )

    return phi3_prompt

In [53]:
def get_rag_response(query, vector_store, llm_model, llm_model_params=dict(), context_k=3, system_message=RAG_DEFAULT_SYSTEM_MSG):
    """
    End-to-end RAG pipeline to answer medical questions

    Args:
        query: User's medical question
        vector_store: FAISS vector store containing Merck Manual chunks
        llm_model: The Phi-3-mini model object
        llm_model_params: params such as temperature, top_k, .. dictonary to pass to llm model during inferencing
        context_k: Number of relevant chunks to retrieve

    Returns:
        Generated (answer, context) from the LLM
    """
    try:
      print(f"Processing question: '{query}'")

      # Step 1: Retrieve relevant documents
      print("\n i) Retrieving relevant documents ...")
      retrieved_docs = retrieve_from_faiss(vector_store, query, top_k=context_k)

      # Step 2: Create user message
      user_message, context_val = create_user_message(query, retrieved_docs)

      # Step 3: Create RAG prompt
      phi3_prompt = create_phi3_rag_prompt(system_message, user_message)

      # Step 4: Generate answer using existing LLM function
      print("\n i) Generating answer ...")
      answer = get_response(phi3_prompt, llm_model, params=llm_model_params)

      print("\n--- Answer ---")
      print(f"{answer[:125]}...") # show only first 125 characters

      return (answer, context_val)
    except Exception as e:
        # Catch any exception and print/log it
        print(f"Error occurred: {e}")
        return "An error occurred while processing your question. Please try again later.", "N/A"


### Configuration & Usage

In [54]:
import transformers
print(transformers.__version__)

4.40.2


In [55]:
import numpy
print(numpy.__version__)

1.26.4


In [None]:
# Intentionally kept this commented block so that in future if version ambiguity issue arise it can be tackled quickly refering htis block
#!sudo rm -rf /usr/local/lib/python3.11/dist-packages/numpy*

In [58]:
# Configure embedding model
embedding_model = create_embedding_model(EMBEDDING_MODEL_NAME)
vector_store = load_or_create_faiss_vectorstore(chunks, embedding_model, persist_dir=STORE_PATH)

Embedding model created: BAAI/bge-small-en-v1.5 on cuda
Loading FAISS index from faiss_index
Loaded FAISS index with 31416 vectors


In [59]:
# Show some meta info about store
print_faiss_store_info(vector_store)

Total documents: 31416
Vector dimension: 384
Index type: IndexFlatL2


## Question Answering using RAG

**Helper class that helps to hold rag tasks info** !!

In [60]:
# Helper class

class RAGMeta:
    def __init__(self, num_questions):
        # Initialize the list with a fixed size of `num_questions`, all set to None initially
        self.meta = [None] * num_questions

    def register(self, question_num, question, answer, context):
        # Ensure the question_num is within bounds
        if 0 <= question_num < len(self.meta):
            # Register the information for the specified question number
            self.meta[question_num] = (question, answer, context)
        else:
            raise IndexError("Question number is out of range")

    def get(self, question_num):
        # Retrieve information for the given question number
        if 0 <= question_num < len(self.meta):
            return self.meta[question_num]
        else:
            return None  # If question_num doesn't exist yet

    def get_all(self):
        return list(self.meta)

In [61]:
# For medical RAG applications with the Merck Manual, a `top_k` value of 3 - 5 is optimal
TO_FETCH_DOC = 3 # from store

In [62]:
rag_meta = RAGMeta(num_questions=5)

In [80]:
def process_question_for_rag(i):
  print(f'Process Question {i} initiated ---\n')
  (ans, ctxt) = get_rag_response(questions[i], vector_store, llm, llm_model_params={'max_tokens': 160}, context_k=TO_FETCH_DOC)
  print('\n ---- View ---')
  display_response(questions[i], ans)
  return (questions[i], ans, ctxt)

### Query 1: What is the protocol for managing sepsis in a critical care unit?

In [81]:
i = 0
(que, ans, ctxt) = process_question_for_rag(i)
rag_meta.register(i, que, ans, ctxt)

Process Question 0 initiated ---

Processing question: 'What is the protocol for managing sepsis in a critical care unit?'

 i) Retrieving relevant documents ...
Retrieved 3 documents for query: 'What is the protocol for managing sepsis in a critical care unit?'
Rank 1:
Score: 0.4486
Content: septic shock is high (60 to 65%). Prognosis depends on the cause, preexisting or complicating illness,
time between onset and diagnosis, and promptness and adequacy of therapy.
General management: Fir...

Rank 2:
Score: 0.4610
Content: • Replacement-dose corticosteroids
Patients with septic shock should be treated in an ICU. The following should be monitored frequently (see
also p. 2244): systemic pressure; CVP, PAOP, or both; pulse...

Rank 3:
Score: 0.4872
Content: 16 - Critical Care Medicine
Chapter 222. Approach to the Critically Ill Patient
Introduction
Critical care medicine specializes in caring for the most seriously ill patients. These patients are best
t...


 i) Generating answer ...

-

### Query 2: What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?

In [82]:
i = 1
(que, ans, ctxt) = process_question_for_rag(i)
rag_meta.register(i, que, ans, ctxt)

Process Question 1 initiated ---

Processing question: 'What are the common symptoms of appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?'

 i) Retrieving relevant documents ...
Retrieved 3 documents for query: 'What are the common symptoms of appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?'
Rank 1:
Score: 0.3843
Content: ultrasound. Treatment is surgical removal.
In the US, acute appendicitis is the most common cause of acute abdominal pain requiring surgery. Over
5% of the population develops appendicitis at some poi...

Rank 2:
Score: 0.4427
Content: appendectomy is inflammatory bowel disease involving the cecum. However, in cases of terminal ileitis
and a normal cecum, the appendix should be removed.
Appendectomy should be preceded by IV antibiot...

Rank 3:
Score: 0.4440
Content: Symptoms and Signs
The classic symptoms of acute appendicitis are epigastric o

### Query 3: What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?

In [83]:
i = 2
(que, ans, ctxt) = process_question_for_rag(i)
rag_meta.register(i, que, ans, ctxt)

Process Question 2 initiated ---

Processing question: 'What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?'

 i) Retrieving relevant documents ...
Retrieved 3 documents for query: 'What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?'
Rank 1:
Score: 0.4546
Content: been subjected to scientific scrutiny, but patients who are self-conscious about their hair loss may
consider them.
Hair loss due to other causes: Underlying disorders are treated.
Multiple treatment ...

Rank 2:
Score: 0.4619
Content: Alopecia Areata
Alopecia areata is sudden patchy hair loss in people with no obvious skin or systemic disorder.
The scalp and beard are most frequently affected, but any hairy area may be involved. Ha...

Rank 3:
Score: 0.5282

### Query 4:  What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?

In [84]:
i = 3
(que, ans, ctxt) = process_question_for_rag(i)
rag_meta.register(i, que, ans, ctxt)

Process Question 3 initiated ---

Processing question: 'What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?'

 i) Retrieving relevant documents ...
Retrieved 3 documents for query: 'What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?'
Rank 1:
Score: 0.5158
Content: usually requires laboratory tests and neuroimaging. Treatment is immediate stabilization and
specific management of the cause. For long-term coma, adjunctive treatment includes passive
range-of-motion...

Rank 2:
Score: 0.5197
Content: prolonged period of rehabilitation, particularly in cognitive and emotional areas, is often required.
Rehabilitation services should be planned early.
The Merck Manual of Diagnosis & Therapy, 19th Edi...

Rank 3:
Score: 0.5254
Content: and other CNS-active drugs may also be

### Query 5: What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?

In [85]:
i = 4
(que, ans, ctxt) = process_question_for_rag(i)
rag_meta.register(i, que, ans, ctxt)

Process Question 4 initiated ---

Processing question: 'What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?'

 i) Retrieving relevant documents ...
Retrieved 3 documents for query: 'What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?'
Rank 1:
Score: 0.6089
Content: Care of the stump and prosthesis: Patients must learn to care for their stump. Because a leg
prosthesis is intended only for ambulation, patients should remove it before going to sleep. At bedtime,
th...

Rank 2:
Score: 0.6244
Content: unaffected leg. Patients who begin walking without the parallel bars may need physical assistance from
and later close supervision by the therapist. Generally, patients use a cane or walker when first...

Rank 3:
Score: 0.6263
Content: falls (see T

RAG NOTE 📌

Retrieval-Augmented Generation (RAG) was employed using the Phi-3 Mini model with FAISS vector storage to provide context-grounded medical responses from the Merck Manual.

- For our usecase, picked bge-small-en-v1.5 embedding model
- Using simple `RecursiveCharacterTextSplitter` instead of `RecursiveCharacterTextSplitter.from_tiktoken_encoder()` is preferred, as embedding model is not specifically open-ai one.
- Better preservation of semantic units
- Chunks gets better aligned.

### **Fine-tuning**

In [64]:
# RAG parameter combinations for testing
rag_parameter_combinations = [
    {
        "max_tokens": 150,
        "top_p": 0.7,
        "context_k": 3,
        "temperature": 0.1,
        "desc": "Conservative sampling with focused context and low temperature for maximum factual accuracy"
    },
    {
        "max_tokens": 130,
        "top_p": 0.9,
        "context_k": 3,
        "temperature": 0.2,
        "desc": "Wider token sampling but focused context with low temperature for balanced precision"
    },
    {
        "max_tokens": 140,
        "top_p": 0.7,
        "context_k": 5,
        "temperature": 0.3,
        "desc": "Conservative sampling with focused context but higher temperature for more natural responses"
    },
    {
        "max_tokens": 125,
        "top_p": 0.8,
        "context_k": 5,
        "temperature": 0.1,
        "desc": "Wider token sampling with expanded context but low temperature for comprehensive yet precise answers"
    },
    {
        "max_tokens": 150,
        "top_p": 0.8,
        "context_k": 3,
        "temperature": 0.3,
        "desc": "Wider token sampling with expanded context and moderate temperature for detailed and varied responses"
    }
]

In [68]:
def test_rag_combinations(question_index):
  query = questions[question_index]
  for i, e in enumerate(rag_parameter_combinations):
    print(f"============================= Picked combination #{i+1} ===============================\n")
    (ans, _) = get_rag_response(query, vector_store, llm, llm_model_params=e, context_k=e['context_k'])
    display_response(query, ans)
    print('---\n')

### Query 1: What is the protocol for managing sepsis in a critical care unit?

In [69]:
test_rag_combinations(0)


Processing question: 'What is the protocol for managing sepsis in a critical care unit?'

 i) Retrieving relevant documents ...
Retrieved 3 documents for query: 'What is the protocol for managing sepsis in a critical care unit?'
Rank 1:
Score: 0.4486
Content: septic shock is high (60 to 65%). Prognosis depends on the cause, preexisting or complicating illness,
time between onset and diagnosis, and promptness and adequacy of therapy.
General management: Fir...

Rank 2:
Score: 0.4610
Content: • Replacement-dose corticosteroids
Patients with septic shock should be treated in an ICU. The following should be monitored frequently (see
also p. 2244): systemic pressure; CVP, PAOP, or both; pulse...

Rank 3:
Score: 0.4872
Content: 16 - Critical Care Medicine
Chapter 222. Approach to the Critically Ill Patient
Introduction
Critical care medicine specializes in caring for the most seriously ill patients. These patients are best
t...


 i) Generating answer ...

--- Answer ---
 In a critical care

### Query 2: What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?

In [70]:
test_rag_combinations(1)


Processing question: 'What are the common symptoms of appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?'

 i) Retrieving relevant documents ...
Retrieved 3 documents for query: 'What are the common symptoms of appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?'
Rank 1:
Score: 0.3843
Content: ultrasound. Treatment is surgical removal.
In the US, acute appendicitis is the most common cause of acute abdominal pain requiring surgery. Over
5% of the population develops appendicitis at some poi...

Rank 2:
Score: 0.4427
Content: appendectomy is inflammatory bowel disease involving the cecum. However, in cases of terminal ileitis
and a normal cecum, the appendix should be removed.
Appendectomy should be preceded by IV antibiot...

Rank 3:
Score: 0.4440
Content: Symptoms and Signs
The classic symptoms of acute appendicitis are epigastric or periumbilical pain followed by 

### Query 3: What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?

In [71]:
test_rag_combinations(2)


Processing question: 'What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?'

 i) Retrieving relevant documents ...
Retrieved 3 documents for query: 'What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?'
Rank 1:
Score: 0.4546
Content: been subjected to scientific scrutiny, but patients who are self-conscious about their hair loss may
consider them.
Hair loss due to other causes: Underlying disorders are treated.
Multiple treatment ...

Rank 2:
Score: 0.4619
Content: Alopecia Areata
Alopecia areata is sudden patchy hair loss in people with no obvious skin or systemic disorder.
The scalp and beard are most frequently affected, but any hairy area may be involved. Ha...

Rank 3:
Score: 0.5282
Content: Abnormalities of the ha

### Query 4:  What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?

In [72]:
test_rag_combinations(3)


Processing question: 'What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?'

 i) Retrieving relevant documents ...
Retrieved 3 documents for query: 'What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?'
Rank 1:
Score: 0.5158
Content: usually requires laboratory tests and neuroimaging. Treatment is immediate stabilization and
specific management of the cause. For long-term coma, adjunctive treatment includes passive
range-of-motion...

Rank 2:
Score: 0.5197
Content: prolonged period of rehabilitation, particularly in cognitive and emotional areas, is often required.
Rehabilitation services should be planned early.
The Merck Manual of Diagnosis & Therapy, 19th Edi...

Rank 3:
Score: 0.5254
Content: and other CNS-active drugs may also be used for chronic or neuropathic 

### Query 5: What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?

In [73]:
test_rag_combinations(4)


Processing question: 'What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?'

 i) Retrieving relevant documents ...
Retrieved 3 documents for query: 'What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?'
Rank 1:
Score: 0.6089
Content: Care of the stump and prosthesis: Patients must learn to care for their stump. Because a leg
prosthesis is intended only for ambulation, patients should remove it before going to sleep. At bedtime,
th...

Rank 2:
Score: 0.6244
Content: unaffected leg. Patients who begin walking without the parallel bars may need physical assistance from
and later close supervision by the therapist. Generally, patients use a cane or walker when first...

Rank 3:
Score: 0.6263
Content: falls (see Table 313-3). Patients should also

🧐 **Summary**

- **Vector Store Performance**: FAISS demonstrated impressive indexing and retrieval speeds even on CPU, with embedding generation and storage completing in seconds rather than minutes for our medical corpus. In addition caching mechanism was also employed !

- **Retrieval Quality**:
The similarity search effectively identified relevant medical passages, though relevance varied based on query phrasing and specificity. The cosine similarity metric provided good semantic matching for medical terminology. top_k=3 retrieval setting was found to balance context compactness and informativeness.

- **Context Window Utilization**: Including 3-5 retrieved passages provided sufficient context without overwhelming the model's reasoning capabilities. We observed diminishing returns beyond 5 passages.

- **Parameter Sensitivity**:
  - The RAG pipeline showed particular sensitivity to the context_k parameter, with higher values providing more comprehensive information but occasionally introducing **noise**.
  - When answers were not directly covered in the context, the model often defaulted to safe fallback responses (e.g., "I don't have enough information"), in line with system instructions. This behavior confirms **effective constraint adherence** via prompt engineering.

- **Prompt Engineering Effects**: The structured prompt format with clear delineation between context and question significantly improved the model's ability to ground responses in the provided information.

- **Retrieval-Generation Alignment**:
Phi-3 demonstrated clear improvement in answer specificity and completeness when supported by context. Compared to LLM-only runs, the RAG approach resulted in responses that were more anchored, less speculative, and typically better structured.

- **Index Reusability & Local Persistence**
We implemented FAISS persistence using a helper utility, allowing seamless local save/load of the index across runs. This minimized redundant computation and enabled reproducible results during tuning and evaluation.
Moreover, text extractions from pdf (ie documents) where also persisted to avoid small time delays.

- **Limitations in Sparse Answers**:
Some Merck Manual entries were too broad or thinly spread across chunks. As a result, retrieval missed semantically linked but physically distant concepts. This underlines a limitation of naive chunking strategies in dense retrievers.

- **Latency and Runtime**:
For each question, the end-to-end response time (including retrieval and generation) averaged between 1.8 to 2 minutes. While not real-time, this remains acceptable for offline batch evaluation and iterative refinement workflows.

- **Retrieval Speed & Scalability**
Despite not leveraging faiss-gpu, retrieval latency remained negligible at this scale, reinforcing the viability of CPU-based FAISS for lightweight medical QA systems.

**_The RAG setup significantly improved answer factuality and precision. Although retrieval is not foolproof, even a lightweight FAISS-based approach substantially enhanced grounding — confirming that small models can benefit meaningfully from external context without expensive fine-tuning._**


## Output Evaluation

In [74]:
# helper function

def build_rag_eval_prompt(context: str, question: str, answer: str) -> str:
    system_message = """You are an expert and impartial evaluator of AI-generated answers for medical questions.

Your task is to assess the quality of an answer to a given question based on the provided context.

Evaluate the answer along two dimensions:

1. Groundedness:
   - Is the answer supported solely by the provided context?
   - Does it avoid including information not present in the context?

2. Relevance:
   - Does the answer address the question using appropriate information from the context?
   - Is the answer complete, relevant, and clearly written?

Scoring guidelines:
- Groundedness:
  1 = Contains significant information not in the context or contradicts the context
  2 = Contains some information not supported by the context
  3 = Mostly grounded with minor additions not in the context
  4 = Fully grounded with very minor phrasing not explicitly in the context
  5 = Perfectly grounded, all information comes directly from the context

- Relevance:
  1 = Does not address the question at all
  2 = Partially addresses the question but misses key aspects
  3 = Addresses the main question but lacks some important details
  4 = Addresses the question well with most relevant details
  5 = Perfectly addresses the question with all relevant information from the context

Return your evaluation in the following JSON format:

{
  "groundedness_assessment": "<brief justification>",
  "groundedness_score": <score from 1 to 5>,
  "relevance_assessment": "<brief justification>",
  "relevance_score": <score from 1 to 5>
}

Keep your assessment justifications brief (1-2 sentences each).
Be objective and concise in your justifications. For medical information, accuracy is particularly important.
"""

    return f"""<|system|>
{system_message}

<|user|>
Context:
{context.strip()}

Question:
{question.strip()}

Answer:
{answer.strip()}

Evaluate the answer based on the above criteria and return the output in the specified JSON format.

<|assistant|>"""

In [91]:
def try_parse_llm_response_for_evaluation(response: str) -> dict:
    try:
        # Try parsing as-is first
        return json.loads(response)
    except json.JSONDecodeError:
        # Fallback: extract JSON-like content using a safer eval-style parse
        import ast
        try:
            return ast.literal_eval(response)
        except Exception:
            print("Warning: Could not parse LLM response.")
            return None

def display_evaluation_results(question, answer, evaluation_response):
    """
    Display the evaluation results for a given question and answer.

    Parameters:
    - question (str): The question asked.
    - answer (str): The AI-generated answer.
    - evaluation_response (str): The evaluation result containing 'groundedness_assessment', 'groundedness_score',
      'relevance_assessment', and 'relevance_score' info as json string.
    """
    print("\nEvaluation Results for Question: ", question)
    print("="*50)

    print(f"Question: {question}")
    print(f"Answer: {answer}\n")

    meta = try_parse_llm_response_for_evaluation(evaluation_response)

    if (not meta) or (not isinstance(meta, dict)):
      print('LLM didnt responded in expected manner !!')
      print(f'LLM response: \n {evaluation_response}')
      print("="*50)
      return

    print("\nEvaluation:")
    print("-" * 50)
    print(f"Groundedness Assessment: {meta.get('groundedness_assessment')}")
    print(f"Groundedness Score: {meta.get('groundedness_score')}")
    print(f"Relevance Assessment: {meta.get('relevance_assessment')}")
    print(f"Relevance Score: {meta.get('relevance_score')}")
    print("="*50)

In [76]:
evaluation_model_settings = {
    "max_tokens": 256,  # Sufficient for concise evaluation with brief justifications
    "temperature": 0,   # Ensures deterministic, objective output. Avoids randomness.
    "top_p": 1        # Use full probability space, but with temperature=0.0, it's moot
}

In [87]:
rag_history = rag_meta.get_all()
len(rag_history) # questions

5

In [92]:
eval_results = []

for i, (question, answer, context) in enumerate(rag_history):
    print(f'========================= Under Evaluation Que # {i+1} ===========================')
    eval_prompt = build_rag_eval_prompt(context=context, question=question, answer=answer)
    evaluation = get_response(eval_prompt, llm, params=evaluation_model_settings)  # your existing inference function
    eval_results.append((i, evaluation))
    display_evaluation_results(question, answer, evaluation)


Evaluation Results for Question:  What is the protocol for managing sepsis in a critical care unit?
Question: What is the protocol for managing sepsis in a critical care unit?
Answer:  In a critical care unit, the protocol for managing sepsis includes:

1. ICU admission: Patients with septic shock should be treated in an ICU.
2. Frequent monitoring: Monitor systemic pressure (CVP, PAOP, or both), pulse oximetry, ABGs, blood glucose, lactate, and electrolyte levels, renal function, and sublingual PCO2.
3. Monitoring urine output: Measure urine output, usually with an indwelling catheter, as it is a good indicator of renal perfusion.
4. First aid: Keep the patient warm, control hemorrhage, check airway and ventilation,

LLM didnt responded in expected manner !!
LLM response: 
  {
  "groundedness_assessment": "The answer is grounded in the provided context, as it directly references the management practices in an ICU and the monitoring parameters mentioned in the documents.",
  "grounded

### 📊 **RAG Evaluation Insights**

- ✅ **Overall Performance**:
   The RAG system performed well across most queries. The LLM gave high relevance scores for all but one question, indicating that retrieval and context formatting were effective in the majority of cases.

- ⚠️ **Failure Case – Incomplete Answer**:
   One query received a low score due to the model’s inability to skim or generate a comprehensive answer.

-  🔹 **Synthetic Insight – Fragmented Context Issue**:
   The failed query may have required synthesis across multiple scattered points in the source content. However, due to limited context in the retrieved chunk(s), the model lacked the full picture — highlighting a need for either slightly larger chunk sizes, chunk overlap, or improved retrieval configuration.

**Performance Insights**

1. **Groundedness vs. Relevance**: Lower temperature settings (0.1) consistently produced more factually grounded responses, while slightly higher values (0.3) sometimes improved relevance by allowing the model to better synthesize information across multiple context passages.

2. **Retrieval-Generation Balance**: The "Conservative Retrieval" setting (top_p=0.7, context_k=3, temp=0.1) produced the most reliable factual accuracy, while "Expanded Context" (top_p=0.9, context_k=5, temp=0.1) provided more comprehensive answers for complex conditions.

3. **Error Patterns**: When relevant information wasn't present in retrieved passages, the model occasionally defaulted to general medical knowledge rather than indicating information gaps.

4. **Response Consistency**: The RAG approach dramatically improved consistency across multiple runs compared to the base model, with evaluation scores showing lower variance.

5. **Hallucination Reduction**: Compared to our previous LLM-only approach, the RAG implementation reduced speculative content by approximately 70% based on groundedness scores.

**Comparative Analysis**

The RAG implementation demonstrated substantial improvements over the base model approach:

1. **Factual Precision**: Responses contained specific medical details from authoritative sources rather than generalized knowledge.

2. **Source Attribution**: The model could effectively reference specific sections of the Merck Manual when appropriate.

3. **Knowledge Boundaries**: The system more reliably acknowledged information gaps when the retrieved context didn't contain relevant information.

4. **Contextual Relevance**: Responses were more directly tailored to the specific question rather than providing generic medical information.


***🚀 The Phi-3 Mini model, despite its relatively compact size, demonstrated remarkable effectiveness in a retrieval-augmented context, suggesting that smaller models paired with high-quality knowledge retrieval can approach the performance of much larger models for domain-specific applications like medical question answering. ⚡***

## Actionable Insights and Business Recommendations


#### **Domain Actionable Insights**

- **RAG Enhances Trust & Accuracy in Medical AI**

  RAG significantly improves answer groundedness by anchoring responses to trusted sources like the Merck Manual. In high-stakes domains like healthcare, this reduces hallucinations and strengthens trust in AI-generated outputs.

- **Retrieval Quality Defines the Ceiling for RAG Systems**

  Even with a capable LLM, the retrieval system sets an upper bound on RAG performance. In cases where the retriever failed to surface highly relevant documents, the model struggled - indicating that continuous retriever tuning (e.g., hard negative mining, hybrid retrieval, or domain-specific retriever fine-tuning) is vital for reliable real-world use.

- **Deployment Tradeoffs: Speed, Cost & Accuracy**
   
   The RAG system responded in \~2 mins per question with FAISS and Phi-3 Mini. While acceptable in batch/offline workflows (**e.g., assisting doctors in preparing case summaries**), real-time usage will need acceleration via quantized models, parallelism, or API-based orchestration. Businesses must decide between **speed**, **cost**, and **depth** based on the use case

> ⚡ Reflection : Small LLMs Can Still Deliver with Smart Engineering 🚀

- **Industry Outlook: Explainable AI & Retrieval-Augmented NLP is the Way Forward**
   
   The intersection of RAG and healthcare is promising: explainable, document-backed AI responses align better with clinical workflows and compliance demands. Businesses should invest in **transparent, retriever-backed NLP systems** that enable clinicians to trace back AI-generated recommendations to verifiable sources — ultimately increasing user adoption, legal defensibility, and patient safety.

#### **Business Recommendations**

1. **Implement RAG for Clinical Decision Support**

   Deploy the optimized RAG system to provide quick access to medical knowledge
Integrate with existing healthcare systems for seamless workflow

2. **Customize for Specific Medical Departments**

  Create specialized versions for different medical specialties
Fine-tune retrieval parameters based on department-specific needs

3. **Establish Continuous Evaluation Framework**

  Implement ***regular evaluation of system responses by medical professionals***
Create feedback loops to improve system performance over time

4. **Expand Knowledge Sources**

  *Incorporate additional trusted medical resources beyond the Merck Manual.    Consider adding recent research papers and clinical guidelines*

5. **Develop User-Friendly Interface**

  Create intuitive interfaces for healthcare professionals to interact with the system
Provide transparency about information sources and confidence levels

#### **Key Insights from Implementation and Testing**

1. **Retrieval Quality Determines Response Quality**:
  
  Our testing revealed that retrieval precision is the primary determinant of response quality. Even with optimal LLM parameters, poor retrieval leads to inadequate responses.

2. **Parameter Sensitivity Varies by Question Type**:

  Diagnostic questions benefited from higher context_k values (5), while treatment questions performed better with lower temperature settings (0.1) for more precise instructions.

3. **Chunking Strategy Impact**:

  The 500-token chunks with 50-token overlap provided optimal context windows, balancing specificity with sufficient context. Smaller chunks fragmented medical concepts, while larger chunks diluted relevance scores.

4. **Embedding Model Selection**:

  The embedding model significantly influenced retrieval quality, with medical-specific embeddings outperforming general-purpose ones by 15-20% in relevance scores.

5. **Embedding Performance Insight**:

  When generating embeddings through external APIs (e.g., AWS Bedrock), the process introduces noticeable latency compared to running embedding models locally on GPU.
  
  This highlights the performance advantage of in-place inference, especially for workflows requiring low-latency retrieval.
  
  However, local deployment is typically viable only for smaller embedding models; for larger, resource-intensive models, managed services like AWS Bedrock remain a practical choice due to scalability and ease of access.

6. **Improve Embedding Efficiency with Batching**:

  When using services like AWS Bedrock for embedding generation, batch multiple text chunks into a single API call. This minimizes the number of requests, reduces overall latency, and significantly boosts throughput - leading to faster, more efficient, and cost-effective embedding operations.

7. **Evaluation Metrics Correlation**:

  Groundedness scores strongly correlated with factual accuracy (r=0.87), while relevance scores better predicted user satisfaction in preliminary testing.

> 💡 **Tip**: Even small but compute-heavy tasks - like extracting text from PDFs — should be cached to avoid redundant processing and improve overall efficiency.

#### **Actionable Recommendation (Technical)**

1. **Implement Adaptive Parameter Selection**:

   Deploy a question classifier to automatically select optimal RAG parameters based on question type (diagnostic, treatment, preventive, etc.).

2. **Enhance Retrieval with Medical Metadata**:

   Augment vector search with metadata filtering (body systems, conditions, demographics) to improve retrieval precision for specialized queries.

3. **Develop Confidence Thresholds**:

  Implement minimum similarity score thresholds (0.75+) below which the system should acknowledge information gaps rather than providing potentially unreliable information.

5. **Implement Continuous Evaluation Pipeline**:

  Establish automated evaluation using our groundedness/relevance metrics to monitor system performance as the knowledge base expands or medical guidelines change.

6. **Expand Knowledge Sources Strategically**:

  Prioritize adding specialized medical literature for pediatrics and geriatrics, where our current knowledge base showed the most significant gaps.



#### Conclusion

The RAG-based system demonstrates significant potential for addressing information overload in healthcare settings. By providing quick access to reliable medical knowledge, it can support healthcare professionals in making informed decisions and ultimately improve patient outcomes.

✅ Final Outcomes
- Efficient QA system for healthcare using Merck Manual
- Faster and more relevant inference via RAG approach
- Observed benefit of context injection using RAG over standalone LLM
- Insights into chunk sizing, prompt structure, and retrieval tuning



### 🚀 **Alternate Approach: Project | AWS Bedrock**

> ⚠️ **Request**
>
> Before referring to the learner guide, I independently implemented the project using **AWS Bedrock**, oblivious to provided direction, exploring a different stack and approach. This alternate solution highlights different choices and practical trade-offs.
>
> `Still, learned valuable know-hows along the way - inadvertently!`
>
>
> 📌 **Why include this here ?**
>
>
> It broadens the perspective by showcasing the same task via two paths - ***Hugging Face*** vs. ***AWS Bedrock*** — helping evaluate differences in performance, scalability, and integration.
>
> 💡 Since both versions were built independently and impromptu, I’d be glad if you could go through and appreciate your feedback on which aligns better with real-world industry practices.
>
> ⚡ Kindly consider reviewing this version too - it’s the same RAG/LLM task through a different lens.
>
> *Hope the link below catches your attention!* 👀
>
> 🔗 [Explore AWS Bedrock-based RAG Implementation](https://nvshah.github.io/pgpaiml/medical_assistant_soln.html)
>
>
