## Problem Statement

### Business Context

The healthcare industry is rapidly evolving, with professionals facing increasing challenges in managing vast volumes of medical data while delivering accurate and timely diagnoses. The need for quick access to comprehensive, reliable, and up-to-date medical knowledge is critical for improving patient outcomes and ensuring informed decision-making in a fast-paced environment.

Healthcare professionals often encounter information overload, struggling to sift through extensive research and data to create accurate diagnoses and treatment plans. This challenge is amplified by the need for efficiency, particularly in emergencies, where time-sensitive decisions are vital. Furthermore, access to trusted, current medical information from renowned manuals and research papers is essential for maintaining high standards of care.

To address these challenges, healthcare centers can focus on integrating systems that streamline access to medical knowledge, provide tools to support quick decision-making, and enhance efficiency. Leveraging centralized knowledge platforms and ensuring healthcare providers have continuous access to reliable resources can significantly improve patient care and operational effectiveness.

**Common Questions to Answer**

**1. Diagnostic Assistance**: "What are the common symptoms and treatments for pulmonary embolism?"

**2. Drug Information**: "Can you provide the trade names of medications used for treating hypertension?"

**3. Treatment Plans**: "What are the first-line options and alternatives for managing rheumatoid arthritis?"

**4. Specialty Knowledge**: "What are the diagnostic steps for suspected endocrine disorders?"

**5. Critical Care Protocols**: "What is the protocol for managing sepsis in a critical care unit?"

### Objective

As an AI specialist, your task is to develop a RAG-based AI solution using renowned medical manuals to address healthcare challenges. The objective is to **understand** issues like information overload, **apply** AI techniques to streamline decision-making, **analyze** its impact on diagnostics and patient outcomes, **evaluate** its potential to standardize care practices, and **create** a functional prototype demonstrating its feasibility and effectiveness.

### Data Description

The **Merck Manuals** are medical references published by the American pharmaceutical company Merck & Co., that cover a wide range of medical topics, including disorders, tests, diagnoses, and drugs. The manuals have been published since 1899, when Merck & Co. was still a subsidiary of the German company Merck.

The manual is provided as a PDF with over 4,000 pages divided into 23 sections.

## Installing and Importing Necessary Libraries and Dependencies

In [None]:
# Installation for GPU llama-cpp-python
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.85 --force-reinstall --no-cache-dir -q

# Installation for CPU llama-cpp-python
# Uncomment and run the following code in case GPU is not being used
# !CMAKE_ARGS="-DLLAMA_CUBLAS=off" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.85 --force-reinstall --no-cache-dir -q

# For installing the libraries & downloading models from HF Hub
# Explicitly specifying versions to avoid conflicts
!pip install \
    huggingface_hub==0.23.2 \
    pandas==1.5.3 \
    tiktoken==0.6.0 \
    pymupdf==1.24.1 \
    langchain==0.1.1 \
    langchain-community==0.0.13 \
    sentence-transformers==2.7.0 \
    numpy==1.26.4 \
    scipy==1.12.0 \
    transformers==4.40.0 \
    packaging==23.2 \
    langchain-core==0.1.23

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.8 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/1.8 MB[0m [31m3.8 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.8 MB[0m [31m14.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m22.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.1/62.1 kB[0m [31m296.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m273.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.9/16.9 MB[0

In [None]:
# Suppress warning messages to keep the notebook output clean.
import warnings
warnings.filterwarnings('ignore')

# Libraries for data processing
import json, os
import pandas as pd
import tiktoken # Token counting

# Libraries from the LangChain framework for building the RAG pipeline
from langchain.text_splitter import RecursiveCharacterTextSplitter  # To split long texts into smaller chunks
from langchain_community.document_loaders import PyMuPDFLoader      # To load data from PDF files
from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings # To create numerical representations (embeddings) of text
from langchain_community.vectorstores import Chroma                   # The vector database to store and retrieve embeddings

# Libraries for downloading and interacting with the LLM
from huggingface_hub import hf_hub_download  # To download the model file from Hugging Face
from llama_cpp import Llama                  # The LlamaCPP class to load and run the model

## Question Answering using LLM

First, we'll see how the model answers questions without prompt engineering or RAG retreiving directly from the Merck Manuals.

This will give us a baseline.

#### Downloading and Loading the model

In [None]:
# Define the model ID and filename from Hugging Face Hub
# We are using Mistral-7B-Instruct, a powerful model for instruction following.
model_id = "TheBloke/Mistral-7B-Instruct-v0.2-GGUF"
# Corrected filename based on the repository content
model_file = "mistral-7b-instruct-v0.2.Q5_K_M.gguf" # Using Q5_K_M as it is available and a good balance of quality and performance

print(f"Downloading model: {model_file} from {model_id}...")

# Import the userdata module to access Colab secrets
from google.colab import userdata

# Get the Hugging Face token from Colab secrets
# Make sure you have added a secret named 'HF_TOKEN' in the Colab Secrets Manager
hf_token = userdata.get('HF_TOKEN')

# Download the model from the Hugging Face Hub, using the token for authentication
model_path = hf_hub_download(
    repo_id=model_id,
    filename=model_file,
    token=hf_token # Pass the Hugging Face token here
)
print("Model download complete.")

# Load the downloaded model into Llama-cpp-python
print("Loading model into memory...")
llm = Llama(
    model_path=model_path,
    n_gpu_layers=0, # Set to 0 for CPU-only. Set to -1 to offload all possible layers to GPU.
    n_batch=512,    # Batch size for processing prompts (can be adjusted based on VRAM).
    n_ctx=4096,     # The maximum context size for the model, increased to 4096
    verbose=True   # Changed to True to potentially work around the fileno error.
)
print("Model loaded successfully.")

Downloading model: mistral-7b-instruct-v0.2.Q5_K_M.gguf from TheBloke/Mistral-7B-Instruct-v0.2-GGUF...
Model download complete.
Loading model into memory...
Model loaded successfully.


AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 


#### Response

In [None]:
# Define a function to generate a response from the LLM
def response(query, max_tokens=128, temperature=0, top_p=0.95, top_k=50):
    """
    Generates a response from the base LLM.
    """
    model_output = llm(
        prompt=query,                # User input/Query
        max_tokens=max_tokens,       # Output max token length
        temperature=temperature,     # Controls output randomness
        top_p=top_p,                 # Filters token probability pool
        top_k=top_k,                 # Likely tokens considered
        echo=False                   # We don't want the model to repeat our prompt in its answer
    )

    # Only return the generated text from the model's output
    return model_output['choices'][0]['text'].strip()


### Query 1: What is the protocol for managing sepsis in a critical care unit?

In [None]:
user_input1 = "What is the protocol for managing sepsis in a critical care unit?"
print(response(user_input1))

Sepsis is a life-threatening condition that can arise from an infection, and it requires prompt recognition and aggressive management in a critical care unit. The following are general steps for managing sepsis in a critical care unit:

1. Early recognition: Recognize the signs and symptoms of sepsis early and initiate treatment as soon as possible. Sepsis can present with various clinical features, including fever or hypothermia, tachycardia or bradycardia, altered mental status, respiratory distress, and lactic acidosis.
2. Resusc


Observations:

The base model provides a general, plausible-sounding answer about sepsis management. However, it lacks specific details, dosages, or alternative treatments that would likely be found in a medical manual. The answer is based on the model's generalized training, not a definitive source.

### Query 2: What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?

In [None]:
user_input2 = "What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?"
print(response(user_input2))

Llama.generate: prefix-match hit


Appendicitis is a medical condition characterized by inflammation of the appendix, a small pouch that extends from the large intestine. The following are common symptoms of appendicitis:

1. Abdominal pain: The pain is typically located in the lower right side of the abdomen and may start as a mild discomfort that gradually worsens over time. It may be constant or come and go, and it may be accompanied by cramping or bloating.
2. Loss of appetite: People with appendicitis often lose their appetite due to the pain and discomfort in


Observations:

The model correctly identifies common symptoms like abdominal pain, loss of appetite and states that an appendectomy is the standard treatment. This information is common knowledge and likely well-represented in the model's training data. The answer is generally accurate but not verifiable.


### Query 3: What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?

In [None]:
user_input3 = "What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?"
print(response(user_input3))

Llama.generate: prefix-match hit


Sudden patchy hair loss, also known as alopecia areata, is a common autoimmune disorder that affects the hair follicles. It can result in round or oval bald patches on the scalp, but it can also occur on other parts of the body such as the beard area, eyebrows, and eyelashes.

The exact cause of alopecia areata is not known, but it's believed to be related to a combination of genetic and environmental factors that trigger an immune response against the hair follicles. Some possible triggers include stress, viral in


Observations:

The model correctly identifies the condition as alopecia areata and correctly states that the exact cause is unknown but believed to be autoimmune.

This appears to be a good general overview but lacks the depth a healthcare professional would need, such as different classes of treatments or success rates.

### Query 4:  What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?

In [None]:
user_input4 = "What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?"
print(response(user_input4))

Llama.generate: prefix-match hit


There is no one-size-fits-all answer to this question as the specific treatment recommendations for a person with a brain injury depend on various factors such as the severity and location of the injury, the age and overall health of the individual, and the extent of the resulting impairments. However, I can provide an overview of some common treatments and interventions that may be recommended for individuals with brain injuries.

1. Acute care: In the immediate aftermath of a brain injury, the focus is on providing acute care to stabilize the patient's condition and prevent further damage. This may


Observations:

The answer notes that treatment depends on the specific injury and gives a broad overview of acute care.

The information is high-level and not specific enough for clinical decision-making, as it gets cut off by the token limit before providing substantial detail.

### Query 5: What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?

In [None]:
user_input5 = "What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?"
print(response(user_input5))

Llama.generate: prefix-match hit


First and foremost, if you suspect that someone has fractured their leg while hiking, it's essential to ensure their safety and prevent further injury. Here are some necessary precautions:

1. Keep the person calm and still: Encourage them to remain as still as possible to minimize pain and prevent worsening the injury.
2. Assess the situation: Check for any signs of severe bleeding, open wounds, or other injuries. If you suspect that the fracture is open (compound), do not attempt to move the person without medical assistance.
3. Immobilize the


Observations:

The model gives sound, practical advice that aligns with general first-aid principles, such as assessing the situation, calling for help, and immobilizing the person. Demonstrating the model's strong foundation of common knowledge.

### **Comments on Basic Q&A Baseline LLM:**

* **Strengths:** The base LLM has a broad knowledge base and can provide generally accurate, high-level answers to common medical questions. It is fast and requires no additional data preparation.

* **Limitations:** The answers are not verifiable and lack specific details from an authoritative source. There is risk of "hallucination" (inventing facts) for more niche topics, and the information cannot be trusted for clinical use.



## Question Answering using LLM with Prompt Engineering

* Often, we can get better answers from an LLM by giving it more context or instructions in the prompt itself by defining a "system prompt" that tells the model how to behave or what role to adopt.

* Next we will define a **system prompt** to optimize the AI to act like a helpful medical assistant (defined by our problem statement and objectives).

In [None]:
# Define a system prompt to guide the LLM's response
# Using capital letters can help emphasize key instructions for the LLM.
system_prompt = """You are an AI medical assistant TASKED to provide ACCURATE and TIMELY medical information to healthcare professionals. Do not invent information if you don't know the answer."""

def response_with_system_prompt(system_prompt, user_query, max_tokens=128, temperature=0, top_p=0.95, top_k=50):
    """
    Generates a response from the loaded LLM using a system prompt.
    """
    # Mistral Instruct models expect a specific format for prompts with a system message.
    # The format is: <s>[INST] {system_prompt} [/INST]</s>[INST] {user_query} [/INST]
    # <s> and </s> mark the beginning and end of a sequence.
    # [INST] and [/INST] are instruction tokens that wrap user input.
    prompt = f"<s>[INST] {system_prompt} [/INST]</s>[INST] {user_query} [/INST]"

    model_output = llm(
        prompt=prompt,
        max_tokens=max_tokens,
        temperature=temperature,
        top_p=top_p,
        top_k=top_k,
        echo=False
    )
    # The return statement navigates the structured output of the LLM.
    # ['choices']: Accesses the list of possible responses.
    # [0]: Selects the first and most probable response.
    # ['text']: Extracts the text content from the chosen response object.
    return model_output['choices'][0]['text'].strip()

Now we'll ask the same five questions, but this time, the LLM will have our system_prompt to guide its tone and style.  

We'll also play with some parameters to see how they affect the output.

  Parameters:  
  * **temperature**,
  * **max_tokens**,
  * **top_k**,
  * **top_p**



### Query 1: What is the protocol for managing sepsis in a critical care unit?

### Retry the queries with prompt engineering and parameter tuning.

Combination 1 (Baseline with System Prompt):

* temperature=0.1 (slight variability but still heavily steered toward the factual)
* max_tokens=250
* top_k=40
* top_p=0.9

In [None]:
user_input_1 = "What is the protocol for managing sepsis in a critical care unit?"
answer_1_pe_c1 = response_with_system_prompt(system_prompt, user_input_1, max_tokens=250, temperature=0.1, top_k=40, top_p=0.9)
print(f"Query 1 (PE Combo 1): {user_input_1}")
print(f"Answer 1 (PE Combo 1):\n{answer_1_pe_c1}")

Llama.generate: prefix-match hit


Query 1 (PE Combo 1): What is the protocol for managing sepsis in a critical care unit?
Answer 1 (PE Combo 1):
Sepsis is a life-threatening condition caused by the body's response to an infection, and managing it in a critical care unit requires prompt recognition and aggressive treatment. Here are some general steps for managing sepsis in a critical care unit:

1. Early Recognition: Identify sepsis early based on clinical suspicion, laboratory findings, and severity of illness using tools like the Sequential Organ Failure Assessment (SOFA) score or Quick Sequential Organ Failure Assessment (qSOFA) score.
2. Resuscitation: Begin resuscitation as soon as possible with intravenous fluids to maintain adequate blood pressure and organ perfusion. Consider administering oxygen to maintain adequate oxygenation.
3. Antibiotics: Administer broad-spectrum antibiotics based on the suspected source of infection and local microbiology data.
4. Source Control: Identify and address the source of infe

Observations:

By providing a system prompt, the model's response is significantly improved. The answer is now well-organized into a clear, numbered list, which is highly appropriate for a question about a "protocol."

It correctly identifies key pillars of sepsis management, such as early recognition, fluid resuscitation ("early goal-directed therapy"), antibiotics, and source control.

The tone is more direct and professional, aligning with the requested 'AI medical assistant' persona.

However, it's crucial to note that this information is still drawn from the model's general training data and is not yet grounded in our specific medical manual.






### Query 2: What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?

Combination 2 (More Creative/Varied):

* temperature=0.7
* max_tokens=300
* top_k=70
* top_p=0.95

In [None]:
user_input_2 = "What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?"
answer_2_pe_c2 = response_with_system_prompt(system_prompt, user_input_2, max_tokens=300, temperature=0.7, top_k=70, top_p=0.95)
print(f"\nQuery 2 (PE Combo 2): {user_input_2}")
print(f"Answer 2 (PE Combo 2):\n{answer_2_pe_c2}")

Llama.generate: prefix-match hit



Query 2 (PE Combo 2): What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?
Answer 2 (PE Combo 2):
The common symptoms of appendicitis include:

1. Sudden pain in the lower right abdomen that starts near the navel and moves downwards
2. Loss of appetite
3. Nausea and vomiting
4. Fever, which may be low-grade at first but can rise as high as 101°F (38.3°C) or more
5. Inability to pass gas or have a bowel movement
6. Feeling sick or uneasy
7. Diarrhea or constipation
8. Swelling, redness, and tenderness in the abdomen
9. Abdominal pain that worsens when coughing, sneezing, or walking
10. Loss of energy and general malaise

Appendicitis cannot be cured with medicine alone. If left untreated, the appendix can burst, leading to a severe infection called peritonitis. Surgery is the only effective treatment for appendicitis. The standard surgical procedure used to remove the appendix is called an appen

Observations:

With a higher temperature (0.7), the answer is more verbose and uses slightly different phrasing compared to a zero-temperature response. The core information remains the same, but the model's "creativity" in expression is higher. This is generally not desirable for factual medical Q&A where consistency is paramount.

### Query 3: What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?

Combination 3 (Very Focused & Deterministic):

* temperature=0.0
* max_tokens=200 (shorter, more concise)
* top_k=30
* top_p=0.85

In [None]:
user_input_3 = "What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?"
answer_3_pe_c3 = response_with_system_prompt(system_prompt, user_input_3, max_tokens=200, temperature=0.0, top_k=30, top_p=0.85)
print(f"\nQuery 3 (PE Combo 3): {user_input_3}")
print(f"Answer 3 (PE Combo 3):\n{answer_3_pe_c3}")

Llama.generate: prefix-match hit



Query 3 (PE Combo 3): What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?
Answer 3 (PE Combo 3):
Sudden patchy hair loss, also known as alopecia areata, is an autoimmune disorder that results in the loss of hair from specific areas of the scalp or other parts of the body. The exact cause of alopecia areata is not known, but it's believed to be related to a problem with the immune system.

There are several treatments for addressing sudden patchy hair loss:

1. Corticosteroids: These medications can be applied directly to the affected area or taken orally to reduce inflammation and suppress the immune response that causes hair loss. Topical corticosteroids include foams, gels, lotions, and solutions, while oral corticosteroids are available in various forms such as tablets, capsules, and liquids.
2. Immunotherapies: These treatments aim to modify the i

Observations:

The combination of temperature=0.0 and a lower max_tokens limit results in a highly focused and deterministic answer. The output is concise and directly addresses the query without extra conversational elements, making it ideal for quick information retrieval.


### Query 4:  What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?

Combination 4 (Balanced):

* temperature=0.3
* max_tokens=350 (allowing for more detail)
* top_k=50
* top_p=0.95

In [None]:
user_input_4 = "What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?"
answer_4_pe_c4 = response_with_system_prompt(system_prompt, user_input_4, max_tokens=350, temperature=0.3, top_k=50, top_p=0.95)
print(f"\nQuery 4 (PE Combo 4): {user_input_4}")
print(f"Answer 4 (PE Combo 4):\n{answer_4_pe_c4}")

Llama.generate: prefix-match hit



Query 4 (PE Combo 4): What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?
Answer 4 (PE Combo 4):
For a person with a brain injury, the specific treatment recommendations depend on the severity and location of the injury. Here are some common interventions:

1. Emergency care: For severe brain injuries, immediate medical attention is necessary to prevent further damage or complications. This may include surgery for hematomas or skull fractures, airway management, and medication to control intracranial pressure.
2. Rehabilitation: Rehabilitation programs can help individuals regain lost skills and improve overall function. This may include physical therapy, occupational therapy, speech therapy, and cognitive rehabilitation.
3. Medications: Depending on the symptoms, various medications may be prescribed to manage conditions such as seizures, pain, depression, or difficulty wit

Observations:

The model correctly prefaces its answer by stating it cannot give specific recommendations, which aligns with the "do not invent information" rule in our system prompt.

The balanced parameters allow for a detailed but still factual overview of treatment categories, such as emergency care, rehabilitation, and surgery

### Query 5: What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?

Combination 5 (Slightly More Restrictive Sampling):

* temperature=0.2
* max_tokens=270
* top_k=35
* top_p=0.92

In [None]:
user_input_5 = "What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?"
answer_5_pe_c5 = response_with_system_prompt(system_prompt, user_input_5, max_tokens=270, temperature=0.2, top_k=35, top_p=0.92)
print(f"\nQuery 5 (PE Combo 5): {user_input_5}")
print(f"Answer 5 (PE Combo 5):\n{answer_5_pe_c5}")

Llama.generate: prefix-match hit



Query 5 (PE Combo 5): What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?
Answer 5 (PE Combo 5):
A fractured leg is a serious injury that requires prompt medical attention. Here are some general precautions and treatment steps for a person with a fractured leg:

1. Immobilize the injured leg: Use a splint, sling, or cast to keep the bone in place and prevent any further movement. This will help reduce pain and swelling and prevent complications such as nerve damage or improper healing.
2. Control bleeding: Apply direct pressure to the injury site with a clean cloth to control any bleeding. Elevate the leg above heart level if possible to reduce swelling and promote blood flow.
3. Pain management: Use over-the-counter pain medications, such as acetaminophen or ibuprofen, as directed by a healthcare professional. Prescription pain medication may be necessary in so

Observations:

The restrictive sampling parameters (top_k, top_p) combined with a low temperature produce a very stable and high-quality seeming response. The steps appear logical, well-organized, and read like a reliable first-aid guide covering immobilization, bleeding control, pain management, and follow-up care which sound correct but are still genral unverifiable medical advice and not specific or grounded to any preferred data.

### **Comments on Prompt Engineering:**

Prompt engineering gives us significant control over the model's tone, style, and adherence to rules. By providing a clear persona and constraints via a system prompt, we can elicit more professional and reliable responses, even without changing the model's underlying knowledge.

However, it does not solve the core problem of verifiability.

## Data Preparation for RAG

To make our Merck Manuals PDF useful for the RAG system, we need to:

1. **Load the Data:** Read the content of the PDF.
2. **Chunk the Data:** Break down the number of pages into smaller, digestible pieces.
3. **Embed the Chunks:** Convert these text chunks into numerical representations (embeddings).
4. **Store in a Vector Database:** Put these embeddings into a special database that allows for efficient similarity searches.

### Loading the Data

In [None]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

# The uploaded file is 'medical_diagnosis_manual.pdf
merck_pdf_path = 'medical_diagnosis_manual.pdf'

pdf_loader = PyMuPDFLoader(merck_pdf_path) # Reads the PDF and extracts text content along with metadata (like page number).
manual = pdf_loader.load() # Loads all pages into a list of Document objects

Saving medical_diagnosis_manual.pdf to medical_diagnosis_manual.pdf
User uploaded file "medical_diagnosis_manual.pdf" with length 20151775 bytes


Alternate Load

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

# merck_pdf_path = '/content/sample_data/medical_diagnosis_manual.pdf'

# pdf_loader = PyMuPDFLoader(merck_pdf_path) # Reads the PDF and extracts text content along with metadata (like page number).
# manual = pdf_loader.load() # Loads all pages into a list of Document objects

### Data Overview

#### Checking the first 5 pages

In [None]:
for i in range(min(5, len(manual))): # Ensure we don't go out of bounds if PDF has < 5 pages
    print(f"\n--- Page {i+1} (metadata: {manual[i].metadata}) ---")
    print(manual[i].page_content[:500],end="\n") # Print first 500 characters of the page


--- Page 1 (metadata: {'source': 'medical_diagnosis_manual.pdf', 'file_path': 'medical_diagnosis_manual.pdf', 'page': 0, 'total_pages': 4114, 'format': 'PDF 1.7', 'title': 'The Merck Manual of Diagnosis & Therapy, 19th Edition', 'author': '', 'subject': '', 'keywords': '', 'creator': 'Atop CHM to PDF Converter', 'producer': 'pdf-lib (https://github.com/Hopding/pdf-lib)', 'creationDate': 'D:20120615054440Z', 'modDate': 'D:20250525161424Z', 'trapped': ''}) ---
mwill.itmission@gmail.com
SMJEDP569T
ant for personal use by mwill.itmission@g
shing the contents in part or full is liable 


--- Page 2 (metadata: {'source': 'medical_diagnosis_manual.pdf', 'file_path': 'medical_diagnosis_manual.pdf', 'page': 1, 'total_pages': 4114, 'format': 'PDF 1.7', 'title': 'The Merck Manual of Diagnosis & Therapy, 19th Edition', 'author': '', 'subject': '', 'keywords': '', 'creator': 'Atop CHM to PDF Converter', 'producer': 'pdf-lib (https://github.com/Hopding/pdf-lib)', 'creationDate': 'D:20120615054440Z'

Observations:

The data from our input manual is being retrieved from the resource.The initial pages, including the table of contents, confirming that the PDF is being loaded and read correctly can be seen.

#### Checking the number of pages

In [None]:
print(f"The manual has {len(manual)} pages.")

The manual has 4114 pages.


### Data Chunking

In [None]:
# Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    encoding_name='cl100k_base', # Uses the same encoding as many OpenAI models, good for token counting
    chunk_size=512,              # Size of each chunk (in tokens)
    chunk_overlap=64             # Number of tokens to overlap between chunks
)

In [None]:
# Split the loaded documents into smaller chunks
document_chunks = text_splitter.split_documents(manual)

# Check the number of chunks created
print(f"\nNumber of document chunks: {len(document_chunks)}")


Number of document chunks: 8769


In [None]:
# Verify a few chunks to see the content and overlap
if len(document_chunks) > 3:
    print(f"\n--- Chunk 0 (metadata: {document_chunks[0].metadata}) ---")
    # The 'metadata' dictionary contains info like the source file and page number.
    print(document_chunks[0].page_content) # This is the actual text content of the chunk.
    print(f"\n--- Chunk 1 (metadata: {document_chunks[1].metadata}) ---")
    print(document_chunks[1].page_content)

    # To better see overlap, we can compare the end of chunk 0 with the start of chunk 1.
    if document_chunks[0].metadata.get('page') == document_chunks[1].metadata.get('page'):
        print("\nOverlap illustration (If from same page):")
        print(f"End of Chunk 0: ...{document_chunks[0].page_content[-100:]}")
        print(f"Start of Chunk 1: {document_chunks[1].page_content[:100]}...")


--- Chunk 0 (metadata: {'source': 'medical_diagnosis_manual.pdf', 'file_path': 'medical_diagnosis_manual.pdf', 'page': 0, 'total_pages': 4114, 'format': 'PDF 1.7', 'title': 'The Merck Manual of Diagnosis & Therapy, 19th Edition', 'author': '', 'subject': '', 'keywords': '', 'creator': 'Atop CHM to PDF Converter', 'producer': 'pdf-lib (https://github.com/Hopding/pdf-lib)', 'creationDate': 'D:20120615054440Z', 'modDate': 'D:20250525161424Z', 'trapped': ''}) ---
mwill.itmission@gmail.com
SMJEDP569T
ant for personal use by mwill.itmission@g
shing the contents in part or full is liable

--- Chunk 1 (metadata: {'source': 'medical_diagnosis_manual.pdf', 'file_path': 'medical_diagnosis_manual.pdf', 'page': 1, 'total_pages': 4114, 'format': 'PDF 1.7', 'title': 'The Merck Manual of Diagnosis & Therapy, 19th Edition', 'author': '', 'subject': '', 'keywords': '', 'creator': 'Atop CHM to PDF Converter', 'producer': 'pdf-lib (https://github.com/Hopding/pdf-lib)', 'creationDate': 'D:20120615054440Z'

Comments:

* **RecursiveCharacterTextSplitter.from_tiktoken_encoder(...)**:
We use from_tiktoken_encoder to ensure our chunk_size is based on tokens (what the LLM sees) rather than just character count. cl100k_base is a common tokenizer.

* **chunk_size=512**:
Each chunk will be roughly 512 tokens long. This size needs to be small enough to fit into our LLM's context window along with the question and prompt, but large enough to contain meaningful information.

* **chunk_overlap=64**: Each chunk will share its last 64 tokens with the next chunk's first 64 tokens. This helps maintain context across splits.

* **text_splitter.split_documents(manual_docs)**: This applies the splitting logic to all our loaded pages.

Seeing the content of a few chunks helps us understand what these pieces of text look like. The metadata (like page number) is preserved for each chunk, which is very helpful for referencing later.

### Embedding

Computers don't understand words directly; they understand numbers.  
Embeddings convert text chunks into lists of numbers (vectors) such that chunks with similar meanings have vectors that are numerically "close" to each other.  
Sentence transformer model, is specifically trained for this task.

In [None]:
# !pip install numpy==1.26.4 scipy==1.12.0 transformers==4.40.0 sentence-transformers langchain-community==0.0.38
# from langchain_community.embeddings import SentenceTransformerEmbeddings

# We'll use a pre-trained model from Hugging Face. "all-MiniLM-L6-v2" is a good quality and relatively small/fast.
embedding_model_name = "sentence-transformers/all-MiniLM-L6-v2"

embedding_model = SentenceTransformerEmbeddings(
    model_name=embedding_model_name,
    model_kwargs={'device': 'cuda'} # Specify 'cpu' or 'cuda' for execution
)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
# Test the embedding model on a couple of chunks
if len(document_chunks) > 1:
    embedding_1 = embedding_model.embed_query(document_chunks[0].page_content)
    embedding_2 = embedding_model.embed_query(document_chunks[1].page_content)

    print(f"\nDimension of the embedding vector: {len(embedding_1)}")
    print(f"Are dimensions of embedding_1 and embedding_2 the same? {len(embedding_1) == len(embedding_2)}")
else:
    print("Not enough document chunks to create embeddings for this example.")


Dimension of the embedding vector: 384
Are dimensions of embedding_1 and embedding_2 the same? True


### Vector Database

Now that we can turn text chunks into embeddings (vectors), we need a place to **store** them and efficiently **search** through them.
This is where a **Vector Database** comes in. We will be using **Chroma**.

In [None]:
# Define a directory to persist (save) the vector database
out_dir = 'medical_db_merck'

# Create the directory if it doesn't exist
if not os.path.exists(out_dir):
    os.makedirs(out_dir)

In [None]:
# Create the Chroma vector store from our document chunks and embedding model.
# This process takes each chunk, creates an embedding, and stores it.
# This builds the knowledge base the RAG system will use.
# Note: This can take a few minutes for a large document.
!pip install chromadb

print(f"\nCreating vector store. This might take a while...")

# Define a batch size that is less than the max batch size reported by Chroma
batch_size = 5000 # Using 5000, which is less than the max of 5461

# Initialize Chroma client (in-memory for this example, persistent if persist_directory is set)
# We'll add documents in batches later
vectorstore = Chroma(
    embedding_function=embedding_model,
    persist_directory=out_dir
)

# Iterate through document chunks in batches and add to Chroma
for i in range(0, len(document_chunks), batch_size):
    batch_chunks = document_chunks[i:i + batch_size]
    print(f"Adding batch {int(i/batch_size) + 1} of {len(document_chunks) // batch_size + 1}...")
    vectorstore.add_documents(documents=batch_chunks)

print(f"Vector store created and persisted in '{out_dir}'.")


Creating vector store. This might take a while...
Adding batch 1 of 2...
Adding batch 2 of 2...
Vector store created and persisted in 'medical_db_merck'.


In [None]:
# In future sessions, to save time, you can load the existing database instead of rebuilding it.
  # To do this, you would comment out the Chroma.from_documents line in the previous cell and uncomment the line below:
# vectorstore = Chroma(persist_directory=out_dir, embedding_function=embedding_model)

# Test a similarity search to verify the vector store is working.
test_query_for_vectorstore = "symptoms of fever"
print(f"\nTesting similarity search for: '{test_query_for_vectorstore}'")

searched_docs = vectorstore.similarity_search(test_query_for_vectorstore, k=2) # Get top 2 similar docs

for i, doc in enumerate(searched_docs):
    print(f"--- Similar Doc {i+1} (Page: {doc.metadata.get('page', 'N/A')}) ---") # Chroma uses 'page' in metadata
    print(doc.page_content[:300] + "...")


Testing similarity search for: 'symptoms of fever'
--- Similar Doc 1 (Page: 1296) ---
• Identifying any localizing symptoms (eg, headache, cough): These symptoms help narrow the range of
possible causes. The localizing symptom may be part of the patient's chief complaint or identified only
by specific questioning.
• Determining whether the patient is seriously or chronically ill (par...
--- Similar Doc 2 (Page: 2896) ---
Fever plays an integral role in fighting infection and, although it is uncomfortable, does not necessitate
treatment in an otherwise healthy child. Some studies even indicate that lowering the temperature can
prolong some illnesses. However, fever increases the metabolic rate and the demands on the
...


Observations:

We see the similarity search working by finding similarity of symptoms of fever.

### Retriever

The vectorstore we created can perform similarity searches. We wrap it in a Langchain retriever object, which is the standard way to fetch relevant documents for a RAG chain.

In [None]:
# Create a retriever from the vector store
retriever = vectorstore.as_retriever(
    search_type='similarity', # We want to find chunks similar to the query.
    search_kwargs={'k': 3}    # This sets the number of relevant chunks to fetch to 3.
)

In [None]:
# Test the retriever
test_query_for_retriever = "treatment for high blood pressure"
print(f"\nTesting retriever for: '{test_query_for_retriever}'")
relevant_docs = retriever.get_relevant_documents(test_query_for_retriever)

print(f"\nRetrieved {len(relevant_docs)} documents:")
for i, doc in enumerate(relevant_docs):
    print(f"--- Relevant Doc {i+1} (Page: {doc.metadata.get('page', 'N/A')}) ---")
    print(doc.page_content[:300] + "...")


Testing retriever for: 'treatment for high blood pressure'

Retrieved 3 documents:
--- Relevant Doc 1 (Page: 2590) ---
usually < 1 g/day (it is occasionally higher and in the nephrotic range).
Ultrasonography is done only if other causes of renal failure must be excluded. It may show that kidney
size is reduced. Renal biopsy is done only if the diagnosis remains unclear.
Prognosis
Prognosis usually depends on adequa...
--- Relevant Doc 2 (Page: 2230) ---
Chapter 208. Arterial Hypertension
Introduction
Hypertension is sustained elevation of resting systolic BP (≥ 140 mm Hg), diastolic BP (≥ 90 mm
Hg), or both. Hypertension with no known cause (primary; formerly, essential hypertension) is
most common. Hypertension with an identified cause (secondary ...
--- Relevant Doc 3 (Page: 2268) ---
needed, is highly effective but can depress respiration, can reduce myocardial contractility, and is a
potent venous vasodilator. Hypotension and bradycardia secondary to morphine can usually be overc

The retriever has identified and retrieved three documents specific to high blood pressure:
* Page 2590- Detection techniques
* Page 2230- Arterial Hypertension.
* Page 2268- Side-effects and the negation of side-effects

### System and User Prompt Template

Now we combine the retriever (which fetches relevant info from our PDF) with the LLM (which generates answers)

This is the advanced prompt that uses XML tags, rules, and a one-shot example to guide the LLM's behavior more precisely.

### Why This Prompt Is Effective:

This optimized system prompt is designed to be much clearer and more forceful for the AI model. It incorporates several modern prompt engineering techniques to improve the reliability, accuracy, and formatting of the RAG system's responses.

* **Structure (`<tags>`):** The XML-style tags help the model parse the instructions more effectively. This clearly separates the AI's persona, its rules, the required process, and helpful examples, reducing ambiguity.

* **Clarity and Forcefulness (MUST):** Using strong, direct imperatives like "MUST" reduces the chance of the model deviating from critical instructions, especially the rules about using only the provided context and citing sources.

* **Step-by-Step Process:** Outlining a clear, logical process encourages better internal reasoning from the model. It guides the AI to first analyze the question, then scrutinize the context, and finally synthesize the answer, ensuring all constraints are applied correctly.

* **Embedded Example (One-Shot Learning):** The `<example>` block is the most significant enhancement. It provides a perfect demonstration of the desired output, including the exact citation format (`[Page: X]`) and professional tone. Showing the model what to do is far more effective than just describing it.

* **Specificity:** The prompt is more specific about how to handle missing information (by stating an exact phrase) and the precise format for citations, leaving less room for misinterpretation by the model.

In [None]:
# System prompt for the RAG chain. This is the master set of instructions.
qna_system_message = """
<instructions>
  <persona>
    You are an expert medical AI assistant. Your purpose is to provide accurate, concise answers based on the provided medical text.
  </persona>

  <rules>
    1.  **Exclusive Context:** You MUST answer the question based exclusively on the information within the provided `<context>`. Do not use any external knowledge.
    2.  **No Answer if Not in Context:** If the information required to answer the question is not in the `<context>`, you MUST respond with "The provided context does not contain the information needed to answer this question." Do not attempt to guess.
    3.  **Mandatory Citation:** For every piece of information you extract from the context, you MUST cite its source page number. The page number is provided at the start of each context chunk (e.g., 'Source Page: X').
    4.  **Citation Format:** Citations must be in the format `[Page: X]` and placed immediately after the sentence or clause they support.
    5.  **Conciseness:** Provide a direct answer to the question without unnecessary conversational fluff.
  </rules>

  <process>
    Follow these steps to generate your answer:
    Step 1: Carefully read and understand the user's `<question>`.
    Step 2: Thoroughly analyze the `<context>` to locate all relevant information that can help answer the question.
    Step 3: Synthesize the located information into a direct and professional answer.
    Step 4: As you write the answer, ensure every claim from the context is followed by its correct citation as defined in the rules.
  </process>
</instructions>
"""

Using this advanced system prompt, you no longer need a separate qna_user_message_template variable. The logic for the user's part of the prompt will be handled directly inside our response function.

### Response Function

Function that orchestrates the RAG process:

1. Take a user's question.
2. Use the retriever to get relevant document chunks from our Merck Manuals vector database.
3. Combine these chunks into a single context_for_query.
4. Use the qna_system_message and qna_user_message_template to build the final prompt for the LLM.
5. Get the answer from the LLM.

In [None]:
# This is the function that orchestrates the entire RAG process.
def generate_rag_response(query: str, vectorstore, llm, system_prompt: str, k: int = 5) -> str:
    """
    Generates a RAG response, aligning with the new structured system prompt.
    """
    # 1. Retrieve relevant context from the vector store
    similar_chunks = vectorstore.similarity_search(query, k=k)

    # 2. Format the retrieved context with page numbers
    formatted_context = "\n\n".join(
        [f"Source Page: {chunk.metadata.get('page', 'N/A')}\n{chunk.page_content}" for chunk in similar_chunks]
    )

    # 3. Dynamically create the user's part of the prompt
    user_part = f"""
<context>
{formatted_context}
</context>
<question>
{query}
</question>
<answer>
"""
    # 4. Combine the system instructions and the user part into a final prompt
    final_prompt = f"[INST] {system_prompt} [/INST]\n[INST] {user_part} [/INST]"

    # 5. Call the LLM to generate the final answer
    output = llm(
        prompt=final_prompt,
        max_tokens=512,
        temperature=0.0,
        stop=["</answer>"] # Important: tells the model when to stop
    )

    # 6. Return the generated text
    return output['choices'][0]['text'].strip()

## Question Answering using RAG

Test our full RAG system again.

Trying our 5 questions, this time with the RAG system.  

We'll also fine-tune some parameters.

### Query 1: What is the protocol for managing sepsis in a critical care unit?

In [None]:
user_input_1 = "What is the protocol for managing sepsis in a critical care unit?"
rag_answer_1_c1 = generate_rag_response(user_input_1, vectorstore, llm, qna_system_message, k=5)
print(f"Query 1 (RAG Combo 1, k=3): {user_input_1}")
print(f"RAG Answer 1 (Combo 1):\n{rag_answer_1_c1}")

Llama.generate: prefix-match hit


Query 1 (RAG Combo 1, k=3): What is the protocol for managing sepsis in a critical care unit?
RAG Answer 1 (Combo 1):
The protocol for managing sepsis in a critical care unit involves rapid empiric antibiotic therapy, general supportive measures, and later adjusting antibiotics according to sensitivities and the site of infection. If bacterial cultures show no growth by 48 hours and the neonate appears well, antibiotics are stopped [Page: 1182]. General supportive measures include respiratory and hemodynamic management [Page: 1182]. Antibiotic selection requires an educated guess based on the suspected source, clinical setting, knowledge or suspicion of causative organisms and of sensitivity patterns common to that specific inpatient unit, and previous culture results [Page: 2456]. Commonly used antibiotics include gentamicin or tobramycin plus a third-generation cephalosporin (cefotaxime or ceftriaxone), ceftazidime plus a fluoroquinolone, or monotherapy with maximal therapeutic doses

Observations:

The RAG-generated answer is highly specific and clinical. It correctly identifies key interventions like, rapid empiric antibiotic therapy and general supportive measures, including respiratory and hemodynamic management, etc. including reference page numbers for verification.  

This is a significant improvement over the generalized answers from the base LLM, providing actionable, source-based information.




### Query 2: What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?

In [None]:
user_input_2 = "What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?"
rag_answer_2_c2 = generate_rag_response(user_input_2, vectorstore, llm, qna_system_message, k=6)
print(f"\nQuery 2 (RAG Combo 2, k=3): {user_input_2}")
print(f"RAG Answer 2 (Combo 2):\n{rag_answer_2_c2}")

Llama.generate: prefix-match hit



Query 2 (RAG Combo 2, k=3): What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?
RAG Answer 2 (Combo 2):
The common symptoms of appendicitis include epigastric or periumbilical pain followed by brief nausea, vomiting, and anorexia; shifting pain to the right lower quadrant that increases with cough and motion; direct and rebound tenderness at McBurney's point; and additional signs like Rovsing sign, psoas sign, or obturator sign [Page: 173]. Low-grade fever is also common. However, these classic findings appear in less than 50% of patients, and symptoms may vary, especially among infants, children, elderly patients, and pregnant women [Page: 173].

Appendicitis cannot be cured via medicine alone; surgical removal is the standard treatment [Page: 164]. Open or laparoscopic appendectomy is performed, with antibiotics administered before surgery to reduce infection risk. If the appendix is perfora

Observations:

The model correctly states that symptoms include epigastric pain followed by nausea and vomiting, and that surgical removal is the standard treatment. The answer also includes important and specific clinical nuances and source page numbers for quick verification. This kind of specific, statistical detail is only possible through RAG and is invaluable in a health care setting.






### Query 3: What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?

In [None]:
user_input_3 = "What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?"
rag_answer_3_c3 = generate_rag_response(user_input_3, vectorstore, llm, qna_system_message, k=3)
print(f"\nQuery 3 (RAG Combo 3, k=3): {user_input_3}")
print(f"RAG Answer 3 (Combo 3):\n{rag_answer_3_c3}")

Llama.generate: prefix-match hit



Query 3 (RAG Combo 3, k=3): What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?
RAG Answer 3 (Combo 3):
Sudden patchy hair loss, also known as alopecia areata, is a common autoimmune disorder that affects genetically susceptible individuals exposed to unclear environmental triggers [1:858]. The scalp and beard are most frequently affected areas, but any hairy part of the body may be involved [1:849].

The treatment options for alopecia areata include:

1. Topical corticosteroids: These medications can stimulate hair growth and help reduce inflammation in the follicles [1:858].
2. Minoxidil (topical): This medication can promote hair regrowth by increasing blood flow to the affected area [1:858].
3. Immunotherapy: Diphencyprone or squaric acid dibutylester can be used for severe cases of alopecia areata [1:858].
4. Oral antimalarials, corticosteroids, 

Observations:

The response provides a well-structured list of treatments for alopecia areata, identifying topical corticosteroids, minoxidil, and immunotherapy as options. It also successfully synthesizes information about other potential causes like infections and drugs from different retrieved chunks, demonstrating the model's ability to consolidate related facts. Citing in the format [1:858] Document 1, page 858.

This output demonstrates that the RAG process is working correctly—the model is not inventing information but is grounding its answer in a specific page of the provided context. However, while it understood the instruction to cite the source, it deviated slightly from the exact [Page: X] format requested in the prompt. This is a common behavior in LLMs and highlights the nuances of prompt engineering.


### Query 4:  What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?

In [None]:
user_input_4 = "What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?"
rag_answer_4_c4 = generate_rag_response(user_input_4, vectorstore, llm, qna_system_message, k=4)
print(f"\nQuery 4 (RAG Combo 4, k=4): {user_input_4}")
print(f"RAG Answer 4 (Combo 4):\n{rag_answer_4_c4}")

Llama.generate: prefix-match hit



Query 4 (RAG Combo 4, k=4): What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?
RAG Answer 4 (Combo 4):
For a person with a brain injury resulting in temporary or permanent impairment of brain function, early intervention by rehabilitation specialists is crucial for maximal functional recovery [Page: 3647]. Rehabilitation includes prevention of secondary disabilities such as pressure ulcers and joint contractures, prevention of pneumonia, and family education [Page: 3647]. For severe cognitive dysfunction, extensive cognitive therapy is often begun immediately after injury and continued for months or years [Page: 3647]. The specific rehabilitation therapy varies depending on the patient's abnormalities and the level and extent of the injury [Page: 325]. Supportive care is essential, including preventing systemic complications due to immobilization, providing good nutrition, 

Observations:

The model provides a concise and accurate answer stating that "early intervention by rehabilitation specialists" is recommended. The brevity of the answer suggests that while the retrieved context contained this specific piece of information, it may not have held extensive details about other treatment modalities, and the model correctly limited its response to only what was provided.

Again, the pages are cited as explictly prompted in our system prompt. This time the output of the page numbers are in the expected format.



### Query 5: What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?

In [None]:
user_input_5 = "What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?"
rag_answer_5_c5 = generate_rag_response(user_input_5, vectorstore, llm, qna_system_message, k=6)
print(f"\nQuery 5 (RAG Combo 5, k=6): {user_input_5}")
print(f"RAG Answer 5 (Combo 5):\n{rag_answer_5_c5}")

Llama.generate: prefix-match hit



Query 5 (RAG Combo 5, k=6): What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?
RAG Answer 5 (Combo 5):
For a person who has fractured their leg during a hiking trip, the following precautions and treatment steps are necessary:

1. Immediate immobilization of the affected limb by splinting to prevent further injury and decrease pain [Page: 3390]. Pain is typically treated with opioids [Page: 3390].
2. Definitive treatment, such as reduction, may be required and usually involves analgesia or sedation [Page: 3390]. Closed reduction is maintained by casting, while open reduction is maintained by various surgical hardware like pins, screws, plates, or external fixators [Page: 3390].
3. RICE (rest, ice, compression, elevation) therapy should be applied to soft-tissue injuries for faster healing and pain relief [Page: 3377].
4. For patients with a lower-limb prosthesi

Observations:

This answer is comprehensive and demonstrates the system's strength in synthesizing a protocol from multiple pieces of information citing different page numbers in the output list. The output answer outlines steps from immediate immobilization and RICE therapy to definitive treatment and rehabilitation. It also pulls in related considerations for patients with prostheses or diabetes, showing its ability to connect relevant adjacent topics found in the retrieved context.



In [None]:
# Store all RAG answers for evaluation
rag_answers = {
    "Query 1 (Sepsis)": rag_answer_1_c1,
    "Query 2 (Appendicitis)": rag_answer_2_c2,
    "Query 3 (Hair Loss)": rag_answer_3_c3,
    "Query 4 (Brain Injury)": rag_answer_4_c4,
    "Query 5 (Leg Fracture)": rag_answer_5_c5,
}
user_inputs = {
    "Query 1 (Sepsis)": user_input_1,
    "Query 2 (Appendicitis)": user_input_2,
    "Query 3 (Hair Loss)": user_input_3,
    "Query 4 (Brain Injury)": user_input_4,
    "Query 5 (Leg Fracture)": user_input_5,
}

### Fine-tuning

"Fine-tuning" in a RAG context often refers to optimizing the pipeline's parameters rather than retraining the LLM's weights.

One of the most impactful parameters is the chunking strategy. A different chunk_size can change the quality of the context retrieved.

Let's test this by using a larger chunk size.

In [None]:
print("Demonstrating pipeline tuning with a new chunking strategy...")

# 1. Define a new text splitter with a larger chunk size
text_splitter_tuned = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    encoding_name='cl100k_base',
    chunk_size=1024, # Increased from 512
    chunk_overlap=150  # Increased overlap to maintain context
)

# 2. Create new document chunks with the new strategy
document_chunks_tuned = text_splitter_tuned.split_documents(manual)
print(f"\nNumber of chunks with tuned strategy: {len(document_chunks_tuned)}")


# 3. Create a new, separate vector store for the tuned chunks
out_dir_tuned = 'medical_db_merck_tuned'
if not os.path.exists(out_dir_tuned):
    os.makedirs(out_dir_tuned)

print(f"\nCreating a new vector store for the tuned chunks...")
vectorstore_tuned = Chroma.from_documents(
    documents=document_chunks_tuned,
    embedding=embedding_model,
    persist_directory=out_dir_tuned
)
print("Tuned vector store created.")

# 4. Re-run a query using the new "tuned" vector store
print("\n--- Testing Query 3 with the tuned RAG pipeline (chunk_size=1024) ---")
tuned_answer = generate_rag_response(
    user_input_3,
    vectorstore_tuned, # Using the new vector store
    llm,
    qna_system_message,
    k=3
)
print(f"Tuned RAG Answer:\n{tuned_answer}")

Demonstrating pipeline tuning with a new chunking strategy...

Number of chunks with tuned strategy: 4544

Creating a new vector store for the tuned chunks...
Tuned vector store created.

--- Testing Query 3 with the tuned RAG pipeline (chunk_size=1024) ---


Llama.generate: prefix-match hit


Tuned RAG Answer:
The treatment for sudden patchy hair loss, also known as alopecia areata, includes topical or intralesional corticosteroids, topical minoxidil, topical anthralin, topical immunotherapy (diphencyprone or squaric acid dibutylester), or psoralen plus ultraviolet A (PUVA) [Page: 858]. Hormonal modulators such as oral contraceptives or spironolactone may be useful for female-pattern hair loss associated with hyperandrogenemia [Page: 858]. The underlying cause of alopecia areata is believed to be an autoimmune disorder affecting genetically susceptible people exposed to unclear environmental triggers [Page: 858]. Other possible causes of sudden patchy hair loss include infection, drugs (including chemotherapeutic agents), and systemic illnesses [Page: 856]. A thorough evaluation including a history, physical examination, and possibly microscopic hair examination or scalp biopsy may be required for definitive diagnosis [Page: 857-858].


Observations:

The tuned answer for Query 3 is much more concise than the original RAG answer. By increasing the chunk_size to 1024, each retrieved chunk contains more information. In this case, the retriever was likely able to find a single, highly relevant chunk that summarized the main treatments, leading to a shorter and more focused response.

This demonstrates the trade-off: larger chunks can provide more complete context but might sometimes be less specific than smaller, more targeted chunks.

The optimal strategy depends on the nature of the document and the types of queries expected.



## Output Evaluation

We have a responses from our RAG system. But how good are they?

To evaluate them, we use an **"LLM-as-a-judge"** approach to rate our system's output on specific criteria.

We'll focus on two key metrics:

* **Groundedness** (or Faithfulness/Accuracy): Does the answer stick to the information provided in the retrieved context? It shouldn't invent facts or use outside knowledge.
* **Relevance** (or Answer Relevance): Is the answer relevant to the user's actual question? Does it directly address what they wanted to know?

### Defining Evaluation Prompts
To perform this evaluation, we need to create prompts that instruct our "judge" LLM on how to perform the assessment and score the results.

In [None]:
# System prompt for the "Groundedness" rater
groundedness_rater_system_message = """You are an expert evaluator. Your task is to assess if the provided 'Answer' is well-grounded in the given 'Context'. The 'Answer' should only contain information that can be directly verified from the 'Context'. It should not add any external information or make assumptions beyond the 'Context'.
Score the "Answer" on a scale of 1 to 5 for groundedness, where:
1: The Answer is mostly not based on the Context or contains significant external information.
2: The Answer has some basis in the Context but also includes noticeable external information or assumptions.
3: The Answer is generally based on the Context but might have minor unverified details or slight deviations.
4. The Answer is well-supported by the Context with only very minor or no external information.
5: The Answer is entirely and accurately based on the Context, with no external information or assumptions.

Provide your score as a number (e.g., "Score: 5") followed by a brief justification for your rating."""

In [None]:
# System prompt for the "Relevance" rater
relevance_rater_system_message = """You are an expert evaluator. Your task is to assess the relevance of the provided 'Answer' to the given 'Question', considering it was generated based on the 'Context'. The 'Answer' should directly address the 'Question'. It should be pertinent and focused on the user's query.
Score the 'Answer' on a scale of 1 to 5 for relevance, where:
1: The Answer is not relevant to the Question at all.
2: The Answer is only slightly relevant or significantly misses the main point of the Question.
3: The Answer is moderately relevant but could be more direct or focused.
4: The Answer is highly relevant and directly addresses the Question well.
5: The Answer is perfectly relevant, concise, and fully addresses all aspects of the Question based on the context.

Provide your score as a number (e.g., "Score: 5") followed by a brief justification for your rating."""

In [None]:
# User message template for the evaluators
evaluation_user_message_template = """
### Question:
{question}

### Context:
{context}

### Answer:
{answer}

Based on the instructions given to you as an evaluator, please provide your score and justification."""

* We have two distinct system prompts, one for each evaluation aspect. They clearly define the criteria and the scoring scale.

* The **evaluation_user_message_template** provides the structure to feed the question, the context that was used to generate the answer, and the actual answer itself to the judge LLM.

### Function to Generate Evaluations

This function takes a user query and its RAG answer, then gets evaluations for groundedness and relevance.

In [None]:
# This function is modified to handle potential context window errors.
def get_evaluations(user_input, rag_answer, k_retriever_for_context=3, max_tokens_eval=150, temp_eval=0.0):
    """
    Gets groundedness and relevance evaluations for a given RAG answer.
    Includes a fix to prevent context window overflow errors.
    """
    # 1. Retrieve the context
    current_retriever = vectorstore.as_retriever(search_kwargs={'k': k_retriever_for_context})
    relevant_document_chunks = current_retriever.get_relevant_documents(query=user_input)
    context_list = [d.page_content for d in relevant_document_chunks]
    context_for_evaluation = "\n\n---\n\n".join(context_list)

    # **ERROR FIX**: Truncate the context for evaluation to prevent exceeding the model's context window.
    # The total prompt includes instructions, context, question, and answer. Truncating the context
    # is a reliable way to ensure the entire payload fits. 3000 characters is a safe limit here.
    if len(context_for_evaluation) > 3000:
        context_for_evaluation = context_for_evaluation[:3000] + "\n...[CONTEXT TRUNCATED]..."

    # 2. Evaluate Groundedness
    eval_user_message_filled = evaluation_user_message_template.format(
        question=user_input,
        context=context_for_evaluation,
        answer=rag_answer
    )
    groundedness_prompt = f"<s>[INST] {groundedness_rater_system_message} [/INST]</s>[INST] {eval_user_message_filled} [/INST]"

    try:
        groundedness_output = llm(prompt=groundedness_prompt, max_tokens=max_tokens_eval, temperature=temp_eval, echo=False)
        groundedness_eval = groundedness_output['choices'][0]['text'].strip()
    except Exception as e:
        groundedness_eval = f"Error during groundedness evaluation: {e}"

    # 3. Evaluate Relevance
    relevance_prompt = f"<s>[INST] {relevance_rater_system_message} [/INST]</s>[INST] {eval_user_message_filled} [/INST]"
    try:
        relevance_output = llm(prompt=relevance_prompt, max_tokens=max_tokens_eval, temperature=temp_eval, echo=False)
        relevance_eval = relevance_output['choices'][0]['text'].strip()
    except Exception as e:
        relevance_eval = f"Error during relevance evaluation: {e}"

    return groundedness_eval, relevance_eval

### Evaluating the RAG Responses for our 5 Queries

Loop through the RAG answers generated previously and get evaluations for them.

In [None]:
# The 'k' value used for the retriever for each answer was:
k_values_used = {
    "Query 1 (Sepsis)": 5,
    "Query 2 (Appendicitis)": 6,
    "Query 3 (Hair Loss)": 3,
    "Query 4 (Brain Injury)": 4,
    "Query 5 (Leg Fracture)": 6, # This query used a larger k, making it prone to context window errors.
}

evaluation_results = {}

print("\n--- Output Evaluations ---")
for query_key, rag_answer_text in rag_answers.items():
    original_question = user_inputs[query_key]
    k_for_context = k_values_used[query_key]

    print(f"\n--- Evaluating: {query_key} ---")
    print(f"Question: {original_question}")
    print(f"RAG Answer: {rag_answer_text[:200]}...")

    ground_eval, rel_eval = get_evaluations(
        user_input=original_question,
        rag_answer=rag_answer_text,
        k_retriever_for_context=k_for_context
    )
    evaluation_results[query_key] = {
        "groundedness": ground_eval,
        "relevance": rel_eval
    }
    print(f"Groundedness Evaluation:\n{ground_eval}")
    print(f"Relevance Evaluation:\n{rel_eval}")


--- Output Evaluations ---

--- Evaluating: Query 1 (Sepsis) ---
Question: What is the protocol for managing sepsis in a critical care unit?
RAG Answer: The protocol for managing sepsis in a critical care unit involves rapid empiric antibiotic therapy, general supportive measures, and later adjusting antibiotics according to sensitivities and the site...


Llama.generate: prefix-match hit
Llama.generate: prefix-match hit


Groundedness Evaluation:
Score: 5

The Answer is entirely and accurately based on the Context, with no external information or assumptions added. The Answer directly quotes and summarizes the information from the Context regarding the protocol for managing sepsis in a critical care unit, including rapid empiric antibiotic therapy, supportive measures, antibiotic selection, and surgical intervention when necessary. The Answer also includes accurate references to specific pages in the Context for verification.
Relevance Evaluation:
Score: 5

The Answer directly addresses the Question by providing specific details about the protocol for managing sepsis in a critical care unit. It mentions the importance of rapid empiric antibiotic therapy, general supportive measures, and later adjusting antibiotics according to sensitivities and the site of infection. The Answer also provides information on antibiotic selection, common antibiotics used, and additional treatments such as drainage of absce

Llama.generate: prefix-match hit
Llama.generate: prefix-match hit


Groundedness Evaluation:
Score: 5

The answer is entirely and accurately based on the context, with no external information or assumptions added. The symptoms described in the answer are directly taken from the context, and the statement about the treatment being surgical removal is also explicitly stated in the context. Additionally, the mention of antibiotics before and after surgery is also supported by the context.
Relevance Evaluation:
Score: 5

The answer directly addresses both parts of the question by providing a detailed explanation of the common symptoms of appendicitis and stating that surgical removal is the standard treatment for appendicitis. The answer also mentions the specific procedures used to treat appendicitis, such as open or laparoscopic appendectomy and antibiotic administration. The context provided in the text was effectively utilized to generate an accurate and comprehensive response.

--- Evaluating: Query 3 (Hair Loss) ---
Question: What are the effective t

Llama.generate: prefix-match hit
Llama.generate: prefix-match hit


Groundedness Evaluation:
Score: 5

The Answer is entirely and accurately based on the Context, with no external information or assumptions added. The Answer summarizes the information from the Context regarding alopecia areata, its causes, and treatment options. It also correctly identifies that a thorough evaluation is necessary to determine the underlying cause of sudden patchy hair loss. The references provided in the Answer ([1:849, 855, 858]) correspond to the locations within the Context where the information was obtained.
Relevance Evaluation:
Score: 5

The provided answer directly addresses the question by discussing the causes of sudden patchy hair loss (alopecia areata) and its effective treatments. The answer also mentions other possible causes of hair loss and their respective treatments. The information is relevant, concise, and fully covers all aspects of the question based on the context provided.

--- Evaluating: Query 4 (Brain Injury) ---
Question: What treatments are 

Llama.generate: prefix-match hit
Llama.generate: prefix-match hit


Groundedness Evaluation:
Score: 5

Justification: The Answer is entirely and accurately based on the Context, with no external information or assumptions added. It summarizes the key points from the provided text regarding rehabilitation for individuals with brain injuries, including the importance of early intervention, various aspects of rehabilitation therapy, and supportive care.
Relevance Evaluation:
Score: 5

The Answer directly addresses the Question by discussing the importance of early intervention by rehabilitation specialists for a person with a brain injury and the various aspects of rehabilitation therapy, including prevention of secondary disabilities, cognitive therapy, and supportive care. The Answer also mentions that the specific rehabilitation therapy varies depending on the patient's abnormalities and the level and extent of the injury, making it relevant to the context provided in the Question. Additionally, the Answer touches upon the prognosis for patients in a p

Llama.generate: prefix-match hit
Llama.generate: prefix-match hit


Groundedness Evaluation:
Score: 5

The Answer is entirely and accurately based on the Context, with no external information or assumptions added. The Answer directly quotes and summarizes the necessary precautions and treatment steps for a person who has fractured their leg from the provided Context.
Relevance Evaluation:
Score: 5

The answer directly addresses the question by providing a comprehensive list of necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip. The information is relevant and focused on the user's query, as it covers immobilization, pain management, RICE therapy, prosthetic considerations, rehabilitation, activity restrictions, wound care, and infection prevention. The answer is also concise and fully addresses all aspects of the question based on the context provided.


### Interpreting Evaluation Results:
We'd look for high scores (4 or 5) in both groundedness and relevance.

* **High Groundedness**: The RAG system is doing a good job of sticking to the facts in the Merck Manuals and not making things up.

* **High Relevance**: Means the answers are actually useful and address the user's query.

If scores are low, the justifications provided by the judge LLM can give us clues on what to improve (e.g., better chunking, different retriever settings, prompt adjustments, etc.).

This LLM-as-a-judge method isn't perfect (the judge itself can have biases or make mistakes), but it's a very useful and scalable way to get automated feedback on the quality of our RAG system.

Observations:

The LLM-as-a-judge evaluation was executed successfully across all five queries, providing quantitative validation of the RAG system's performance. The results are exceptionally positive:

* Consistently High Scores: The "judge" model consistently awarded the highest possible score (Score: 5) for both Groundedness and Relevance on all queries where a detailed answer was provided.

* Validation of Groundedness: The perfect scores for Groundedness confirm that the RAG system is successfully adhering to its most critical instruction: to answer exclusively from the provided context of the Merck Manual. The justifications note that the answers are "entirely and accurately based on the Context, with no external information or assumptions added," which is the ideal outcome for a fact-based system.

* Validation of Relevance: The perfect scores for Relevance indicate that the retriever component is effectively identifying and fetching the correct document chunks that are pertinent to the user's question. The generator AI is then successfully using that specific context to construct a comprehensive and focused response.

* Robustness of the Evaluation Function: It's important to note that the evaluation process completed without the context window overflow error seen in previous attempts. This confirms that the context truncation logic added to the get_evaluations function is working correctly, making the evaluation pipeline robust enough to handle queries that retrieve a large amount of text.

The automated evaluation demonstrates that the RAG pipeline is performing with a high degree of accuracy and reliability, providing a strong, evidence-based foundation for its use as a clinical decision support tool.

## Actionable Insights and Business Recommendations

### Actionable Insights:

1. **RAG Reliably Translates Manuals into Q&A Systems:**

      This project proves that the RAG architecture is highly effective for converting dense, domain-specific documents like the Merck Manual into an interactive, accurate, and verifiable question-answering system. The final RAG responses were consistently superior to the base LLM's generic knowledge.


2.  **Advanced Prompting is Key to Safe and Effective Use:**

      The implementation of a structured XML prompt with explicit rules was the most critical factor in controlling the model's behavior. This allowed us to enforce mandatory source citation and to make the model safely refuse to answer when the context was insufficient—a non-negotiable feature for a medical application.

3. **Pipeline Tuning is a Balancing Act:**

      The fine-tuning experiment showed that RAG performance is a system of trade-offs. Changing the chunk_size directly impacts the context quality. Larger chunks can provide more comprehensive answers but may reduce retrieval precision. The optimal configuration is not universal and must be tuned based on the document's structure and expected user queries.

4. **Context Window Management can be an Engineering Hurdle:**

      The context window overflow error was encountered initially during the evaluation phase which is a primary practical challenge in building RAG systems. It highlights that the entire pipeline, from the number of documents retrieved (k) to the length of the prompts, must be managed. Implementing robust strategies like context truncation is essential for system stability.

5. **Automated Evaluation Provides Scalable Quality Assurance:**

      The LLM-as-a-judge approach can provided a scalable and consistent way to score the RAG system's output on key metrics like Groundedness and Relevance. This methodology is useful for regression testing, comparing different pipeline configurations, and ensuring ongoing quality control.

### Business Recommendations

1.  **Initiate a Controlled Clinical Pilot Program:**

      The prototype has successfully demonstrated technical feasibility and high performance on test queries. The next logical step is to deploy this system in a controlled pilot for a specific clinical group (e.g., medical residents, a specialized department). The goal is to gather feedback on real-world usability, identify the most common query types, and build user trust.

2.  **Invest in a User-Centric Interface for Verifiability:**

      For clinicians to adopt this tool, it must be transparent and trustworthy. We recommend developing a user interface that displays the generated answer alongside the retrieved context snippets. Each citation `[Page: X]` should be a clickable link that highlights the source sentence in the context, allowing for instant verification.

3.  **Establish a "Living Document" Update Cycle:**

      Medical knowledge is not static. A formal process must be created to update the system's knowledge base. When a new edition of the Merck Manual (or other source material) is released, the data preparation pipeline (chunking, embedding, and vector store creation) must be re-run to ensure the AI assistant is always providing the most current medical information.

4.  **Strategically Expand the Knowledge Base:**

      The value of the system will increase exponentially with more high-quality data. We recommend strategically expanding the knowledge base to include other trusted sources, such as internal hospital treatment protocols, pharmaceutical formularies, and key clinical practice guidelines from recognized medical societies. The RAG architecture can be scaled to search across multiple data sources simultaneously.

5.  **Frame and Train as an "Decision Support Assistant":**

      All internal marketing, training, and documentation must consistently position this tool as an "AI Co-Pilot" or "Decision Support Assistant." It is designed to augment and accelerate a professional's ability to access information, not to replace their clinical judgment. This framing is essential for user adoption, managing expectations, and ensuring patient safety.

