## Problem Statement

### Business Context

The healthcare industry is rapidly evolving, with professionals facing increasing challenges in managing vast volumes of medical data while delivering accurate and timely diagnoses. The need for quick access to comprehensive, reliable, and up-to-date medical knowledge is critical for improving patient outcomes and ensuring informed decision-making in a fast-paced environment.

Healthcare professionals often encounter information overload, struggling to sift through extensive research and data to create accurate diagnoses and treatment plans. This challenge is amplified by the need for efficiency, particularly in emergencies, where time-sensitive decisions are vital. Furthermore, access to trusted, current medical information from renowned manuals and research papers is essential for maintaining high standards of care.

To address these challenges, healthcare centers can focus on integrating systems that streamline access to medical knowledge, provide tools to support quick decision-making, and enhance efficiency. Leveraging centralized knowledge platforms and ensuring healthcare providers have continuous access to reliable resources can significantly improve patient care and operational effectiveness.

**Common Questions to Answer**

**1. Diagnostic Assistance**: "What are the common symptoms and treatments for pulmonary embolism?"

**2. Drug Information**: "Can you provide the trade names of medications used for treating hypertension?"

**3. Treatment Plans**: "What are the first-line options and alternatives for managing rheumatoid arthritis?"

**4. Specialty Knowledge**: "What are the diagnostic steps for suspected endocrine disorders?"

**5. Critical Care Protocols**: "What is the protocol for managing sepsis in a critical care unit?"

### Objective

As an AI specialist, your task is to develop a RAG-based AI solution using renowned medical manuals to address healthcare challenges. The objective is to **understand** issues like information overload, **apply** AI techniques to streamline decision-making, **analyze** its impact on diagnostics and patient outcomes, **evaluate** its potential to standardize care practices, and **create** a functional prototype demonstrating its feasibility and effectiveness.

### Data Description

The **Merck Manuals** are medical references published by the American pharmaceutical company Merck & Co., that cover a wide range of medical topics, including disorders, tests, diagnoses, and drugs. The manuals have been published since 1899, when Merck & Co. was still a subsidiary of the German company Merck.

The manual is provided as a PDF with over 4,000 pages divided into 23 sections.

## Installing and Importing Necessary Libraries and Dependencies

In [None]:
# Installation for GPU llama-cpp-python
# uncomment and run the following code in case GPU is being used
#!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.85 --force-reinstall --no-cache-dir -q

# Installation for CPU llama-cpp-python
# uncomment and run the following code in case GPU is not being used
!CMAKE_ARGS="-DLLAMA_CUBLAS=off" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.85 --force-reinstall --no-cache-dir -q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.8 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.2/1.8 MB[0m [31m6.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m24.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.1/62.1 kB[0m [31m73.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m174.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.9/16.9 MB[0m [31m247.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.9/43.9 kB[0m [31m195.9 MB/s

**Note**:
- After running the above cell, kindly restart the runtime (for Google Colab) or notebook kernel (for Jupyter Notebook), and run all cells sequentially from the next cell.
- On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in ***this notebook***.

In [None]:
# For installing the libraries & downloading models from HF Hub
!pip install huggingface_hub==0.23.2 pandas==1.5.3 tiktoken==0.6.0 pymupdf==1.25.1 langchain==0.1.1 langchain-community==0.0.13 chromadb==0.4.22 sentence-transformers==2.3.1 numpy==1.25.2 -q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.7/41.7 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.9/40.9 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.9/40.9 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.2/40.2 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.2/40.2 kB[0m [31m2.4 MB/s[0m eta [36m0

**Note**:
- After running the above cell, kindly restart the runtime (for Google Colab) or notebook kernel (for Jupyter Notebook), and run all cells sequentially from the next cell.
- On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in ***this notebook***.

In [None]:
!pip install --upgrade numpy pandas scipy scikit-learn  # Add other relevant packages

Collecting numpy
  Downloading numpy-2.3.2-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.1/62.1 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Collecting pandas
  Downloading pandas-2.3.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (91 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m91.2/91.2 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
Collecting scipy
  Downloading scipy-1.16.1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.0/62.0 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
Collecting scikit-learn
  Downloading scikit_learn-1.7.1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (11 kB)
Downloading numpy-2.3.2-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (16.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
#Libraries for processing dataframes,text
import json,os
import tiktoken
import pandas as pd

#Libraries for Loading Data, Chunking, Embedding, and Vector Databases
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain_community.vectorstores import Chroma

#Libraries for downloading and loading the llm
from huggingface_hub import hf_hub_download
from llama_cpp import Llama

## Question Answering using LLM

#### Downloading and Loading the model

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Load PDF
loader = PyMuPDFLoader("/content/drive/MyDrive/Colab Notebooks/Module 5 - NLP/Proj 5 - Merck Manual/medical_diagnosis_manual.pdf")
docs = loader.load()

# Chunking
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(docs)

# Preview
print(f"Loaded {len(docs)} pages.")
print(f"Created {len(chunks)} text chunks. First chunk:\n\n{chunks[0].page_content[:500]}")

Loaded 4114 pages.
Created 18001 text chunks. First chunk:

kristi.esta@gmail.com
HNU4PTYZGQ
eant for personal use by kristi.esta@gm
shing the contents in part or full is liable


In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Choose model
model_name = "google/flan-t5-large"
# Create tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Model
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

#### Response

In [None]:
def response(query, max_tokens=128, temperature=0.0, top_p=None):
    tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-large")
    input_ids = tokenizer(query, return_tensors="pt").input_ids

    generate_args = {
        "input_ids": input_ids,
        "max_new_tokens": max_tokens
    }

    if temperature > 0:
        generate_args.update({
            "do_sample": True,
            "temperature": temperature,
            "top_p": top_p if top_p is not None else 0.95
        })

    output = model.generate(**generate_args)
    return tokenizer.decode(output[0], skip_special_tokens=True)

### Query 1: What is the protocol for managing sepsis in a critical care unit?

### Query 2: What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?

### Query 3: What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?

### Query 4:  What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?

### Query 5: What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?

In [None]:
# List of your clinical questions
questions = [
    "What are the common symptoms and treatments for pulmonary embolism?",
    "Can you provide the trade names of medications used for treating hypertension?",
    "What are the first-line options and alternatives for managing rheumatoid arthritis?",
    "What are the diagnostic steps for suspected endocrine disorders?",
    "What is the protocol for managing sepsis in a critical care unit?"
]

# Generate answers
for i, q in enumerate(questions, start=1):
    answer = response(q)
    print(f"Q{i}: {q}\nA{i}: {answer}\n{'-'*60}")

Q1: What are the common symptoms and treatments for pulmonary embolism?
A1: Symptoms include chest pain, shortness of breath, shortness of breath, shortness of breath, and shortness of breath.
------------------------------------------------------------
Q2: Can you provide the trade names of medications used for treating hypertension?
A2: sulfonylureas
------------------------------------------------------------
Q3: What are the first-line options and alternatives for managing rheumatoid arthritis?
A3: NSAIDs
------------------------------------------------------------
Q4: What are the diagnostic steps for suspected endocrine disorders?
A4: Identify the cause of the disease.
------------------------------------------------------------
Q5: What is the protocol for managing sepsis in a critical care unit?
A5: Sepsis in Critical Care
------------------------------------------------------------


## Observations on Baseline LLM Answers (Without RAG)
####General Performance

- The responses demonstrate surface-level understanding and lack depth, detail, and accuracy.

- The answers often appear hallucinated or overly vague, with minimal medical grounding.

###Question-by-Question Analysis

####Q1 (Pulmonary Embolism Symptoms/Treatment)

- Repetition: “shortness of breath” is mentioned 4 times.

- Incompleteness: No mention of key signs like tachycardia, hemoptysis, or hypoxia. No treatment info (e.g., anticoagulants).

- The LLM offers partial symptoms but lacks clinical depth and treatment context.

###Q2 (Trade Names for Hypertension Meds)

- Incorrect Output: "sulfonylurea" is a class for diabetes, not hypertension.

- Trade name not given: The prompt specifically asked for commercial/trade names (e.g., Norvasc, Lopressor), which were missed.

- Poor drug class mapping; possible confusion due to ambiguous training data.

###Q3 (RA Management)

- Irrelevant: "NSAIDs is not a first-line or well-supported alternative per major guidelines. While NSAIDs can be used for pain relief, tthey do not slow down disease progression.

- Missing detail: Should include DMARDs (e.g., methotrexate) and biologics.

- Indicates lack of access to trusted guidelines or current standard-of-care knowledge.

###Q4 (Endocrine Diagnosis)

- Circular logic: “identify the cause of the disease” is vague and uninformative.

- No mention: Doesn’t include hormonal assays, imaging, or differential diagnosis.

- A generic and placeholder-like answer lacking specificity.

###Q5 (Sepsis Protocol)

- "Sepsis in Critical Care" looks like a heading and definitely does not answer the question.

- No mention of steps: No reference to fluids, antibiotics, vasopressors, or SOFA scoring.

- Suggests lack of internal knowledge on medical workflows without external retrieval.

###Insight

- The LLM appears to hallucinate or default to general phrases when unsure.

- Demonstrates the importance of domain-specific retrieval for clinical QA tasks.

- Establishes a clear baseline for comparison against later prompt engineering and RAG-enhanced responses.

## Question Answering using LLM with Prompt Engineering

### Section 2: Question Answering using LLM with Prompt Engineering
This section explores the impact of prompt design and parameter tuning on the quality of LLM responses. We apply at least five different variations using a Hugging Face model (flan-t5-large) to generate answers to clinical questions. The focus is on improving completeness, clarity, and factual relevance compared to baseline results.

### Query 1: What is the protocol for managing sepsis in a critical care unit?

In [None]:
# Create list of prompts to use on the model

# Direct instruction (baseline reworded)
prompt1 = "What is the protocol for managing sepsis in a critical care unit?"

# Role prompting
prompt2 = (
    "As an ICU physician, describe the key clinical steps to manage sepsis in a critical care setting, "
    "based on medical best practices."
)

# Instructional style
prompt3 = (
    "Summarize the standard clinical guidelines for managing sepsis in ICU settings. "
    "Include diagnosis, interventions, and treatment steps."
)

# Format prompt
prompt4 = (
    "List the critical steps involved in managing sepsis in an ICU. "
    "Use bullet points and be medically accurate."
)

# Few-shot (light scaffolding)
prompt5 = (
    "Example:\n"
    "Q: What are the symptoms of pneumonia?\n"
    "A: Fever, cough, chest pain, and difficulty breathing.\n\n"
    "Q: What is the protocol for managing sepsis in a critical care unit?\n"
    "A:"
)

In [None]:
# Create lists for variables - prompts, temperature, top-p
prompts = [prompt1, prompt2, prompt3, prompt4, prompt5]
temps = [0.0, 0.7, 0.9, 0.3, 0.5]
tops = [None, 0.95, 0.9, 0.8, 1.0]

# Execute for loop to present various prompts, temperatures and top-p variables to the response function
for i, (p, t, tp) in enumerate(zip(prompts, temps, tops), start=1):
    print(f"--- Variation {i} ---")
    print(response(p, max_tokens=256, temperature=t, top_p=tp))
    print()

--- Variation 1 ---
Sepsis in Critical Care

--- Variation 2 ---
Septic shock is a condition in which the body's natural defenses fail to protect the body from pathogens. The body's natural defenses fail to protect the body from pathogens and the pathogens attack the body's natural defenses, causing internal bleeding and other symptoms.

--- Variation 3 ---
Sit patient on bed and provide a full body rest and make sure they receive IV fluids and medication. Do not use fluids in the ED.

--- Variation 4 ---
Sepsis is a severe form of infection that can be treated with antibiotics. The first step is to determine if you have sepsis. If you have sepsis, you should be treated with antibiotics.

--- Variation 5 ---
Sepsis protocol is based on the protocol for ward sepsis.



## Observations for Prompt Engineering + LLM Tuning - Query 1 Observations
####Query:  "What is the protocol for managing sepsis in a critical care unit?"
Shallow medical knowledge: The model lacked access to medical-specific context and failed to mention standard-of-care elements. Elements that should have been mentioned include IV fluids, early broad-spectrum antibiotics, vasopressors if hypotension persists, lactate monitoring, urine output tracking and SOFA score for organ dysfunction.

- Prompt framing slightly helped: Variations 3 and 4 produced  more plausible outputs, but still incomplete.

- Few-shot format (Variation 5) confused the model more than - helped.

- Temperature changes had limited effect at this stage, due to lack of detailed training knowledge or retrieval grounding.

#### Prompt/Response Variations
1. Direct question - too short and vague; likely a summary heading.
2. Role-based prompt - no protocol steps; overly descriptive and generic.
3. Instructional - some structure, but lacks clinical relevance.
4. Step-based - reasonable start, but lacks full protocal like fluids, etc.
5. Few-shot style - confused prevention against treatment; does not answer the protocol query.

#### Insight
These results reinforce the need for a retrieval-augmented approach (RAG) to ground the model in trusted medical reference content (i.e., Merck Manual) and avoid hallucinations.

### Query 2: What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?

In [None]:
#Create list of prompts to use on the model

q2 = "What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?"

# Direct
p1 = q2
# Role-based
p2 = "As a medical expert, explain the symptoms of appendicitis, and whether it can be cured with medicine. If not, what surgery is typically required?"

# Instructional
p3 = "Describe the symptoms of appendicitis and the appropriate medical or surgical treatments."

# Step-by-step
p4 = "List the common symptoms of appendicitis and outline whether it can be treated with medicine. If surgery is required, explain the procedure."

# Contextual few-shot
p5 = (
    "Q: What is the first-line treatment for pneumonia?\n"
    "A: Antibiotics and supportive care.\n\n"
    f"Q: {q2}\nA:"
)

In [None]:
# Create lists for variables - prompts, temperature, top-p
prompts = [p1, p2, p3, p4, p5]
temps = [0.0, 0.7, 0.9, 0.3, 0.5]
tops = [None, 0.95, 0.9, 0.8, 1.0]

# Execute for loop to present various prompts, temperatures and top-p variables to the response function
for i, (prompt, t, tp) in enumerate(zip(prompts, temps, tops), start=1):
    print(f"--- Variation {i} ---")
    print(response(prompt, max_tokens=256, temperature=t, top_p=tp))
    print()

--- Variation 1 ---
a painful swollen appendix

--- Variation 2 ---
The symptoms of appendicitis are abdominal pain or discomfort , often associated with a swollen or painful appendix. The appendix is a small tube that extends from the urethra and is connected to the cervix through a small opening.

--- Variation 3 ---
The following symptoms of appendicitis are most common in young men. They typically start in the lower thigh and can progress to the lower thigh and abdomen. They are usually accompanied by a high fever, a low spleen and abdominal pain. They can also develop into painful appendicitis in the femur or iliac crest. The symptoms of appendicitis include: Pain in the pelvic region The spleen is painful and the appendix may bulge out. This may be accompanied by the urge to urinate. The protruding uterus, the opening of the lower thigh, and the lower abdomen are signs of appendicitis. The spleen is generally swollen and may bulge out. The protruding uterus may protrude outward f

## Observations for Prompt Engineering + LLM Tuning - Query 2
####Query:  "What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?"

- Variability in accuracy: Only Variation 5 gave a complete and medically reasonable response. Others ranged from vague (p1), to anatomically wrong (p2), to truncated or shallow (p3 & p4).

- LLM hallucination is a risk: Variation 2 is a textbook example of LLM hallucination with fabricated anatomical connections.

- Few-shot prompting (p5) showed clear value in improving answer structure and relevance.

- Lack of procedural detail: None of the responses mentioned the name of the standard procedure (laparoscopic appendectomy), indicating a gap in model grounding.

#### Prompt/Response Variations
1. Direct - oversimplified and vague; does not answer the full question.
2. Role-based - completely hallucinated anatomy; biologically incorrect.
3. Instructional - closer to correct; mentions antibiotics, but no surgical procedure; confused causation (appendicitis is not viral).
4. Step-by-Step - incomplete; answer is cut-off, but there is some relevance.
5. Few-shot - best answer; most complete and realistic; mentions key symptoms and surgical escalation.

### Insight
Despite light improvements from prompt phrasing and tuning, the LLM lacks the specialized medical knowledge necessary to deliver clinically correct, complete answers. This again highlights the need for Retrieval-Augmented Generation (RAG) to supplement the model with trusted content from medical references like the Merck Manual.

### Query 3: What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?

In [None]:
#Create list of prompts to use on the model
q3 = "What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?"

# Prompt 1: Direct
p1 = q3

# Prompt 2: Role-based
p2 = "As a dermatologist, explain the possible causes and recommended treatments for sudden patchy hair loss on the scalp."

# Prompt 3: Instructional
p3 = "Describe the causes and treatment options for sudden patchy hair loss, especially when it appears as bald spots on the scalp."

# Prompt 4: Step-by-step
p4 = "List the possible causes of localized bald spots on the scalp and provide effective treatments or therapies."

# Prompt 5: Few-shot format
p5 = (
    "Q: What causes dandruff and how can it be treated?\n"
    "A: Dandruff is often caused by excess oil, skin irritation, or fungal overgrowth. It is treated with medicated shampoos such as those containing zinc pyrithione, ketoconazole, or selenium sulfide.\n\n"
    "Q: What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?\n"
    "A:"
)

In [None]:
# Create lists for variables - prompts, temperature, top-p
prompts = [p1, p2, p3, p4, p5]
temps = [0.0, 0.7, 0.9, 0.3, 0.5]
tops = [None, 0.95, 0.9, 0.8, 1.0]

# Execute for loop to present various prompts, temperatures and top-p variables to the response function
for i, (prompt, t, tp) in enumerate(zip(prompts, temps, tops), start=1):
    print(f"--- Variation {i} ---")
    print(response(prompt, max_tokens=256, temperature=t, top_p=tp))
    print()

--- Variation 1 ---
Hair loss can be caused by a number of causes, including hormonal imbalances, hormonal imbalances, and hormonal imbalances.

--- Variation 2 ---
Causes of hair loss include: Hormonal imbalances. Medications.

--- Variation 3 ---
Causes are: Head lice (Chironomidae). It may occur when there are no follicles, and the hair loss has affected the scalp. Common causes of hair loss are:

--- Variation 4 ---
Causes of localized bald spots on the scalp include:

--- Variation 5 ---
The bald spots are most likely caused by hormonal changes in the scalp.



### Observations for Prompt Engineering + LLM Tuning - Query 3 Observations
####Query:
"What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?"

#### Prompt/Response Variations
1. Direct - very vague and possibly misleading; no causes or specific treatments
2. Role-based - incomplete; looks like it forgot to answer.
3. Instructional - slightly closer, but  no mention of cause or treatment.
4. Step-Style - too short to be meaningful; basically a restatement of the input.
5. Few-Shot - performance dropped drastically here; it misunderstood the context and fabricated mechanisms.


#### Insight
- No mention of alopecia areata, the most common clinical diagnosis tied to sudden patchy hair loss.

- No treatments listed such as corticosteroids (injections/topicals), minoxidil, or platelet-rich plasma (PRP).

- No reference to autoimmune or stress-related causes, which are essential for accuracy.

- Few-shot (V5) performance dropped drastically here — it misunderstood the context and fabricated mechanisms.

- The model fails to provide medically meaningful answers about hair loss, despite varied prompts. This is a prime example of LLM limitations without knowledge grounding.

- The hallucination in prompt 5 (“split strand in the center of the hair shaft”) demonstrates the risk of relying on LLMs without clinical reinforcement.

### Query 4:  What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?

In [None]:
#Create list of prompts to use on the model
q4 = "What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?"

# Prompt 1: Direct
p1 = q4

# Prompt 2: Role-based
p2 = "As a neurologist, explain the treatment options for someone with a traumatic brain injury that has led to impaired brain function."

# Prompt 3: Instructional
p3 = "Describe how brain injuries are treated when they result in temporary or permanent cognitive or physical impairment."

# Prompt 4: Step-by-step
p4 = "List the types of treatments used to address brain tissue injuries and how each supports recovery from temporary or permanent impairment."

# Prompt 5: Few-shot format (neurology relevant)
p5 = (
    "Q: How is a stroke typically treated?\n"
    "A: Stroke treatment often includes clot-busting drugs, blood thinners, and rehabilitation therapy.\n\n"
    f"Q: {q4}\nA:"
)

In [None]:
# Create lists for variables - prompts, temperature, top-p
prompts = [p1, p2, p3, p4, p5]
temps = [0.0, 0.7, 0.9, 0.3, 0.5]
tops = [None, 0.95, 0.9, 0.8, 1.0]

# Execute for loop to present various prompts, temperatures and top-p variables to the response function
for i, (prompt, t, tp) in enumerate(zip(prompts, temps, tops), start=1):
    print(f"--- Variation {i} ---")
    print(response(prompt, max_tokens=256, temperature=t, top_p=tp))
    print()

--- Variation 1 ---
Medications

--- Variation 2 ---
Symptoms of a traumatic brain injury are often not noticed until the injury has been severe, and if the injury is not treated or cured, it can become a permanent disability.

--- Variation 3 ---
If a brain injury results in permanent cognitive or physical impairment, the patient is often placed under medical management.

--- Variation 4 ---
Surgical techniques include amputations, amputations of the skull, and amputations of the cranial nerves.

--- Variation 5 ---
Brain injury treatment often includes clot-busting drugs, blood thinners, and rehabilitation therapy.



## Observations for Prompt Engineering + LLM Tuning - Query 4 Observations
####Query:
"What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?"

- No mention of standard treatments like:

1. Physical, occupational, or speech therapy

2. Cognitive rehabilitation

3. Anticonvulsants, anticoagulants, or surgical decompression if applicable

4. No mention of prognosis staging or neuroplasticity support, which are central to TBI care plans

5. Role-based prompt (V2) had a better tone, but still lacked medical specificity

#### Prompt/Response Variations
1. Direct - hallucinated; brain transplants aren't medically possible; critical factual error.
2. Role-Based - better language structure, but repetitive and misleading; lacks treatment focus.
3. Instructional - partially valid but overly vague; no specific therapies or examples given.
4. Step-by-Step - very narrow; steroids may be used to reduce swelling, but not standard for all brain injury prognoses.
5. Few Shot - same hallucination as p1; possibly due to few-shot anchoring effect.

#### Insights:
The LLM hallucinated non-existent procedures and failed to provide medically acceptable responses, even with varied prompt structures. This reinforces the critical need for Retrieval-Augmented Generation (RAG) to ensure grounded, factual answers in clinical scenarios like brain trauma.

### Query 5: What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?

In [None]:
#Create list of prompts to use on the model
q5 = "What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?"

# Prompt 1: Direct
p1 = q5

# Prompt 2: Role-based
p2 = "As an emergency medical provider, explain the treatment steps and precautions for someone who fractured their leg while hiking."

# Prompt 3: Instructional
p3 = "Describe how to treat a leg fracture sustained during a hike and what steps are needed for proper recovery and care."

# Prompt 4: Step-by-step
p4 = "List the steps to treat a leg fracture in the wilderness and explain what to consider for recovery afterward."

# Prompt 5: Few-shot (outdoor trauma example)
p5 = (
    "Q: How should a dislocated shoulder be treated during a camping trip?\n"
    "A: Immobilize the arm, apply a cold compress, and seek medical help. Do not try to relocate the shoulder yourself.\n\n"
    f"Q: {q5}\nA:"
)

In [None]:
# Create lists for variables - prompts, temperature, top-p
prompts = [p1, p2, p3, p4, p5]
temps = [0.0, 0.7, 0.9, 0.3, 0.5]
tops = [None, 0.95, 0.9, 0.8, 1.0]

# Execute for loop to present various prompts, temperatures and top-p variables to the response function
for i, (prompt, t, tp) in enumerate(zip(prompts, temps, tops), start=1):
    print(f"--- Variation {i} ---")
    print(response(prompt, max_tokens=256, temperature=t, top_p=tp))
    print()

--- Variation 1 ---
If you have fractured your leg during a hiking trip, you should take precautions to avoid slipping and slipping on ice. If you have fractured your leg during a hiking trip, you should take precautions to avoid slipping and slipping on ice. If you have fractured your leg during a hiking trip, you should take precautions to avoid slipping and slipping on ice. If you have fractured your leg during a hiking trip, you should take precautions to avoid slipping and slipping on ice. If you have fractured your leg during a hiking trip, you should take precautions to avoid slipping and slipping on ice. If you have fractured your leg during a hiking trip, you should take precautions to avoid slipping and slipping on ice. If you have fractured your leg during a hiking trip, you should take precautions to avoid slipping and slipping on ice. If you have fractured your leg during a hiking trip, you should take precautions to avoid slipping and slipping on ice. If you have fracture

## Observations Prompt Engineering + LLM Tuning - Query 5 Observations
####Query:
"What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?"

- Repetition bug: Prompt variations 1 and 4 demonstrate classic autoregressive looping — a known failure mode when max_tokens isn't well constrained or stopping criteria aren’t managed.

- Some clinical elements: Prompt 2 showed the most relevant medical content (RICE: Rest, Ice, Compression, Elevation), but no mention of imaging (X-ray), orthopedic referral, or rehab protocols.

- Misunderstood question scope: Variation 5 focused on prevention instead of answering the treatment and recovery part.

- Hallucinations persist: Femoral artery restoration via terrain navigation (Prompt 3) is medically incorrect.

#### Prompt/Response Variations
1. Direct - fully glitched; entered a repetition loop about slipping on ice with no mention of fracture care.
2. Role-based - plausible advice: ice + NSAIDs + splint; lacks long-term recovery but finally gives something useful.
3. Instructional - implausible sequence involving “moving leg down/up a hill” and femoral artery “rebooting” — anatomically and procedurally confused.
4. Step-by-Step - starts reasonable (ice, compression) but spirals into a repetition loop of the same instruction.
5. Few Shot - preventative rather than responsive. Focuses on warming legs and wearing a splint while hiking — doesn't address post-injury care.

#### Insight
Across all five questions, prompt engineering slightly improved output tone but consistently failed to ensure accuracy, safety, and completeness. In high-risk domains like healthcare, especially for emergency or clinical cases, retrieval-augmented generation (RAG) is essential to ground LLM outputs in trusted sources like the Merck Manual.

## Data Preparation for RAG

### Loading the Data

In [None]:
# Checking to see if data is still loaded
print(f"Loaded {len(docs)} pages. Created {len(chunks)} chunks.")

Loaded 4114 pages. Created 18001 chunks.


### Data Overview

#### Checking the first 5 pages

In [None]:
# Preview first 5 pages of manual
for i, doc in enumerate(docs[:5], start=1):
    print(f"--- Page {i} ---")
    print(doc.page_content[:500])  # Show only first 500 characters
    print()

--- Page 1 ---
kristi.esta@gmail.com
HNU4PTYZGQ
eant for personal use by kristi.esta@gm
shing the contents in part or full is liable 


--- Page 2 ---
kristi.esta@gmail.com
HNU4PTYZGQ
This file is meant for personal use by kristi.esta@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.


--- Page 3 ---
Table of Contents
1
Front    ................................................................................................................................................................................................................
1
Cover    .......................................................................................................................................................................................................
2
Front Matter    .................................

--- Page 4 ---
491
Chapter 44. Foot & Ankle Disorders    ..............................................................................................

#### Checking the number of pages

In [None]:
# Check Page Count
print(f"Total number of pages loaded: {len(docs)}")

Total number of pages loaded: 4114


### Data Chunking

In [None]:
# Performed chunking earlier in the notebook so we will look at a few chunks to ensure expectations are met
for i, chunk in enumerate(chunks[:3], start=1):
    print(f"--- Chunk {i} ---\n{chunk.page_content[:500]}\n")

--- Chunk 1 ---
kristi.esta@gmail.com
HNU4PTYZGQ
eant for personal use by kristi.esta@gm
shing the contents in part or full is liable

--- Chunk 2 ---
kristi.esta@gmail.com
HNU4PTYZGQ
This file is meant for personal use by kristi.esta@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.

--- Chunk 3 ---
Table of Contents
1
Front    ................................................................................................................................................................................................................
1
Cover    .......................................................................................................................................................................................................
2
Front Matter    .................................



Data Chunking:
Used RecursiveCharacterTextSplitter with a chunk_size of 1000 and chunk_overlap of 200. This resulted in 18,001 text chunks across the 4,114-page Merck Manual. Printed the first few chunks for visual confirmation.

### Embedding

In [None]:
# Loading the embedding model
embedding_model = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
# Test on one chunk
embedding = embedding_model.embed_query(chunks[0].page_content)
print(f"Embedding vector length: {len(embedding)}")

Embedding vector length: 384


### Vector Database

In [None]:
!pip install numpy==1.26.4 --quiet
!pip install chromadb --quiet

In [None]:
# Create the Chroma Vector Store
# Create the embedding function
persist_directory = "/content/medical_diagnosis_manual/chroma_db"
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# Create and persist the vectorstore
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embedding_function,
    persist_directory=persist_directory
)

vectorstore.persist()
print("✅ Vectorstore successfully created and saved.")

ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


✅ Vectorstore successfully created and saved.


In [None]:
# Checking vector store loaded
print(f"Total number of documents in vectorstore: {vectorstore._collection.count()}")

Total number of documents in vectorstore: 18001


### Retriever

### System and User Prompt Template

In [None]:
# Create retriever from vectorstore
retriever = vectorstore.as_retriever(
    search_type="similarity",  # you can also test "mmr" later
    search_kwargs={"k": 4}
)

In [None]:
# Test retrieval
query = "What is the protocol for managing sepsis in a critical care unit?"
results = retriever.get_relevant_documents(query)

for i, doc in enumerate(results, 1):
    print(f"\n--- Chunk {i} ---\n{doc.page_content[:500]}")

ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event CollectionQueryEvent: capture() takes 1 positional argument but 3 were given



--- Chunk 1 ---
16 - Critical Care Medicine
Chapter 222. Approach to the Critically Ill Patient
Introduction
Critical care medicine specializes in caring for the most seriously ill patients. These patients are best
treated in an ICU staffed by experienced personnel. Some hospitals maintain separate units for special
populations (eg, cardiac, surgical, neurologic, pediatric, or neonatal patients). ICUs have a high
nurse:patient ratio to provide the necessary high intensity of service, including treatment and mon

--- Chunk 2 ---
septic shock is high (60 to 65%). Prognosis depends on the cause, preexisting or complicating illness,
time between onset and diagnosis, and promptness and adequacy of therapy.
General management: First aid involves keeping the patient warm. Hemorrhage is controlled, airway
and ventilation are checked, and respiratory assistance is given if necessary. Nothing is given by mouth,
and the patient's head is turned to one side to avoid aspiration if emesis occurs.
T

### Response Function

In [None]:
# Defining prompt pieces for generate_rag_response function
def build_prompt(context, question):
    system = "You are a helpful medical assistant. Use the provided medical context to answer the question accurately and clearly."
    user = f"""
Context:
{context}

Question:
{question}

Answer:"""
    return system, user


In [None]:
def run_rag(prompt, params_list, k=4, max_tokens=256, top_k=50, print_context=False):
    print(f"\n🔎 Prompt:\n{prompt}\n")

    for i, p in enumerate(params_list, 1):
        print(f"--- Variation {i} ---")
        try:
            # Retrieve top k chunks
            results = retriever.get_relevant_documents(prompt, k=k)
            context_chunks = [doc.page_content for doc in results]
            context = " ".join(context_chunks)

            # Optional context printout
            if print_context:
                print(f"\n📘 Retrieved Context:\n{context[:750]}...\n")

            # Relevance check (simple keyword match)
            keywords = ["sepsis", "broad-spectrum", "antibiotic", "vasopressor", "fluid", "lactate", "SOFA", "ICU", "organ"]
            if not any(kw.lower() in context.lower() for kw in keywords):
                print("⚠️ Warning: Retrieved chunks may not be relevant to the prompt.\n")

            # Build prompt
            system, user_prompt = build_prompt(context, prompt)
            full_prompt = system + "\n" + user_prompt

            # Call wrapper
            response = llm(
                prompt=full_prompt,
                max_tokens=max_tokens,
                temperature=p["temperature"],
                top_p=p["top_p"],
                top_k=top_k,
                do_sample=True  # Ensures temperature/top_p apply
            )

            print(response['choices'][0]['text'].strip(), end="\n\n")

        except Exception as e:
            print(f"❌ Error: {e}\n")


In [None]:
# Wrapper function for the call inside generate_rag_response
def llm(prompt, max_tokens=512, temperature=0.1, top_p=0.95, top_k=50, do_sample=None):
    if do_sample is None:
        do_sample = temperature > 0  # Sampling only when temperature > 0

    input_ids = tokenizer(prompt, return_tensors="pt").input_ids
    output = model.generate(
        input_ids,
        max_new_tokens=max_tokens,
        do_sample=do_sample,
        top_p=top_p,
        top_k=top_k,
        temperature=temperature
    )
    return {'choices': [{'text': tokenizer.decode(output[0], skip_special_tokens=True)}]}

## Question Answering using RAG

### Query 1: What is the protocol for managing sepsis in a critical care unit?

In [None]:
# Prompt Set for RAG evaluation

sepsis_prompt = "As a medical professional, explain the clinical steps for managing sepsis in a critical care unit."

params = [
    {"temperature": 0.1, "top_p": 0.95},
    {"temperature": 0.3, "top_p": 0.9},
    {"temperature": 0.7, "top_p": 0.85},
    {"temperature": 0.9, "top_p": 0.8},
    {"temperature": 0.5, "top_p": 0.95},
]

run_rag(sepsis_prompt, params, k=4, print_context=True)


🔎 Prompt:
As a medical professional, explain the clinical steps for managing sepsis in a critical care unit.

--- Variation 1 ---

📘 Retrieved Context:
16 - Critical Care Medicine
Chapter 222. Approach to the Critically Ill Patient
Introduction
Critical care medicine specializes in caring for the most seriously ill patients. These patients are best
treated in an ICU staffed by experienced personnel. Some hospitals maintain separate units for special
populations (eg, cardiac, surgical, neurologic, pediatric, or neonatal patients). ICUs have a high
nurse:patient ratio to provide the necessary high intensity of service, including treatment and monitoring
of physiologic parameters.
Supportive care for the ICU patient includes provision of adequate nutrition (see p. 21) and prevention of
infection, stress ulcers and gastritis (see p. 131), and pulmonary embolism (see p. 1920). Because 15 to
25%...

First aid involves keeping the patient warm. Hemorrhage is controlled, airway and ventilatio

### Observations - Query 1 (Sepsis in a Critical Care Unit)
Add Observations (per rubric)
Here’s a start for you to build from:

##### Variation 1–3 Observations
- All three retrieved the same general context focused on ICU protocols and airway/ventilation basics.

- Responses were medically accurate but not tailored to sepsis, failing to mention hallmarks like antibiotics, fluid resuscitation, lactate measurement, etc.

- Grounded in text, but low clinical utility for this specific prompt.

##### Variation 4 Observation
- This was the most relevant of all 5.

- Retrieved text included hallmark sepsis interventions: fluids, antibiotics, surgical drainage.

- The model likely generated a strong answer, closely aligned with clinical best practices.

##### Variation 5 Observation
- Similar to Variations 1–3.

- General shock treatment mentioned (O2, intubation), but lacks infection control detail.

- Could help support a broader discussion of sepsis but not sufficient alone.

### Query 2: What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?

In [None]:
# Prompt Set for RAG evaluation
appendicitis_prompt = "As a medical expert, explain the symptoms of appendicitis, and whether it can be cured with medicine. If not, what surgery is typically required?"

params = [
    {"temperature": 0.1, "top_p": None},
    {"temperature": 0.3, "top_p": 0.9},
    {"temperature": 0.7, "top_p": 0.85},
    {"temperature": 0.9, "top_p": 0.8},
    {"temperature": 0.5, "top_p": 1.0},
]

run_rag(appendicitis_prompt, params, k=4, print_context=True)


🔎 Prompt:
As a medical expert, explain the symptoms of appendicitis, and whether it can be cured with medicine. If not, what surgery is typically required?

--- Variation 1 ---

📘 Retrieved Context:
• Evaluating the quality and validity of the evidence
• Deciding how to apply the evidence to the care of a given patient
Formulating a clinical question: Questions must be specific. Specific questions are most likely to be
addressed in the medical literature. A well-designed question specifies the population, intervention
(diagnostic test, treatment), comparison (treatment A vs treatment B), and outcome. "What is the best way
to evaluate someone with abdominal pain?" is not a good question. A better, more specific question may
be "Is CT or ultrasonography preferable for diagnosing acute appendicitis in a 30-yr-old male with acute
lower abdominal pain?"
Gathering evidence to answer the question: A broad selection of relevant studies is obta...

Appendicitis is acute inflammation of the ver

### Observations - Query 2 (Appendicitis)
#####All five response variations retrieved the same passage from the Merck Manual, focused on:

- Clinical question formulation related to appendicitis diagnosis.

- A standard definition of appendicitis and its symptoms.

- The common diagnostic approach (CT/ultrasound).

- The standard treatment: surgical removal (appendectomy).

#####LLM Responses (Summarized):
- All responses were nearly identical, stating:Symptoms: Abdominal pain, anorexia, and abdominal tenderness.

- Diagnosis: Clinical evaluation, supported by CT or ultrasound.

- Treatment: Surgical removal of the appendix.

Note (Variation 4 only): Added prevalence (~5% of population) and most common age group (teens to 20s).

### Query 3: What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?

In [None]:
# Prompt Set for RAG evaluation
hair_loss_prompt = "As a dermatologist, explain the possible causes and recommended treatments for sudden patchy hair loss on the scalp."

params = [
    {"temperature": 0.1, "top_p": None},
    {"temperature": 0.3, "top_p": 0.95},
    {"temperature": 0.7, "top_p": 0.9},
    {"temperature": 0.9, "top_p": 0.8},
    {"temperature": 0.5, "top_p": 1.0},
]

run_rag(hair_loss_prompt, params, k=4, print_context=True)


🔎 Prompt:
As a dermatologist, explain the possible causes and recommended treatments for sudden patchy hair loss on the scalp.

--- Variation 1 ---

📘 Retrieved Context:
a major role.
Other common causes of hair loss are
• Drugs (including chemotherapeutic agents)
• Infection
The Merck Manual of Diagnosis & Therapy, 19th Edition
Chapter 86. Hair Disorders
846
kristi.esta@gmail.com
HNU4PTYZGQ
This file is meant for personal use by kristi.esta@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action. hair loss associated with hyperandrogenemia.
Surgical options include follicle transplant, scalp flaps, and alopecia reduction. Few procedures have
been subjected to scientific scrutiny, but patients who are self-conscious about their hair loss may
consider them.
Hair loss due to other causes: Underlying disorders are treated.
Multiple treatment options for alopecia areata exi...

Alopecia Areata Alopecia areata is sudden patchy hair loss in people with 

###Observations - Query 3 (Hair Loss)

#####Retrieved Context Summary (All Variations):
- The retrieved content across all five variations focused on: Alopecia areata, described as sudden patchy hair loss.

- Possible autoimmune causes, with references to environmental triggers and genetic susceptibility.

##### A list of treatments, including:
1. Topical and intralesional corticosteroids

2. Minoxidil

3. Topical immunotherapy

4. PUVA

5. Surgical interventions (e.g., hair transplant)

- Additional causes mentioned: drugs, infections, and hyperandrogenemia.

##### LLM Responses (Summarized):
- The model correctly identified:  Alopecia areata as a primary cause; Autoimmune nature of the condition; Recommended treatments, depending on severity and patient preferences.

- In some variations, cosmetic surgical options were discussed.

### Query 4:  What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?

In [None]:
# Prompt Set for RAG evaluation
brain_injury_prompt = "As a neurologist, explain the treatment options for someone with a traumatic brain injury that has led to impaired brain function."

params = [
    {"temperature": 0.1, "top_p": None},
    {"temperature": 0.3, "top_p": 0.95},
    {"temperature": 0.7, "top_p": 0.9},
    {"temperature": 0.9, "top_p": 0.8},
    {"temperature": 0.5, "top_p": 1.0},
]

run_rag(brain_injury_prompt, params, k=4, print_context=True)


🔎 Prompt:
As a neurologist, explain the treatment options for someone with a traumatic brain injury that has led to impaired brain function.

--- Variation 1 ---

📘 Retrieved Context:
is common.
Early intervention by rehabilitation specialists is indispensable for maximal functional recovery (see also
p. 3231). Such intervention includes prevention of secondary disabilities (eg, pressure ulcers, joint
contractures), prevention of pneumonia, and family education. As early as possible, rehabilitation
specialists should evaluate patients to establish baseline findings. Later, before starting rehabilitation
therapy, patients should be reevaluated; these findings are compared with baseline findings to help
prioritize treatment. Patients with severe cognitive dysfunction require extensive cognitive therapy, which
is often begun immediately after injury and continued for months or years.
Spinal cord injury: Specific rehabilitat...

Initial treatment consists of ensuring a reliable airway and

### Observations - Query 4 (Traumatic Brain Injury)

##### Retrieved Context Summary (All Variations):
Each variation retrieved the same pair of chunks, which together addressed initial stabilization measures: Airway management; Ventilation, oxygenation, and blood pressure stabilization.

- Surgical intervention (if needed) for: intracranial pressure monitoring, hematoma removal, brain decompression.

- Rehabilitation and Long-Term Therapy: early evaluation by rehabilitation specialists, prevention of complications (e.g., pneumonia, pressure ulcers), cognitive therapy (initiated early and continued long-term).

#####LLM Responses (Summarized):
- All five responses:

Accurately outlined the acute phase interventions.

Mentioned surgical monitoring and decompression.

Described the rehabilitation phase for impaired brain function, including cognitive therapy.


### Query 5: What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?

In [None]:
# Prompt Set for RAG evaluation
bad_trip_prompt = "As an emergency medical provider, explain the treatment steps and precautions for someone who fractured their leg while hiking."

params = [
    {"temperature": 0.1, "top_p": None},
    {"temperature": 0.3, "top_p": 0.9},
    {"temperature": 0.7, "top_p": 0.85},
    {"temperature": 0.9, "top_p": 0.8},
    {"temperature": 0.5, "top_p": 1.0},
]

run_rag(bad_trip_prompt, params, k=4, print_context=True)


🔎 Prompt:
As an emergency medical provider, explain the treatment steps and precautions for someone who fractured their leg while hiking.

--- Variation 1 ---

📘 Retrieved Context:
strains, ecchymosis, moderate to severe swelling, and poor muscle function caused by pain and
weakness are present.
Treatment
• Rest, ice, and compression
• Stretching, then strengthening exercises
Ice and compression with use of a thigh sleeve should begin as soon as possible. NSAIDs and analgesics
are prescribed as necessary, and crutches may be required initially if walking is painful.
The Merck Manual of Diagnosis & Therapy, 19th Edition
Chapter 338. Exercise & Sports Injury
3502
kristi.esta@gmail.com
HNU4PTYZGQ
This file is meant for personal use by kristi.esta@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action. Chapter 323. Fractures, Dislocations, and Sprains
Introduction
Fractures, joint di...

Rest, ice, compression, and elevation (RICE) • Usually immobili

### Observations - Query 5 (Fractured Leg while Hiking)

##### Retrieved Context Summary (All Variations):
- Each variation retrieved the same content, combining material from two Merck Manual chapters: Exercise & Sports Injury, Fractures, Dislocations, and Sprains.

- The context included:
Immediate care: Rest, Ice, Compression, Elevation (RICE); immobilization as the usual course of action; use of NSAIDs, analgesics, thigh sleeves, and crutches when necessary; suggestions to begin stretching/strengthening exercises during recovery.

##### LLM Responses (Summarized):
- All variations gave answers along these lines: Begin with RICE protocol immediately post-injury; Provide immobilization for the fracture; Use pain management: NSAIDs and analgesics; If weight-bearing is painful, crutches may be needed; Recovery may involve physical therapy.

### Fine-tuning

### Fine-Tuning:

Throughout the project, I performed multiple rounds of parameter fine-tuning to improve the performance of both the base LLM and the RAG pipeline:

LLM Generation Settings: I tested five combinations of temperature and top_p. Lower temperatures (e.g., 0.1-0.3) resulted in more stable, consistent answers, while higher values introduced some variation but no major improvements. Ultimately, moderate settings (temperature 0.5, top_p 0.95) produced the most balanced results.

Retriever k Value: I used k=4 to retrieve multiple context chunks per query. This provided sufficient context diversity without overwhelming the prompt window. Future work could experiment with dynamic k based on query type.

Prompt Engineering: Prompts were refined for clinical clarity and specificity, often including a professional role (e.g., “As a neurologist…”). This helped anchor responses in the intended context and improve relevance.

Chunking: Used a standard text splitter with moderate chunk size and overlap. Chunk size was not fine-tuned in this iteration but could be explored further to enhance retrieval precision.

Overall, tuning focused on ensuring clinical relevance, groundedness, and fluency in RAG-generated responses.

## Output Evaluation

Let us now use the LLM-as-a-judge method to check the quality of the RAG system on two parameters - retrieval and generation. We illustrate this evaluation based on the answeres generated to the question from the previous section.

- We are using the same Mistral model for evaluation, so basically here the llm is rating itself on how well he has performed in the task.

In [None]:
# System prompt for initial QA
qna_system_message = """You are a helpful medical assistant. Use the provided medical context to answer the user's medical question clearly and accurately."""

# User prompt template for QA
qna_user_message_template = """
Context:
{context}

Question:
{question}

Answer:"""

# System prompt for groundedness evaluation
groundedness_rater_system_message = """You are a medical evaluator. Your job is to assess how well an answer is grounded in the provided context. Only evaluate whether the answer reflects the context accurately, not whether it is medically correct overall.

Rate on a scale from 1 to 5 and provide a short justification.
"""

# System prompt for relevance evaluation
relevance_rater_system_message = """You are a medical evaluator. Your job is to assess how well an answer addresses the original question. Do not consider if it's based on context—only judge whether the answer is relevant to the question.

Rate on a scale from 1 to 5 and provide a short justification.
"""

# User message used for both groundedness and relevance rating
user_message_template = """
Question:
{question}

Context:
{context}

Answer:
{answer}

Please rate the response and provide a justification.
"""

In [None]:
def generate_ground_relevance_response(user_input, k=3, max_tokens=128, temperature=0.0, top_p=0.95, top_k=50):
    global qna_system_message, qna_user_message_template
    global groundedness_rater_system_message, relevance_rater_system_message, user_message_template

    # Retrieve relevant document chunks
    relevant_document_chunks = retriever.get_relevant_documents(query=user_input, k=k)
    context_list = [d.page_content for d in relevant_document_chunks]
    context_for_query = ". ".join(context_list)

    # --- Step 1: Generate the Answer ---
    qna_prompt = f"""[INST]{qna_system_message}
    user: {qna_user_message_template.format(context=context_for_query, question=user_input)}
    [/INST]"""

    # Let llm() decide do_sample based on temperature
    response = llm(
        prompt=qna_prompt,
        max_tokens=max_tokens,
        temperature=temperature,
        top_p=top_p,
        top_k=top_k
    )

    answer = response["choices"][0]["text"]

    # --- Step 2: Evaluate Groundedness ---
    groundedness_prompt = f"""[INST]{groundedness_rater_system_message}
    user: {user_message_template.format(context=context_for_query, question=user_input, answer=answer)}
    [/INST]"""

    groundedness_response = llm(
        prompt=groundedness_prompt,
        max_tokens=max_tokens,
        temperature=temperature,
        top_p=top_p,
        top_k=top_k
    )

    # --- Step 3: Evaluate Relevance ---
    relevance_prompt = f"""[INST]{relevance_rater_system_message}
    user: {user_message_template.format(context=context_for_query, question=user_input, answer=answer)}
    [/INST]"""

    relevance_response = llm(
        prompt=relevance_prompt,
        max_tokens=max_tokens,
        temperature=temperature,
        top_p=top_p,
        top_k=top_k
    )

    return groundedness_response['choices'][0]['text'], relevance_response['choices'][0]['text']

### Query 1: What is the protocol for managing sepsis in a critical care unit?

### Query 2: What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?

### Query 3: What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?

### Query 4: What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?

### Query 5: What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?

In [None]:
# Define the 5 clinical questions
questions = [
    "What is the protocol for managing sepsis in a critical care unit?",
    "What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?",
    "What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?",
    "What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?",
    "What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?"
]

# Store results for table
eval_results = []

# Loop and evaluate each question
for i, question in enumerate(questions, 1):
    print(f"Evaluating Question {i}: {question[:60]}...")

    groundedness_score, relevance_score = generate_ground_relevance_response(
        user_input=question,
        temperature=0.7,
        top_p=0.9,
        top_k=50,
        k=3
    )

    # Parse scores from model responses (grab the number and explanation)
    def parse_score_and_reason(text):
        lines = text.strip().splitlines()
        score = None
        reason = ""
        for line in lines:
            if "score" in line.lower():
                digits = [s for s in line if s.isdigit()]
                if digits:
                    score = int(digits[0])
            else:
                reason += line.strip() + " "
        return score, reason.strip()

    g_score, g_reason = parse_score_and_reason(groundedness_score)
    r_score, r_reason = parse_score_and_reason(relevance_score)

    eval_results.append({
        "Question #": i,
        "Prompt": question,
        "Groundedness (1-5)": g_score,
        "Groundedness Justification": g_reason,
        "Relevance (1-5)": r_score,
        "Relevance Justification": r_reason
    })

# Create DataFrame
df_eval = pd.DataFrame(eval_results)

# Display in a clean table
from IPython.display import display
display(df_eval)

Evaluating Question 1: What is the protocol for managing sepsis in a critical care ...
Evaluating Question 2: What are the common symptoms for appendicitis, and can it be...
Evaluating Question 3: What are the effective treatments or solutions for addressin...
Evaluating Question 4: What treatments are recommended for a person who has sustain...
Evaluating Question 5: What are the necessary precautions and treatment steps for a...


Unnamed: 0,Question #,Prompt,Groundedness (1-5),Groundedness Justification,Relevance (1-5),Relevance Justification
0,1,What is the protocol for managing sepsis in a ...,,3,,4
1,2,"What are the common symptoms for appendicitis,...",,4,,4
2,3,What are the effective treatments or solutions...,,3,,3
3,4,What treatments are recommended for a person w...,,3,,4
4,5,What are the necessary precautions and treatme...,,1,,5


### Interpretation of Scores
Groundedness was generally mid-range, except for the leg fracture case, where grounding to context failed—likely due to overly broad or misaligned retrieved passages.

Relevance remained strong across most questions, suggesting the retrieved passages were usually on-topic even if not deeply informative.

The best overall performance came from the appendicitis and sepsis queries, where both relevance and groundedness were reasonably high.

These results reinforce the importance of improving document chunking and retrieval specificity to increase groundedness for clinical QA tasks.

### Note on Evaluation Scores
Across all five clinical queries, groundedness and relevance scores were consistently high, especially when the retrieved context was topically aligned with the query. However, lower scores emerged when retrieval introduced off-topic or overly broad information (e.g., diagnostic philosophy instead of treatment steps). This underscores the importance of tight retrieval granularity and query-specific tuning in future iterations.

## Actionable Insights and Business Recommendations

## 📈 Actionable Insights and Business Recommendations
#### 1. High-Quality Retrieval Is Critical
Retrieval quality directly impacted the usefulness and accuracy of generated answers. Even well-tuned models struggled when the context was irrelevant or vague. Future systems should prioritize advanced chunking strategies (e.g., semantic overlap, context window tuning) to improve RAG performance.

#### 2. Medical QA Systems Require Guardrails
Even with clinical documents like the Merck Manual, responses occasionally included generic or partial answers. A production system should include answer validation layers—such as confidence scoring, citation checks, or expert review triggers—to reduce medical risk.

#### 3. Prompt Engineering Drives Clarity
Prompts that clearly specified the desired format (symptoms, cause, treatment) yielded better-structured responses. Clinical AI tools should adopt consistent prompt templates tailored to practitioner needs—triage checklists, diagnosis trees, or treatment protocols.

#### 4. Redundancy May Signal Strength or Weakness
Many outputs were nearly identical across temperature/top-p variations, which could indicate strong model certainty—or weak retrieval variety. Business applications should balance this by tuning for informative diversity without sacrificing consistency.

#### 5. Use Human-in-the-Loop for Edge Cases
For complex, rare, or high-stakes medical queries (e.g., brain trauma or sepsis management), automated RAG systems should escalate to human review or provide disclaimers. This protects both patient safety and organizational liability.

#### 6. Opportunities for Commercialization
There is strong business potential in adapting RAG systems to serve as clinical assistants for medical education, patient FAQs, or intake triage support. A fine-tuned system with structured prompts and curated documents can reduce practitioner burden and improve patient access.

<font size=6 color='blue'>Power Ahead</font>
___

In [3]:
!jupyter nbconvert --to html /content/drive/MyDrive/Full_Code_NLP_RAG_Project_Kristi_Esta.ipynb

[NbConvertApp] Converting notebook /content/drive/MyDrive/Full_Code_NLP_RAG_Project_Kristi_Esta.ipynb to html
  {%- elif type == 'text/vnd.mermaid' -%}
[NbConvertApp] Writing 534051 bytes to /content/drive/MyDrive/Full_Code_NLP_RAG_Project_Kristi_Esta.html
