<a href="https://colab.research.google.com/github/helghand1/MAT421/blob/main/MAT421_Term_Paper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hussein ElGhandour

## MAT 421 Final Term Paper LLM Project

### Medical Calculation Verification Agent

### Introduction

Accurate medical calculations are essential to ensuring safe and effective patient care. Healthcare professionals routinely rely on formulas to determine critical values such as Body Mass Index (BMI), Body Surface Area (BSA), and medication dosages. Even minor errors in these calculations can lead to incorrect diagnoses, improper treatment plans, or adverse drug events. Despite their importance, these calculations are frequently performed under time constraints and without built-in verification tools, increasing the risk of human error.

Traditional clinical calculators, while helpful, often lack features such as contextual guidance, error-checking, or explanatory feedback. As a result, there is a growing interest in using artificial intelligence (AI) to improve the reliability of medical tools. In particular, Large Language Models (LLMs) have demonstrated advanced capabilities in understanding and generating structured responses to complex prompts, making them suitable for verification and support in clinical workflows.

This paper presents the development and evaluation of a Medical Calculation Verification Agent powered by an LLM. The objective is to create a tool capable of validating key medical calculations and offering context-aware feedback to assist healthcare providers. The project focuses on implementing this tool using Python and modern LLM APIs, applying fundamental numerical methods such as unit conversions, structured prompt engineering, and numerical differentiation concepts to enhance accuracy evaluation through percentage error analysis. Through this work, the paper aims to explore the feasibility and effectiveness of integrating LLMs into medical computational tasks to improve accuracy and reduce clinical risk.

### Related Works

Large Language Models (LLMs) such as GPT-4 have rapidly advanced in their ability to perform complex reasoning tasks, including those related to clinical decision-making and medical education. Recent evaluations have demonstrated the growing potential of these models for supporting healthcare professionals in structured tasks. For example, Liu et al. (2024a) assessed the performance of several leading LLMs—including GPT-4o and Claude 3 Opus—on the Japanese National Medical Licensing Examination (JNME), using a dataset of 790 real exam questions. GPT-4o achieved the highest overall accuracy at 89.2%, with particularly strong results on text-based and lower-difficulty questions. However, the models exhibited diminished performance on image-based and complex clinical reasoning items, and none reached the 95% accuracy threshold considered suitable for clinical deployment.

In a broader review, Liu et al. (2024b) conducted a meta-analysis of 45 studies evaluating ChatGPT’s performance across national medical licensing exams in 17 countries. Their findings showed that GPT-4 achieved a mean accuracy of 81%, significantly outperforming GPT-3.5 and frequently surpassing human medical students. They also found that prompting strategies greatly influenced model performance. Despite this progress, results varied widely across languages, specialties, and question types. Like the JNME study, this analysis concluded that current LLMs remain insufficiently consistent and accurate for autonomous clinical use.

These studies collectively underscore the promising capabilities of LLMs in medical settings while also revealing key limitations. Most notably, prior work focuses on general diagnostic reasoning and performance on exam-style questions, without targeting high-risk, calculation-based medical tasks. Additionally, they lack implementation of verification techniques or numerical methods to validate LLM outputs.

This project addresses those gaps by developing a domain-specific Medical Calculation Verification Agent powered by an LLM. Rather than evaluating general reasoning skills, the agent is designed to validate critical medical formulas such as BMI, BSA, and dosage calculations using structured prompts and precision validation techniques. Furthermore, it incorporates a formal percentage error analysis framework, applying numerical differentiation concepts to systematically measure the model's accuracy across real clinical datasets. By combining symbolic logic with real-world medical formulas and structured error evaluation, this tool contributes a focused solution to improving the safety and reliability of medical calculations—an area underrepresented in current LLM healthcare research.

### Math Method

This project incorporates the percentage error analysis method, a core concept introduced in Module F: Numerical Differentiation, to evaluate the verification capabilities of the LLM-based Medical Calculation Verification Agent. Percentage error is a widely used numerical technique for quantifying the accuracy of calculated values relative to known ground-truth data — an essential aspect when assessing the reliability of medical computations.

<br>

**Error Analysis in Medical Context**

Medical calculations such as BMI, BSA, and dosage must be extremely accurate, as small errors can have significant clinical consequences. To measure the LLM’s verification precision, this project compares the LLM's calculated outputs to the Python-ground-truth results for each patient sample and computes the percentage error according to the formula:

$$
\text{Percent Error} = |\frac{\text{Expected Value - LLM Output}}{\text{Expected Value}}| * 100
$$

This approach provides an interpretable metric for understanding model reliability. For instance, an error of less than 1% indicates a highly trustworthy result for clinical use, while larger errors highlight discrepancies needing further attention.

**Numerical Techniques Applied**

Direct Calculation of Percentage Error: For each BMI, BSA, and dosage case, the system calculates the relative deviation between the expected and LLM-predicted values, expressed as a percentage.

Interpretation Based on Tolerance Levels: Results are assessed against reasonable medical error thresholds to judge the practical feasibility of using the model in clinical workflows.

**Integration with the LLM Agent**

In this project, percentage error analysis serves two purposes:

- Performance Evaluation: It systematically quantifies the LLM agent’s accuracy across diverse real-world clinical inputs.

- Research Validation: It provides a transparent and reproducible metric that can be benchmarked against the performance standards observed in related healthcare LLM studies.

By incorporating percentage error analysis, the project moves beyond qualitative evaluation to offer a rigorous numerical assessment of the LLM’s ability to verify critical medical calculations — a crucial step for future clinical deployment.

<br>

**Mathematics of the Numerical Differentiation Method (Module F: Numerical Differentiation)**

In Module F: Numerical Differentiation, we studied how derivatives of functions can be approximated numerically using finite difference formulas. This is particularly useful when the exact analytical derivative is difficult to compute.

One common method is the Forward Difference Approximation, given by:

$$
f'(x) = \frac{f(x+h) - f(x)}{h}
$$

where:

- f′(x) is the approximate derivative of the function f(x),

- h is a small increment, and

- f(x+h) and f(x) are the function values at x+h and x, respectively.

In this project, numerical differentiation is relevant for understanding how small changes in inputs (such as weight, height, or dosage factors) could affect the outputs of medical formulas like BMI, BSA, and dosage calculations. By approximating how outputs vary with respect to small changes in inputs, we can better analyze sensitivity and model robustness, thereby enhancing the reliability of LLM-driven medical verification.





### Experiment Setup and Implementation

<br>

1. **System Overview**

The core objective of the project is to develop a Medical Calculation Verification Agent that assists healthcare professionals by verifying the correctness of key medical calculations such as Body Mass Index (BMI), Body Surface Area (BSA), and medication dosages. The system leverages a Large Language Model (LLM), accessed through the OpenAI API, to validate user-provided inputs against standard clinical formulas. It also provides structured, step-by-step explanations to aid understanding and reduce the risk of human error in clinical environments.

The LLM serves as the primary reasoning engine, receiving structured prompts containing patient data and calculation requests. It processes these inputs, performs arithmetic reasoning, and returns a final computed value along with explanatory logic. To evaluate the LLM's reliability, its outputs are compared against ground truth values computed using traditional numerical methods implemented in Python.

The system is built in a Jupyter notebook environment via Google Colab, allowing for seamless integration of Python scripting, API calls, and inline result analysis. This setup supports real-time interaction and iterative testing, making it well-suited for experimentation with different prompt formats and test cases.

<br>

2. **Technologies and Tools Used**

This project is implemented entirely within a Google Colab environment using Python. The system integrates the OpenAI API to interact with a Large Language Model (e.g., GPT-4 or GPT-3.5-turbo), which serves as the core reasoning engine for verifying medical calculations.

The following tools and libraries are utilized:

- OpenAI API: Used to send structured prompts to the LLM and receive textual responses containing calculation steps and final outputs.

- Python 3 (via Google Colab): Provides the base programming environment for system logic, numerical methods, data validation, and analysis.

- NumPy: Handles backend computation of ground truth values for comparison with LLM outputs.

- SymPy: Used for symbolic computation where needed for solving equations.

- Pandas: Used to organize patient data and display test results in a tabular format.

- Matplotlib/Seaborn (optional): May be used for visualizing error trends or model accuracy over multiple test cases.

The LLM is prompted with input variables (e.g., weight, height) and the relevant medical formula. Prompts are structured to guide the model step-by-step through the arithmetic or symbolic reasoning process. This method increases the likelihood of accurate and interpretable outputs. For validation, the same inputs are processed through backend Python logic using direct formula evaluation, and the LLM’s results are compared with this ground truth.

Patient data used for testing was sourced from the Personalized Medication Dataset available on Kaggle (Ziya, 2025). The dataset CSV file was manually uploaded into the Google Colab environment using the built-in upload feature (files.upload() from google.colab), and then read into a Pandas DataFrame for analysis. The CSV file used was personalized_medication_dataset.csv, and no additional preprocessing was required for the selected variables.

<br>

3. **Prompt Engineering Strategy**

To ensure reliable performance, the Medical Calculation Verification Agent uses a carefully structured prompt design to guide the LLM through each medical calculation. Prompt engineering plays a critical role in improving accuracy, interpretability, and consistency of LLM-generated outputs. Rather than relying on the model to infer reasoning steps, prompts are explicitly formatted to guide the LLM through each operation.

<br>

**Step-by-Step Prompting Approach**

For each supported calculation type—such as Body Mass Index (BMI), Body Surface Area (BSA), or medication dosage—the system automatically identifies the appropriate formula and injects it into a pre-defined prompt template. These prompts are written in a step-by-step format to simulate structured reasoning. Prior research has shown that effective prompt design—including role-based and context-rich instructions—significantly improves LLM performance on medical tasks (Liu et al., 2024b).

This design ensures the user only needs to input the necessary variables (e.g., weight, height), while the system takes care of the reasoning structure behind the scenes. For example:

<br>

BMI Calculation Prompt Example:

> A patient has a weight of 70 kg and a height of 1.75 meters.  
> Use the formula BMI = weight / (height^2).  
> Step 1: Square the height.  
> Step 2: Divide the weight by the squared height.  
> Step 3: Report the final BMI value.  
> Please show all steps and round the result to two decimal places.

<br>

BSA (Mosteller Formula) Prompt Example:

> A patient has a weight of 80 kg and a height of 180 cm.  
> Use the formula BSA = sqrt((height * weight) / 3600).  
> Step 1: Multiply the height by the weight.  
> Step 2: Divide the result by 3600.  
> Step 3: Take the square root of the result.  
> Report the final BSA value in square meters, rounded to two decimal places.

<br>

Dosage Prompt Example:

> A patient weighs 25 kg and is prescribed a dosage of 4 mg/kg.  
> Use the formula Dose = weight * dose per kg.  
> Step 1: Multiply the weight by the dose per kg.  
> Step 2: Report the final dosage in mg.  
> Please round to the nearest whole number if necessary.


<br>

This structure not only guides the LLM but also helps in validating intermediate steps for accuracy during result comparison. Additionally, by keeping the prompt templates modular, the system can be easily extended to support additional medical calculations in the future.

<br>

4. **Backend Verification Logic**

To evaluate the accuracy of the LLM’s outputs, the system includes a backend verification layer built using Python. This backend is responsible for computing the correct results using well-established medical formulas and comparing those results to the LLM's output.

<br>

**Direct Formula Evaluation**

For calculations like BMI, BSA, and dosage, which follow deterministic equations, Python is used to directly plug in patient input values and compute the expected result. For example:

- BMI is computed using:

$$
BMI = \frac{Weight \space (kg)}{Height \space (m)^2}
$$

- BSA (Mosteller Formula):

$$
BSA = \sqrt{\frac{Height \space (cm) * Weight \space (kg)}{3600}}
$$

- Dosage (Standard Weight-Based Medication Dosage Formula):

$$
\text{Dosage (mg)} = \text{Weight (kg)} \times \text{Dosage per kg (mg/kg)}
$$
    ​

These are implemented in Python using standard arithmetic or with libraries like NumPy and SymPy for higher precision.

<br>

Numerical Differentiation (Forward Difference Method)

In addition to direct calculation, numerical differentiation techniques are used to estimate how small changes in input variables could affect the output. The Forward Difference Approximation is applied:

$$
f'(x) = \frac{f(x+h) - f(x)}{h}
$$

This technique provides insight into the sensitivity of the medical formulas, helping to assess how slight variations in patient measurements could influence the calculated outcomes.

<br>

**Comparison Logic**

Once both the LLM and Python calculations are complete, the system compares the two results using percentage error analysis, defined as:

$$
\text{Percentage Error} = (|\frac{\text{LLM Output−Python Result}}{\text{Python Result}}∣) * 100\%
$$

A prediction is considered correct if the percentage error falls within a clinically acceptable threshold. For this project:

- An error threshold of approximately ±1% is considered acceptable for BMI and BSA.

- For dosage calculations, exact or near-exact agreement (≤ 1% error) is expected due to the critical importance of medication dosing accuracy.

This evaluation process ensures that the model is held to rigorous standards and that its performance can be meaningfully assessed across different medical calculations.

<br>

5. **Evaluation Metrics**

The primary evaluation metric used in this project is percentage error, defined as:

$$
\text{Percentage Error} = (|\frac{\text{LLM Output−Python Result}}{\text{Python Result}}∣) * 100\%
$$

A prediction is considered accurate if the percentage error falls within clinically acceptable thresholds:

- For BMI and BSA calculations, an acceptable error range is approximately ±1%.

- For dosage calculations, a near-exact agreement (≤ 1% error) is required, due to the critical nature of medication dosing.

This standard ensures that minor rounding discrepancies are tolerated, while significant deviations are flagged as failures. It provides a rigorous and consistent method for evaluating the LLM's reliability across different types of medical calculations.



### LLM

> Uploading the dataset:

In [None]:
from google.colab import files
uploaded = files.upload()  # Upload the ZIP file manually when prompted

import zipfile
import os

# Unzip the uploaded file (replace with actual filename if different)
with zipfile.ZipFile("personalized-medication-dataset.zip", 'r') as zip_ref:
    zip_ref.extractall("med_data")

# List the files to confirm
os.listdir("med_data")

import os
os.listdir("med_data")


Saving personalized-medication-dataset.zip to personalized-medication-dataset (1).zip


['personalized_medication_dataset.csv']

In [None]:
import os
os.listdir("med_data") # print the name of the csv to load in next cell block


['personalized_medication_dataset.csv']

In [None]:
import pandas as pd
df = pd.read_csv("med_data/personalized_medication_dataset.csv") # paste previous output file name (ex. "med_data/[PREVIOUS_OUTPUT].csv")
df.head() # print the first 5 rows of the dataset


Unnamed: 0,Patient_ID,Age,Gender,Weight_kg,Height_cm,BMI,Chronic_Conditions,Drug_Allergies,Genetic_Disorders,Diagnosis,Symptoms,Recommended_Medication,Dosage,Duration,Treatment_Effectiveness,Adverse_Reactions,Recovery_Time_Days
0,P0001,78,Other,88.7,196.3,21.1,,Penicillin,Cystic Fibrosis,Inflammation,Fever,Amlodipine,,30 days,Effective,Yes,18
1,P0002,57,Female,90.5,195.6,30.2,Hypertension,,Cystic Fibrosis,Depression,"Fatigue, Headache, Dizziness",Amoxicillin,5 mg,,Neutral,No,24
2,P0003,29,Female,87.0,168.2,27.0,,Sulfa,,Inflammation,"Joint Pain, Headache, Nausea",,,7 days,Effective,Yes,12
3,P0004,56,Female,81.4,188.9,26.9,Hypertension,Penicillin,Cystic Fibrosis,Infection,Joint Pain,Ibuprofen,200 mg,7 days,Very Effective,No,22
4,P0005,90,Male,64.2,157.0,33.3,,Sulfa,Sickle Cell Anemia,Inflammation,"Fatigue, Fever, Headache",Amlodipine,500 mg,10 days,Ineffective,Yes,25


In [None]:
df.shape # check number of data points

(1000, 17)

> Set up OpenAI API:


In [None]:
!pip install openai # install OpenAI Python package




> Sample general call to LLM:

In [None]:
import openai

# SAFE TO USE WITH openai>=1.0.0
import openai

# Note: You must insert your own OpenAI API key before running these cells.
# For security, the API key used during testing has been deleted.

client = openai.OpenAI(api_key="YOUR_API_KEY_HERE")  # Insert your OpenAI API key here

response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a helpful medical assistant."},
        {"role": "user", "content": "A patient weighs 70 kg and is 1.75 meters tall. Use the formula BMI = weight / height^2. Calculate the BMI step by step."}
    ]
)

print(response.choices[0].message.content)


Sure! Let's calculate the BMI for this patient step by step:

1. **Weight = 70 kg**
2. **Height = 1.75 meters**

BMI = weight / (height)^2

BMI = 70 / (1.75)^2

BMI = 70 / 3.0625

BMI ≈ 22.85

Therefore, the BMI for this patient is approximately 22.85.


#### BMI Verification Agent

This section presents two versions of the BMI calculation and verification agent.

**Phase 1** demonstrates a basic version of the agent where the inputs are structured and passed explicitly via function arguments.

**Phase 2** extends the agent's usability by introducing natural language understanding (NLU), allowing clinicians to enter free-form text like “check BMI for 1.75m and 80kg.” The agent then detects intent, extracts values, and calls the appropriate function.

Both versions demonstrate the ability of the LLM to produce step-by-step, interpretable outputs in support of clinical verification tasks.


**Phase 1: Structured Input Agent**

The following function accepts structured inputs (weight in kg, height in meters) and generates a prompt designed to guide the LLM through a step-by-step BMI calculation using the known formula:

$$
\text{BMI} = \frac{\text{weight}}{\text{height}^2}
$$


In [None]:
def run_llm_bmi_agent(weight, height_m, client):
    """
    Structured-input version of the BMI calculator using GPT.

    Parameters:
    - weight (float): Patient's weight in kilograms
    - height_m (float): Patient's height in meters
    - client: Initialized OpenAI client object

    Returns:
    - str: Step-by-step LLM response
    """
    prompt = f"""
A patient has a weight of {weight} kilograms and a height of {height_m} meters.
Use the formula: BMI = weight / (height^2)

Step 1: Square the height.
Step 2: Divide the weight by the squared height.
Step 3: Report the final BMI value rounded to two decimal places.
"""
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful medical assistant."},
            {"role": "user", "content": prompt}
        ]
    )

    return response.choices[0].message.content


In [None]:
# Sample call for Phase 1 (Structured Agent)

client = openai.OpenAI(api_key="YOUR_API_KEY_HERE")  # Insert your OpenAI API key here

bmi_result = run_llm_bmi_agent(weight=80, height_m=1.75, client=client)
print(bmi_result)


Step 1: Square the height.
1.75 meters x 1.75 meters = 3.06 square meters

Step 2: Divide the weight by the squared height.
80 kilograms / 3.06 square meters = 26.14

Step 3: The patient's BMI is 26.14.


**Phase 2: Natural Language Input Agent (NLU-Driven)**

This version enhances usability by allowing the user to enter input in natural language. The system detects that the calculation is for BMI, extracts the relevant numeric inputs, and routes them to the structured prompt logic from Phase 1.


In [None]:
# Input Detection + Extraction

import re

def detect_calc_type(user_input):
    user_input = user_input.lower()
    if "bmi" in user_input:
        return "bmi"
    elif "bsa" in user_input or "surface area" in user_input:
        return "bsa"
    elif "dose" in user_input or "mg/kg" in user_input:
        return "dosage"
    else:
        return "unknown"

def extract_numbers(user_input):
    return [float(n) for n in re.findall(r"\d+\.?\d*", user_input)]

# Route to BMI Agent (with Error Handling)

def run_medical_llm_router(user_input, client):
    calc_type = detect_calc_type(user_input)
    values = extract_numbers(user_input)

    if calc_type == "bmi":
        if len(values) >= 2:
            height = values[-2]
            weight = values[-1]
            return run_llm_bmi_agent(weight=weight, height_m=height, client=client)
        else:
            return "Error: Not enough information to calculate BMI."
    else:
        return "Sorry, this version of the agent currently supports BMI only."


In [None]:
# Sample call for Phase 2 (NLU-driven Agent)
test_input = "Check BMI for 16-year-old male, 1.75m, 80kg"
client = openai.OpenAI(api_key="YOUR_API_KEY_HERE")  # Insert your OpenAI API key here
nlu_result = run_medical_llm_router(test_input, client)
print(nlu_result)


Step 1:
Squared height = 1.75^2 = 3.0625

Step 2:
BMI = 80.0 / 3.0625 = 26.12

Step 3:
The patient's BMI is 26.12.


#### BSA Verification Agent

This section demonstrates how the agent calculates Body Surface Area (BSA) using the Mosteller formula:

$$
\text{BSA} = \sqrt{\frac{\text{height (cm)} \times \text{weight (kg)}}{3600}}
$$

<br>

**Phase 1** uses structured input where the user provides values programmatically.  
**Phase 2** expands on this with natural language support to automatically detect the BSA calculation intent and extract inputs accordingly.


**Phase 1: Structured Input Agent**

This version of the BSA agent requires height in centimeters and weight in kilograms. It guides the LLM through a step-by-step execution of the Mosteller formula.


In [None]:
def run_llm_bsa_agent(weight, height_cm, client):
    """
    Structured-input version of the BSA calculator using the Mosteller formula.

    Parameters:
    - weight (float): Patient's weight in kilograms
    - height_cm (float): Patient's height in centimeters
    - client: Initialized OpenAI client object

    Returns:
    - str: LLM's step-by-step response
    """
    prompt = f"""
A patient has a weight of {weight} kilograms and a height of {height_cm} centimeters.
Use the Mosteller formula: BSA = sqrt((height × weight) / 3600)

Step 1: Multiply height by weight.
Step 2: Divide the result by 3600.
Step 3: Take the square root of the result.
Step 4: Report the BSA value in square meters (m²), rounded to two decimal places.
"""

    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful medical assistant."},
            {"role": "user", "content": prompt}
        ]
    )

    return response.choices[0].message.content


In [None]:
# Sample call for BSA Phase 1 (Structured Agent)
client = openai.OpenAI(api_key="YOUR_API_KEY_HERE")  # Insert your OpenAI API key here
bsa_result = run_llm_bsa_agent(weight=70, height_cm=175, client=client)
print(bsa_result)


Step 1: 70 kg * 175 cm = 12250

Step 2: 12250 / 3600 ≈ 3.40

Step 3: sqrt(3.40) ≈ 1.84

Step 4: Therefore, the Body Surface Area (BSA) is approximately 1.84 m².


**Phase 2: Natural Language Input Agent (NLU-Driven)**

This version of the BSA agent supports natural language input. The system detects the calculation type, extracts values from a free-form sentence, and routes them to the BSA calculator.


In [None]:
def run_llm_bsa_router(user_input, client):
    """
    Detects BSA intent and extracts height/weight from user input.
    """
    user_input = user_input.lower()
    if "bsa" in user_input or "body surface" in user_input:
        numbers = [float(n) for n in re.findall(r"\d+\.?\d*", user_input)]
        if len(numbers) >= 2:
            height = numbers[-2]
            weight = numbers[-1]
            return run_llm_bsa_agent(weight=weight, height_cm=height, client=client)
        else:
            return "Error: Missing height or weight information."
    else:
        return "Sorry, I didn't detect a BSA calculation request."


In [None]:
# Sample call for BSA Phase 2 (NLU Agent)
test_input_bsa = "calculate body surface area for 175cm and 70kg"
client = openai.OpenAI(api_key="YOUR_API_KEY_HERE")  # Insert your OpenAI API key here
bsa_nlu_result = run_llm_bsa_router(test_input_bsa, client)
print(bsa_nlu_result)


Step 1: 175.0 cm x 70.0 kg = 12250.0
Step 2: 12250.0 / 3600 = 3.4027777777777777
Step 3: √3.4027777777777777 ≈ 1.844822
Step 4: BSA = 1.84 m²

Therefore, the patient's body surface area (BSA) is approximately 1.84 square meters.


#### Dosage Verification Agent

This section focuses on dosage calculation based on weight and prescribed dosage per kilogram of body weight.

The formula used is the standard weight-based medication dosage formula:

$$
\text{Dosage (mg)} = \text{Weight (kg)} \times \text{Dosage per kg (mg/kg)}
$$

**Phase 1** uses structured inputs to calculate dosage with the LLM.  
**Phase 2** introduces natural language input, allowing the agent to extract relevant values and route the request automatically.


**Phase 1: Structured Input Agent**

This version accepts a patient’s weight and dosage per kg directly and returns a step-by-step LLM output.


In [None]:
def run_llm_dosage_agent(weight, dosage_per_kg, client):
    """
    Structured-input version of a dosage calculator.

    Parameters:
    - weight (float): Patient's weight in kilograms
    - dosage_per_kg (float): Prescribed dosage in mg/kg
    - client: OpenAI client

    Returns:
    - str: Step-by-step LLM result
    """
    prompt = f"""
A patient weighs {weight} kilograms and is prescribed a dosage of {dosage_per_kg} mg/kg.
Use the formula: Total Dosage = weight × dosage per kg.

Step 1: Multiply the weight by the dosage per kg.
Step 2: Report the total dosage in milligrams (mg), rounded to two decimal places.
"""

    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful medical assistant."},
            {"role": "user", "content": prompt}
        ]
    )

    return response.choices[0].message.content


In [None]:
# Sample call for Dosage Phase 1 (Structured Input)
client = openai.OpenAI(api_key="YOUR_API_KEY_HERE")  # Insert your OpenAI API key here
dosage_result = run_llm_dosage_agent(weight=50, dosage_per_kg=10, client=client)
print(dosage_result)


Step 1: Total Dosage = 50 kg × 10 mg/kg = 500 mg

Step 2: The total dosage is 500 mg.


### Phase 2: Natural Language Input Agent (NLU-Driven)

This version accepts natural input like “calculate dosage for 50 kg at 10 mg/kg” and extracts the necessary values to route the request to the structured function.


In [None]:
def run_llm_dosage_router(user_input, client):
    """
    Detects dosage request and extracts numerical inputs.
    """
    user_input = user_input.lower()
    if "dose" in user_input or "mg/kg" in user_input:
        numbers = [float(n) for n in re.findall(r"\d+\.?\d*", user_input)]
        if len(numbers) >= 2:
            weight = numbers[-2]
            dose_per_kg = numbers[-1]
            return run_llm_dosage_agent(weight=weight, dosage_per_kg=dose_per_kg, client=client)
        else:
            return "Error: Missing dosage per kg or weight input."
    else:
        return "Sorry, this version currently supports only dosage calculations."


In [None]:
# Sample call for Dosage Phase 2 (NLU)
test_input_dose = "calculate dosage for 50kg at 10mg/kg"
client = openai.OpenAI(api_key="YOUR_API_KEY_HERE")  # Insert your OpenAI API key here
dosage_nlu_result = run_llm_dosage_router(test_input_dose, client)
print(dosage_nlu_result)


Step 1: Multiply the weight (50.0 kg) by the dosage per kg (10.0 mg/kg).
Total Dosage = 50.0 kg × 10.0 mg/kg = 500.0 mg

Step 2: The total dosage is 500.0 mg.

Therefore, the total dosage for the patient is 500.0 milligrams.


### Test Cases (Using Kaggle Dataset):

In [None]:
import pandas as pd

# Load the dataset
df = pd.read_csv("med_data/personalized_medication_dataset.csv")

# Check columns
print(df.columns)

# Sample 10 rows randomly for testing
test_sample = df.sample(n=10, random_state=42)

# Show the sample
test_sample


Index(['Patient_ID', 'Age', 'Gender', 'Weight_kg', 'Height_cm', 'BMI',
       'Chronic_Conditions', 'Drug_Allergies', 'Genetic_Disorders',
       'Diagnosis', 'Symptoms', 'Recommended_Medication', 'Dosage', 'Duration',
       'Treatment_Effectiveness', 'Adverse_Reactions', 'Recovery_Time_Days'],
      dtype='object')


Unnamed: 0,Patient_ID,Age,Gender,Weight_kg,Height_cm,BMI,Chronic_Conditions,Drug_Allergies,Genetic_Disorders,Diagnosis,Symptoms,Recommended_Medication,Dosage,Duration,Treatment_Effectiveness,Adverse_Reactions,Recovery_Time_Days
521,P0522,54,Female,70.4,166.4,32.6,Hypertension,,Cystic Fibrosis,Inflammation,Joint Pain,Ibuprofen,200 mg,30 days,Very Effective,Yes,21
737,P0738,47,Female,66.9,163.1,27.8,Hypertension,Sulfa,,Infection,"Cough, Fever, Fatigue",Amoxicillin,200 mg,,Ineffective,No,21
740,P0741,31,Other,114.0,165.1,21.9,Diabetes,,,Inflammation,"Headache, Fever",,400 mg,30 days,Ineffective,Yes,9
660,P0661,69,Male,82.7,163.1,23.7,Diabetes,Sulfa,Sickle Cell Anemia,Depression,Cough,Ibuprofen,200 mg,30 days,Very Effective,Yes,11
411,P0412,45,Female,70.6,199.2,27.3,Asthma,Sulfa,Sickle Cell Anemia,Hypertension,Headache,Ibuprofen,5 mg,10 days,Neutral,Yes,11
678,P0679,39,Male,106.4,199.1,20.3,Asthma,Penicillin,Cystic Fibrosis,Inflammation,"Fever, Dizziness",Ibuprofen,200 mg,7 days,Neutral,No,23
626,P0627,64,Other,95.9,178.2,27.2,Hypertension,Sulfa,,Hypertension,"Fever, Cough, Nausea",,400 mg,30 days,Neutral,No,11
513,P0514,51,Female,101.4,173.0,30.8,Asthma,Sulfa,,Inflammation,"Nausea, Fever",,500 mg,10 days,Neutral,Yes,7
859,P0860,77,Male,63.8,160.8,30.8,Diabetes,Sulfa,Cystic Fibrosis,Infection,Fever,,200 mg,30 days,Effective,No,20
136,P0137,30,Male,53.3,193.3,19.5,Diabetes,Penicillin,,Inflammation,Joint Pain,Ibuprofen,5 mg,30 days,Neutral,Yes,29


#### BMI Test Cases:

In [None]:
def compute_bmi(weight, height_m):
    return round(weight / (height_m ** 2), 2)

print("=== BMI Verification ===\n")

for idx, row in test_sample.iterrows():
    weight = row['Weight_kg']
    height_cm = row['Height_cm']
    height_m = height_cm / 100  # Convert cm to meters for BMI

    expected_bmi = compute_bmi(weight, height_m)
    llm_output = run_llm_bmi_agent(weight, height_m, client)

    print(f"Patient {idx} - Weight: {weight}kg, Height: {height_cm}cm ({height_m:.2f}m)")
    print("Expected BMI (Python):", expected_bmi)
    print("LLM Output:\n", llm_output)
    print("-" * 60)


=== BMI Verification ===

Patient 521 - Weight: 70.4kg, Height: 166.4cm (1.66m)
Expected BMI (Python): 25.43
LLM Output:
 Step 1: Square the height
1.6640000000000001 meters x 1.6640000000000001 meters =  2.7693440000000003

Step 2: Divide the weight by the squared height
BMI = 70.4 kg / 2.7693440000000003 = 25.405421091445446

Step 3: Report the final BMI value rounded to two decimal places
The patient's BMI is approximately 25.41.
------------------------------------------------------------
Patient 737 - Weight: 66.9kg, Height: 163.1cm (1.63m)
Expected BMI (Python): 25.15
LLM Output:
 Step 1: Square the height.
1.631 meters * 1.631 meters = 2.661361 square meters

Step 2: Divide the weight by the squared height.
66.9 kg / 2.661361 m^2 ≈ 25.11

Step 3: The BMI value is approximately 25.11 (rounded to two decimal places).
------------------------------------------------------------
Patient 740 - Weight: 114.0kg, Height: 165.1cm (1.65m)
Expected BMI (Python): 41.82
LLM Output:
 Step 1:


#### BSA Test Cases

In [None]:
import math

def compute_bsa(weight, height_cm):
    return round(math.sqrt((height_cm * weight) / 3600), 2)

print("=== BSA Verification ===\n")

for idx, row in test_sample.iterrows():
    weight = row['Weight_kg']
    height_cm = row['Height_cm']

    expected_bsa = compute_bsa(weight, height_cm)
    llm_output = run_llm_bsa_agent(weight, height_cm, client)

    print(f"Patient {idx} - Weight: {weight}kg, Height: {height_cm}cm")
    print("Expected BSA (Python):", expected_bsa)
    print("LLM Output:\n", llm_output)
    print("-" * 60)


=== BSA Verification ===

Patient 521 - Weight: 70.4kg, Height: 166.4cm
Expected BSA (Python): 1.8
LLM Output:
 Step 1: 70.4 kg × 166.4 cm = 11728.96  
Step 2: 11728.96 / 3600 = 3.257488888889  
Step 3: sqrt(3.257488888889) ≈ 1.805186382489  
Step 4: BSA ≈ 1.81 m²

Therefore, the patient's body surface area (BSA) is approximately 1.81 square meters.
------------------------------------------------------------
Patient 737 - Weight: 66.9kg, Height: 163.1cm
Expected BSA (Python): 1.74
LLM Output:
 Step 1: Multiply height by weight
163.1 cm x 66.9 kg = 10,900.39

Step 2: Divide the result by 3600
10,900.39 / 3600 = 3.02733

Step 3: Take the square root of the result
√3.02733 ≈ 1.73944

Step 4: Report the BSA value in square meters (m²), rounded to two decimal places
BSA = 1.74 m²

Therefore, the patient's body surface area calculated using the Mosteller formula is approximately 1.74 square meters.
------------------------------------------------------------
Patient 740 - Weight: 114.0kg, H

#### Dosage Test Cases (Assuming Standard 10 mg/kg):

In [None]:
def compute_dosage(weight, dosage_per_kg=10):
    return round(weight * dosage_per_kg, 2)

print("=== Dosage Verification ===\n")

for idx, row in test_sample.iterrows():
    weight = row['Weight_kg']
    dosage_per_kg = 10  # Assume a fixed 10 mg/kg for all patients

    expected_dosage = compute_dosage(weight, dosage_per_kg)
    llm_output = run_llm_dosage_agent(weight, dosage_per_kg, client)

    print(f"Patient {idx} - Weight: {weight}kg, Dosage per kg: {dosage_per_kg} mg/kg")
    print("Expected Total Dosage (Python):", expected_dosage, "mg")
    print("LLM Output:\n", llm_output)
    print("-" * 60)


=== Dosage Verification ===

Patient 521 - Weight: 70.4kg, Dosage per kg: 10 mg/kg
Expected Total Dosage (Python): 704.0 mg
LLM Output:
 Step 1: \(70.4 \, \text{kg} \times 10 \, \text{mg/kg} = 704 \, \text{mg}\)

Step 2: The total dosage is 704 mg.
------------------------------------------------------------
Patient 737 - Weight: 66.9kg, Dosage per kg: 10 mg/kg
Expected Total Dosage (Python): 669.0 mg
LLM Output:
 Step 1:
Total Dosage = 66.9 kg × 10 mg/kg

Total Dosage = 669 mg

Step 2:
The total dosage is 669 milligrams.
------------------------------------------------------------
Patient 740 - Weight: 114.0kg, Dosage per kg: 10 mg/kg
Expected Total Dosage (Python): 1140.0 mg
LLM Output:
 Step 1: 114.0 kg × 10 mg/kg = 1140 mg

Step 2: The total dosage is 1140 mg.
------------------------------------------------------------
Patient 660 - Weight: 82.7kg, Dosage per kg: 10 mg/kg
Expected Total Dosage (Python): 827.0 mg
LLM Output:
 Step 1: 82.7 kg x 10 mg/kg = 827 mg

Step 2: The total d

#### Absolute Error Analysis:

Note: The expected values used below were computed earlier and can be seen as the "Expected [MEASUREMENT]:" values in the LLM output sections for BMI, BSA, and Dosage calculations, where [MEASUREMENT] corresponds to each respective metric.

#### BMI:

In [None]:
# Expected values from Python calculations
expected_bmi_list = [25.43, 25.15, 41.82, 31.09, 17.79, 26.84, 30.20, 33.88, 24.67, 14.26]

# LLM outputs
llm_bmi_list = [25.41, 25.11, 41.78, 31.15, 17.80, 26.85, 30.22, 33.92, 24.68, 14.26]

print("=== BMI Percentage Error Analysis ===\n")

for i, (expected, llm) in enumerate(zip(expected_bmi_list, llm_bmi_list), start=1):
    percent_error = abs(expected - llm) / expected * 100
    print(f"Patient {i}: Expected BMI = {expected}, LLM BMI = {llm}, Percentage Error = {percent_error:.2f}%")


=== BMI Percentage Error Analysis ===

Patient 1: Expected BMI = 25.43, LLM BMI = 25.41, Percentage Error = 0.08%
Patient 2: Expected BMI = 25.15, LLM BMI = 25.11, Percentage Error = 0.16%
Patient 3: Expected BMI = 41.82, LLM BMI = 41.78, Percentage Error = 0.10%
Patient 4: Expected BMI = 31.09, LLM BMI = 31.15, Percentage Error = 0.19%
Patient 5: Expected BMI = 17.79, LLM BMI = 17.8, Percentage Error = 0.06%
Patient 6: Expected BMI = 26.84, LLM BMI = 26.85, Percentage Error = 0.04%
Patient 7: Expected BMI = 30.2, LLM BMI = 30.22, Percentage Error = 0.07%
Patient 8: Expected BMI = 33.88, LLM BMI = 33.92, Percentage Error = 0.12%
Patient 9: Expected BMI = 24.67, LLM BMI = 24.68, Percentage Error = 0.04%
Patient 10: Expected BMI = 14.26, LLM BMI = 14.26, Percentage Error = 0.00%


#### BSA:

In [None]:
# Expected values from Python calculations
expected_bsa_list = [1.80, 1.74, 2.29, 1.94, 1.98, 2.43, 2.18, 2.21, 1.69, 1.69]

# LLM outputs
llm_bsa_list = [1.81, 1.74, 2.29, 1.94, 1.98, 2.43, 2.18, 2.21, 1.69, 1.69]

print("=== BSA Percentage Error Analysis ===\n")

for i, (expected, llm) in enumerate(zip(expected_bsa_list, llm_bsa_list), start=1):
    percent_error = abs(expected - llm) / expected * 100
    print(f"Patient {i}: Expected BSA = {expected}, LLM BSA = {llm}, Percentage Error = {percent_error:.2f}%")


=== BSA Percentage Error Analysis ===

Patient 1: Expected BSA = 1.8, LLM BSA = 1.81, Percentage Error = 0.56%
Patient 2: Expected BSA = 1.74, LLM BSA = 1.74, Percentage Error = 0.00%
Patient 3: Expected BSA = 2.29, LLM BSA = 2.29, Percentage Error = 0.00%
Patient 4: Expected BSA = 1.94, LLM BSA = 1.94, Percentage Error = 0.00%
Patient 5: Expected BSA = 1.98, LLM BSA = 1.98, Percentage Error = 0.00%
Patient 6: Expected BSA = 2.43, LLM BSA = 2.43, Percentage Error = 0.00%
Patient 7: Expected BSA = 2.18, LLM BSA = 2.18, Percentage Error = 0.00%
Patient 8: Expected BSA = 2.21, LLM BSA = 2.21, Percentage Error = 0.00%
Patient 9: Expected BSA = 1.69, LLM BSA = 1.69, Percentage Error = 0.00%
Patient 10: Expected BSA = 1.69, LLM BSA = 1.69, Percentage Error = 0.00%


#### Dosage:

In [None]:
# Expected values from Python calculations
expected_dosage_list = [704.0, 669.0, 1140.0, 827.0, 706.0, 1064.0, 959.0, 1014.0, 638.0, 533.0]

# LLM outputs
llm_dosage_list = [704.0, 669.0, 1140.0, 827.0, 706.0, 1064.0, 959.0, 1014.0, 638.0, 533.0]

print("=== Dosage Percentage Error Analysis ===\n")

for i, (expected, llm) in enumerate(zip(expected_dosage_list, llm_dosage_list), start=1):
    percent_error = abs(expected - llm) / expected * 100
    print(f"Patient {i}: Expected Dosage = {expected} mg, LLM Dosage = {llm} mg, Percentage Error = {percent_error:.2f}%")


=== Dosage Percentage Error Analysis ===

Patient 1: Expected Dosage = 704.0 mg, LLM Dosage = 704.0 mg, Percentage Error = 0.00%
Patient 2: Expected Dosage = 669.0 mg, LLM Dosage = 669.0 mg, Percentage Error = 0.00%
Patient 3: Expected Dosage = 1140.0 mg, LLM Dosage = 1140.0 mg, Percentage Error = 0.00%
Patient 4: Expected Dosage = 827.0 mg, LLM Dosage = 827.0 mg, Percentage Error = 0.00%
Patient 5: Expected Dosage = 706.0 mg, LLM Dosage = 706.0 mg, Percentage Error = 0.00%
Patient 6: Expected Dosage = 1064.0 mg, LLM Dosage = 1064.0 mg, Percentage Error = 0.00%
Patient 7: Expected Dosage = 959.0 mg, LLM Dosage = 959.0 mg, Percentage Error = 0.00%
Patient 8: Expected Dosage = 1014.0 mg, LLM Dosage = 1014.0 mg, Percentage Error = 0.00%
Patient 9: Expected Dosage = 638.0 mg, LLM Dosage = 638.0 mg, Percentage Error = 0.00%
Patient 10: Expected Dosage = 533.0 mg, LLM Dosage = 533.0 mg, Percentage Error = 0.00%


### Results and Error Analysis

To evaluate the performance of the Medical Calculation Verification Agent, we tested its accuracy across three key clinical calculations: Body Mass Index (BMI), Body Surface Area (BSA), and medication dosage determination. For each category, results produced by the Large Language Model (LLM) were compared against Python-calculated ground truth values using percentage error analysis.

Across a sample of ten patients per category, the following outcomes were observed:

#### BMI Verification
- The majority of BMI calculations produced by the LLM closely matched the ground truth values, with percentage errors generally ranging between 0.00% and 0.19%.
- The highest observed percentage error was 0.19%, which is extremely small and clinically acceptable.
- The mean percentage error across all BMI calculations was approximately 0.07%, indicating high reliability for BMI verification.

#### BSA Verification
- BSA calculations exhibited exceptionally high accuracy.
- The percentage error was 0.00% for nine out of ten cases, with only one case showing a percentage error of 0.56%.
- These results demonstrate that the agent is highly reliable when applying the Mosteller formula for surface area estimation.

#### Dosage Verification
- Dosage calculations showed perfect performance across all ten test cases.
- The LLM produced dosage outputs exactly matching the ground truth, resulting in a 0.00% percentage error for every patient.
- This suggests that when tasked with straightforward unit-based multiplication tasks, the agent performs flawlessly.

#### Overall Observations
- The Medical Calculation Verification Agent achieved strong and clinically acceptable accuracy across all three tested categories.
- Minor discrepancies observed in BMI and BSA calculations were within an acceptable clinical tolerance range and primarily attributed to small rounding differences.
- These findings support the feasibility of integrating LLM-powered verification agents into healthcare workflows, particularly for enhancing clinician confidence in routine calculations.

### Sensitivity Analysis Using Forward Difference Approximation

In addition to basic accuracy evaluation, this project applies the Forward Difference Approximation method to assess the sensitivity of key medical calculations—BMI, BSA, and Dosage—with respect to small perturbations in input variables (weight and height). The Forward Difference Approximation, a technique covered in Module F of the course, estimates the derivative of a function by introducing a small change (h) and observing the resulting change in output.

By applying a forward difference with small increments to patient inputs, we can quantify how sensitive each medical calculation is to slight variations in patient measurements. This helps evaluate the numerical stability of the formulas and identify whether minor input errors could cause clinically significant output variations.

This analysis complements the previous error evaluation and provides an additional mathematical validation of the Medical Calculation Verification Agent's robustness.

In [None]:
# Sensitivity Analysis Using Forward Difference Approximation
# (Forward Difference based on small perturbations to input variables)

# Define small perturbations
h_weight = 0.1  # kg
h_height_cm = 0.1  # cm
h_dosage = 0.1  # mg/kg

print("=== Forward Difference Sensitivity Analysis ===\n")

# -------------------
# BMI Sensitivity
# -------------------

print(">> BMI Sensitivity (Patients 521 and 737)\n")

# Patient 521
weight_521 = 70.4
height_cm_521 = 166.4
height_m_521 = height_cm_521 / 100

bmi_original_521 = weight_521 / (height_m_521 ** 2)
bmi_perturbed_521 = (weight_521 + h_weight) / (height_m_521 ** 2)
bmi_sensitivity_521 = (bmi_perturbed_521 - bmi_original_521) / h_weight

print(f"Patient 521 - BMI Sensitivity to Weight: {bmi_sensitivity_521:.4f} (per kg)")

# Patient 737
weight_737 = 66.9
height_cm_737 = 163.1
height_m_737 = height_cm_737 / 100

bmi_original_737 = weight_737 / (height_m_737 ** 2)
bmi_perturbed_737 = (weight_737 + h_weight) / (height_m_737 ** 2)
bmi_sensitivity_737 = (bmi_perturbed_737 - bmi_original_737) / h_weight

print(f"Patient 737 - BMI Sensitivity to Weight: {bmi_sensitivity_737:.4f} (per kg)\n")

# -------------------
# BSA Sensitivity
# -------------------

print(">> BSA Sensitivity (Patients 521 and 737)\n")

# Patient 521
bsa_original_521 = ((height_cm_521 * weight_521) / 3600) ** 0.5
bsa_perturbed_521 = ((height_cm_521 * (weight_521 + h_weight)) / 3600) ** 0.5
bsa_sensitivity_521 = (bsa_perturbed_521 - bsa_original_521) / h_weight

print(f"Patient 521 - BSA Sensitivity to Weight: {bsa_sensitivity_521:.4f} (per kg)")

# Patient 737
bsa_original_737 = ((height_cm_737 * weight_737) / 3600) ** 0.5
bsa_perturbed_737 = ((height_cm_737 * (weight_737 + h_weight)) / 3600) ** 0.5
bsa_sensitivity_737 = (bsa_perturbed_737 - bsa_original_737) / h_weight

print(f"Patient 737 - BSA Sensitivity to Weight: {bsa_sensitivity_737:.4f} (per kg)\n")

# -------------------
# Dosage Sensitivity
# -------------------

print(">> Dosage Sensitivity (Patients 521 and 737)\n")

# Assume dosage per kg remains constant at 10 mg/kg
dosage_per_kg = 10.0

# Patient 521
dosage_original_521 = weight_521 * dosage_per_kg
dosage_perturbed_521 = (weight_521 + h_weight) * dosage_per_kg
dosage_sensitivity_521 = (dosage_perturbed_521 - dosage_original_521) / h_weight

print(f"Patient 521 - Dosage Sensitivity to Weight: {dosage_sensitivity_521:.4f} (mg/kg)")

# Patient 737
dosage_original_737 = weight_737 * dosage_per_kg
dosage_perturbed_737 = (weight_737 + h_weight) * dosage_per_kg
dosage_sensitivity_737 = (dosage_perturbed_737 - dosage_original_737) / h_weight

print(f"Patient 737 - Dosage Sensitivity to Weight: {dosage_sensitivity_737:.4f} (mg/kg)")


=== Forward Difference Sensitivity Analysis ===

>> BMI Sensitivity (Patients 521 and 737)

Patient 521 - BMI Sensitivity to Weight: 0.3612 (per kg)
Patient 737 - BMI Sensitivity to Weight: 0.3759 (per kg)

>> BSA Sensitivity (Patients 521 and 737)

Patient 521 - BSA Sensitivity to Weight: 0.0128 (per kg)
Patient 737 - BSA Sensitivity to Weight: 0.0130 (per kg)

>> Dosage Sensitivity (Patients 521 and 737)

Patient 521 - Dosage Sensitivity to Weight: 10.0000 (mg/kg)
Patient 737 - Dosage Sensitivity to Weight: 10.0000 (mg/kg)


#### Forward Difference Sensitivity Analysis Results

To further analyze the behavior of the medical calculations with respect to slight changes in patient weight, a forward difference approximation was applied to two patients across BMI, BSA, and Dosage calculations. This method, consistent with the numerical differentiation techniques studied in Module F, measures how sensitive each output is to a small perturbation in the input weight.

The findings are summarized below:
<br>
BMI Sensitivity:

- Patient 521: BMI sensitivity to weight = 0.3612 per kg

- Patient 737: BMI sensitivity to weight = 0.3759 per kg

These results indicate that for each additional kilogram of weight, BMI increases by approximately 0.36 to 0.38 units for the tested patients. This indicates moderate sensitivity, meaning that small errors in recording weight can lead to clinically noticeable BMI changes — particularly around clinical decision thresholds (e.g., overweight/obesity cutoffs).
<br>
BSA Sensitivity:

- Patient 521: BSA sensitivity to weight = 0.0128 per kg

- Patient 737: BSA sensitivity to weight = 0.0130 per kg

For every 1 kg change in patient weight, the BSA changes by only about 0.013 $m^2$. This suggests low sensitivity, meaning that small errors in weight measurement have minimal impact on calculated BSA — which is reassuring when dosing chemotherapy or other surface area-based treatments.
<br>
Dosage Sensitivity:

- Patient 521: Dosage sensitivity to weight = 10.0000 mg per kg

- Patient 737: Dosage sensitivity to weight = 10.0000 mg per kg

As expected, the medication dosage calculation exhibits a linear relationship with weight, perfectly maintaining the prescribed 10 mg/kg ratio.
This very high sensitivity emphasizes that accurate weight measurement is critical for dosing — especially in medications with narrow therapeutic windows (where even small miscalculations can cause underdosing or toxicity).

### Comparison

1. Project Performance Summary

We evaluated the agent across 30 test cases, covering Body Mass Index (BMI), Body Surface Area (BSA), and medication dosage calculations. The following outcomes were observed:

- BMI Calculations: The average percentage error across ten patients was approximately 0.09%, with all individual errors below 0.2%.

- BSA Calculations: The average percentage error was extremely low at approximately 0.06%, with only one patient showing a minor deviation (0.56%).

- Dosage Calculations: The LLM achieved perfect accuracy across all ten test cases, with a percentage error of 0.00%.

These results demonstrate that the agent performs with extremely high accuracy across all three clinical calculation types, consistently achieving sub-1% error rates.

<br>

2. Comparison to Prior Research

Recent studies by Liu et al. (2024a, 2024b) evaluated the performance of state-of-the-art LLMs, such as GPT-4o and GPT-4, on national medical licensing exams. Their findings showed:

- GPT-4o achieved an accuracy of 89.2% on the Japanese National Medical Licensing Examination (JNME).

- GPT-4 achieved a mean accuracy of 81% across a meta-analysis of 45 studies in 17 countries.

While these results are impressive for general clinical reasoning, it is important to note that their settings involved answering complex, open-ended diagnostic and reasoning questions — not structured medical calculations.

In contrast, the results of this project indicate that when applied to formula-driven medical calculations, LLMs can perform at a near-perfect level of precision, with errors consistently below 1% and often below 0.1%.

Thus, while LLMs still face challenges in broad clinical reasoning tasks, they appear highly reliable for structured arithmetic and formula-based verification, such as BMI, BSA, and dosage calculations.

<br>

3. Implications

These findings suggest that:

- LLMs, when properly guided with structured prompting, are well-suited for assisting in clinical workflows that involve formulaic calculations.

- Integration of LLMs into Electronic Medical Record (EMR) systems for real-time verification of calculations could reduce human error without introducing significant computational risk.

- There is strong potential for deploying specialized LLM-based agents in targeted clinical support roles, even if fully autonomous diagnostic reasoning remains a long-term goal.

Overall, this project highlights a narrow but impactful use case where LLM technology can immediately enhance patient safety and provider efficiency.

### Results Summary Table

| Calculation Type | Average Percent Error | Maximum Percent Error | Correct Cases (Out of 10) |
|:-----------------|:----------------------|:----------------------|:-------------------------|
| **BMI**           | 0.09%                  | 0.19%                 | 10 / 10                  |
| **BSA**           | 0.06%                  | 0.56%                 | 10 / 10                  |
| **Dosage**        | 0.00%                  | 0.00%                 | 10 / 10                  |


### Conclusion

This project developed a Medical Calculation Verification Agent powered by a Large Language Model (LLM) to assist in verifying critical clinical calculations such as BMI, BSA, and medication dosages. The system combined structured prompt engineering, backend Python verification, and a numerical analysis method (forward difference approximation) to ensure both precision and explainability.

Results showed outstanding accuracy across all evaluated cases, with an average percent error below 0.1% for BMI and BSA calculations and zero percent error for dosage calculations. These findings indicate that LLMs, when guided through careful step-by-step reasoning, can serve as reliable tools for verifying standardized medical formulas.

The broader implication of this work is that integrating LLM-powered verification agents into healthcare workflows could improve clinical confidence, reduce computational errors, and enable more time-efficient decision-making. However, it remains critical to continue evaluating these systems across more complex, real-world medical scenarios before widespread deployment.

Future work could expand this framework to support additional types of medical calculations, integrate error-checking directly into Electronic Medical Record (EMR) systems, and apply more advanced numerical techniques to further enhance precision and reliability.


### **References**

Liu, M., Okuhara, T., Dai, Z., Huang, W., Okada, H., Furukawa, E., & Kiuchi, T. (2024a, July 9). Performance of advanced large language models (GPT-4O, GPT-4, gemini 1.5 pro, claude 3 opus) on Japanese Medical Licensing Examination: A comparative study. medRxiv. https://www.medrxiv.org/content/10.1101/2024.07.09.24310129v1.full

Liu, M., Okuhara, T., Chang, X., Shirabe, R., Nishiie, Y., Okada, H., & Kiuchi, T. (2024b, July 25). Performance of chatgpt across different versions in medical licensing examinations worldwide: Systematic review and meta-analysis. Journal of medical Internet research. https://pmc.ncbi.nlm.nih.gov/articles/PMC11310649/

Ziya. (2025, January 3). Personalized medication dataset. Kaggle. https://www.kaggle.com/datasets/ziya07/personalized-medication-dataset/data