<a href="https://colab.research.google.com/github/jaynetra/AIForHealthCare_Mimic3/blob/main/LLMAssignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Medical Note Summarization with Gemini and Prompt Engineering

This project aims to develop a robust system for summarizing medical notes using Google
Gemini models and advanced prompt engineering techniques.
By leveraging synthetic data and an iterative refinement process,
the project will deliver high-quality, concise summaries of medical records,
enhancing accessibility and understanding of patient information.

**Key Features**

* **Prompt Engineering:** Employs iterative prompt refinement to guide the model towards producing accurate and informative summaries.

* **Synthetic Data:** Uses synthetic medical data from Hugging Face to train and evaluate the system, ensuring privacy and data security.

**Workflow:**

1. **Data Acquisition:** Load synthetic medical data from Hugging Face Datasets.

4. **Iterative Refinement:**  Continuously evaluate the generated summaries and refine the prompts based on performance analysis.

**TODO**
* **Streamlit Integration:** Deploy the trained model and prompt store within a Streamlit application, enabling users to interact with the system.


* **Interactive Interface:** Provides a user-friendly Streamlit application for querying the model with medical notes and viewing generated summaries.

In [5]:
# pip installs
!pip install datasets
!pip install huggingface_hub

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m24.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.w

In [6]:
# import packages for models - gemini

!pip install -U -q "google-genai==1.7.0"

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/144.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m144.7/144.7 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
# set up google api key
from google.colab import userdata
userdata.get('GOOGLE_KEY')


In [7]:
# prompt: get some medical notes from hugging face project

from datasets import load_dataset
from google.colab import userdata

# Load the dataset
dataset = load_dataset("starmpcc/Asclepius-Synthetic-Clinical-Notes")

# Access the medical notes
#medical_notes = dataset['train']['text']



README.md:   0%|          | 0.00/3.07k [00:00<?, ?B/s]

synthetic.csv:   0%|          | 0.00/402M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/158114 [00:00<?, ? examples/s]

In [8]:
# Print the first  medical notes
noteToProcess = dataset['train'][0]['note']


In [10]:
print(noteToProcess)

Discharge Summary:

Patient: 60-year-old male with moderate ARDS from COVID-19

Hospital Course:

The patient was admitted to the hospital with symptoms of fever, dry cough, and dyspnea. During physical therapy on the acute ward, the patient experienced coughing attacks that induced oxygen desaturation and dyspnea with any change of position or deep breathing. To avoid rapid deterioration and respiratory failure, a step-by-step approach was used for position changes. The breathing exercises were adapted to avoid prolonged coughing and oxygen desaturation, and with close monitoring, the patient managed to perform strength and walking exercises at a low level. Exercise progression was low initially but increased daily until hospital discharge to a rehabilitation clinic on day 10.

Clinical Outcome:

The patient was discharged on day 10 to a rehabilitation clinic making satisfactory progress with all symptoms resolved.

Follow-up:

The patient will receive follow-up care at the rehabilita

In [11]:
from google import genai
from google.genai import types

from IPython.display import HTML, Markdown, display

In [12]:
from google.api_core import retry


is_retriable = lambda e: (isinstance(e, genai.errors.APIError) and e.code in {429, 503})

genai.models.Models.generate_content = retry.Retry(
    predicate=is_retriable)(genai.models.Models.generate_content)

In [13]:
client = genai.Client(api_key=userdata.get('GOOGLE_KEY'))



In [14]:
# Zero shot prompting
model_config = types.GenerateContentConfig(
    temperature=0.1,
    top_p=1,
    max_output_tokens=5,
)


prompt = "Summarize the clinical notes"
response = client.models.generate_content(
  model="gemini-1.5-flash",
  contents=[
      types.Part.from_bytes(
        data=noteToProcess.encode('utf-8'),
        mime_type='text/plain',
      ),
      prompt])


print(response.text)

A 60-year-old male with moderate COVID-19-induced ARDS was hospitalized for 10 days.  He presented with fever, dry cough, and dyspnea.  Physical therapy was cautiously implemented due to coughing-induced hypoxemia.  Gradual progress was made in breathing exercises and mobility, culminating in discharge to a rehabilitation clinic where continued care and monitoring will ensure full recovery.  The patient responded well to treatment and achieved satisfactory progress.



In [15]:
# few shot prompt
fs_prompt = """
Parse the notes into a valid JSON
example : json
{
  "patient": {
    "age": 60,
    "sex": "male",
    "diagnosis": "Moderate ARDS from COVID-19"
  }
"""
response = client.models.generate_content(
  model="gemini-1.5-flash",
  contents=[
      types.Part.from_bytes(
        data=noteToProcess.encode('utf-8'),
        mime_type='text/plain',
      ),
      fs_prompt])


In [16]:
print(response.text)

```json
{
  "patient": {
    "age": 60,
    "sex": "male",
    "diagnosis": "Moderate ARDS from COVID-19"
  },
  "hospitalCourse": {
    "admissionSymptoms": ["fever", "dry cough", "dyspnea"],
    "physicalTherapy": {
      "challenges": "coughing attacks inducing oxygen desaturation and dyspnea with position changes or deep breathing",
      "approach": "step-by-step approach for position changes, adapted breathing exercises to avoid prolonged coughing and oxygen desaturation",
      "exercises": ["strength exercises", "walking exercises"],
      "exerciseProgression": "low initially, increased daily"
    },
    "lengthOfStay": "10 days"
  },
  "clinicalOutcome": {
    "dischargeLocation": "rehabilitation clinic",
    "progress": "satisfactory",
    "symptoms": "resolved"
  },
  "followUp": {
    "location": "rehabilitation clinic",
    "plan": "regular monitoring of progress and further rehabilitation exercises until full recovery",
    "instructions": "report any new symptoms or con

In [None]:
fs_prompt = "Parse the notes into a valid JSON and title it Discharge Summary"
response = client.models.generate_content(
  model="gemini-1.5-flash",
  contents=[
      types.Part.from_bytes(
        data=noteToProcess.encode('utf-8'),
        mime_type='text/plain',
      ),
      fs_prompt])
print(response.text)

```json
{
  "Discharge Summary": {
    "Patient": {
      "age": 60,
      "sex": "male",
      "diagnosis": "Moderate ARDS from COVID-19"
    },
    "Hospital Course": {
      "admissionSymptoms": ["fever", "dry cough", "dyspnea"],
      "physicalTherapyChallenges": "Coughing attacks induced oxygen desaturation and dyspnea with position changes or deep breathing.",
      "treatmentApproach": "Step-by-step approach for position changes, adapted breathing exercises to avoid prolonged coughing and oxygen desaturation, close monitoring.",
      "exerciseProgression": "Low initially, increased daily until discharge.",
      "lengthOfStay": "10 days"
    },
    "Clinical Outcome": {
      "dischargeLocation": "Rehabilitation clinic",
      "dischargeStatus": "Satisfactory progress, all symptoms resolved"
    },
    "Follow-up": {
      "location": "Rehabilitation clinic",
      "plan": "Regular monitoring of progress and further rehabilitation exercises until full recovery.",
      "instruc

In [None]:
# let us ask some questions
qnPrompt = 'What is the major diagnosis for the patient?'
response = client.models.generate_content(
  model="gemini-1.5-flash",
  contents=[
      types.Part.from_bytes(
        data=noteToProcess.encode('utf-8'),
        mime_type='text/plain',
      ),
      qnPrompt])
print(response.text)

The major diagnosis is **ARDS (Acute Respiratory Distress Syndrome) secondary to COVID-19**.



In [None]:
# Chain of thought promplt
cotPrompt = 'What is the major diagnosis for the patient? Think step by step'
response = client.models.generate_content(
  model="gemini-1.5-flash",
  contents=[
      types.Part.from_bytes(
        data=noteToProcess.encode('utf-8'),
        mime_type='text/plain',
      ),
      cotPrompt])
Markdown(response.text)

Step 1: Identify the patient's presenting symptoms.  The patient presented with fever, dry cough, and dyspnea.

Step 2: Identify the condition diagnosed. The summary explicitly states the patient had "moderate ARDS from COVID-19".

Step 3: Determine the primary diagnosis.  While the patient had COVID-19, the major diagnosis impacting his hospital stay and treatment was **moderate Acute Respiratory Distress Syndrome (ARDS)** secondary to COVID-19.  The ARDS is the primary focus of the treatment and rehabilitation described.


In [None]:
#Thinking mode
import io
from IPython.display import Markdown, clear_output


response = client.models.generate_content_stream(
    model='gemini-2.0-flash-thinking-exp',
    contents=[
      types.Part.from_bytes(
        data=noteToProcess.encode('utf-8'),
        mime_type='text/plain',
      ),
      cotPrompt])


buf = io.StringIO()
for chunk in response:
    buf.write(chunk.text)
    # Display the response as it is streamed
    print(chunk.text, end='')

# And then render the finished response as formatted markdown.
clear_output()
Markdown(buf.getvalue())


The major diagnosis for the patient is **moderate ARDS from COVID-19**.

**Explanation:**

The discharge summary explicitly states in the first line: "Patient: 60-year-old male with moderate ARDS from COVID-19".  This directly indicates the patient's primary medical condition that led to hospitalization and treatment.

* **ARDS** stands for Acute Respiratory Distress Syndrome, a severe lung condition.
* **COVID-19** specifies the cause of the ARDS, which is the viral infection.
* **Moderate** describes the severity of the ARDS.

Therefore, the major diagnosis is clearly identified as moderate ARDS resulting from COVID-19.

In [17]:
# Experiment Tree of thought
# Get gemini to explain what is a tree of thought
prompt = "Explain what is a tree of thought for a beginner learning LLMs"
response = client.models.generate_content(
  model="gemini-1.5-flash",
  contents=[
      types.Part.from_bytes(
        data=noteToProcess.encode('utf-8'),
        mime_type='text/plain',
      ),
      prompt])

In [18]:
print(response.text)

The discharge summary describes a successful treatment of a 60-year-old male with moderate ARDS (Acute Respiratory Distress Syndrome) caused by COVID-19.  The key points are:

* **Severe Illness:** The patient presented with classic COVID-19 symptoms and developed ARDS, a serious lung condition requiring significant respiratory support.
* **Cautious Rehabilitation:** Due to the risk of further respiratory distress, his physical therapy was carefully managed, progressing slowly from minimal exercise to more strenuous activity as tolerated.  The focus was on preventing further oxygen desaturation and respiratory distress.
* **Successful Rehabilitation:**  Despite the initial severity, the patient made satisfactory progress and was discharged to a rehabilitation clinic after 10 days.
* **Ongoing Care:**  His recovery will continue at the rehabilitation clinic with ongoing monitoring and therapy.

This summary highlights the importance of a multidisciplinary approach to managing severe COV

In [None]:
# explore some questions with reasoning
prompt = 'What is the major diagnosis for the patient? Use tree of thought to conclude and tell how you arrived'
response = client.models.generate_content(
  model="gemini-1.5-flash",
  contents=[
      types.Part.from_bytes(
        data=noteToProcess.encode('utf-8'),
        mime_type='text/plain',
      ),
      prompt])
Markdown(response.text)

**Tree of Thought to Determine Major Diagnosis:**

1. **Initial Symptoms:** The discharge summary highlights fever, dry cough, and dyspnea (shortness of breath) as the initial presentation.  These are common symptoms of various respiratory illnesses, including COVID-19.

2. **Diagnosis Confirmation (Implied):**  The summary states the patient had "moderate ARDS from COVID-19."  This implies that COVID-19 was confirmed through testing (e.g., PCR test) and that the ARDS (Acute Respiratory Distress Syndrome) was a direct consequence of the infection.

3. **ARDS as the Complication:** ARDS is a severe lung injury leading to respiratory failure.  While the patient initially presented with COVID-19 symptoms, the ARDS significantly impacted his hospital course and treatment. The focus of the hospital stay was managing the ARDS.

4. **Hospital Course Focused on ARDS Management:** The description of the patient's physical therapy, focusing on carefully managed positioning and breathing exercises to prevent oxygen desaturation, directly points to the management of ARDS.  The slow progression of exercises underscores the severity of the lung condition.

5. **Discharge to Rehabilitation:**  The discharge to a rehabilitation clinic, rather than directly home, indicates the ongoing need for respiratory and physical rehabilitation to address the lingering effects of ARDS.

**Conclusion:**

Therefore, while the underlying cause was COVID-19 infection, the **major diagnosis** during the hospital stay and the primary reason for admission was **moderate ARDS secondary to COVID-19**.  The COVID-19 infection itself is a contributing factor but not the focus of treatment and outcome description in this summary.  ARDS, with its complexities and requiring specialized management, is the central diagnostic feature highlighted throughout the discharge summary.


In [19]:
# Teach me the difference between chain of thought and tree of thought and use the data in notetoprocess variable to explain
prompt = 'What is the difference between chain of thought and tree of thought and use the data in notetoprocess variable to explain'
response = client.models.generate_content(
  model="gemini-1.5-flash",
  contents=[
      types.Part.from_bytes(
        data=noteToProcess.encode('utf-8'),
        mime_type='text/plain',
      ),
      prompt])
Markdown(response.text)

The provided discharge summary doesn't offer any information related to "chain of thought" or "tree of thought" reasoning processes.  These are concepts from the field of artificial intelligence, specifically in large language models (LLMs), and relate to how the models generate text or solve problems. They are not medical concepts.

**Chain of thought prompting:** encourages the model to break down a complex problem into a sequence of intermediate reasoning steps before arriving at a final answer.  It's linear.

**Tree of thought prompting:** extends chain of thought by allowing the model to explore multiple reasoning paths simultaneously, similar to branching out in a tree. It considers different possible solutions or intermediate steps in parallel.


The discharge summary details a patient's journey through ARDS treatment and recovery.  There's no application of these AI reasoning methods within the medical information given.  To illustrate the difference using a *hypothetical* example relevant to the patient's case:

**Chain of Thought (example):**

* **Problem:** Patient has low oxygen saturation during position changes.
* **Step 1:** Assess the patient's tolerance to movement.
* **Step 2:** Start with minimal position changes.
* **Step 3:** Gradually increase the range and frequency of position changes as tolerated.
* **Step 4:** Monitor oxygen saturation continuously.
* **Step 5:** Adjust the therapy based on the patient's response.


**Tree of Thought (example):**

The problem remains the same: low oxygen saturation during position changes.  But instead of a linear approach, a tree of thought might explore different interventions concurrently:


* **Branch 1:**  Minimal position changes + supplemental oxygen.
* **Branch 2:**  Positioning techniques + breathing exercises.
* **Branch 3:**  Medication to reduce coughing + gradual position changes.

The model would then evaluate the effectiveness of each branch and potentially combine successful strategies.  This is a more exploratory and potentially more efficient problem-solving approach, especially when dealing with uncertainty (which is common in medical situations).

In short, the discharge summary contains clinical data;  chain of thought and tree of thought are AI methodologies that aren't directly present in that medical record.
