## Quick start
- Run the `pip install` cell.
- Get a free API key from [Google AI Studio](https://aistudio.google.com/app/apikey).
- Set your `GOOGLE_API_KEY` in the code below.

In [None]:
# Install dependencies
%pip install -q -U google-generativeai pandas


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [None]:
import os
from io import StringIO
import pandas as pd
import google.generativeai as genai

# Setup Gemini API
# Get your free key: https://aistudio.google.com/app/apikey
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY") or "PASTE_YOUR_KEY_HERE"

if GOOGLE_API_KEY == "PASTE_YOUR_KEY_HERE":
    print("⚠️ Please paste your Google API Key in the variable above.")
else:
    genai.configure(api_key=GOOGLE_API_KEY)
    print("✅ Gemini API configured")

✅ Gemini API configured


## Create mock medical reports
We will define a couple of sample radiology reports containing PHI. These reports are longer and contain more subtle forms of PHI (like employer names, device IDs, or specific dates/locations).

In [7]:
reports = [
    """
EXAMINATION: MRI BRAIN WITHOUT AND WITH CONTRAST

ACCESSION: 23910-2231
DATE: November 14, 2024
PATIENT ID: 8822119

HISTORY: 54-year-old male, software engineer at Oracle, presenting with chronic headaches and visual disturbances. Patient states headaches began after returning from a trip to Cabo San Lucas last month.

TECHNIQUE: Multiplanar multisequence imaging of the brain was performed before and after the administration of 15 mL of Dotarem.

COMPARISON: CT Head from October 12, 2024, performed at Mercy General.

FINDINGS:
There is a 2.3 x 1.8 cm enhancing mass within the right frontal lobe with surrounding vasogenic edema. Mass effect is noted with 4 mm of midline shift to the left. The ventricles are otherwise symmetric. No acute hemorrhage or territorial infarct. 
The sella and pituitary appear normal. The major intracranial flow voids are preserved. 
Paranasal sinuses are clear. Mastoid air cells are well aerated.
Incidental note is made of a small venous angioma in the left cerebellar hemisphere, unchanged from prior MRI in 2019.

IMPRESSION:
1. Right frontal lobe mass, suspicious for high-grade glioma.
2. Mass effect with midline shift.
3. Stable left cerebellar venous angioma.

RECOMMENDATION: Neurosurgical consultation. Dr. P. Sharma has been notified of these results at 14:30 on 11/14/2024.

Report generated by: VoiceDictation v4.2 (Device ID: VD-992)
""",

"""
PROCEDURE: CT ABDOMEN AND PELVIS WITH CONTRAST

MRN: 445-22-99
DOB: 02/14/1980
DATE: 2024-11-20

INDICATION: Right lower quadrant pain. Rule out appendicitis. Patient is a 44-year-old female.

FINDINGS:
LOWER CHEST: Visualized lung bases are clear.
LIVER: Normal in size and contour. No focal lesions.
GALLBLADDER: Surgically absent. Surgical clips are present in the RUQ.
SPLEEN: Normal.
PANCREAS: Normal.
KIDNEYS: Bilateral non-obstructing renal calculi. The largest on the right measures 4mm.
ADRENALS: Normal.
BOWEL: The appendix is dilated to 1.2 cm with surrounding fat stranding and a calcified appendicolith. No free air or fluid.
BLADDER: Distended, wall is smooth.
UTERUS/ADNEXA: IUD in place. Trace free fluid in the cul-de-sac.

BONES: No aggressive osseous lesions. Mild degenerative changes in the lumbar spine.
SOFT TISSUES: Unremarkable.

IMPRESSION:
1. Acute appendicitis with appendicolith.
2. Bilateral non-obstructing renal stones.
3. Status post cholecystectomy.

Electronically signed by: Sarah Miller, MD on 11/20/2024 10:15 AM
Transcribed by: T. Johnson
"""
]

for i, report in enumerate(reports):
    print(f"--- Report {i+1} ---")
    print(report)
    print("\n")

--- Report 1 ---

EXAMINATION: MRI BRAIN WITHOUT AND WITH CONTRAST

ACCESSION: 23910-2231
DATE: November 14, 2024
PATIENT ID: 8822119

HISTORY: 54-year-old male, software engineer at Oracle, presenting with chronic headaches and visual disturbances. Patient states headaches began after returning from a trip to Cabo San Lucas last month.

TECHNIQUE: Multiplanar multisequence imaging of the brain was performed before and after the administration of 15 mL of Dotarem.

COMPARISON: CT Head from October 12, 2024, performed at Mercy General.

FINDINGS:
There is a 2.3 x 1.8 cm enhancing mass within the right frontal lobe with surrounding vasogenic edema. Mass effect is noted with 4 mm of midline shift to the left. The ventricles are otherwise symmetric. No acute hemorrhage or territorial infarct. 
The sella and pituitary appear normal. The major intracranial flow voids are preserved. 
Paranasal sinuses are clear. Mastoid air cells are well aerated.
Incidental note is made of a small venous ang

## Task 1: Identify PHI
We will ask the model to analyze the radiology report and list all personal health information found.

**Prompt:** "Analyze the radiology report and give me a list of all the personal health information."

In [8]:
model = genai.GenerativeModel('models/gemini-flash-lite-latest')
# or, if you want to use the latest SOTA model
# But limited availability on free tier 
# model = genai.GenerativeModel('models/gemini-3-pro-preview')

prompt_identify = """
Analyze the radiology report and give me a list of all the personal health information (PHI) that should be anonymized. 
This includes names, dates, locations, IDs, and any other identifiable information. 
This does not include medical information or findings.

Format the response as a markdown table with two columns: 'Type of Information' and 'Details'.
For example:

| Type of Information | Details    |
|---------------------|------------|
| Name                | John Doe   |
| Date of Birth       | 01/01/1970 |


"""
processed_reports = []

for i, report in enumerate(reports):
    response = model.generate_content(f"{prompt_identify}\n\nReport:\n{report}")
    processed_reports.append(response.text)

In [9]:
def extract_table_from_response(response_text):
    lines = response_text.split('\n')
    table_lines = []
    in_table = False
    for line in lines:
        if line.strip().startswith('|'):
            in_table = True
            table_lines.append(line)
        elif in_table:
            break
    return '\n'.join(table_lines)

In [10]:
def convert_markdown_table_to_dataframe(markdown_table):
    table_io = StringIO(markdown_table)
    df = pd.read_csv(table_io, sep='|', skipinitialspace=True, engine='python')
    
    # Drop empty columns created by leading/trailing pipes
    df = df.dropna(axis=1, how='all')

    # Remove the separator row (e.g., "---|---|---")
    # We check if the first row contains dashes
    if len(df) > 0 and df.iloc[0].astype(str).str.contains('---').any():
        df = df.iloc[1:]

    # Clean up whitespace in column names and data
    df.columns = df.columns.str.strip()
    df = df.apply(lambda x: x.str.strip() if x.dtype == "object" else x)

    # Reset the index
    df = df.reset_index(drop=True)
    
    return df

In [11]:
for report in processed_reports:
    table_md = extract_table_from_response(report)
    df = convert_markdown_table_to_dataframe(table_md)
    display(df)

Unnamed: 0,Type of Information,Details
0,Accession Number,23910-2231
1,Date of Examination,"November 14, 2024"
2,Patient ID,8822119
3,Age,54-year-old (Age is generally considered PHI i...
4,Occupation,software engineer
5,Employer Name,Oracle
6,Travel Location,Cabo San Lucas
7,Date of Prior Exam,"October 12, 2024"
8,Facility Name,Mercy General
9,Date of Prior MRI,"2019 (Year only, but linked to a specific find..."


Unnamed: 0,Type of Information,Details
0,Medical Record Number (MRN),445-22-99
1,Date of Birth (DOB),02/14/1980
2,Examination Date,2024-11-20
3,Age/Demographic Info,44-year-old female (often considered PHI in co...
4,Physician Name,"Sarah Miller, MD"
5,Signature Date/Time,11/20/2024 10:15 AM
6,Transcriber Name,T. Johnson


## Task 2: Anonymize the Report
Now we will ask the model to rewrite the report with the PHI removed or replaced.

**Prompt:** "Anonymize all the potential personal health information on the radiology report."

In [12]:
prompt_anonymize = """

Anonymize all the potential personal health information on the radiology report. 

Here is a list of PHI to anonymize %s

Only return the anonymized report without any additional commentary.
"""

for report, processed_report in zip(reports, processed_reports):
    print(f"--- Report Anonymization ---")
    _prompt = prompt_anonymize % processed_report
    response = model.generate_content(f"{_prompt}\n\nReport:\n{report}")
    print(response.text)
    print("\n" + "="*50 + "\n")

--- Report Anonymization ---
EXAMINATION: MRI BRAIN WITHOUT AND WITH CONTRAST

ACCESSION: [ANONYMIZED ACCESSION NUMBER]
DATE: [ANONYMIZED DATE]
PATIENT ID: [ANONYMIZED ID]

HISTORY: [ANONYMIZED AGE] male, [ANONYMIZED OCCUPATION] at [ANONYMIZED EMPLOYER], presenting with chronic headaches and visual disturbances. Patient states headaches began after returning from a trip to [ANONYMIZED LOCATION] last month.

TECHNIQUE: Multiplanar multisequence imaging of the brain was performed before and after the administration of 15 mL of Dotarem.

COMPARISON: CT Head from [ANONYMIZED PRIOR DATE], performed at [ANONYMIZED FACILITY].

FINDINGS:
There is a 2.3 x 1.8 cm enhancing mass within the right frontal lobe with surrounding vasogenic edema. Mass effect is noted with 4 mm of midline shift to the left. The ventricles are otherwise symmetric. No acute hemorrhage or territorial infarct. 
The sella and pituitary appear normal. The major intracranial flow voids are preserved. 
Paranasal sinuses are cl