<a href="https://colab.research.google.com/github/ekrombouts/GenCareAI/blob/main/scripts/data_generation/100_GenerateCareReportsColab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GenCare AI: Synthetic Healthcare Data Generation

**Author:** Isabel Ronan (original work by [Eva Rombouts](https://github.com/ekrombouts))  
**Date:** 2024-06-03  
**Updated:** 2024-06-03
**Version:** 1.0

### Description
This script generates synthetic healthcare data for NLP experiments.  
It utilizes OpenAI's models to create realistic, anonymized datasets that mimic real-world client records of nursing care homes for use in machine learning and data analysis.   

### Setup
- Ensure [OpenAI API key](https://platform.openai.com/docs/quickstart?context=python) is set in the environment or configuration file.
- Google Drive must be mounted for file operations if running in Colab.

In [None]:
# installation needed for Colab
!pip install -q openai

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m325.2/325.2 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import pandas as pd
import os
from openai import OpenAI
import re

In [None]:
from google.colab import userdata
OPENAI_API_KEY =userdata.get('open_ai_key') # Set key in secret keys section in Colab

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Constants / variables
SEED = 1
MODEL = 'gpt-3.5-turbo-0125'
TEMPERATURE = 1.1

NUM_COMPLETIONS = 100 # Number of completions per query

DATA_DIR = '/content/drive/MyDrive/Colab Notebooks/GenCareAI/GenCareAI/data'
FN_REPORTS = os.path.join(DATA_DIR, 'TEST_care_reports.csv')

In [None]:
# System role, describes the assistant's behavior in a specific context - this role is modified for an irish nursing home context
SYSTEM_ROLE = '''
You are a specialist in generating fictitious data for natural language processing projects in healthcare.

You speak the language of a nurse in an Irish nursing home. Namely, you speak hiberno-English.
'''

In [None]:
# Dictionary of user roles for different topics of the reports
USER_ROLES = {
        'unmet_notes':
        '''

This is an example of a nurse note for a patient in a day:
"John was assisted with a shower this morning. He had his lunch in the canteen, he ate half a bowl of soup, half a chicken salad and a full bowl of ice cream along with 2 cups of tea. Johns sister Mary came to see him today, he was in good form. He experienced some nausea in the evening and was given Zofran (anti emetic) with good relief from same. John was given his regular analgesia and has no complaints of pain with same. John’s sister has expressed her concern that his mobility is reduced and an appointment with the GP has been scheduled for tomorrow."

Other reports may include: washing, dressing, brushing teeth, getting ready for the day, getting ready for the night, showering, cleaning dental prostheses, or assistance after incontinence.
Other reports could include: what the client has or has not eaten, what help is needed with eating (full help, encouragement, adapted cutlery or cup), choking, keeping hydration and nutrition lists.
Other reports could include: Organised activities, getting visitors, browsing through a magazine, interacting with fellow residents. Keep in mind that these are reports from people in a nursing home, with severe disabilities, so social interaction and activities are limited. Usually it involves sociability, but not always.
Other reports may include, for example: oedema, pressure ulcers, peeling, redness and itching of the skin. Nails that are too long, blemishes.
Other reports could include, for example: care plan discussions, minor medical complaints, family requests, ordering medication.
Reports can be, for example, about: restlessness and wandering at night, sleeping well, going to the toilet at night, phoning, lying crookedly in bed.
Reports may include: agitation, restlessness, apathy, confusion; usually the confusion is subtle, but sometimes more intense.
Reports may include, for example: pain, tightness of breath, nausea, diarrhoea, back pain, palliative care; usually the complaints are subtle, but sometimes more severe.
Other reports can be about, for example: walking aids, the wheelchair, falls, fall incidents, transfers, lifts.
Most reports are about everyday things, so not everything is a serious incident.

Make up 25 such reports for 25 residents with unmet palliative care needs. Return only the reports separated by "\n---". Vary the sentence structure and style.

        ''',
        'met_notes':
        '''

This is an example of a nurse note for a patient in a day:
"John was assisted with a shower this morning. He had his lunch in the canteen, he ate half a bowl of soup, half a chicken salad and a full bowl of ice cream along with 2 cups of tea. Johns sister Mary came to see him today, he was in good form. He experienced some nausea in the evening and was given Zofran (anti emetic) with good relief from same. John was given his regular analgesia and has no complaints of pain with same. John’s sister has expressed her concern that his mobility is reduced and an appointment with the GP has been scheduled for tomorrow."

Other reports may include: washing, dressing, brushing teeth, getting ready for the day, getting ready for the night, showering, cleaning dental prostheses, or assistance after incontinence.
Other reports could include: what the client has or has not eaten, what help is needed with eating (full help, encouragement, adapted cutlery or cup), choking, keeping hydration and nutrition lists.
Other reports could include: Organised activities, getting visitors, browsing through a magazine, interacting with fellow residents. Keep in mind that these are reports from people in a nursing home, with severe disabilities, so social interaction and activities are limited. Usually it involves sociability, but not always.
Other reports may include, for example: oedema, pressure ulcers, peeling, redness and itching of the skin. Nails that are too long, blemishes.
Other reports could include, for example: care plan discussions, minor medical complaints, family requests, ordering medication.
Reports can be, for example, about: restlessness and wandering at night, sleeping well, going to the toilet at night, phoning, lying crookedly in bed.
Reports may include: agitation, restlessness, apathy, confusion; usually the confusion is subtle, but sometimes more intense.
Reports may include, for example: pain, tightness of breath, nausea, diarrhoea, back pain, palliative care; usually the complaints are subtle, but sometimes more severe.
Other reports can be about, for example: walking aids, the wheelchair, falls, fall incidents, transfers, lifts.
Most reports are about everyday things, so not everything is a serious incident.

Make up 25 such reports for 25 residents with met palliative care needs. Return only the reports separated by "\n---". Vary the sentence structure and style.
        '''
}


In [None]:
def setup_openai_client():
    """Create an OpenAI client instance."""
    try:
        client = OpenAI(api_key=OPENAI_API_KEY)
        return client
    except Exception as e:
        print(f"Failed to set up OpenAI client: {str(e)}")
        return None

In [None]:
def generate_healthcare_data(client, system_role, user_role, model, seed=None, temperature=1, n=1):
    """Generate synthetic healthcare data using the OpenAI API."""
    try:
        completion = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": system_role},
                {"role": "user", "content": user_role}
            ],
            seed=seed,
            temperature=temperature,
            n=n
        )
        return completion
    except Exception as e:
        print(f"An error occurred during data generation: {str(e)}")
        return None

In [None]:
def parse_and_clean_reports(completions):
    """Parse and clean reports from OpenAI completions."""
    reports = []
    for choice in completions.choices:
        splits = choice.message.content.split('\n---')
        for split in splits:
          more_split = split.split('"')
          for another_split in more_split:
            if not ":" in another_split:
              if not '"' in another_split:
                if re.search('[a-zA-Z]', another_split):
                  reports.append(another_split)

    return reports

In [None]:
def save_reports(df, filename):
    """Save the DataFrame to a CSV file."""
    try:
        df.to_csv(filename, index=False)
        print(f"Reports saved successfully to {filename}")
    except Exception as e:
        print(f"Failed to save reports: {str(e)}")
        return df  # Optionally return the DataFrame for further handling

In [None]:
def generate_and_collect_reports():
    """Main function to orchestrate data generation."""
    client = setup_openai_client()
    if not client:
        print("OpenAI client setup failed. Exiting...")
        return

    all_reports = []
    for topic, user_role in USER_ROLES.items():
        print(f"Generating reports for topic: {topic}")
        completion = generate_healthcare_data(client, SYSTEM_ROLE, user_role, MODEL, SEED, TEMPERATURE, NUM_COMPLETIONS)
        if completion:
            reports = parse_and_clean_reports(completion)
            df = pd.DataFrame(reports, columns=['report'])
            df['topic'] = topic
            all_reports.append(df)
        else:
            print(f"Failed to generate reports for topic: {topic}")

    return all_reports

In [None]:
all_reports = generate_and_collect_reports()

Generating reports for topic: unmet_notes
Generating reports for topic: met_notes


In [None]:
df_all_reports = pd.concat(all_reports, ignore_index=True)
saved_df = save_reports(df_all_reports, FN_REPORTS)

Reports saved successfully to /content/drive/MyDrive/Colab Notebooks/GenCareAI/GenCareAI/data/A_care_reports.csv


In [None]:
# FN_REPORTS_TO_CONCAT = os.path.join(DATA_DIR, 'gcai_reports2.csv')
# FN_REPORTS_CONCATENATED = os.path.join(DATA_DIR, 'gcai_reports_concatenated.csv')

# df_reports_to_concat = pd.read_csv(FN_REPORTS_TO_CONCAT)

# # Combine the DataFrames
# df_reports_concatenated = pd.concat([df_all_reports, df_reports_to_concat])

# # Save the concatenated DataFrame to a CSV file
# df_reports_concatenated.to_csv(FN_REPORTS_CONCATENATED, index=False)