<a href="https://colab.research.google.com/github/ekrombouts/GenCareAI/blob/main/scripts/data_generation/100_GenerateCareReportsColab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GenCare AI: Synthetic Healthcare Data Generation

**Author:** Isabel Ronan (original work by [Eva Rombouts](https://github.com/ekrombouts))  
**Date:** 2024-06-03  
**Updated:** 2024-06-03
**Version:** 1.0

### Description
This script generates synthetic healthcare data for NLP experiments.  
It utilizes OpenAI's models to create realistic, anonymized datasets that mimic real-world client records of nursing care homes for use in machine learning and data analysis.   

### Setup
- Ensure [OpenAI API key](https://platform.openai.com/docs/quickstart?context=python) is set in the environment or configuration file.
- Google Drive must be mounted for file operations if running in Colab.

In [12]:
# imports needed for Google Colab
!pip install -q langchain langchain_core langchain_openai

In [13]:
from typing import List

from langchain_core.prompts import PromptTemplate
from langchain.chains import SequentialChain, LLMChain
from langchain_openai import ChatOpenAI

import pandas as pd
import os

In [14]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [15]:
from google.colab import userdata
OPENAI_API_KEY =userdata.get('open_ai_project_nurse_notes') # Set key in secret keys section in Colab

In [16]:
# variables
SEED = 1
MODEL = 'gpt-3.5-turbo-0125'
TEMPERATURE = 1.1
note_number = 0
NUM_COMPLETIONS = 1 # number of completions per query
NUM_REPORTS = 25 # number of notes to generate per query
LANGUAGE = "hiberno-English"
NATIONALITY = "Irish"

DATA_DIR = '/content/drive/MyDrive/Colab Notebooks/GenCareAI/GenCareAI/data'
FN_REPORTS = os.path.join(DATA_DIR, f'{note_number}_care_reports.csv')

In [17]:
# create role for model based on an irish nursing home context
SYSTEM_ROLE_PROMPT = '''
You are a specialist in generating fictitious data for natural language processing projects in healthcare.
You speak the language of a nurse in an {nationality} nursing home. Namely, you speak {language}.
'''

In [18]:
#  create note query prompt template text
NOTE_QUERY_PROMPT = '''
This is an example of a nurse note for a patient in a day:
"{example_note}"

Other reports may include: washing, dressing, brushing teeth, getting ready for the day, getting ready for the night, showering, cleaning dental prostheses, or assistance after incontinence.
Other reports could include: what the client has or has not eaten, what help is needed with eating (full help, encouragement, adapted cutlery or cup), choking, keeping hydration and nutrition lists.
Other reports could include: Organised activities, getting visitors, browsing through a magazine, interacting with fellow residents. Keep in mind that these are reports from people in a nursing home, with severe disabilities, so social interaction and activities are limited. Usually it involves sociability, but not always.
Other reports may include, for example: oedema, pressure ulcers, peeling, redness and itching of the skin. Nails that are too long, blemishes.
Other reports could include, for example: care plan discussions, minor medical complaints, family requests, ordering medication.
Reports can be, for example, about: restlessness and wandering at night, sleeping well, going to the toilet at night, phoning, lying crookedly in bed.
Reports may include: agitation, restlessness, apathy, confusion; usually the confusion is subtle, but sometimes more intense.
Reports may include, for example: pain, tightness of breath, nausea, diarrhoea, back pain, palliative care; usually the complaints are subtle, but sometimes more severe.
Other reports can be about, for example: walking aids, the wheelchair, falls, fall incidents, transfers, lifts.
Most reports are about everyday things, so not everything is a serious incident.

Make up {number_of_reports} such reports for {number_of_reports} residents with {needs} palliative care needs. Return only the reports separated by "\n---". Vary the sentence structure and style.
'''


In [19]:
# initialize OpenAI model
llm = ChatOpenAI(api_key=OPENAI_API_KEY, temperature=TEMPERATURE, model=MODEL)

In [20]:
# create prompts
role_prompt = PromptTemplate(template=SYSTEM_ROLE_PROMPT, input_variables=["nationality", "language"])
note_prompt = PromptTemplate(template=NOTE_QUERY_PROMPT, input_variables=["example_note", "number_of_reports", "needs"])

# create LLMChains
role_chain = LLMChain(llm=llm, prompt=role_prompt, output_key = "intermediate_output")
note_chain = LLMChain(llm=llm, prompt=note_prompt, output_key = "output")
# create SequentialDocumentsChain
sequential_chain = SequentialChain(
    chains=[role_chain, note_chain], input_variables = ["nationality", "language","example_note", "number_of_reports", "needs"], output_variables = ["output"]
)

In [21]:
input_data = {"nationality": NATIONALITY, "language": LANGUAGE, "example_note": "John was assisted with a shower this morning. He had his lunch in the canteen, he ate half a bowl of soup, half a chicken salad and a full bowl of ice cream along with 2 cups of tea. Johns’ sister Mary came to see him today and he was in good form.", "number_of_reports": NUM_REPORTS, "needs": "unmet"}

In [22]:
results = []
for _ in range(NUM_COMPLETIONS):
  response = sequential_chain.invoke(input_data)
  results.append(response)

RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}

In [None]:
results

In [None]:
def generate_healthcare_data(client, system_role, user_role, model, seed=None, temperature=1, n=1):
    """Generate synthetic healthcare data using the OpenAI API."""
    try:
        completion = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": system_role},
                {"role": "user", "content": user_role}
            ],
            seed=seed,
            temperature=temperature,
            n=n
        )
        return completion
    except Exception as e:
        print(f"An error occurred during data generation: {str(e)}")
        return None

In [None]:
def parse_and_clean_reports(completions):
    """Parse and clean reports from OpenAI completions."""
    reports = []
    for choice in completions.choices:
        splits = choice.message.content.split('\n---')
        for split in splits:
          more_split = split.split('"')
          for another_split in more_split:
            if not ":" in another_split:
              if not '"' in another_split:
                if re.search('[a-zA-Z]', another_split):
                  reports.append(another_split)

    return reports

In [None]:
def save_reports(df, filename):
    """Save the DataFrame to a CSV file."""
    try:
        df.to_csv(filename, index=False)
        print(f"Reports saved successfully to {filename}")
    except Exception as e:
        print(f"Failed to save reports: {str(e)}")
        return df  # Optionally return the DataFrame for further handling

In [None]:
def generate_and_collect_reports():
    """Main function to orchestrate data generation."""
    client = setup_openai_client()
    if not client:
        print("OpenAI client setup failed. Exiting...")
        return

    all_reports = []
    for topic, user_role in USER_ROLES.items():
        print(f"Generating reports for topic: {topic}")
        completion = generate_healthcare_data(client, SYSTEM_ROLE, user_role, MODEL, SEED, TEMPERATURE, NUM_COMPLETIONS)
        if completion:
            reports = parse_and_clean_reports(completion)
            df = pd.DataFrame(reports, columns=['report'])
            df['topic'] = topic
            all_reports.append(df)
        else:
            print(f"Failed to generate reports for topic: {topic}")

    return all_reports

In [None]:
all_reports = generate_and_collect_reports()

In [None]:
df_all_reports = pd.concat(all_reports, ignore_index=True)
saved_df = save_reports(df_all_reports, FN_REPORTS)

In [None]:
# FN_REPORTS_TO_CONCAT = os.path.join(DATA_DIR, 'gcai_reports2.csv')
# FN_REPORTS_CONCATENATED = os.path.join(DATA_DIR, 'gcai_reports_concatenated.csv')

# df_reports_to_concat = pd.read_csv(FN_REPORTS_TO_CONCAT)

# # Combine the DataFrames
# df_reports_concatenated = pd.concat([df_all_reports, df_reports_to_concat])

# # Save the concatenated DataFrame to a CSV file
# df_reports_concatenated.to_csv(FN_REPORTS_CONCATENATED, index=False)