<a href="https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/docs/notebooks/create_synthetic_data_with_tabular_dp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


## Installation and instructions

This notebook walks through using Gretel Tabular DP to generate synthetic healthcare data with differential privacy. It also walks through using Gretel LSTM and Gretel ACTGAN for comparison.

In [None]:
%%capture
! pip install numpy pandas
! pip install -U gretel-client

## Log in to Gretel using your API key

In [None]:
from gretel_client import configure_session
configure_session(api_key="prompt", validate=True, clear=True)

## Load data

This dataset contains information about the readmission of hospital patients with diabetes. Most of the 43 variables are categorical. Only a handful, such as `time_in_hospital`, `num_lab_procedures`, `num_procedures`, `num_medications`, `number_outpatient`, `number_emergency`, `number_inpatient`, and `number_diagnoses` contain numeric values. 

In [None]:
import pandas as pd
DATA_PATH = "https://gretel-public-website.s3.us-west-2.amazonaws.com/datasets/uci_diabetes_readmission_data.csv"
df = pd.read_csv(DATA_PATH)

In [None]:
pd.set_option("display.max_columns", 50)
df.head()

## Train Tabular DP with epsilon = 0.5
Privacy parameter epsilon is set to 0.5. Privacy parameter delta is set automatically based on dataset size. 
See https://docs.gretel.ai/reference/synthetics/models/gretel-tabular-dp#model-creation for more information on setting these parameters.

In [None]:
from gretel_client.projects import create_or_get_unique_project

# set up a project 
project = create_or_get_unique_project(name="hospital-readmission-tabular-dp")

# upload data source
data_source_identifier = project.upload_artifact(DATA_PATH)

In [None]:
from gretel_client.projects.models import read_model_config
from pprint import pprint
from gretel_client.helpers import poll

# Create a new model configuration.
config = read_model_config("synthetics/tabular-differential-privacy")
config["models"][0]["tabular_dp"]["params"]["epsilon"] = 0.5
config["name"] = "hospital-readmission-tabular-dp-epsilon-0.5"

# view config
pprint(config)

# create and submit the model for training
model = project.create_model_obj(model_config=config, data_source=data_source_identifier)
model.submit_cloud()
poll(model)

# view the synthetic data generated
synthetic = pd.read_csv(model.get_artifact_link("data_preview"), compression="gzip")
display(synthetic.head())

# get quick information on synthetic data quality
pprint(model.get_report_summary())

## Train Tabular DP with epsilon = 1
Privacy parameter epsilon is set to 1. Privacy parameter delta is set automatically based on dataset size. 
See https://docs.gretel.ai/reference/synthetics/models/gretel-tabular-dp#model-creation for more information on setting these parameters.

In [None]:
# Create a new model configuration.
config2 = read_model_config("synthetics/tabular-differential-privacy")
config2["models"][0]["tabular_dp"]["params"]["epsilon"] = 1
config2["name"] = "hospital-readmission-tabular-dp-epsilon-1"

# view config
pprint(config2)

# create and submit the model for training
model2 = project.create_model_obj(model_config=config2, data_source=data_source_identifier)
model2.submit_cloud()
poll(model2)

# view the synthetic data generated
synthetic2 = pd.read_csv(model2.get_artifact_link("data_preview"), compression="gzip")
display(synthetic2.head())

# get quick information on synthetic data quality
pprint(model2.get_report_summary())

## Train other Gretel models for comparison

* Gretel LSTM
* Gretel ACTGAN

In [None]:
# Gretel LSTM 

# Create a new model configuration.
config3 = read_model_config("synthetics/tabular-lstm")
config3["name"] = "hospital-readmission-tabular-lstm"

pprint(config3)

# create and submit the model for training
model3 = project.create_model_obj(model_config=config3, data_source=data_source_identifier)
model3.submit_cloud()
display(poll(model3))

# view the synthetic data generated
synthetic3 = pd.read_csv(model3.get_artifact_link("data_preview"), compression="gzip")
display(synthetic3.head())

# get quick information on synthetic data quality
pprint(model3.get_report_summary())

In [None]:
# Gretel ACTGAN 

# Create a new model configuration.
config5 = read_model_config("synthetics/tabular-actgan")
config5["name"] = "hospital-readmission-tabular-actgan"

pprint(config5)

# create and submit the model for training
model5 = project.create_model_obj(model_config=config5, data_source=data_source_identifier)
model5.submit_cloud()
poll(model5)

# view the synthetic data generated
synthetic5 = pd.read_csv(model5.get_artifact_link("data_preview"), compression="gzip")
display(synthetic5.head())

# get quick information on synthetic data quality
pprint(model5.get_report_summary())