# Gretel Hybrid on AWS

This Notebook will walk you through creating synthetic data using Gretel Hybrid on AWS. Before you can use this Notebook, you will need a Gretel Hybrid cluster setup in your AWS environment.

To get Gretel Hybrid on AWS setup, please see our documentation:

https://docs.gretel.ai/guides/environment-setup/running-gretel-hybrid

In [1]:
%%capture

# Install Gretel Client and Google Cloud dependencies
!pip install -U gretel-client boto3

In [2]:
from getpass import getpass

# Set the following variables.

# This bucket will store:
# 1) Training data, which will be uploaded directly from the Gretel Client
# 2) Artifacts such as the generated synthetic data, reports, and logs

# NOTE: This bucket is the same as the SINK BUCKET from this Hybrid setup step: https://docs.gretel.ai/guides/environment-setup/running-gretel-hybrid/aws-setup#create-s3-buckets
S3_BUCKET = "gretel-hybrid-platform-env-us-west-2-sink-bucket"

# NOTE: If the project does not exist, one will be created.
GRETEL_PROJECT = "proj_2ULi8qV3snDTm8sBtxV8ByhAygg"

# Set which Gretel model you want to use
# https://github.com/gretelai/gretel-blueprints/tree/main/config_templates/gretel/synthetics
GRETEL_MODEL = "synthetics/tabular-differential-privacy"

In [3]:
import os

# Pass in AWS Credentials at runtime
AWS_ACCESS_KEY = getpass(prompt="AWS Access Key:")
AWS_SECRET_KEY = getpass(prompt="AWS Secret Key:")
AWS_SESSION_TOKEN = getpass(prompt="AWS Session Token:")

os.environ["AWS_ACCESS_KEY_ID"] = AWS_ACCESS_KEY
os.environ["AWS_SECRET_ACCESS_KEY"] = AWS_SECRET_KEY
os.environ["AWS_SESSION_TOKEN"] = AWS_SESSION_TOKEN

AWS Access Key:··········
AWS Secret Key:··········
AWS Session Token:··········


# Authenticate with AWS

Authenticate with the provided credentials and ensure the provided bucket is accessible.

In [4]:
import boto3

session = boto3.Session(
    aws_access_key_id=AWS_ACCESS_KEY,
    aws_secret_access_key=AWS_SECRET_KEY,
    aws_session_token=AWS_SESSION_TOKEN
)

sts = session.client('sts')
try:
    sts.get_caller_identity()
    print("Provided AWS credentials are valid. Session created successfully.")
except:
    print("ERROR: Provided AWS credentials are not valid. Received an error on session creation.")

s3 = session.client('s3')

try:
  s3.head_bucket(Bucket=S3_BUCKET)
  print(f"Bucket {S3_BUCKET} exists and is accessible.")
except:
  print(f"ERROR: Bucket {S3_BUCKET} does not exist or we do not have access to it.")

Provided AWS credentials are valid. Session created successfully.
Bucket gretel-hybrid-platform-env-us-west-2-sink-bucket exists and is accessible.


# Authenticate with Gretel Cloud

This step will configure your Gretel Client to submit job _requests_ to Gretel Cloud. Once a job _request_ is sent to Gretel Cloud, the Hybrid cluster will download the job request _metadata_ and schedule the job to run on the Hybrid cluster in AWS.

In [5]:
from gretel_client import configure_session

S3_BUCKET_WITH_PROTOCOL = f"s3://{S3_BUCKET}"

configure_session(
  api_key="prompt", # for Notebook environments,
  endpoint="https://api.gretel.cloud",
  validate=True,
  clear=True,
  default_runner="hybrid",
  artifact_endpoint=S3_BUCKET_WITH_PROTOCOL
)

Gretel Api Key··········
Using endpoint https://api.gretel.cloud
Logged in as ben+awshybrid@gretellabs.com ✅


# Create a Gretel Model

This step will request a model creation job and queue it in Gretel Cloud. The request metadata will be downloaded by the Gretel Hybrid cluster in AWS and begin training the model.

In [6]:
import pandas as pd

from gretel_client import get_project
from gretel_client.helpers import poll

gretel_project = get_project(name=GRETEL_PROJECT)

In [7]:
training_df = pd.read_csv("https://raw.githubusercontent.com/gretelai/gretel-blueprints/main/sample_data/us-adult-income.csv")
training_df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,33,Private,229051,Some-college,10,Never-married,Prof-specialty,Not-in-family,White,Male,0,0,52,United-States,<=50K
1,38,Local-gov,91711,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,>50K
2,56,Private,282023,HS-grad,9,Married-civ-spouse,Adm-clerical,Husband,White,Male,0,0,40,United-States,<=50K
3,32,Private,209538,Masters,14,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,55,United-States,>50K
4,34,Self-emp-inc,215382,Masters,14,Separated,Prof-specialty,Not-in-family,White,Female,4787,0,40,United-States,>50K


In [10]:
gretel_model = gretel_project.create_model_obj(model_config=GRETEL_MODEL, data_source=training_df)
gretel_model = gretel_model.submit_hybrid()
print(f"Gretel Model ID submitted for Hybrid, see project here: {gretel_project.get_console_url()}")

Gretel Model ID submitted for Hybrid, see project here: https://console.gretel.ai/proj_2ULi8qV3snDTm8sBtxV8ByhAygg


In [11]:
poll(gretel_model)

{
    "uid": "650891f35e598f341bf1ea6f",
    "guid": "model_2Va3fYxMNAQq2gZLMuKdV17jEI4",
    "model_name": "tabular-differential-privacy",
    "runner_mode": "manual",
    "user_id": "64dd01d5bff6210b83545f5d",
    "user_guid": "user_2U4j0gAdS2DhKwmJlkJMm0GVWal",
    "billing_domain": "c607654170f8449ea5cfa7647e663292.gretel",
    "billing_domain_guid": "domain_2U4d40JrNgKVhM4hz26AYQYwHso",
    "project_id": "64e4ef705bcda7b5d4e4b64a",
    "project_guid": "proj_2ULi8qV3snDTm8sBtxV8ByhAygg",
    "status_history": {
        "created": "2023-09-18T18:07:47.866475Z"
    },
    "last_modified": "2023-09-18T18:07:48.089726Z",
    "status": "created",
    "last_active_hb": null,
    "duration_minutes": null,
    "error_msg": null,
    "error_id": null,
    "traceback": null,
    "annotations": null,
    "provenance": null,
    "container_image": "074762682575.dkr.ecr.us-west-2.amazonaws.com/models/tabular_dp@sha256:153535a2cf05fad62eead2698d9cbecee8579b98e95bad30b8c60dd837b48ce4",
    "conta

INFO: Starting poller
INFO: Status is created. Model creation has been queued.
INFO: Status is pending. A worker is being allocated to begin model creation.
INFO: Status is active. A worker has started creating your model!
2023-09-18T18:12:30.072795Z  Analyzing input data and checking for auto-params...
2023-09-18T18:12:30.074276Z  Parameter `delta` was automatically set to 6e-07 based on the number of records, n. Note that n was not determined with differential privacy.
2023-09-18T18:12:30.074706Z  Found 1 auto-params that were set based on input data.
{
    "delta": 6.03681610520369e-07
}
2023-09-18T18:12:30.075207Z  Using updated model configuration: 
{
    "schema_version": "1.0",
    "name": "tabular-differential-privacy",
    "models": [
        {
            "tabular_dp": {
                "data_source": [
                    "s3://gretel-hybrid-platform-env-us-west-2-sink-bucket/sources/64e4ef705bcda7b5d4e4b64a/gretel_809584fc1f694f1e893eea1b13caaffe_dataframe-83c3ed7d-5344-4dc

# Preview Synthetic Data
As part of the model training process, a sample of synthetic data is created, you can explore that data easily.

In [12]:
# If you ever need to restore your Gretel Model object, you can do so like this:

# gretel_model = gretel_project.get_model("64de615d5c7248c58cc50247")

# Next we look at the data that was generated as part of model training
with gretel_model.get_artifact_handle("data_preview") as remote_file:
  syn_df = pd.read_csv(remote_file)
syn_df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,27,Private,330269,HS-grad,9,Never-married,Craft-repair,Not-in-family,White,Male,702,0,40,United-States,<=50K
1,35,Private,161378,Bachelors,13,Never-married,Sales,Not-in-family,Black,Female,7131,0,60,United-States,<=50K
2,54,Private,14331,Bachelors,13,Married-civ-spouse,Craft-repair,Husband,Other,Male,289,0,40,United-States,>50K
3,46,Private,280834,Some-college,10,Never-married,Craft-repair,Own-child,Black,Male,7692,0,40,United-States,<=50K
4,51,Private,268390,HS-grad,9,Divorced,Machine-op-inspct,Own-child,White,Female,194,0,40,United-States,<=50K


# Explore the Synthetic Quality Report
This will download the full HTML of the Gretel Synthetic Quality Report.

In [13]:
from IPython.display import display, HTML

with gretel_model.get_artifact_handle("report") as fin:
    html_contents = fin.read().decode()

In [14]:
display(HTML(html_contents), metadata=dict(isolated=True))

0,1,2,3,4,5
How to interpret your SQS,Excellent,Good,Moderate,Poor,Very Poor
Suitable for machine learning or statistical analysis,,,,,
Suitable for balancing or augmenting machine learning data sources,,,,,
Suitable for pre-production testing environments,,,,,
Suitable for demo environments or mock data,,,,,
Improve your model using our tips and advice,,,,,
Significant tuning required to improve model,,,,,

0,1,2,3,4,5
Data Sharing Use Case,Excellent,Very Good,Good,Normal,Poor
"Internally, within the same team",,,,,
"Internally, across different teams",,,,,
"Externally, with trusted partners",,,,,
"Externally, public availability",,,,,

Unnamed: 0,Training Data,Synthetic Data
Row Count,5000,5000
Column Count,15,15
Training Lines Duplicated,--,0

Default Privacy Protections,Advanced Protections

Field,Unique,Missing,Ave. Length,Type,Distribution Stability
native_country,41,0,12.23,Categorical,Excellent
capital_loss,53,0,1.13,Numeric,Excellent
fnlwgt,4556,0,5.83,Numeric,Excellent
age,69,0,2.0,Numeric,Excellent
capital_gain,82,0,1.28,Numeric,Excellent
occupation,14,0,12.2,Categorical,Excellent
hours_per_week,78,0,1.98,Numeric,Excellent
education_num,16,0,1.55,Numeric,Excellent
education,16,0,8.45,Categorical,Excellent
marital_status,7,0,14.42,Categorical,Excellent


# Generate Synthetic Data

Now that the model is trained, synthetic data can be generated.

In [19]:
# Set the number of records to generate
NUM_SYNTHETIC_RECORDS_TO_GENERATE = 5000

record_handler = gretel_model.create_record_handler_obj(
    params={"num_records": NUM_SYNTHETIC_RECORDS_TO_GENERATE}
)
record_handler.submit_hybrid()
print(f"Gretel job submitted for Hybrid, see project here: {gretel_project.get_console_url()}")

Gretel job submitted for Hybrid, see project here: https://console.gretel.ai/proj_2ULi8qV3snDTm8sBtxV8ByhAygg


In [20]:
poll(record_handler)

INFO: Starting poller
INFO: Status is created. A job has been queued.


{
    "uid": "650899508236a9e7680c0354",
    "guid": "model_run_2Va7UTnHgt2vBptXkTQaxmGdw89",
    "model_name": null,
    "runner_mode": "manual",
    "user_id": "64dd01d5bff6210b83545f5d",
    "user_guid": "user_2U4j0gAdS2DhKwmJlkJMm0GVWal",
    "billing_domain": "c607654170f8449ea5cfa7647e663292.gretel",
    "billing_domain_guid": "domain_2U4d40JrNgKVhM4hz26AYQYwHso",
    "project_id": "64e4ef705bcda7b5d4e4b64a",
    "project_guid": "proj_2ULi8qV3snDTm8sBtxV8ByhAygg",
    "status_history": {
        "created": "2023-09-18T18:39:12.238000Z"
    },
    "last_modified": "2023-09-18T18:39:12.425000Z",
    "status": "created",
    "last_active_hb": null,
    "duration_minutes": null,
    "error_msg": null,
    "error_id": null,
    "traceback": null,
    "annotations": null,
    "provenance": null,
    "container_image": "074762682575.dkr.ecr.us-west-2.amazonaws.com/models/tabular_dp@sha256:153535a2cf05fad62eead2698d9cbecee8579b98e95bad30b8c60dd837b48ce4",
    "container_image_version": "

INFO: Status is pending. A worker is being allocated to begin running.
INFO: Status is active. A worker has started!
2023-09-18T18:40:00.796683Z  Loading model to worker
2023-09-18T18:40:13.918224Z  Loading Tabular DP model...
2023-09-18T18:40:13.919514Z  Sampling 5000 records...
2023-09-18T18:40:14.439231Z  Uploading artifacts to your object store...
2023-09-18T18:40:14.699033Z  Upload to your object store is completed.


For convenience the following cell will give you the output artifacts for the previously completed job. The `data` artifact will contain the generated synthetic records.

In [21]:
artifacts = record_handler.get_artifacts()
for a in artifacts:
  print(f"Artifact: {a[0]}, Location: {a[1]}")

Artifact: run_report_json, Location: s3://gretel-hybrid-platform-env-us-west-2-sink-bucket/64e4ef705bcda7b5d4e4b64a/run/650899508236a9e7680c0354/run_report_json.json.gz
Artifact: data, Location: s3://gretel-hybrid-platform-env-us-west-2-sink-bucket/64e4ef705bcda7b5d4e4b64a/run/650899508236a9e7680c0354/data.gz
Artifact: run_logs, Location: s3://gretel-hybrid-platform-env-us-west-2-sink-bucket/64e4ef705bcda7b5d4e4b64a/run/650899508236a9e7680c0354/logs.json.gz
Artifact: output_files, Location: s3://gretel-hybrid-platform-env-us-west-2-sink-bucket/64e4ef705bcda7b5d4e4b64a/run/650899508236a9e7680c0354/output_files.tar.gz


In [24]:
# Next we look at the data that was generated as a result of our synthetics job using the trained model
with record_handler.get_artifact_handle("data") as remote_file:
  syn_df = pd.read_csv(remote_file)
syn_df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,26,Private,659248,Assoc-voc,11,Never-married,Protective-serv,Other-relative,White,Male,285,0,40,United-States,<=50K
1,45,Private,177605,10th,6,Married-civ-spouse,Other-service,Husband,Black,Male,11763,0,20,United-States,<=50K
2,32,Private,191367,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,20,0,40,United-States,>50K
3,21,Private,40220,Bachelors,13,Never-married,Craft-repair,Own-child,White,Male,649,0,40,United-States,<=50K
4,55,Private,381480,Some-college,10,Married-civ-spouse,Adm-clerical,Husband,White,Male,517,0,40,United-States,<=50K


In [27]:
print(f"Successfully generated {syn_df.shape[0]} synthetic records.")

Successfully generated 5000 synthetic records.
