# Gretel Hybrid on Microsoft Azure

This Notebook will walk you through creating synthetic data using Gretel Hybrid on Microsoft Azure. Before you can use this Notebook, you will need a Gretel Hybrid cluster setup in your Microsoft Azure environment.

To get Gretel Hybrid on Microsoft Azure setup, please see our documentation:

https://docs.gretel.ai/guides/environment-setup/running-gretel-hybrid

In [1]:
%%capture

# Install Gretel Client and Microsoft Azure dependencies
!pip install -U gretel-client azure-storage-blob requests

In [2]:
import os
from getpass import getpass
# Set the following variables.

# NOTE: This container is the same as the SINK CONTAINER from this Hybrid setup step: https://docs.gretel.ai/guides/environment-setup/running-gretel-hybrid/azure-setup#create-an-azure-blob-container
#
# This container will store:
# 1) Training data, which will be uploaded directly from the Gretel Client
# 2) Artifacts such as the generated synthetic data, reports, and logs
AZURE_SINK_CONTAINER = "your-container-name"

# NOTE: Connection string for the storage account hosting the containers can be found at https://portal.azure.com/
# Navigate to "Storage Accounts", click on your storage account and click Access Keys to find the connection string
CONNECTION_STRING = getpass(prompt="Connection String")
os.environ["AZURE_STORAGE_CONNECTION_STRING"] = CONNECTION_STRING

# This project should have already been created in Gretel
GRETEL_PROJECT = "your-gretel-project-name"

# Set which Gretel model you want to use
# https://github.com/gretelai/gretel-blueprints/tree/main/config_templates/gretel/synthetics
# You can set the filename of any blueprint template below with a "synthetics/" prefix.
GRETEL_MODEL = "synthetics/tabular-actgan"


Connection String··········


# Authenticate with Microsoft Azure

Authenticate using connection string and ensure artifact container can be accessed.

In [3]:
import requests
from azure.storage.blob import BlobServiceClient, ContainerClient

blob_service_client = BlobServiceClient.from_connection_string(CONNECTION_STRING)

# Check if the container exists
container_client = blob_service_client.get_container_client(AZURE_SINK_CONTAINER)
if container_client.exists():
    print(f"Access to container '{AZURE_SINK_CONTAINER}' is successful.")
else:
    print(f"Container '{AZURE_SINK_CONTAINER}' does not exist or access was denied.")

Access to container 'my-gretel-sink' is successful.


# Authenticate with Gretel Cloud

This step will configure your Gretel Client to submit job _requests_ to Gretel Cloud. Once a job _request_ is sent to Gretel Cloud, the Hybrid cluster will download the job request _metadata_ and schedule the job to run on the Hybrid cluster in Microsoft Azure.

In [4]:
from gretel_client import configure_session

configure_session(
  api_key="prompt", # for Notebook environments
  validate=True,
  default_runner="hybrid",
  artifact_endpoint="azure://"+ AZURE_SINK_CONTAINER
)

Gretel Api Key··········
Using endpoint https://api.gretel.cloud
Logged in as ilgin+azurehybrid@gretellabs.com ✅


# Create a Gretel Model

This step will request a model creation job and queue it in Gretel Cloud. The request metadata will be downloaded by the Gretel Hybrid cluster in Microsoft Azure and begin training the model.

In [5]:
import pandas as pd

from gretel_client import get_project
from gretel_client.helpers import poll

gretel_project = get_project(name=GRETEL_PROJECT)

In [6]:
training_df = pd.read_csv("https://raw.githubusercontent.com/gretelai/gretel-blueprints/main/sample_data/us-adult-income.csv")
training_df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,33,Private,229051,Some-college,10,Never-married,Prof-specialty,Not-in-family,White,Male,0,0,52,United-States,<=50K
1,38,Local-gov,91711,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,>50K
2,56,Private,282023,HS-grad,9,Married-civ-spouse,Adm-clerical,Husband,White,Male,0,0,40,United-States,<=50K
3,32,Private,209538,Masters,14,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,55,United-States,>50K
4,34,Self-emp-inc,215382,Masters,14,Separated,Prof-specialty,Not-in-family,White,Female,4787,0,40,United-States,>50K


In [7]:
gretel_model = gretel_project.create_model_obj(model_config=GRETEL_MODEL, data_source=training_df)
gretel_model = gretel_model.submit()
print(f"Gretel Model ID submitted for Hybrid, see project here: {gretel_project.get_console_url()}")

Gretel Model ID submitted for Hybrid, see project here: https://console.gretel.ai/proj_2U41emWEV2jNBCcoRSS1WbTXGtz


In [8]:
poll(gretel_model)

INFO: Starting poller
INFO: Status is created. Model creation has been queued.


{
    "uid": "64e633f1d09777174aa98522",
    "guid": "model_2UOQWQScc7MKLped95ZdiMvBhgG",
    "model_name": "tabular-actgan",
    "runner_mode": "manual",
    "user_id": "64da5ce1bff621343a255193",
    "user_guid": "user_2Tz3kcNgpfCmNUZLpheDB4Gqmwu",
    "billing_domain": "314057facd594564a6851d88cebfde28.gretel",
    "billing_domain_guid": "domain_2Tr90oeZVxLjxutjs73jPY6eqhm",
    "project_id": "64dcae48849335fea60c22a2",
    "project_guid": "proj_2U41emWEV2jNBCcoRSS1WbTXGtz",
    "status_history": {
        "created": "2023-08-23T16:29:37.337218Z"
    },
    "last_modified": "2023-08-23T16:29:37.521801Z",
    "status": "created",
    "last_active_hb": null,
    "duration_minutes": null,
    "error_msg": null,
    "error_id": null,
    "traceback": null,
    "annotations": null,
    "provenance": null,
    "container_image": "074762682575.dkr.ecr.us-west-2.amazonaws.com/models/actgan@sha256:ed3d1e4a9c591e707a829b11b6d508624bacb5087688a5ebc9f15462fa11c518",
    "container_image_version

INFO: Status is pending. A worker is being allocated to begin model creation.
INFO: Status is active. A worker has started creating your model!
2023-08-23T16:40:31.868697Z  Analyzing input data and checking for auto-params...
2023-08-23T16:40:31.905363Z  Found 2 auto-params that were set based on input data.
{
    "epochs": 600,
    "batch_size": 600
}
2023-08-23T16:40:31.905885Z  Using updated model configuration: 
{
    "schema_version": "1.0",
    "name": "tabular-actgan",
    "models": [
        {
            "actgan": {
                "privacy_filters": {
                    "outliers": null,
                    "similarity": null,
                    "max_iterations": 10
                },
                "data_source": [
                    "azure://my-gretel-sink/sources/64dcae48849335fea60c22a2/gretel_16e109c4481d46b4b02dcfd6d2135c40_dataframe-f7641998-04ac-403c-ae73-0fea0c91bfe3.csv"
                ],
                "ref_data": {},
                "params": {
             

# Preview Synthetic Data
As part of the model training process, a sample of synthetic data is created, you can explore that data easily.

In [9]:
# If you ever need to restore your Gretel Model object, you can do so like this:

# gretel_model = gretel_project.get_model("64de615d5c7248c58cc50247")

# Next we look at the data that was generated as part of model training
with gretel_model.get_artifact_handle("data_preview") as remote_file:
  syn_df = pd.read_csv(remote_file)
syn_df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,21,Private,175633,10th,4,Never-married,Other-service,Own-child,White,Female,0,0,22,United-States,<=50K
1,24,Private,44523,Bachelors,12,Divorced,Prof-specialty,Not-in-family,White,Female,2851,0,40,United-States,<=50K
2,35,Federal-gov,372254,Bachelors,13,Never-married,Prof-specialty,Not-in-family,White,Male,15182,0,50,?,>50K
3,18,Private,84451,10th,6,Never-married,Transport-moving,Own-child,White,Female,0,0,14,United-States,<=50K
4,27,Private,320433,HS-grad,9,Never-married,Sales,Own-child,White,Male,0,0,35,United-States,<=50K


# Explore the Synthetic Quality Report
This will download the full HTML of the Gretel Synthetic Quality Report.

In [10]:
from IPython.display import display, HTML

with gretel_model.get_artifact_handle("report") as fin:
    html_contents = fin.read().decode()

In [11]:
display(HTML(html_contents), metadata=dict(isolated=True))

0,1,2,3,4,5
How to interpret your SQS,Excellent,Good,Moderate,Poor,Very Poor
Suitable for machine learning or statistical analysis,,,,,
Suitable for balancing or augmenting machine learning data sources,,,,,
Suitable for pre-production testing environments,,,,,
Suitable for demo environments or mock data,,,,,
Improve your model using our tips and advice,,,,,
Significant tuning required to improve model,,,,,

0,1,2,3,4,5
Data Sharing Use Case,Excellent,Very Good,Good,Normal,Poor
"Internally, within the same team",,,,,
"Internally, across different teams",,,,,
"Externally, with trusted partners",,,,,
"Externally, public availability",,,,,

Unnamed: 0,Training Data,Synthetic Data
Row Count,5000,5000
Column Count,15,15
Training Lines Duplicated,--,0

Default Privacy Protections,Advanced Protections

Field,Unique,Missing,Ave. Length,Type,Distribution Stability
hours_per_week,78,0,1.98,Numeric,Good
age,69,0,2.0,Numeric,Good
occupation,14,0,12.2,Categorical,Good
fnlwgt,4556,0,5.83,Numeric,Good
education_num,16,0,1.55,Numeric,Excellent
race,5,0,5.6,Categorical,Excellent
relationship,6,0,9.12,Categorical,Excellent
income_bracket,2,0,4.77,Binary,Excellent
capital_loss,53,0,1.13,Numeric,Excellent
workclass,8,0,7.87,Categorical,Excellent


# Generate More Data

With the Gretel Model created, you can run inferrence from that model as many times as you wish. You may either request a total number of records to generate or depending on the model, utilize conditioning. Conditioning allows you to provide partial values as an input dataset, and then the model will complete the remainder of each record.

In [12]:
# Generate more records based on record count

model_run = gretel_model.create_record_handler_obj(params=dict(num_records=142))
model_run.submit()
poll(model_run)

INFO: Starting poller


{
    "uid": "64e638ad9f88c36acff55b11",
    "guid": "model_run_2UOSyhLDweOva45H0odTXGjuc5h",
    "model_name": null,
    "runner_mode": "manual",
    "user_id": "64da5ce1bff621343a255193",
    "user_guid": "user_2Tz3kcNgpfCmNUZLpheDB4Gqmwu",
    "billing_domain": "314057facd594564a6851d88cebfde28.gretel",
    "billing_domain_guid": "domain_2Tr90oeZVxLjxutjs73jPY6eqhm",
    "project_id": "64dcae48849335fea60c22a2",
    "project_guid": "proj_2U41emWEV2jNBCcoRSS1WbTXGtz",
    "status_history": {
        "created": "2023-08-23T16:49:49.687000Z"
    },
    "last_modified": "2023-08-23T16:49:49.824000Z",
    "status": "created",
    "last_active_hb": null,
    "duration_minutes": null,
    "error_msg": null,
    "error_id": null,
    "traceback": null,
    "annotations": null,
    "provenance": null,
    "container_image": "074762682575.dkr.ecr.us-west-2.amazonaws.com/models/actgan@sha256:ed3d1e4a9c591e707a829b11b6d508624bacb5087688a5ebc9f15462fa11c518",
    "container_image_version": "2.10

INFO: Status is created. A job has been queued.
INFO: Status is pending. A worker is being allocated to begin running.
INFO: Status is active. A worker has started!
2023-08-23T16:55:56.991840Z  Loading model to worker
2023-08-23T16:56:08.743978Z  Loading ACTGAN model...
2023-08-23T16:56:08.752879Z  Sampling 142 records...
2023-08-23T16:56:08.962301Z  Preparing privacy filters
2023-08-23T16:56:08.963097Z  Loaded 0 privacy filters
2023-08-23T16:56:08.963426Z  Starting privacy filtering
2023-08-23T16:56:08.963675Z  Privacy filtering complete.
2023-08-23T16:56:08.966691Z  Uploading artifacts to your object store...
2023-08-23T16:56:09.015910Z  Upload to your object store is completed.


In [13]:
# You can always retrieve a model run with the below:

# model_run = gretel_model.get_record_handler("64df7fb5f62d5b782416f0d2")

# Retrieve newly generated data:

with model_run.get_artifact_handle("data") as fin:
    syn_df = pd.read_csv(fin)

print(f"Total records generated: {len(syn_df)}")
syn_df.head()

Total records generated: 142


Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,19,Private,264946,HS-grad,9,Never-married,Other-service,Own-child,White,Female,4,0,40,United-States,<=50K
1,31,Private,218293,HS-grad,9,Married-civ-spouse,Craft-repair,Husband,White,Male,7299,0,40,United-States,>50K
2,71,Private,98185,9th,4,Never-married,Other-service,Not-in-family,White,Female,0,0,17,United-States,<=50K
3,60,Private,97918,7th-8th,10,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,United-States,<=50K
4,19,Private,212571,Assoc-voc,9,Never-married,Tech-support,Own-child,Black,Male,20,0,31,United-States,<=50K


# Generate Records With Conditioning

In this mode of generation, you may provide a dataset of partial records, and the model will complete each record for
you. If you provide a file of 10 partial records, then you will receive 10 complete records at the end of the job. This mode of generation is only available with the Tabular ACTGAN model.

In [14]:
# First create a dataset of partial records that you want the model to complete.

partial_records_df = pd.DataFrame(
    ["Private"] * 5 + ["Local-gov"] * 5,
    columns=["workclass"]
)

partial_records_df

Unnamed: 0,workclass
0,Private
1,Private
2,Private
3,Private
4,Private
5,Local-gov
6,Local-gov
7,Local-gov
8,Local-gov
9,Local-gov


In [15]:
# Next run the model, providing the conditioning DF as the input data source

model_run = gretel_model.create_record_handler_obj(data_source=partial_records_df)
model_run.submit()
poll(model_run)

INFO: Starting poller
INFO: Status is created. A job has been queued.


{
    "uid": "64e63aba237b9a64ffa2f1af",
    "guid": "model_run_2UOU2gXxo3RfFr1t3wpWxZGaEsY",
    "model_name": null,
    "runner_mode": "manual",
    "user_id": "64da5ce1bff621343a255193",
    "user_guid": "user_2Tz3kcNgpfCmNUZLpheDB4Gqmwu",
    "billing_domain": "314057facd594564a6851d88cebfde28.gretel",
    "billing_domain_guid": "domain_2Tr90oeZVxLjxutjs73jPY6eqhm",
    "project_id": "64dcae48849335fea60c22a2",
    "project_guid": "proj_2U41emWEV2jNBCcoRSS1WbTXGtz",
    "status_history": {
        "created": "2023-08-23T16:58:34.361000Z"
    },
    "last_modified": "2023-08-23T16:58:34.487000Z",
    "status": "created",
    "last_active_hb": null,
    "duration_minutes": null,
    "error_msg": null,
    "error_id": null,
    "traceback": null,
    "annotations": null,
    "provenance": null,
    "container_image": "074762682575.dkr.ecr.us-west-2.amazonaws.com/models/actgan@sha256:ed3d1e4a9c591e707a829b11b6d508624bacb5087688a5ebc9f15462fa11c518",
    "container_image_version": "2.10

INFO: Status is pending. A worker is being allocated to begin running.
INFO: Status is active. A worker has started!
2023-08-23T16:58:54.188347Z  Loading model to worker
2023-08-23T16:59:05.435489Z  Loading ACTGAN model...
2023-08-23T16:59:05.443313Z  Sampling 10 records from conditioning input...
2023-08-23T16:59:05.903361Z  Preparing privacy filters
2023-08-23T16:59:05.903885Z  Loaded 0 privacy filters
2023-08-23T16:59:05.904087Z  Starting privacy filtering
2023-08-23T16:59:05.904296Z  Privacy filtering complete.
2023-08-23T16:59:05.905805Z  Uploading artifacts to your object store...
2023-08-23T16:59:05.952403Z  Upload to your object store is completed.


In [16]:
# Access our completed records, note that our conditioned column, "workclass", contains
# the exact values we submitted

with model_run.get_artifact_handle("data") as fin:
    syn_df = pd.read_csv(fin)

syn_df

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,37,Private,119905,12th,9,Divorced,Other-service,Not-in-family,White,Female,2,1,35,United-States,<=50K
1,31,Private,95445,9th,4,Widowed,Other-service,Unmarried,Black,Female,0,0,13,United-States,<=50K
2,23,Private,395494,Some-college,10,Never-married,Handlers-cleaners,Other-relative,White,Male,0,0,30,United-States,<=50K
3,38,Private,100086,Some-college,10,Married-civ-spouse,Tech-support,Own-child,White,Female,27,0,36,United-States,<=50K
4,47,Private,91320,HS-grad,9,Widowed,Other-service,Not-in-family,White,Female,0,0,40,United-States,<=50K
5,57,Local-gov,88539,Some-college,14,Married-civ-spouse,Transport-moving,Husband,White,Male,26,0,67,Mexico,>50K
6,56,Local-gov,184208,Some-college,10,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,<=50K
7,32,Local-gov,103284,Some-college,10,Separated,Tech-support,Unmarried,White,Female,3369,0,40,United-States,<=50K
8,43,Local-gov,168403,Assoc-voc,11,Separated,Prof-specialty,Not-in-family,White,Female,0,1,40,United-States,>50K
9,37,Local-gov,105275,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,30,United-States,>50K
