# Gretel Hybrid on Google Cloud

This Notebook will walk you through creating synthetic data using Gretel Hybrid on Google Cloud. Before you can use this Notebook, you will need a Gretel Hybrid cluster setup in your Google Cloud environment.

To get Gretel Hybrid on Google Cloud setup, please see our documentation: 

https://docs.gretel.ai/guides/environment-setup/running-gretel-hybrid

In [2]:
%%capture

!pip install -U gretel-client[gcp]

In [None]:
# Set the following variables.


# NOTE: This bucket is the same as the SINK BUCKET from this Hybrid setup step: 
# https://docs.gretel.ai/guides/environment-setup/running-gretel-hybrid/gcp-setup#create-gcs-buckets
#
# This bucket will store:
# 1) Training data, which will be uploaded directly from the Gretel Client
# 2) Artifacts such as the generated synthetic data, reports, and logs
GCS_BUCKET = "gs://your-bucket-name"

# Set the name of your Google Cloud Project
GOOGLE_PROJECT = "your-gcp-project-name"

# This project should have already been created in Gretel
GRETEL_PROJECT = "your-gretel-project-name"

# Set which Gretel model you want to use
# https://github.com/gretelai/gretel-blueprints/tree/main/config_templates/gretel/synthetics
# You can set the filename of any blueprint template below with a "synthetics/" prefix.
GRETEL_MODEL = "synthetics/tabular-actgan"

# If using a GCP service account for GCS access, set the absolute path to the JSON file here
GOOGLE_CREDS = "/path/to/gcp/creds.json"

# Authenticate with Google Cloud

*NOTE*: If creating a service account, we also add Vertex AI permissions so that the generated synthetic data
can easily be used with Vertex APIs

If you are using *Vertex Notebooks*, your environment is already authenticated. Skip this step.

If you are using *Colab*, run the cell below and follow the instructions when prompted to authenticate your account via oAuth.

Otherwise, follow these steps:

 - In the Cloud Console, go to the Create service account key page.

 - Click Create service account.

 - In the Service account name field, enter a name, and click "Create and Continue".

- In the Grant this service account access to project section, click the Role drop-down list. Type "Vertex AI" into the filter box, and select Vertex AI Administrator. 

- Type "Storage Object Admin" into the filter box, and select Storage Object Admin.

- Click Create. A JSON file that contains your key downloads to your local environment.

Enter the path to your service account key as the `GOOGLE_APPLICATION_CREDENTIALS` variable in the cell below and run the cell.

In [None]:
import os
import sys

# If you are running this notebook in Colab, run this cell and follow the
# instructions to authenticate your GCP account. This provides access to your
# Cloud Storage bucket and lets you submit training jobs and prediction
# requests.

# The Google Cloud Notebook product has specific requirements
IS_GOOGLE_CLOUD_NOTEBOOK = os.path.exists("/opt/deeplearning/metadata/env_version")

# If on Google Cloud Notebooks, then don't execute this code
if not IS_GOOGLE_CLOUD_NOTEBOOK:
    if "google.colab" in sys.modules:
        from google.colab import auth as google_auth

        google_auth.authenticate_user()

    # If you are running this notebook locally, replace the string below with the
    # path to your service account key and run this cell to authenticate your GCP
    # account.
    elif not os.getenv("IS_TESTING"):
        os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = GOOGLE_CREDS

Next, verify that we can access the GCS bucket:

In [None]:
!gcloud config set project {GOOGLE_PROJECT}
!gsutil ls -al $GCS_BUCKET

# Authenticate with Gretel Cloud

This step will configure your Gretel Client to submit job _requests_ to Gretel Cloud. Once a job _request_ is sent to Gretel Cloud, the Hybrid cluster will download the job request _metadata_ and schedule the job to run on the Hybrid cluster in Google Cloud.

In [None]:
from gretel_client import configure_session

configure_session(
  api_key="prompt", # for Notebook environments
  validate=True,
  clear=True,
  default_runner="hybrid",
  artifact_endpoint=GCS_BUCKET
)

# Create a Gretel Model

This step will request a model creation job and queue it in Gretel Cloud. The request metadata will be downloaded by the Gretel Hybrid cluster in Google Cloud and begin training the model.

In [None]:
import pandas as pd

from gretel_client import get_project
from gretel_client.helpers import poll

gretel_project = get_project(name=GRETEL_PROJECT)

In [None]:
training_df = pd.read_csv("https://raw.githubusercontent.com/gretelai/gretel-blueprints/main/sample_data/us-adult-income.csv")
training_df.head()

In [None]:
gretel_model = gretel_project.create_model_obj(model_config=GRETEL_MODEL, data_source=training_df)
gretel_model = gretel_model.submit()
print(f"Gretel Model ID submitted for Hybrid, see project here: {gretel_project.get_console_url()}")

In [None]:
poll(gretel_model)

# Preview Synthetic Data

As part of the model training process, a sample of synthetic data is created, you can explore that data easily.

In [None]:
# If you ever need to restore your Gretel Model object, you can do so like this:

# gretel_model = gretel_project.get_model("64de615d5c7248c58cc50247")

# Next we look at the data that was generated as part of model training
with gretel_model.get_artifact_handle("data_preview") as fin:
    syn_df = pd.read_csv(fin)
    
syn_df.head()

# Explore the Synthetic Quality Report

This will download the full HTML of the Gretel Synthetic Quality Report.

In [None]:
from IPython.display import display, HTML

with gretel_model.get_artifact_handle("report") as fin:
    html_contents = fin.read().decode()

In [None]:
display(HTML(html_contents), metadata=dict(isolated=True))

# Generate More Data

With the Gretel Model created, you can run inferrence from that model as many times as you wish. You may either request a total number of records to generate or depending on the model, utilize conditioning. Conditioning allows you to provide partial values as an input dataset, and then the model will complete the remainder of each record.

In [None]:
# Generate more records based on record count

model_run = gretel_model.create_record_handler_obj(params=dict(num_records=142))
model_run.submit()
poll(model_run)

In [None]:
# You can always retrieve a model run with the below:

# model_run = gretel_model.get_record_handler("64df7fb5f62d5b782416f0d2")

# Retrieve newly generated data:

with model_run.get_artifact_handle("data") as fin:
    syn_df = pd.read_csv(fin)

print(f"Total records generated: {len(syn_df)}")
syn_df.head()

# Generate Records With Conditioning

In this mode of generation, you may provide a dataset of partial records, and the model will complete each record for
you. If you provide a file of 10 partial records, then you will receive 10 complete records at the end of the job. This mode of generation is only available with the Tabular ACTGAN model.

In [None]:
# First create a dataset of partial records that you want the model to complete.

partial_records_df = pd.DataFrame(
    ["Private"] * 5 + ["Local-gov"] * 5,
    columns=["workclass"]
)

partial_records_df

In [None]:
# Next run the model, providing the conditioning DF as the input data source

model_run = gretel_model.create_record_handler_obj(data_source=partial_records_df)
model_run.submit()
poll(model_run)

In [None]:
# Access our completed records, note that our conditioned column, "workclass", contains
# the exact values we submitted 

with model_run.get_artifact_handle("data") as fin:
    syn_df = pd.read_csv(fin)
    
syn_df