# Gretel Synthetics Walkthrough

Welcome to the Gretel Synthetics walkthrough! In this tutorial we will take you through the steps of extracting data from Gretel, building a training dataset, creating synthetic data, and validating the new data!

This tutorial assumes you have already created and uploaded data to a [Gretel project](https://console.gretel.cloud).

Let's get started!

## Configuration

- If using Google Colab, we recommend you change to a GPU runtime. From the menu, choose "Runtime" and then choose "Change runtime type"

- Input your Gretel URI String. Just run the cell below (no need to change it's contents) and then enter your Gretel URI in the pop-up box when it appears. 

In [None]:
import getpass
import os

gretel_uri = os.getenv("GRETEL_URI") or getpass.getpass("Your Gretel URI")

## Steps to create a synthetic dataset


In the code below, we will:
* Install Gretel packages and dependencies
* Optionally connect to Gretel API and download source data the project stream
* Automatically build a record validator from the source data
* Train a synthetic model (neural network) on the source data
* Generate `gen_lines` synthetic data records that pass validation
* Create a synthetic data performance report to compare the source and synthetic datasets

In [None]:
%%capture

!pip install -U gretel-client

# NOTE: if you need synthetics, but already have TensorFlow installed (like in Colab) install below
!pip install gretel-synthetics

# NOTE: if you need synthetics AND TensorFlow, use the below
# !pip install gretel-synthetics[tf]

In [None]:
from gretel_client import project_from_uri

project = project_from_uri(gretel_uri)
project.client.install_packages()

## Create Training DataFrame

Here you have one of three options:

1) Download records from Gretel (using your Gretel URI from before)
2) You can provide an absolute path to your own CSV
3) You can create your own code to generate your own DataFrame however you like

By default we suggest filtering fields based on percent unique and percent missing. We reccomend using fields that have no more than 80% uniqueness and are missing no more than 20% of the time. Feel free to adjust these parameters.

If you wish to use all fields, you can omit the returned ``include_fields`` list from the synthetic bundle creation below.


In [None]:
from gretel_helpers.synthetics import create_df, SyntheticDataBundle

# NOTE: You can change the first argument to a CSV file of your choice
# to load your own data outside of a Gretel Project. If you want to use the
# entire CSV in this case, set ``num_rows`` to ``None.``

training_df = create_df(
    gretel_uri,  # This can be changed to a CSV path (local Filesystem, S3, etc)
    num_rows=5000,  # set to ``None`` to include all records
    max_unique_percent=80,  # set to 100 to include all columns
    max_missing_percent=20  # set to 100 to include all columns
)

In [None]:
# Now you have the DataFrame that will be used for training, this can be manipulated beforehand
training_df.head()

## Create a Gretel Synthetic Bundle

Next, we run our bundle automation process. This automates the following actions:

- Automatically detect a field delimiter to be used for the Gretel Synthetics library
- Automatically detect correlations between columns and create batches of column headers for synthesis
- Build data validators that ensure generated records are within a range of boundaries learned from your training data
- Build neural network models
- Utilize AI models to create synthetic data


# Synthetic Configuration

- See [our documentation](https://gretel-synthetics.readthedocs.io/en/stable/api/config.html) for additional config options

In [None]:
# Create the Gretel Synthtetics Training / Model Configuration
from pathlib import Path

checkpoint_dir = str(Path.cwd() / "checkpoints")

config_template = {
    "checkpoint_dir": checkpoint_dir,
    "dp": True, # enable differential privacy in training
    "epochs": 15,
    "gen_lines": 100,
    "overwrite": True,
    "save_all_checkpoints": False,
    "vocab_size": 20000
}

In [None]:
bundle = SyntheticDataBundle(
    training_df=training_df,
    delimiter=None, # if ``None``, it will try and automatically be detected, otherwise you can set it
    auto_validate=True, # build record validators that learn per-column, these are used to ensure generated records have the same composition as the original
    synthetic_config=config_template, # the config for Synthetics
    # sample_cutoff=100000, # if the training DF has more rows than this, we will use this value to sample records for delim detection, header clustering, and data validation
)

In [None]:
bundle.build()

In [None]:
bundle.train()

In [None]:
# optional params:
#
# num_lines=500 will override the synthetic config ``num_lines``, set whatever number you need
# max_invalid=5000 will override the default invalid line limit that terminates execution, set whatever number you need

bundle.generate()

In [None]:
bundle.get_synthetic_df()

## Performance Report

The Performance Report compares the training data to the newly created synthetic data and assesses their statistical similarity.   It shows you both quantitatively and graphically any differences between within field distributions as well as cross field correlations.

In [None]:
bundle.generate_report()