<a target="_blank" href="https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/docs/notebooks/demo/gretel-tuner-intro-tutorial.ipynb"> 
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/> </a>

# 🧹 Hyperparameter Sweeps with **Gretel Tuner**


<br>

<center><img src="https://gretel-public-website.s3.us-west-2.amazonaws.com/misc/sweep_the_params.jpg" alt="Gretel" width="500"/></center>

<br>

In this tutorial, we will demonstrate how to tune the hyperparameters of a Gretel Synthetics model using **Gretel Tuner**.

## 💿 Installation

- The tuner requires additional dependencies beyond the minimal requirements of [gretel_client](https://github.com/gretelai/gretel-python-client).

- To install the tuner along with the client, add the `[tuner]` option to the pip install command:

In [None]:
%%capture
!pip install gretel-client[tuner]

## 🛜 Configure your Gretel session

- The [`Gretel` object](https://docs.gretel.ai/guides/high-level-sdk-interface/the-gretel-object) provides a high-level interface for streamlining interactions with Gretel's APIs.

- Each `Gretel` instance is bound to a single [Gretel project](https://docs.gretel.ai/guides/gretel-fundamentals/projects).

- Running the cell below will prompt you for your Gretel API key, which you can retrieve [here](https://console.gretel.ai/users/me/key).

- With `validate=True`, your login credentials will be validated immediately at instantiation.

In [None]:
from gretel_client import Gretel

gretel = Gretel(
    project_name="tuner-intro-tutorial",
    api_key="prompt",
    validate=True,
)

In [None]:
# @title 🗂️ Pick a tabular data source 👇 { display-mode: "form" }
# @markdown Run this cell to set the `data_source` path.


dataset_path_dict = {
    "adult income in the USA (14000 records, 15 fields)": "https://raw.githubusercontent.com/gretelai/gretel-blueprints/main/sample_data/us-adult-income.csv",
    "hospital length of stay (9999 records, 18 fields)": "https://raw.githubusercontent.com/gretelai/gretel-blueprints/main/sample_data/sample-synthetic-healthcare.csv",
    "customer churn (7032 records, 21 fields)": "https://raw.githubusercontent.com/gretelai/gretel-blueprints/main/sample_data/monthly-customer-payments.csv"
}

data_source = "adult income in the USA (14000 records, 15 fields)" # @param ["adult income in the USA (14000 records, 15 fields)", "hospital length of stay (9999 records, 18 fields)", "customer churn (7032 records, 21 fields)"]
data_source = dataset_path_dict[data_source]

## ⚙️ Define the tuner configuration

* The tuner's main settings are set inside a single config, which can be passed as a yaml string, yaml file path, or dict.
* The tuner config follows the same format as the model section of the associated Gretel model config, with the following differences:
 * A `base_config` parameter is required to define the model and its default parameters. The value of this parameter can be a name from the [gretel-blueprints](https://github.com/gretelai/gretel-blueprints/tree/main/config_templates/gretel/synthetics) repo or a config file path.
 * An optional `metric` parameter can be used to select a Gretel metric to optimize during the hyperparameter sweeps. For tabular models, valid metrics are
   * `synthetic_data_quality_score` (default)
   * `field_correlation_stability`
   * `principal_component_stability`
   * `field_distribution_stability`
 * Instead of setting the model parameter values, you set how the tuner should sample them. Sampling options include
   * `choices` (sample from a discrete list of choices)
   * `int_range` (sample integers over a uniform range)
   * `float_range` (sample floats over a uniform range)
   * `log_range` (sample floats over a log-uniform range)
   * `fixed` (explicitly fix the parameter value).  

In [None]:
tuner_config = """
base_config: tabular-actgan

metric: synthetic_data_quality_score

params:

    batch_size:
        fixed: 500

    epochs:
        choices: [100, 500]

    generator_lr:
        log_range: [0.00001, 0.001]

    discriminator_lr:
        log_range: [0.00001, 0.001]

    embedding_dim:
        choices: [64, 128, 256]

    generator_dim:
        choices:
            - [512, 512, 512, 512]
            - [1024, 1024]
            - [1024, 1024, 1024]
            - [2048, 2048]
            - [2048, 2048, 2048]

    discriminator_dim:
        choices:
            - [512, 512, 512, 512]
            - [1024, 1024]
            - [1024, 1024, 1024]
            - [2048, 2048]
            - [2048, 2048, 2048]
"""

## 🏃‍♂️ Run Gretel Tuner

- The [Gretel object](https://docs.gretel.ai/guides/high-level-sdk-interface/the-gretel-object) has a convenience `run_tuner` method, which will run the parameter sweeps in a single command.

- There is an optional `use_temporary_project` argument (default is `False`), which is useful if you plan to run a very large number of trials, each of which trains a model. If you use this option, be sure to save the best config, since the project (and hence best model) will be deleted upon completion.

- The tuner submits training jobs to Gretel with different model configurations. While the submitted jobs are running remotely in the cloud, the tuner operates locally, initiating new jobs as model training completes from previous jobs.

- Here, we use `n_trials = 4`, which is typically insufficient for finding an optimal model. For thorough hyperparameter tuning, we recommend conducting approximately 20-50 trials, depending on your metric score's convergence.

In [None]:
# This call should take ~5-15 minutes to complete.
tuner_results = gretel.run_tuner(
    tuner_config,
    n_trials=4,
    n_jobs=2,
    data_source=data_source
)

## 📈 Visualize the experiment results

- Under the hood, Gretel Tuner uses [Optuna](https://optuna.readthedocs.io/en/stable/index.html) to drive the sampling of hyperparameters.

- This means we can use Optuna's excellent visualization tools to better understand our tuning experiments.

In [None]:
import optuna.visualization as viz

# Plot the optimization metric as a function of trial number.
viz.plot_optimization_history(tuner_results.study)

In [None]:
# Compare the importances of the sampled hyperparameters.
viz.plot_param_importances(tuner_results.study)

## 🧐 Inspect the tuner results

- The tuner returns a results object with information about the best model/config, as well as log data for all trials.
- Note that `best_model_id` will be `None` if you set `use_temporary_project=True`, since the project and its models will be deleted when the tuning job is finished. In this case, you should save the best config, which is stored as a dict in the `tuner_results.best_config` attribute, and use it to train a new model.

In [None]:
tuner_results

In [None]:
# The best model config is the most important attribute.
tuner_results.best_config

In [None]:
# The trial data is stored in a pandas DataFrame.
tuner_results.trial_data

In [None]:
# Here's how you can fetch the best model's training job results.
trained = gretel.fetch_train_job_results(tuner_results.best_model_id)

In [None]:
# Inspect the data used to generate Gretel's synthetic data quality report.
df_synth = trained.fetch_report_synthetic_data()
df_synth

In [None]:
# Inspect the full report from the best model.
trained.report.display_in_notebook()