# Gretel Synthetics Tuning With Optuna
* This notebook will let you tune the Gretel synthetic model hyperparameters of several datasets at once.
* It is also setup to run multiple Optuna trials at once using an SQLite database (prepackaged with most operating systems).
* This notebook makes use of our python module Optuna_Trials.py. In most cases, you won't need to modify this module. It is configured with all the relelvant synyhetic model hyperparameters and their relevant ranges. If you'd like to change which parameters are tuned or the range of values to tune over, then you will need to modify that module.
* This notebook works seemlessly on Linux and Ubuntu, but not on a Mac.

## First specify all the options needed in this notebook

In [None]:
# First you'll need to specify the location of a file that contains a list of all the datasets (e.g. training 
# filenames) you'd like to tune synthetic models on. Here we use a list containing eight popular Kaggle datasets

dataset_list = "datasets/dataset_list.csv"

In [None]:
# Now you'll need to specify how many processes you'd like to run in parallel for each dataset. As we also process
# datasets in parallel, the total number of processes running in parallel with be dataset cnt x trial_job_cnt;
# which in this case sums up to 48. Note, each process has very low CPU impact as we will be using Gretel SDK
# calls to the cloud to train each model.

trial_job_cnt = 6

In [None]:
# Now you'll need to specify how many trials (e.g. set of hyperparameters to test out) you'd like each process
# to run. Here we're setting trials_per_job to 5 which means the overall number of trials per dataset will be 30.
# This is typically a good set of trials for Optuna to narrow in on an optimal hyperparameter set.

trials_per_job = 5

In [None]:
# Now set a base name you'd like to use for each study. There will be a total of eight studies, since we have
# eight datasets. Later, we'll set each study name to the base name you've chosen followed by the dataset number.

study_base_name = "Optuna_Tuning"

In [None]:
# Now you'll need to specify the database location where your trials will be stored. This will enable the running
# of trials in the same study in parallel. Here we're using SQLite as it comes preinstalled with most operating 
# systems. If the database name you specify doesn't already exist, Optuna will create it for you.

storage = "sqlite:///tuning.db" 

In [None]:
# Now specify the optimization algorithm you'd like Optuna to use. Here, we're choosing the default optimizer TPE
# (Tree-structured Parzen Estimator) algorithm. If you're just starting out, a good rule of thumb is if you have 
# a lot of computing resources, use Random Search, otherwise use TPE. You can read more about Optuna sampling
# algorithms here:
# https://optuna.readthedocs.io/en/stable/tutorial/10_key_features/003_efficient_optimization_algorithms.html#sphx-glr-tutorial-10-key-features-003-efficient-optimization-algorithms-py
from optuna.samplers import TPESampler

sampler=TPESampler()

## Install necessary packages

In [None]:
%%capture
!pip install -U gretel-client

In [None]:
# Install Optuna
!pip install optuna

## Specify your Gretel API key

In [None]:
from getpass import getpass

api_key=getpass(prompt="Enter Gretel API key")

## Load the plethora of visualization options available in Optuna

In [None]:
from optuna.visualization import plot_contour
from optuna.visualization import plot_edf
from optuna.visualization import plot_intermediate_values
from optuna.visualization import plot_optimization_history
from optuna.visualization import plot_parallel_coordinate
from optuna.visualization import plot_param_importances
from optuna.visualization import plot_slice

## Grab the default Synthetic Config file:

In [None]:
from smart_open import open
import yaml

with open("https://raw.githubusercontent.com/gretelai/gretel-blueprints/main/config_templates/gretel/synthetics/default.yml", 'r') as stream:
    config = yaml.safe_load(stream) 


## Define our function that will initiate an Optuna study
* This function uses Optuna's engueue_trial method to queue up Gretel synthetics default set of model hyperparameters. This is a good spot for Optuna to begin. You can use this method to queue up as many parameter settings as you'd like. Optuna will first try the hyperparameter sets you've queued up and then move on to using it's optimization algorithm to search for other potential sets.

In [None]:
import subprocess
import optuna

def create_study(study_name, dataset, trial_job_cnt, trials_per_job, api_key, storage, sampler):
       
    study = optuna.create_study(study_name=study_name,storage=storage, sampler=sampler, direction="maximize")
    
    # Tell Optuna to start with our default config settings. This will be your Trial 0

    study.enqueue_trial(
        {
        "vocab_size": config['models'][0]['synthetics']['params']['vocab_size'],
        "reset_states": config['models'][0]['synthetics']['params']['reset_states'],
        "rnn_units": config['models'][0]['synthetics']['params']['rnn_units'],
        "learning_rate": config['models'][0]['synthetics']['params']['learning_rate'],
        "gen_temp": config['models'][0]['synthetics']['params']['gen_temp'],
        "dropout_rate": config['models'][0]['synthetics']['params']['dropout_rate'],
        }
    )
      
    # Now initiate "trial_job_cnt" processes for this study, each running "trials_per_job" trials
    
    trial_cnt = str(trials_per_job)
    for i in range(trial_job_cnt):
        mytrial = subprocess.Popen(["python", "src/Optuna_Trials.py", study_name, trial_cnt, dataset, api_key, storage])
    
    return study
    

## Read in your datasets and start tuning!

In [None]:
import pandas as pd

datasets = pd.read_csv(dataset_list)

studies = []

for i in range(len(datasets)):
    dataset = datasets.loc[i]["filename"]
    study_name = study_base_name + str(i)
    studies.append(create_study(study_name, dataset, trial_job_cnt, trials_per_job, api_key, storage, sampler))

## Monitor your tuning as it progresses

In [None]:
# Track each trial's status and SQS scores as they complete

for i in range(len(studies)):
    study = studies[i]
    print("Study " + str(i))
    for j in range(len(study.trials)):
        state = str(study.trials[j].state)[11:]
        if state == "COMPLETE":
            sqs = study.trials[j].values[0]
            print("\tTrial " + str(j) + " has state " + state + " and SQS " + str(sqs))
        else:
            print("\tTrial " + str(j) + " has state " + state)
            

In [None]:
# You can look graphically at the optimization history of a study while waiting for it to complete.

study = studies[0]
plot_optimization_history(study)

## Analyze your results

In [None]:
# When your trials are complete, you can use this cell to gather all results into a dataframe.
# Remember: trial 0 is the Gretel synthetic default config.

study_list = []
trial_list = []
best_list = []
state_list = []
sqs_list = []
vocab_list = []
rnn_list = []
dropout_list = []
gentemp_list = []
learning_list = []
reset_list = []
dataset_list = []

# Loop through each study (dataset)
for i in range(len(studies)):
    study = studies[i]
    best_trial = study.best_trial.number
    dataset = datasets.loc[i]["filename"]
    
    # Loop through each trial in the study
    for j in range(len(study.trials)):
        best = False
        if j == best_trial:
            best = True
        state = str(study.trials[j].state)[11:]
        values = study.trials[j].values
        sqs = 0
        if values:
            sqs = values[0]
        vocab_size = study.trials[j].params['vocab_size']
        rnn_units = study.trials[j].params['rnn_units']
        dropout_rate = round(study.trials[j].params['dropout_rate'], 4)
        gen_temp = round(study.trials[j].params['gen_temp'], 4)
        learning_rate = round(study.trials[j].params['learning_rate'], 4)
        reset_states = study.trials[j].params['reset_states']
        study_list.append(i)
        trial_list.append(j)
        best_list.append(best)
        state_list.append(state)
        sqs_list.append(sqs)
        vocab_list.append(vocab_size)
        rnn_list.append(rnn_units)
        dropout_list.append(dropout_rate)
        gentemp_list.append(gen_temp)
        learning_list.append(learning_rate)
        reset_list.append(reset_states)
        dataset_list.append(dataset)
       
# Gather all results into a datafame
df_results_studies = pd.DataFrame({"study": study_list, "trial": trial_list, "best": best_list, "state": state_list,
                          "sqs": sqs_list, "vocab_size": vocab_list, "rnn_units": rnn_list, "dropout_rate": dropout_list,
                          "gen_temp": gentemp_list, "learning_rate": learning_list, "reset_states": reset_list,
                          "dataset": dataset_list})

# Show trial state counts for each study. Note, it's typical to have a few Optuna errors.
df_results_studies.groupby(['study', 'state']).size()

In [None]:
# Just look at the best run for each study (e.g. dataset)
df_results_studies[df_results_studies["best"] == True]

In [None]:
# Look at the top scoring runs for a specific study
df_results_studies[df_results_studies["study"] == 1].sort_values(by='sqs', ascending=False).head(30)

In [None]:
# Look at the parameter importance for a specific study
study = studies[0]
plot_param_importances(study)

In [None]:
# Plot the parameter relationship as slice plot in a study.
# This shows the trial number as the color, so you can see the tuning homing in on a range

study = studies[0]
plot_slice(study)

In [None]:
# You can use the contour plot to look at the relationship between parameters and the objective value

study = studies[0]
plot_contour(study)

In [None]:
# The parallel coordinates map can be insightful. Each line is a trial
# Note, because I made vocab_size categorical in Optuna, they aren't in the right order

study = studies[0]
plot_parallel_coordinate(study)

In [None]:
# The params field is optional but useful if you want to see two params side by side

study = studies[0]
plot_parallel_coordinate(study, params=["rnn_units", "dropout_rate"])

In [None]:
# Plot a study's Empirical Distribution Function Plot

study = studies[0]
plot_edf(study)

In [None]:
# Here's how you look at the nitty gritty of what comes back from Optuna

study = studies[0]
study.get_trials()

In [None]:
# A quick way to get a study's best trial
study.best_trial.number

In [None]:
# Look at a specific trial's state
study.trials[5].state

In [None]:
# Look at a specific study's params
study.trials[20].params

In [None]:
# Look at a specific study's SQS (Optimization funtion) value (note it's a list)
study.trials[20].values

In [None]:
# Examples of how to access a trial's params
study.trials[0].params['vocab_size']
study.trials[0].params['rnn_units']
study.trials[0].params['dropout_rate']
study.trials[0].params['gen_temp']
study.trials[0].params['learning_rate']
study.trials[0].params['reset_states']

In [None]:
# Here's an example of how you would delete a trial if you ever need to,
# But don't do this until you're fully done analyzing it.

optuna.delete_study(study_name="Optuna_Tuning0", storage=storage)