# How to use this notebook
Use this notebook to understand the quality and performance of your synthetic or augmented data on downstream machine learning regression tasks. 

# Installation
Install Gretel Client to use both Gretel's synthetic models as well as the Gretel Evaluate Regression model. You'll have to get your API key from the Gretel console dashboard to configure your session. 

In [None]:
!pip install -U gretel-client

In [None]:
from gretel_client import configure_session

configure_session(endpoint="https://api-dev.gretel.cloud", api_key="prompt", cache="yes")

 # Generate synthetic data, then evaluate the synthetic data on regression models against real-world data
 First, we'll generate synthetic data using a publicly available Dow Jones stock prediction dataset, which predicts the percentage of return that a stock will have in the next week ("percent_change_next_weeks_price"). We'll use Gretel's DGAN model to train on the real-world data and generate the synthetic data.
 
 To use the Gretel Evaluate Regression model, you must indicate the target column. Optionally, you can change the test-holdout amount, which is a float indicating the amount of real-world data you want to use as a holdout for testing the downstream regression models. Youc an also optionally select which models to use and which metric to optimize for. 

In [None]:
#### SUPPORTED MODELS AND METRICS ####
## If you want to only use certain regression models, you can also indicate which models you want the autoML library to use, by indicating from the list below. 
## By default, all models will be used in the autoML training. 
## If you want to change the metric that the regression models will use to optimize for, you can select one metric from regression_metrics below. The default metric is R2.

regression_models = [
    "lr",
    "lasso",
    "ridge",
    "en",
    "lar",
    "llar",
    "omp",
    "br",
    "ard",
    "par",
    "ransac",
    "tr",
    "huber",
    "kr",
    "svm",
    "knn",
    "dt",
    "rf",
    "et",
    "ada",
    "gbr",
    "mlp",
    "xgboost",
    "lightgbm",
    "dummy"
]

regression_metrics = [
    "mae",
    "mse",
    "rmse",
    "r2",
    "rmsle",
    "mape"
]

In [None]:
# Create a project with a name that describes this use case
from gretel_client.projects import create_or_get_unique_project

project = create_or_get_unique_project(name="bank-marketing-synthetic-data-downstream-classification-evaluation")

In [None]:
from gretel_client.helpers import poll
from gretel_client.projects.models import read_model_config

# We'll import the Dow Jones stock price dataset from Gretel's public S3 bucket
# You can modify this to select a dataset of your choice
dataset_path = "https://gretel-datasets.s3.amazonaws.com/dow_jones_index/data.csv" 

# Modify the default config to add an extra downstream task.
# We do this by adding an evaluate stanza to our config.
# Regression example, uncomment the additional params to change from defaults.
config = read_model_config("synthetics/time-series")

config["models"][0]["synthetics"]["evaluate"] = {
    # Available downstream tasks are "classification" or "regression"
    "task": "regression",
    # Set to the target you wish to predict -- Change this if you try a different data set!
    "target": "percent_change_next_weeks_price",  # target column for regression prediction
    # "holdout": 0.2,  # default holdout value
    # "models": regression_models,  # default set of models
    # "metric": "r2",  # default metric used for sorting results, choose one
}

In [None]:
## Train and run the model
## Note: this will both train and run the model to generate synthetic data as well as 
## run the downstream metrics evaluation immediately after

model = project.create_model_obj(
    model_config=config, 
    data_source=dataset_path
)

model.submit_cloud()

poll(model)

# Save all artifacts
model.download_artifacts("/tmp")

# Option 2: BYO synthetic or augmented data to evaluate downstream metrics against real-world data
Already have your synthetic or augmented data? You can use your own CSV or JSON(L) data files in the Gretel Evaluate Regression model. 

In [None]:
# Use Evaluate SDK using your custom config
from gretel_client.evaluation.downstream_regression_report import DownstreamRegressionReport

# Params
# Synthetic data, REQUIRED for evaluate model
data_source = "" # TODO: link to SD

# Real data, REQUIRED for evaluate model
ref_data = "https://gretel-datasets.s3.amazonaws.com/dow_jones_index/data.csv" 

# Target to predict, REQUIRED for evaluate model
target = 'V1'  # age, numeric field for regression example

# Default holdout value
# test_holdout = 0.2

# Supply a subset if you do not want all of these, default is to use all of them
# models = regression_models

# Metric to use for ordering results, defaults are "acc" (Accuracy) for classification, "r2" (R2) for regression.
# metric = "r2"

# Create a downstream regression report
evaluate = DownstreamRegressionReport(
    # project=None,  # Create a temp project
    target=target, 
    data_source=data_source, 
    ref_data=ref_data,
    # holdout=test_holdout,
    # models=models,
    # metric=metric,
    # output_dir=None,
    # runner_mode="cloud",
)

evaluate.run() # this will wait for the job to finish

# This will return the full report JSON details.
evaluate.as_dict

# This will return the full HTML contents of the report.
evaluate.as_html

# Returns a dictionary representation of the top level report scores.
evaluate.peek()