# Evaluate synthetic vs. real data on classification models

### How to use this notebook
Use this notebook to analyze the performance of your synthetic data vs. real data, where both are trained and evaluated on machine learning classifiers. 

This notebook gives you 2 options: use a Gretel model to generate synthetic data or BYO synthetic data. In either case, you'll use a Gretel Evaluate task to perform the training and evaluation. After the task completes, you'll see a Gretel Synthetic Data Utility Report. This report provides you the model(s) metrics and synthetic vs. real data comparison. 

Interested in evaluation on regression models? Check out [the regression notebook](https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/docs/notebooks/downstream_machine_learning_regression_evaluation.ipynb).


### A low-code alternative
You can also try the `Synthesize data + evaluate ML performance` flow in the [Gretel Console](https://console.gretel.ai/use_cases/cards/use-case-downstream-accuracy/projects). This is a low-code alternative that will walk you step-by-step through the evaluation. You can find the Synthetic Data Utility Report at the end of the process in [your Projects list](https://console.gretel.ai/projects).




### Installation
Install the Gretel Client to use Gretel's synthetic models as well as the Gretel Evaluate Regression model. You'll have to get your API key from the [Gretel console](https://www.console.gretel.ai) to configure your session. 

In [None]:
# Install the latest Gretel Client
%pip install -U gretel-client

In [None]:
# Configure your Gretel session - enter your API key when prompted
from gretel_client import configure_session

configure_session(endpoint="https://api.gretel.cloud", api_key="prompt", cache="yes")


 ## Try: Generate synthetic data, then evaluate the synthetic data on classifiers against real-world data
 First, we'll generate synthetic data using a smaller-sized version of the publicly available [bank marketing dataset](https://archive.ics.uci.edu/ml/datasets/bank+marketing), which predicts whether a client will subscribe a term deposit (prediction: yes/no in column "y"). We'll use Gretel's LSTM model to train on the real-world and generate the synthetic data.
 
 To use the Gretel Evaluate model, you must indicate the target column. Optionally, you can change the test-holdout amount, which is a float indicating the amount of real-world data you want to use as a holdout for testing the downstream classifiers. You can also optionally select which classifiers to use and which metric to optimize for. 

In [None]:
#### SUPPORTED MODELS AND METRICS ####
## If you want to only use some classification models, you can also indicate which models you want the autoML library to use, by indicating from the list below. 
## By default, all models will be used in the autoML training. 
## If you want to change the metric that the classifiers will use to optimize for, you can select one metric from classification_metrics below. The default metric is acc (accuracy).

classification_models = [
    "lr", 
    "knn", 
    "nb", 
    "dt", 
    "svm", 
    "rbfsvm", 
    "gpc", 
    "mlp", 
    "ridge", 
    "rf", 
    "qda", 
    "ada", 
    "gbc", 
    "lda", 
    "et", 
    "xgboost", 
    "lightgbm", 
    "dummy"
]

shorter_list_classification_models = [
    "nb", 
    "ridge",
    "rbfsvm",
    "knn",
    "xgboost", 
    "ada", 
    "gbc", 
    "mlp",
    "dummy"
]

classification_metrics = [
    "acc",
    "auc",
    "recall",
    "precision",
    "f1",
    "kappa",
    "mcc"
]

First create a project on Gretel Cloud using the following example project name. Then, notice that the config includes both the synthetic data model and evaluation model. Note we're using the Gretel LSTM model configuration in the following code.

In [None]:
# Create a project with a name that describes this use case
from gretel_client.projects import create_or_get_unique_project

project = create_or_get_unique_project(name="bank-marketing-classification-notebook")

In [None]:
from gretel_client.helpers import poll
from gretel_client.projects.models import read_model_config

# We'll import the bank_marketing_small dataset from Gretel's public S3 bucket
# You can modify this to select a dataset of your choice
dataset_path = "https://gretel-datasets.s3.amazonaws.com/bank_marketing_small.csv"

# We will modify the config for Gretel synthetic models to add an extra downstream Evaluate model and task
# Uncomment the additional params to change from defaults.
config = read_model_config("synthetics/tabular-actgan")

config["models"][0]["actgan"]["evaluate"] = {
    # Available downstream tasks are "classification" or "regression"
    "task": "classification",
    # Set to the target you wish to predict -- Change this if you try a different data set!
    "target": "y",  # yes/no to subscriptions, use a categorical column for classification
    # "holdout": 0.2,  # default holdout value
    # "models": classification_models,  # default set of models
    # "metric": "acc",  # default metric used for sorting results, choose one
}



Now we'll train and run the model. At the end when the job completes, you can find the Gretel Synthetic Data Utility Report in your local `/tmp` folder OR go to the `bank-marketing-classification-notebook` project by logging into [the Gretel Console](https://console.gretel.ai/projects) for all the downloads and to see more about the model you trained.

In [None]:
## Train and run the model
## Note: this will both train and run the model to generate synthetic data as well as 
## run the downstream metrics evaluation immediately after

model = project.create_model_obj(
    model_config=config, 
    data_source=dataset_path
)

model.submit_cloud()

poll(model)

# Save all artifacts
model.download_artifacts("/tmp")

## Or: BYO synthetic or augmented data to evaluate downstream classification against real-world data
Already have your synthetic or augmented data? You can use your own CSV or JSON(L) data files in the Gretel Evaluate Classification model. 

In [None]:
# Use Evaluate SDK using your custom config
from gretel_client.evaluation.downstream_classification_report import DownstreamClassificationReport

# Create a project with a name that describes this use case
# When you go to your Gretel Console, you can find this project and also download the report after the evaluation finishes
from gretel_client.projects import create_or_get_unique_project
project = create_or_get_unique_project(name="evaluate-bank-classification-notebook-2") 

# Params
# This is the synthetic data, REQUIRED for evaluate model
# Download this sample bank marketing synthetic dataset: https://drive.google.com/uc?export=download&id=1s9nT7be3NFC1HrpEIgIj2tKib8ftoAC_
# And make sure your file path is correct
data_source = "/Users/[MY_USERNAME]/Downloads/bank-marketing-synthetic.csv"

# This is the real-world data, REQUIRED for evaluate model
ref_data = "https://gretel-datasets.s3.amazonaws.com/bank_marketing_small.csv"

# Target to predict, REQUIRED for evaluate model
target = 'y'  # prediction field for whether a user will opt in

# Default holdout value
# test_holdout = 0.2

# Supply a subset if you do not want all of these, default is to use all of them
# models = classification_models

# Metric to use for ordering results, default is "acc" (Accuracy) for classification
# metric = "acc"

# Create a downstream classification report
evaluate = DownstreamClassificationReport(
    project=project,
    target=target, 
    data_source=data_source, 
    ref_data=ref_data,
    # holdout=test_holdout,
    # models=models,
    # metric=metric,
    # output_dir = '/tmp',
    # runner_mode="cloud",
)


In [None]:
## Run and view the Evaluate Classification Report

evaluate.run() # this will wait for the job to finish


In [None]:
# This will return the full HTML contents of the report.
evaluate.as_html

In [None]:
# This will return the full report JSON details.
evaluate.as_dict

In [None]:
# Returns a dictionary representation of how well the top 3 models trained on synthetic data performed against the 
# top 3 models trained on real-world data. 'Value' is the synthetic or augmented data's performance against real-world data (averaged)
evaluate.peek()

## Results
To see the Gretel Data Utility Report and the results of your evaluation, go to your [Projects list](https://console.gretel.ai/projects) and look for the projects titled `bank-marketing-classification-notebook` or `evaluate-bank-classification-notebook-2`. You can download the Gretel Synthetic Data Quality Report and the Synthetic Data Utility Report.  

You can also check out more model details like the configuration and model stats, or keep synthesizing or augmenting your data to get the best results for you. 
Happy synthesizing!