# Gretel Trainer

This notebook is designed to help users successfully train synthetic models on complex datasets with high row and column counts. The code works by intelligently dividing a dataset into a set of smaller datasets of correlated columns that can be parallelized and then stitched back together. 

In [4]:
import strategy
import runner

from gretel_client.projects import create_or_get_unique_project
from gretel_client.projects.models import read_model_config
from gretel_client.projects.jobs import Status
from gretel_synthetics.utils.header_clusters import cluster

import pandas as pd

ModuleNotFoundError: No module named 'category_encoders'

In [2]:
CACHE_FILE = "runner_cache.json"

PROJECT = create_or_get_unique_project(name="cpu-dataset")

In [3]:
# Run this cell for your original creation of a strategy

DF = pd.read_csv("cpu_states.csv", low_memory=False)
HEADER_CLUSTERS = cluster(DF)

# Just using the default Synthetics model for now, this will load it from GitHub automatically
CONFIG = read_model_config("synthetics/mostly-numeric-data")

In [4]:
CONFIG["models"][0]["synthetics"]["params"]["epochs"] = 200
CONFIG["models"][0]["synthetics"]["privacy_filters"] = {}
CONFIG["models"][0]["synthetics"]["privacy_filters"]["outliers"] = None
CONFIG["models"][0]["synthetics"]["privacy_filters"]["similarity"] = None

In [5]:
# Faster loading if re-using an existing strategy and all models have started

# DF = None
# HEADER_CLUSTERS = None

# Just using the default Synthetics model for now, this will load it from GitHub automatically
# CONFIG = None

# CACHE_FILE = "runner_cache.json"

# PROJECT = create_or_get_unique_project(name="cpu-dataset")

In [6]:
# First we need to create some constraints for the partition strategy, this will be used to create the specific
# partitions under the hood
#
# Params:
# - header_clusters: Any header clusters desired, if omitted, we'll use all headers
# - max_row_partitions: The max number of row "clusters" to use, mutually exclusive with `max_row_count`
# - max_row_count: The max number of records to include in a row cluster

constraints = strategy.PartitionConstraints(
    header_clusters=HEADER_CLUSTERS, 
    max_row_partitions=1, # Setting this to 1 will use all rows per partition, so just rely on header clustering
    max_row_count=None
)

In [7]:
# Create our actual runner, some notes on caching:
# - If the cache already exists, we'll load an existing strategy from that (any provided constraints will be ignored)
# - You can force the runner to create a new Strategy (losing previous state tracking) with the ``cache_overwrite`` param
# - If the cache does not exist, a new one will be created with the provided constraints

run = runner.StrategyRunner(
    strategy_id="foo",
    df=DF,
    cache_file=CACHE_FILE,
    cache_overwrite=True,  # False means we'll try and load an existing cache and start back up, otherwise start fresh
    model_config=CONFIG,
    partition_constraints=constraints,
    project=PROJECT
)

In [8]:
# run = runner.StrategyRunner.from_completed(PROJECT, CACHE_FILE)

In [9]:
run.train_all_partitions()

Processing 32 partitions
Partition 0 is new, starting model creation
Removing artifact not belonging to this Strategy...
Started model: 61f097a0014a41e84378a70d
Partition 1 is new, starting model creation
Removing artifact not belonging to this Strategy...
Started model: 61f097a2014a41e84378a70e
Partition 0 status change from created to pending
Partition 2 is new, starting model creation
Removing artifact not belonging to this Strategy...
Started model: 61f097a4014a41e84378a70f
Partition 3 is new, starting model creation
Removing artifact not belonging to this Strategy...
Started model: 61f097a6014a41e84378a710
Partition 1 status change from created to pending
Partition 2 status change from created to pending
Partition 3 status change from created to pending
Partition 4 is new, starting model creation
Removing artifact not belonging to this Strategy...
Started model: 61f097a9014a41e84378a711
Partition 4 status change from created to pending
Partition 5 is new, starting model creation
R

Partition 28 status change from created to pending
At active capacity, waiting for more...
At active capacity, waiting for more...
Partition 11 status change from active to completed
Partition 22 status change from pending to active
Partition 29 is new, starting model creation
Attempting to remove artifact: gretel_103370b63be34e659cd6d388d7e66b6f_foo-11.csv
Started model: 61f09a286de685c2c17fdb39
Partition 29 status change from created to pending
At active capacity, waiting for more...
At active capacity, waiting for more...
Partition 12 status change from active to completed
Partition 17 status change from pending to active
Partition 30 is new, starting model creation
Attempting to remove artifact: gretel_648aed425e95424ab98f760053950cb7_foo-12.csv
Started model: 61f09a65014a41e84378a71f
Partition 22 status change from active to completed
Partition 30 status change from created to pending
Partition 31 is new, starting model creation
Attempting to remove artifact: gretel_b11af8c37a4d4d

In [10]:
syn_df = run.get_training_synthetic_data()

Re-assembling data for 32 header clusters
Fetching artifact data for partition 0
Fetching artifact data for partition 17
Fetching artifact data for partition 8
Fetching artifact data for partition 1
Fetching artifact data for partition 18
Fetching artifact data for partition 22
Fetching artifact data for partition 26
Fetching artifact data for partition 9
Fetching artifact data for partition 4
Fetching artifact data for partition 28
Fetching artifact data for partition 23
Fetching artifact data for partition 24
Fetching artifact data for partition 16
Fetching artifact data for partition 27
Fetching artifact data for partition 2
Fetching artifact data for partition 6Fetching artifact data for partition 14
Fetching artifact data for partition 31

Fetching artifact data for partition 7
Fetching artifact data for partition 25
Fetching artifact data for partition 29
Fetching artifact data for partition 20
Fetching artifact data for partition 13
Fetching artifact data for partition 15
Fetchi

In [11]:
syn_df

Unnamed: 0,ps2,class,ps479,ps478,ps203,ps204,ps205,ps480,ps477,ps346,...,ps271,ps277,ps263,ps279,ps265,ps273,ps259,ps261,ps280,test_name
0,0,B,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,nginx
1,0,B,0,0,0,0,0,0,0,0,...,286,0,0,0,0,0,0,286,0,nginx_7
2,0,B,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,nginx
3,0,B,0,0,0,0,0,0,0,0,...,286,0,0,0,0,0,0,286,0,nginx_7
4,0,B,0,0,0,0,0,0,0,0,...,7,15,7,15,7,7,8,0,0,stress_i
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,0,B,0,0,0,0,0,0,0,0,...,960,0,0,0,0,0,0,960,0,php-fpm7.1
4996,0,B,0,0,0,0,0,0,0,0,...,0,0,0,6651,0,0,6636,0,0,nginx_4
4997,0,B,0,0,0,0,0,0,0,0,...,0,0,0,6651,0,0,6636,0,0,nginx_4
4998,0,B,0,0,0,0,0,0,0,0,...,0,0,0,6651,0,0,6636,0,0,nginx_4


In [12]:
# run.cancel_all()