# DataFrame Batch Training

This notebook explores the new batch training feature in Gretel Synthetics. This interface will create N synthetic training configurations, where N is a specific number of batches of column names. We break down the source DataFrame into smaller DataFrames that have the same number of rows, but only a subset of total columns.

In [None]:
# If you are using Colab, you may wish to mount your Google Drive, once that is done, you can create a symlinked
# directory that you can use to store the checkpoint directories in.
#
# For this example we are using some Google data that can be learned and trained relatively quickly
# 
# NOTE: Gretel Synthetic paths must NOT contain whitespaces, which is why we have to symlink to a more local directory
# in /content. Unfortunately, Google Drive mounts contain whitespaces either in the "My drive" or "Shared drives" portion
# of the path
#
# !ln -s "/content/drive/Shared drives[My Drive]/YOUR_TARGET_DIRECTORY" checkpoints
#
# !pip install -U gretel-synthetics

In [40]:
import pandas as pd
from gretel_synthetics.batch import DataFrameBatch, PATH_HOLDER

source_df = pd.read_csv("https://gretel-public-website.s3-us-west-2.amazonaws.com/tests/synthetics/data/USAdultIncome14K.csv")



In [2]:
source_df.shape

(14000, 15)

In [None]:
# Here we create a dict with our config params, these are identical to when creating a LocalConfig object
#
# NOTE: We do not specify a ``input_data_path`` as this is automatically created for each batch

In [26]:
from pathlib import Path

checkpoint_dir = str(Path.cwd() / "test-model-2")

config_template = {
    "epochs": 100,
    "max_line_len": 2048,
    "vocab_size": 200000,
    "field_delimiter": ",",
    "overwrite": True,
    "checkpoint_dir": checkpoint_dir
}

In [34]:
from gretel_synthetics.tokenizers import CharTokenizerTrainer
from gretel_synthetics.config import LocalConfig

# Create our batch handler. During construction, checkpoint directories are automatically created
# based on the configured batch size
batcher = DataFrameBatch(df=source_df, config=config_template, batch_size=30)

# Optionally, you can also provide your own batches, which can be a list of lists of strings:
#
# my_batches = [["col1", "col2"], ["col3", "col4", "col5"]]
# batcher = DataFrameBatch(df=source_df, batch_headers=my_batches, config=config_template)

2022-02-04 14:51:41,305 : MainThread : INFO : Creating directory structure for batch jobs...


In [35]:
# batcher.config
config_obj = LocalConfig(input_data_path=PATH_HOLDER, **batcher.config)
batcher.tokenizer = CharTokenizerTrainer(config=config_obj)

In [36]:
# Next we generate our actual training DataFrames and Training text files
#
# Each batch directory will now have it's own "train.csv" file
# Each Batch object now has a ``training_df`` associated with it
batcher.create_training_data()

2022-02-04 14:51:46,144 : MainThread : INFO : Generating training DF and CSV for batch 0


In [37]:
# Now we can trigger each batch to train
batcher.train_all_batches()

2022-02-04 14:51:48,218 : MainThread : INFO : Loading input data from /Users/jtm/gretel/gretel-synthetics/examples/test-model-2/batch_0/train.csv
2022-02-04 14:51:48,295 : MainThread : INFO : Tokenizing input data
2022-02-04 14:51:48.295627: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-02-04 14:51:48.295715: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)






100%|█████████████████████████████████████████████████████████████████████████| 14000/14000 [00:00<00:00, 147239.78it/s][A[A[A[A[A[A
2022-02-04 14:51:48,392 : MainThread : INFO : Shuffling input data
2022-02-04 14:51:49,624 : MainThread : INFO : Creating validation dataset
2022-02-04 14:51:4

Model: "sequential_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_6 (Embedding)     (64, None, 256)           16896     
                                                                 
 dropout_18 (Dropout)        (64, None, 256)           0         
                                                                 
 lstm_12 (LSTM)              (64, None, 256)           525312    
                                                                 
 dropout_19 (Dropout)        (64, None, 256)           0         
                                                                 
 lstm_13 (LSTM)              (64, None, 256)           525312    
                                                                 
 dropout_20 (Dropout)        (64, None, 256)           0         
                                                                 
 dense_6 (Dense)             (64, None, 66)           

2022-02-04 14:51:50.365430: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
2022-02-04 14:51:51.220122: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
2022-02-04 14:51:51.401295: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
2022-02-04 14:51:51.578959: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
2022-02-04 14:51:51.881668: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.


    187/Unknown - 23s 109ms/step - loss: 3.5484 - accuracy: 0.1182

2022-02-04 14:52:12.582478: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
2022-02-04 14:52:13.326126: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
2022-02-04 14:52:13.483553: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.


Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100


2022-02-04 14:59:12,370 : MainThread : INFO : Saving model history to model_history.csv
2022-02-04 14:59:12,371 : MainThread : INFO : Saving model to /Users/jtm/gretel/gretel-synthetics/examples/test-model-2/batch_0/synthetic


In [38]:
# Next, we can trigger all batched models to create output. This loops over each model and will attempt to generate
# ``gen_lines`` valid lines for each model. This method returns a dictionary of bools that is indexed by batch number
# and tells us if, for each batch, we were able to generate the requested number of valid lines
status = batcher.generate_all_batch_lines(num_lines=2000)

Valid record count :   0%|          | 0/2000 [00:00<?, ?it/s]

Invalid record count :   0%|          | 0/1000 [00:00<?, ?it/s]

In [39]:
# Finally, we can re-assemble all synthetic batches into our new synthetic DF
batcher.batches_to_df()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,7v,W0,W0,W0,32,B0,40,U2,47,U6,W0,tn,40,,0ct<=50K
1,,49,14,Wg,W2,A0,20,40,,n,,0,10,60,IdiUnianiUniasMaveishiWrichicki2ditk
2,tKaitKDtveiteiteiteiteite,te,10,,0itniWnilyilhiWrirdieriedeed-sl-msoEsoul-is,ih,10,10,Ud,40,W0,W0,22,40,102U0vurbunbE7K
3,vginm,ly,W0,6h,W4,7g,14,91,82,W5,W9,W3,Wh,,niUniPhilhiehiergeeceedeesersedl-spos-inoindid...
4,51teoterterierieriedied,ed,Wh,Un,W0,42,40,40,Wa,M5,35,11,W6,W6aCni<lieriedierieriediedierued,edlApoc-svplirrilhinri<ralrarralroir-in-inhinhitK
5,a4dat-ateateageapedne-opsoenoecoeclecneineis,is,10,3e,WeithitniUniUhithivniPsipmivci,ag,10,nh3An-os-oS-orsoedphlrpprserlecl-asealeandanba...,2g,10,40,40,tn,20,W0-Um-EpeinhirhilhicKiPhithatnHgdvIsundalm
6,,75,27,,0,,0,,0,30,30,Wn,W0,W0,60-Hn-op--lhcnocn-IppsnpuspicyinK
7,87gfi70ish,6d,W0iUhiteithi,no,W2,W6,,h,Wh,9h,W0,W0,t0,40,6nmIcpnp-mp-esle-le-seeleesprsuroinhi3-ichichi...
8,asit,W7,W6,40,40,40,30,4l,W0,40,U74UnianitlatnKivri-hiPKiacialiarianianialialm...,3d,BriPhickiWhiphiphicdiWkivriadtasoasmanbaBbaedF...,ed,Wlocl-mo-oo-oohsr=vnoEs-Inecseuseaseanhind5ld
9,19eaSeaseuSeaTeiWhiWhichishi<hisrier,ed,Wr,4h,Wn,W0ithitniUhi2hitnitniUnitriteateatedced-erperv...,le,,h,t0,40,50iUniahitld,Wd,Wl,IeiWhidhicd


# Read only mode

If you've already created a model(s) and simply want to load that data to generate more lines, you can use the read-only mode for the batch interface. No input DataFrame is required and it will automatically try and load model information from a primary checkpoint directory.

Additionally, you can also control the number of lines you wish to generate with the ``num_lines`` parameter for generation. This option exists for write mode as well and overrides the number of lines specified in the synthetic config that was used.

In [None]:
read_batch = DataFrameBatch(mode="read", checkpoint_dir=checkpoint_dir)

In [None]:
read_batch.generate_all_batch_lines(num_lines=5)

In [None]:
read_batch.batches_to_df()