# DataFrame Batch Training

This notebook explores the new batch training feature in Gretel Synthetics. This interface will create N synthetic training configurations, where N is a specific number of batches of column names. We break down the source DataFrame into smaller DataFrames that have the same number of rows, but only a subset of total columns.

In [None]:
# If you are using Colab, you may wish to mount your Google Drive, once that is done, you can create a symlinked
# directory that you can use to store the checkpoint directories in.
#
# For this example we are using some Google data that can be learned and trained relatively quickly
# 
# NOTE: Gretel Synthetic paths must NOT contain whitespaces, which is why we have to symlink to a more local directory
# in /content. Unfortunately, Google Drive mounts contain whitespaces either in the "My drive" or "Shared drives" portion
# of the path
#
# !ln -s "/content/drive/Shared drives[My Drive]/YOUR_TARGET_DIRECTORY" checkpoints
#
# !pip install -U gretel-synthetics

In [1]:
import pandas as pd
from gretel_synthetics.batch import DataFrameBatch

source_df = pd.read_csv("https://gretel-public-website.s3-us-west-2.amazonaws.com/tests/synthetics/data/USAdultIncome14K.csv")

In [2]:
source_df.shape

(14000, 15)

In [None]:
# Here we create a dict with our config params, these are identical to when creating a LocalConfig object
#
# NOTE: We do not specify a ``input_data_path`` as this is automatically created for each batch

In [8]:
from pathlib import Path

checkpoint_dir = str(Path.cwd() / "test-model-2")

config_template = {
    "epochs": 10,
    "max_line_len": 2048,
    "vocab_size": 200000,
    "field_delimiter": ",",
    "overwrite": True,
    "checkpoint_dir": checkpoint_dir
}

In [9]:
# Create our batch handler. During construction, checkpoint directories are automatically created
# based on the configured batch size
batcher = DataFrameBatch(df=source_df, config=config_template, batch_size=5)

# Optionally, you can also provide your own batches, which can be a list of lists of strings:
#
# my_batches = [["col1", "col2"], ["col3", "col4", "col5"]]
# batcher = DataFrameBatch(df=source_df, batch_headers=my_batches, config=config_template)

2021-02-04 12:58:50,096 : MainThread : INFO : Creating directory structure for batch jobs...


In [10]:
# Next we generate our actual training DataFrames and Training text files
#
# Each batch directory will now have it's own "train.csv" file
# Each Batch object now has a ``training_df`` associated with it
batcher.create_training_data()

2021-02-04 12:58:51,411 : MainThread : INFO : Generating training DF and CSV for batch 0
2021-02-04 12:58:51,454 : MainThread : INFO : Generating training DF and CSV for batch 1
2021-02-04 12:58:51,500 : MainThread : INFO : Generating training DF and CSV for batch 2


In [11]:
# Now we can trigger each batch to train
batcher.train_all_batches()

2021-02-04 12:58:53,708 : MainThread : INFO : Loading training data from /Users/jtm/gretel/gretel-synthetics/examples/test-model-2/batch_0/train.csv
2021-02-04 12:58:53,739 : MainThread : INFO : Training SentencePiece tokenizer
2021-02-04 12:58:54,116 : MainThread : INFO : Loading tokenizer from: m.model
2021-02-04 12:58:54,127 : MainThread : INFO : Tokenizer model vocabulary size: 5536 tokens
2021-02-04 12:58:54,128 : MainThread : INFO : Mapping first line of training data

'30<d>?<d>157289<d>11th<d>7<n>'
 ---- sample tokens mapped to pieces ---- > 
▁, 3, 0, <d>, ?, <d>, 1, 5, 72, 89, <d>, 1, 1, th, <d>, 7, <n>

2021-02-04 12:58:54,128 : MainThread : INFO : Mapping first line of training data

'30<d>?<d>157289<d>11th<d>7<n>'
 ---- sample tokens mapped to int ---- > 
25, 47, 4, 55, 4, 1140, 13, 4, 50, 41, 4, 34, 3

2021-02-04 12:58:54,132 : MainThread : INFO : Tokenizing training data
100%|██████████| 14000/14000 [00:00<00:00, 46378.37it/s]
2021-02-04 12:58:54,436 : MainThread : INFO :

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (64, None, 256)           1417216   
_________________________________________________________________
dropout_3 (Dropout)          (64, None, 256)           0         
_________________________________________________________________
lstm_2 (LSTM)                (64, None, 256)           525312    
_________________________________________________________________
dropout_4 (Dropout)          (64, None, 256)           0         
_________________________________________________________________
lstm_3 (LSTM)                (64, None, 256)           525312    
_________________________________________________________________
dropout_5 (Dropout)          (64, None, 256)           0         
_________________________________________________________________
dense_1 (Dense)              (64, None, 5536)         

2021-02-04 13:06:03,825 : MainThread : INFO : Saving model history to model_history.csv
2021-02-04 13:06:03,828 : MainThread : INFO : Saving model to /Users/jtm/gretel/gretel-synthetics/examples/test-model-2/batch_0/synthetic
2021-02-04 13:06:03,833 : MainThread : INFO : Loading training data from /Users/jtm/gretel/gretel-synthetics/examples/test-model-2/batch_1/train.csv
2021-02-04 13:06:03,864 : MainThread : INFO : Training SentencePiece tokenizer
2021-02-04 13:06:04,262 : MainThread : INFO : Loading tokenizer from: m.model
2021-02-04 13:06:04,265 : MainThread : INFO : Tokenizer model vocabulary size: 118 tokens
2021-02-04 13:06:04,266 : MainThread : INFO : Mapping first line of training data

'Never-married<d>?<d>Unmarried<d>White<d>Male<n>'
 ---- sample tokens mapped to pieces ---- > 
▁, N, e, ve, r, -, ma, r, ried, <d>, ?, <d>, U, n, m, arried, <d>, W, hi, t, e, <d>, M, ale, <n>

2021-02-04 13:06:04,267 : MainThread : INFO : Mapping first line of training data

'Never-married<d>?<

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (64, None, 256)           30208     
_________________________________________________________________
dropout_6 (Dropout)          (64, None, 256)           0         
_________________________________________________________________
lstm_4 (LSTM)                (64, None, 256)           525312    
_________________________________________________________________
dropout_7 (Dropout)          (64, None, 256)           0         
_________________________________________________________________
lstm_5 (LSTM)                (64, None, 256)           525312    
_________________________________________________________________
dropout_8 (Dropout)          (64, None, 256)           0         
_________________________________________________________________
dense_2 (Dense)              (64, None, 118)          

2021-02-04 13:12:01,853 : MainThread : INFO : Saving model history to model_history.csv
2021-02-04 13:12:01,855 : MainThread : INFO : Saving model to /Users/jtm/gretel/gretel-synthetics/examples/test-model-2/batch_1/synthetic
2021-02-04 13:12:01,857 : MainThread : INFO : Loading training data from /Users/jtm/gretel/gretel-synthetics/examples/test-model-2/batch_2/train.csv
2021-02-04 13:12:01,883 : MainThread : INFO : Training SentencePiece tokenizer
2021-02-04 13:12:02,124 : MainThread : INFO : Loading tokenizer from: m.model
2021-02-04 13:12:02,127 : MainThread : INFO : Tokenizer model vocabulary size: 265 tokens
2021-02-04 13:12:02,128 : MainThread : INFO : Mapping first line of training data

'0<d>0<d>40<d>United-States<d><=50K<n>'
 ---- sample tokens mapped to pieces ---- > 
▁, 0, <d>, 0, <d>, 4, 0, <, d, >, U, ni, te, d, -, S, t, ate, s, <, d, >, <, =, 50, K, <n>

2021-02-04 13:12:02,129 : MainThread : INFO : Mapping first line of training data

'0<d>0<d>40<d>United-States<d><=50K

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (64, None, 256)           67840     
_________________________________________________________________
dropout_9 (Dropout)          (64, None, 256)           0         
_________________________________________________________________
lstm_6 (LSTM)                (64, None, 256)           525312    
_________________________________________________________________
dropout_10 (Dropout)         (64, None, 256)           0         
_________________________________________________________________
lstm_7 (LSTM)                (64, None, 256)           525312    
_________________________________________________________________
dropout_11 (Dropout)         (64, None, 256)           0         
_________________________________________________________________
dense_3 (Dense)              (64, None, 265)          

2021-02-04 13:15:40,605 : MainThread : INFO : Saving model history to model_history.csv
2021-02-04 13:15:40,607 : MainThread : INFO : Saving model to /Users/jtm/gretel/gretel-synthetics/examples/test-model-2/batch_2/synthetic


In [None]:
# Next, we can trigger all batched models to create output. This loops over each model and will attempt to generate
# ``gen_lines`` valid lines for each model. This method returns a dictionary of bools that is indexed by batch number
# and tells us if, for each batch, we were able to generate the requested number of valid lines
status = batcher.generate_all_batch_lines(num_lines=2000)

In [None]:
batcher.batches[2].gen_data_stream.getvalue()

In [None]:
status

In [None]:
# We can grab a DataFrame for each batch index
batcher.batch_to_df(0)

In [None]:
# Finally, we can re-assemble all synthetic batches into our new synthetic DF
batcher.batches_to_df()

# Read only mode

If you've already created a model(s) and simply want to load that data to generate more lines, you can use the read-only mode for the batch interface. No input DataFrame is required and it will automatically try and load model information from a primary checkpoint directory.

Additionally, you can also control the number of lines you wish to generate with the ``num_lines`` parameter for generation. This option exists for write mode as well and overrides the number of lines specified in the synthetic config that was used.

In [None]:
read_batch = DataFrameBatch(mode="read", checkpoint_dir=checkpoint_dir)

In [None]:
read_batch.generate_all_batch_lines(num_lines=5)

In [None]:
read_batch.batches_to_df()