# Generating synthetic data

This notebook walks through training a probabilistic, generative RNN model
on a rental scooter location dataset, and then generating a synthetic
dataset with greater privacy guarantees. 

For both training and generating data, we can use the ``config.py`` module and
create a ``LocalConfig`` instance that contains all the attributes that we need
for both activities.

In the below example, we will create a config that can work on a CPU. Performing 
operations on a GPU is recommended with more complex settings.

In [None]:
# pip install gretel-synthetics

In [None]:
import os

from gretel_synthetics.config import LocalConfig

CURR_DIR = !pwd
CURR_DIR = CURR_DIR[0]

# Create a config that we can use for both training and generating, with CPU-friendly settings
# The default values for ``max_chars`` and ``epochs`` are better suited for GPUs

config = LocalConfig(
    max_chars=100000,  # friendly towards CPUs
    epochs=5,  # friendly towards CPUs
    gen_chars=0, # the maximum number of characters possible per-generated line of text
    gen_lines=100, # the number of generated text lines
    rnn_units=256, # dimensionality of LSTM output space
    batch_size=64, # batch size
    buffer_size=1000, # buffer size to shuffle the dataset
    dropout_rate=0.2, # fraction of the inputs to drop
    dp=True, # let's use differential privacy
    dp_learning_rate=0.015, # learning rate
    dp_noise_multiplier=1.1, # control how much noise is added to gradients
    dp_l2_norm_clip=1.0, # bound optimizer's sensitivity to individual training points
    dp_microbatches=256, # split batches into minibatches for parallelism
    checkpoint_dir=os.path.join(CURR_DIR, 'checkpoints'),
    training_data=os.path.join(CURR_DIR, 'data/uber_scooter_rides_1day.csv')
)

In [None]:
# Train a model
# The training function only requires our config as a single arg
from gretel_synthetics.train import train_rnn

train_rnn(config)

In [None]:
# Let's generate some text!
#
# The ``generate_text`` funtion is a generator that will return
# a line of predicted text based on the ``gen_lines`` setting in your
# config.
#
# There is no limit on the line length as with proper training, your model
# should learn where newlines generally occur. However, if you want to
# specify a maximum char len for each line, you may set the ``gen_chars``
# attribute in your config object
from gretel_synthetics.generate import generate_text

# Optionally, when generating text, you can provide a callable that takes the 
# generated line as a single arg. If this function raises any errors, the 
# line will fail validation and will not be returned.  The exception message
# will be provided as a ``explain`` field in the resulting dict that gets
# created by ``generate_text``
def validate_record(line):
    rec = line.split(", ")
    if len(rec) == 6:
        float(rec[5])
        float(rec[4])
        float(rec[3])
        float(rec[2])
        int(rec[0])
    else:
        raise Exception('record not 6 parts')
        

for line in generate_text(config, line_validator=validate_record):
    print(line)
