# Generating synthetic data

This notebook walks through training a probabilistic, generative RNN model<br>
on a rental scooter location dataset, and then generating a synthetic<br>
dataset with greater privacy guarantees. 

For both training and generating data, we can use the ``config.py`` module and<br>
create a ``LocalConfig`` instance that contains all the attributes that we need<br>
for both activities.

In [None]:
# Google Colab support
# Note: Click "Runtime->Change Runtime Type" set Hardware Accelerator to "GPU"
# Note: Use pip install gretel-synthetics[tf] to install tensorflow if necessary
# 
#!pip install gretel-synthetics

In [None]:
from pathlib import Path

from gretel_synthetics.config import LocalConfig

# Create a config that we can use for both training and generating, with CPU-friendly settings
# The default values for ``max_lines`` and ``epochs`` are better suited for GPUs

config = LocalConfig(
    max_lines=0, # use max_lines of training data. Set to 0 (zero) to on all lines in dataset
    epochs=15, # 15-50 epochs with GPU for best performance
    vocab_size=15000, # tokenizer model vocabulary size
    character_coverage=1.0, # tokenizer model character coverage percent
    gen_chars=0, # the maximum number of characters possible per-generated line of text
    gen_lines=100, # the number of generated text lines
    rnn_units=256, # dimensionality of LSTM output space
    batch_size=64, # batch size
    buffer_size=1000, # buffer size to shuffle the dataset
    dropout_rate=0.2, # fraction of the inputs to drop
    dp=True, # let's use differential privacy
    dp_learning_rate=0.015, # learning rate
    dp_noise_multiplier=1.1, # control how much noise is added to gradients
    dp_l2_norm_clip=1.0, # bound optimizer's sensitivity to individual training points
    dp_microbatches=256, # split batches into minibatches for parallelism
    checkpoint_dir=(Path.cwd() / 'checkpoints').as_posix(),
    field_delimiter=",",  # if the training text is structured
    # overwrite=True,  # enable this if you want to keep training models to the same checkpoint location
    input_data_path="https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/uber_scooter_rides_1day.csv" # filepath or S3
)

In [None]:
# Train a model
# The training function only requires our config as a single arg
from gretel_synthetics.train import train_rnn

train_rnn(config)

In [None]:
# Let's generate some text!
#
# The ``generate_text`` funtion is a generator that will return
# a line of predicted text based on the ``gen_lines`` setting in your
# config.
#
# There is no limit on the line length as with proper training, your model
# should learn where newlines generally occur. However, if you want to
# specify a maximum char len for each line, you may set the ``gen_chars``
# attribute in your config object
from gretel_synthetics.generate import generate_text

# Optionally, when generating text, you can provide a callable that takes the 
# generated line as a single arg. If this function raises any errors, the 
# line will fail validation and will not be returned.  The exception message
# will be provided as a ``explain`` field in the resulting dict that gets
# created by ``generate_text``
def validate_record(line):
    rec = line.split(", ")
    if len(rec) == 6:
        float(rec[5])
        float(rec[4])
        float(rec[3])
        float(rec[2])
        int(rec[0])
    else:
        raise Exception('record not 6 parts')
        
for line in generate_text(config, line_validator=validate_record):
    print(line)