Skip to content

Commit

Permalink
Jm/cleanup (#69)
Browse files Browse the repository at this point in the history
* Update tokenizer demo notebook

* readme updates, DP notebook

* dp updates

* int tests update

* use all procs for ITs

* update netflix example for differential privacy

* Update tokenizer documentation

* Research papers and Slack badge

* bugfix

* add datetime validator
  • Loading branch information
johntmyers committed Nov 17, 2020
1 parent 5a58a06 commit e15479e
Show file tree
Hide file tree
Showing 13 changed files with 186 additions and 181 deletions.
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -136,4 +136,6 @@ venv*
checkpoints
examples/checkpoints.zip

docs/_build
docs/_build
examples/tokenizer_demo/
dp-checkpoints
50 changes: 41 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,22 +11,14 @@
[![Python](https://img.shields.io/pypi/pyversions/gretel-synthetics.svg)](https://github.com/gretelai/gretel-synthetics)
[![Downloads](https://pepy.tech/badge/gretel-synthetics)](https://pepy.tech/project/gretel-synthetics)
[![GitHub stars](https://img.shields.io/github/stars/gretelai/gretel-synthetics?style=social)](https://github.com/gretelai/gretel-synthetics)
[![Slack](https://img.shields.io/badge/Slack%20Workspace-Join%20now!-36C5F0?logo=slack)](https://gretel.ai/slackinvite)

## Documentation
* [Get started with gretel-synthetics](https://gretel-synthetics.readthedocs.io/en/stable/)
* [Configuration](https://gretel-synthetics.readthedocs.io/en/stable/api/config.html)
* [Train your model](https://gretel-synthetics.readthedocs.io/en/stable/api/train.html)
* [Generate synthetic recoreds](https://gretel-synthetics.readthedocs.io/en/stable/api/generate.html)

## Overview

This package allows developers to quickly get immersed with synthetic data generation through the use of neural networks. The more complex pieces of working with libraries like Tensorflow and differential privacy are bundled into friendly Python classes and functions.


**NOTE**: The settings in our Jupyter Notebook examples are optimized to run on a GPU, which you can experiment with
for free in Google Colaboratory. If you're running on a CPU, you might want to grab a cup of coffee,
or lower `max_lines` and `epochs` to 5000 and 10, respectively. This code is developed for TensorFlow 2.3.X and above.


## Try it out now!
If you want to quickly discover gretel-synthetics, simply click the button below and follow the tutorials!
Expand Down Expand Up @@ -62,3 +54,43 @@ $ jupyter notebook

When the UI launches in your browser, navigate to `examples/synthetic_records.ipynb` and get generating!


## Overview

This package allows developers to quickly get immersed with synthetic data generation through the use of neural networks. The more complex pieces of working with libraries like Tensorflow and differential privacy are bundled into friendly Python classes and functions. There are two high level modes that can be utilized.

### Simple Mode

The simple mode will train line-per-line on an input file of text. When generating data, the generator will yield a custom object that can be used a variety of different ways based on your use case. [This notebook](https://github.com/gretelai/gretel-synthetics/blob/master/examples/tensorflow/simple-character-model.ipynb) demonstrates this mode.

### DataFrame Mode

This library supports CSV / DataFrames natively using the DataFrame "batch" mode. This module provided a wrapper around our simple mode that is geared for working with tabular data. Additionally, it is capabable of handling a high number of columns by breaking the input DataFrame up into "batches" of columns and training a model on each batch. [This notebook](https://github.com/gretelai/gretel-synthetics/blob/master/examples/dataframe_batch.ipynb) shows an overview of using this library with DataFrames natively.

### Components

There are four primary components to be aware of when using this library.

1) Configurations. Configurations are classes that are specific to an underlying ML engine used to train and generate data. An example would be using `TensorFlowConfig` to create all the necessary paramters to train a model based on TF. `LocalConfig` is aliased to `TensorFlowConfig` for backwards compatability with older versions of the library. A model is saved to a designated directory, which can optionally be archived and utilized later.

2) Tokenizers. Tokenizers convert input text into integer based IDs that are used by the underlying ML engine. These tokenizers can be created and sent to the training input. This is optional, and if no specific tokenizer is specified then a default one will be used. You can find [an example](https://github.com/gretelai/gretel-synthetics/blob/master/examples/tensorflow/batch-df-char-tokenizer.ipynb) here that uses a simple char-by-char tokenizer to build a model from an input CSV. When training in a non-differentially private mode, we suggest using the default `SentencePiece` tokenizer, an unsupervised tokenizer that learns subword units (e.g., **byte-pair-encoding (BPE)** [[Sennrich et al.](http://www.aclweb.org/anthology/P16-1162)]) and **unigram language model** [[Kudo.](https://arxiv.org/abs/1804.10959)]) for faster training and increased accuracy of the synthetic model.

3) Training. Training a model combines the configuration and tokenizer and builds a model, which is stored in the designated directory, that can be used to generate new records.

4) Generation. Once a model is trained, any number of new lines or records can be generated. Optionally, a record validator can be provided to ensure that the generated data meets any constraints that are necessary. See our notebooks for examples on validators.

## Differential Privacy

Differential privacy support for our TensorFlow mode is built on the great work being done by the Google TF team and their [TensorFlow Privacy library](https://github.com/tensorflow/privacy).

When utilizing DP, we currently recommend using the character tokenizer as it will only create a vocabulary of single tokens and removes the risk of sensitive data being memorized as actual tokens that can be replayed during generation.

There are also a few configuration options that are notable such as:

- `predict_batch_size` should be set to 1
- `dp` should be enabled
- `learning_rate`, `dp_noise_multiplier`, `dp_l2_norm_clip`, and `dp_microbatches` can be adjusted to achieve various epsilon values.
- `reset_states` should be disabled

Please see our [example Notebook](https://github.com/gretelai/gretel-synthetics/blob/master/examples/tensorflow/diff_privacy.ipynb) for training a DP model based on the [Netflix Prize](https://en.wikipedia.org/wiki/Netflix_Prize) dataset.

77 changes: 0 additions & 77 deletions examples/generate_as_module.py

This file was deleted.

125 changes: 125 additions & 0 deletions examples/tensorflow/diff_privacy.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "synthetics_dp.py",
"provenance": [],
"collapsed_sections": []
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"accelerator": "GPU"
},
"cells": [
{
"cell_type": "code",
"metadata": {
"id": "TY5zsaXme67e"
},
"source": [
"from pathlib import Path\n",
"\n",
"from gretel_synthetics.config import TensorFlowConfig\n",
"from gretel_synthetics.tokenizers import CharTokenizerTrainer\n",
"from gretel_synthetics.train import train"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "CNTeiB24e4TJ"
},
"source": [
"# This config will utilize TensorFlow Privacy to inject noised data into \n",
"# the model during training. Adjust the dp_* parameters to balance\n",
"# privacy vs. accuracy for a synthetic model. \n",
"\n",
"config = TensorFlowConfig(\n",
" gen_lines=1000,\n",
" max_lines=1e5,\n",
" dp=True,\n",
" predict_batch_size=1,\n",
" rnn_units=256,\n",
" batch_size=16,\n",
" learning_rate=0.0015,\n",
" dp_noise_multiplier=0.2,\n",
" dp_l2_norm_clip=1.0,\n",
" dropout_rate=0.5,\n",
" dp_microbatches=1,\n",
" reset_states=False,\n",
" overwrite=True,\n",
" checkpoint_dir=(Path.cwd() / 'checkpoints').as_posix(),\n",
" # The \"Netflix Challenge\", dataset\n",
" input_data_path='https://gretel-public-website.s3.amazonaws.com/datasets/netflix/netflix.txt'\n",
")\n",
"\n",
"# Initialize the tokenizer\n",
"tokenizer = CharTokenizerTrainer(config=config)\n",
"\n",
"# Train the model\n",
"train(config, tokenizer)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "EzAlijMSlSJz"
},
"source": [
"from collections import Counter\n",
"import datetime\n",
"import pandas as pd\n",
"import json\n",
"\n",
"from gretel_synthetics.generate import generate_text\n",
"\n",
"\n",
"# extract training params\n",
"def get_privacy_guarantees():\n",
" df = pd.read_csv(f\"{config.checkpoint_dir}/model_history.csv\")\n",
" epsilon = df[df['best'] == 1]['epsilon'].values[0]\n",
" delta = df[df['best'] == 1]['delta'].values[0]\n",
" return {\n",
" \"epsilon\": epsilon,\n",
" \"delta\": delta,\n",
" }\n",
"\n",
"# Build a validator\n",
"def validate_record(line):\n",
" rec = line.split(\",\")\n",
" if len(rec) == 4:\n",
" datetime.datetime.strptime(rec[3], '%Y-%m-%d')\n",
" int(rec[2])\n",
" int(rec[1])\n",
" int(rec[0])\n",
" else:\n",
" raise Exception('record not valid')\n",
"\n",
"\n",
"# Print differential privacy epsilon and delta values\n",
"print(json.dumps(get_privacy_guarantees(), indent=2))\n",
"\n",
"# Print CSV header and synthetic lines\n",
"counter = 0\n",
"print(\"movie_id,user_id,rating,date\")\n",
"for line in generate_text(config, \n",
" line_validator=validate_record, \n",
" max_invalid=1e5):\n",
" if line.valid:\n",
" print(f\"{line.text}\")\n",
" counter += 1\n",
" if counter > config.gen_lines:\n",
" break\n"
],
"execution_count": null,
"outputs": []
}
]
}
Binary file removed examples/tokenizer_demo/char2idx.p
Binary file not shown.
Binary file removed examples/tokenizer_demo/idx2char.p
Binary file not shown.
Binary file removed examples/tokenizer_demo/m.model
Binary file not shown.
72 changes: 0 additions & 72 deletions examples/tokenizer_demo/m.vocab

This file was deleted.

1 change: 0 additions & 1 deletion examples/tokenizer_demo/tokenizer_params.json

This file was deleted.

6 changes: 0 additions & 6 deletions examples/tokenizer_demo/training_data.txt

This file was deleted.

0 comments on commit e15479e

Please sign in to comment.