In [None]:
import numpy as np
import pandas as pd

## Source

Lei Xu and Maria Skoularidou and Alfredo Cuesta-Infante and Kalyan Veeramachaneni (2019). Modeling Tabular Data using Conditional GAN. [https://arxiv.org/abs/1907.00503]. arXiv

As we will get into below, this paper introduces conditional tabular generative adversarial network (CTGAN), a deep learning generative adversarial network (GAN) model that creates new tabular data based on real tabular data.

## Project & Introduction to Synthetic Data

My modus operandi is to start projects by finding interesting datasets and building out the other details from there. In this project, we will start with an application of deep learning and see if we can get it working.

We will try to build a deep learning model that creates synthetic data. Synthetic data has been in the air lately due to its proposed use to feed even greater amounts of data into burgeoning artificial intelligence models. This use of synthetic data is focused on increasing the amount of data that is available. So, if we have 1,000,000 rows of data, we can use a model to learn the ins and outs of that data and then create an indefinite amount of new rows to supplement the real ones.

But there another use for synthetic data that I am actually more interested in, a use that is relevant to real-world data systems that I oversee and work with. We can use synthetic data to protect privacy. 

If we have a dataset that includes personally identifiable information (PII) -- names, social security numbers or other unique identifiers, addresses, emails, dates of birth, etc. -- our first step would be to anonymize or drop that data. One approach here would be to develop an algorithm for assigning an anonymized unique identifier for each unique individual based on the PII. We can feed PII into the model, let the model cluster by unique individuals, and then have the model assign that anonymized identifier to each cluster. We could then drop all PII from the dataset and replace it with the anonymized identifier.

This is a good first step in that it removes data that can identify an individual directly. But we still have a problem. Someone may be able to use, for instance, high school graduation year, high school grade point average, courses taken in high school, race and ethnicity, and gender to reverse engineer who someone is. We cannot anonymize all data, including non-PII data, since we would have nothing left to run analysis on. But, we can synthesize data.

We would again build a deep learning model that learns the patterns of the real anonymized data. We then have the model create new rows for us. If the model has learned the right patterns in the data, then the new rows should show statistical relationships that are close enough to those found in the real data while allowing us to work with data that is all generated by the model, so there is no actual person to tie any row back to.

It is this second use case of synthetic data that I am exploring for work purposes, so what better way is there to start to become fluent with synthetic data than trying to build my own deep learning model to create synthetic data.


## Strategy

I will admit up front that there are pieces of building a model to create synthetic data that I do not really understand at this point. Instead of figuring that out ahead of time, I am hoping to use this project to explore strategies for modeling as well as referring to the academic papers in the References section below to overcome hurdles. It may be in the end that we follow one of those papers closely, likely the one in the Source section above.

There are a couple of items that stand out in the early exploratory phase.

We are looking to generate synthetic data, so we need a generative model. This points to us using some version of a GAN. (Please see the following section for a description of GANs).



- We can start by looking at generating data for one feature, but we will need to expand to generating data for entire rows. This means we need a way to treat rows in their entirety as inputs and outputs.
- We may want to explore rows as sequential data with one feature following another. I am skeptical of this as of now, but it would open up using recurrent neural networks (RNNs) as GAN generators with the assumption that there is some level of correlation between features as we move forward in the sequence of one feature after the other.
- We are purposely avoiding the use of transformer-based models such as the generative AI models that have been in the news in the past few years. For me, I would like to see how some of the building blocks we have worked on in this course could create synthetic data.


## Generative Adversarial Networks (GANs)

## References

Academic papers
- https://www2.stat.duke.edu/~jerry/Papers/jos03.pdf 
- https://www2.stat.duke.edu/~jerry/Papers/sm04.pdf
- https://arxiv.org/abs/2303.01230v3
- https://dl.acm.org/doi/10.1145/3636424
- https://arxiv.org/abs/1907.00503
- https://arxiv.org/abs/1609.05473
- https://arxiv.org/abs/1611.04051
- https://arxiv.org/abs/1810.06640
- https://arxiv.org/abs/2403.04190v1
- https://arxiv.org/abs/1811.11264

Web articles
- https://medium.com/@aldolamberti/synthetic-data-101-synthetic-data-vs-real-dummy-data-237d790433a9
- https://machinelearningmastery.com/mostly-generate-synethetic-data-machine-learning-why/
- https://towardsdatascience.com/generative-ai-synthetic-data-generation-with-gans-using-pytorch-2e4dde8a17dd
- https://becominghuman.ai/generative-adversarial-networks-for-text-generation-part-1-2b886c8cab10
- https://becominghuman.ai/generative-adversarial-networks-for-text-generation-part-3-non-rl-methods-70d1be02350b
- https://towardsdatascience.com/how-to-generate-tabular-data-using-ctgans-9386e45836a6
- https://medium.com/analytics-vidhya/a-step-by-step-guide-to-generate-tabular-synthetic-dataset-with-gans-d55fc373c8db

Source code
- https://github.com/sdv-dev/TGAN
- https://github.com/sdv-dev/CTGAN

Videos
- https://www.youtube.com/watch?v=yujdA46HKwA (GANs for Tabular Synthetic Data Generation)
- https://www.youtube.com/watch?v=Ei0klF38CNs (Synthetic data generation with CTGAN)
- https://www.youtube.com/watch?v=ROLugVqjf00 (Generation of Synthetic Financial Time Series with GANs - Casper Hogenboom)
- https://www.youtube.com/watch?v=HIusawrGBN4 (What is Synthetic Data? No, It's Not "Fake" Data)
- https://www.youtube.com/watch?v=FLTWjkx0kWE (Generate Synthetic Tabular Data with GANs)