In [None]:
import numpy as np
import pandas as pd

## Source

1. Noseong Park and Mahmoud Mohammadi and Kshitij Gorde and Sushil Jajodia and Hongkyu Park and Youngmin Kim (2018). Data Synthesis Based on Generative Adversarial Networks. [https://www.vldb.org/pvldb/vol11/p1071-park.pdf]. VLDB

    This paper introduces table-GAN, an approach to deep modeling to create synthetic tabular data through the use of convolutional neural networks (CNNs) and generative adversarial networks (GANs). Note that it was published in the same year as the next paper on TGAN, though it is by different authors. Note that source 2 below references this paper and its tableGAN approach.
    
2. Lei Xu and Kalyan Veeramachaneni (2018). Synthesizing Tabular Data using Generative Adversarial Networks. [https://arxiv.org/abs/1811.11264]. arXiv

    This paper introduces tabular generative adversarial networks (TGANs), a precursor to the CTGAN in the next paper and similar to the previous paper but utilizing recurrent neural networks (RNNs) in place of CNNs.
     
3. Lei Xu and Maria Skoularidou and Alfredo Cuesta-Infante and Kalyan Veeramachaneni (2019). Modeling Tabular Data using Conditional GAN. [https://arxiv.org/abs/1907.00503]. arXiv

    This paper introduces conditional tabular generative adversarial networks (CTGAN), a deep learning conditional GAN that improves, according to the authors, on their approach in the previous paper by introducing a conditional generator to the architecture.

4. Emiliano De Cristofaro (2024). Synthetic Data: Methods, Use Cases, and Risks. [https://arxiv.org/abs/2303.01230]. arXiv

    This paper is a good resource for understanding synthetic data generally.

5. Becker, B. & Kohavi, R. (1996). Adult [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5XW20.

    This dataset comes from the University of California Irvine Machine Learning Repository, is based on 1994 Census data, and allows us to classify if given individuals make more than $50,000 per year. While gleaning insights out of the dataset is not the focus of this project, we do still need a dataset to work with as we test out generating synthetic data.

6. Alec Radford and Luke Metz and Soumith Chintala (2016). Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. [https://arxiv.org/abs/1511.06434]. arXiv 
    
    This paper introduces deep convolutional generative adversarial networks (DCGAN), a GAN architecture that [1] relies on.



## Project & Introduction to Synthetic Data

My modus operandi is to start projects by finding interesting datasets and building out the other details from there. In this project, we will instead start with an application of deep learning and see if we can get it working on a tried-and-true machine learning dataset. We will try to build a deep learning model that creates synthetic data. Synthetic data is "artificially generated [data] that resemble[s] the actual data -- more precisely, having similar statistical properties" (De Cristofaro, 2024, p. 1).

Note that it is unclear at the outset of the project just how far into creating synthetic data models we will be able to get in this project. As we will see in the next section, the three research papers above that we will focus on for implementation are pretty complicated. My preference is to tackle that complexity head-on and see how far we can get, but I do want to recognize that each paper has numerous authors with terminal degrees, and this research may be part of some of the authors' theses, so I want to be careful in assuming that I may be able to recreate all of their work, even if I have their research papers to guide me. It may be that we get part of the modeling working, but we do not get a full version of the models from any of the papers. My read on the assignment is that progressing through the stages of building up synthetic data models fits with expectations, and I am really interested in tackling a project with this kind of complexity of theory and modeling, so we will see how far we can get in terms of building one of these models.

Now, back to synthetic data.

Synthetic data has been in the air lately due to its proposed use to feed even greater amounts of data into burgeoning artificial intelligence models. This use of synthetic data is focused on increasing the amount of data that is available. If we have 1,000,000 rows of data, we can use a model to learn the ins and outs of that data and then create new synthetic rows to supplement the real ones. We could also use this use case for synthetic data to re-balance classes in the data.

We can also use synthetic data to protect privacy, the use case I am interested in.

If we have a dataset that includes personally identifiable information (PII) -- names, social security numbers (SSNs) or other unique identifiers, addresses, emails, dates of birth, etc. -- our first step would be to anonymize that data. One approach here would be to develop an algorithm for assigning an anonymized unique identifier for each unique individual based on the PII. We can feed PII into the model, let the model cluster by unique individuals it identifies based on the PII, and then have the model assign the anonymized identifier to each cluster. We could then drop all PII from the dataset and replace it with the anonymized identifier.

This is a good first step in that it removes data that can identify an individual directly, but we still have a problem. Someone may be able to use, for instance, high school graduation year, high school grade point average, courses taken in high school, race and ethnicity, and gender to reverse engineer who someone is. We cannot anonymize all data, including non-PII data, to reduce this risk since we would have nothing left to run analysis on. But, we can synthesize data. Instead of supplementing real data with synthetic data to increase the number of observations in the dataset, this time we can create a dataset that is entirely made up of synthetic data, allowing us to share the synthetic dataset for analysis since it matches the real data's statistical properties but while protecting privacy since we are not releasing any rows with real data, so there is no real person to tie any row back to.

This project will focus on "[g]enerative machine models [that] learn how a dataset is generated using a probabilistic model and [that] create synthetic data by sampling from the learned distribution" (De Cristofaro, 2024, p. 2). Relying on models we learned in class, generative adversarial models (GANs) seem like the obvious choice for this use case. (See the GAN section below for a deeper dive into GANs.) The Monet-image Kaggle competition leverages CycleGANs, and CycleGANs can in a sense be thought of as autoencoders, another model that we may want to look at here. Autoencoders convert an input to a reduced latent space (encoder) and then convert that latent space back into the original space (decoder). Once we train the autoencoder, we can pass random noise as input to the latent space and let the decoder translate that back into the full input space for the encoder. In this scenario, this means we could pass random noise into the latent space input for the decoder, and the decoder would convert that to the synthetic output we are looking for, in theory.

There are risks with synthetic data when it comes to privacy. While sharing completely synthetic data provides extra layers of protection, the statistical trends in the data may be strong enough that a nefarious actor could reverse engineer real identities from the synthetic data. One example of this is any value that shows up infrequently. The synthetic data should pick up on that and recreate that value with a smaller frequency, and that can lead to many common issues when dealing with small n counts in shared datasets. One attempt to address this is l-diversity where each categorical feature must have at least l factors to try and protect against this issue (Park et al., 2018, pp. 1072-1073). 

We also need to worry about very strong correlations between features. The synthetic data will pick up on this, and, again combined with small n concerns, this broadens the information a bad actor has to try and re-identify from the synthetic data. (De Cristofaro, 2024, p. 4, Attribute Disclosure section).

Another concern is re-identification attacks where the attacker uses other sources combined with the synthetic data to re-identify individuals. Fields like SSNs or names are direct identifiers. Fields like race, ethnicity, and gender are quasi-identifiers (QIDs). Attackers can use these QIDs to supplement statistical trends in the synthetic data to piece together who an individual is (Park et al., 2018, p. 1072). We can control the synthetic data we share, but we cannot control what other dataset or outside information someone may have access to.

I do want to point out that the De Cristofaro paper that I am mostly referencing in this section calls out that synthetic data to protect privacy as we will look at it in this project is not necessarily as safe as it may at first seem, providing "little additional protection compared to anonymization techniques, with privacy-utility trade-offs being even harder to predict" (De Cristofaro, 2024, p. 5). The paper recommends looking at differential privacy as a better option on top of synthetic data. Differential privacy introduces noise into the data that is shared, muddying up what is real or not. This highlights the tension between privacy and utility of synthetic data since adding in noise to protect privacy inherently reduces the statistical accuracy of the synthetic data compared to the real data (De Cristofaro, 2024, p. 1). And, regardless of what solution we use, there really is no way to get both privacy and utility for underrepresented values or classes in the data since any upsampling to protect privacy by necessity alters the statistical accuracy of the synthetic data (De Cristofaro, 2024, p. 6).

The issues raised here are not exhaustive, but they hopefully provide an introductory sampling of the kinds of issues we are concerned with and how synthetic data attempts a response to those issues. Zooming out and acknowledging the complexities and concerns mentioned here and referenced in the cited papers, we do need to start somewhere. The goal of this project is to understand how deep learning models for synthetic data work, trying to build a model up while following along with the research papers. We need a solid foundation in synthetic data before adding in too many open issues that are yet unsolved for synthetic data modeling. The next step is to plan out how we will work with the research papers to try to build our own synthetic data model.

## Neural Network Background

This section covers background material that readers are likely to know having gone through this deep learning course, but we will still provide a brief summary.

### Generative Adversarial Networks (GANs)

GANs are machine learning architectures (networks) that use competing neural networks (adversarial) that generate (generative), in this case generating synthetic data. The competing networks are the generator and discriminator. The generator takes in a noise vector as input and, initially, generates random noise. The discriminator takes in the output of the generator along with real data and distinguishes between the two, passing its labeling of real and synthetic data back to the generator. This cycle repeats again and again, each time the generator getting better at generating real-looking synthetic data and the discriminator becoming more discerning between real and synthetic data. The goal is that the generator will eventually win out and produce synthetic data that a well-refined discriminator cannot discern as different from real data. At this point, we have a generator model that we can use to produce real-enough synthetic data.

### Convolutional Neural Networks (CNNs)

CNNs are feed-forward neural networks where we apply kernels or filters -- rectangular windows -- that slide along the dimensions of the input matrices. For tabular data, we transform an input row into a square matrix, zero padding as necessary to fill out the square matrices, so we can think of the input for tabular data as appearing like an image with a width and a height and with values at each pixel. For example, if we have an input row with 13 features, we can transform that 1x13 vector into a 4x4 matrix and pad out the 2nd, 3rd, and 4th columns in the final row since the original row does not have enough features to fill those cells in the transformed matrix.

When we focused on images as inputs in a previous week's project, the CNN kernels summarized or abstracted elements of the images. If we added multiple kernels per layer, then each kernel would pick up on different aspects of the image. When we added multiple layers, each layer picked up on higher levels of abstraction. Note that in the CNN project we started with rectangular images with three channels and reduced those down with convolutional layers. In [1], we use "deconvolutions" or fractionally strided convolutions, and we then follow those with convolutions, the opposite direction of what we saw with the images in the previous project. We will go into the specifics for [1] below.

One aspect of CNNs to highlight is striding. Striding occurs when we slide a filter across a matrix but skip certain amounts at each step. For example, if we have a 10x10 input matrix and a 2x2 filter, without striding we would start with the filter in the upper-left of the matrix, calculate, and then shift the filter to the right by one. If we add a stride of 2, when we move that filter to the right, we skip a column and move to the second column to the right. A stride of three would skip two columns and jump to the third to the right.

For fractionally strided convolutions, we use strides that are less than 1. Strides greater than 1 reduce the output dimensions, so strides less than one increase the output dimensions. If we stride by 1/2, we add rows and columns of 0s between the rows and columns of the original matrix. This will double the dimensions of the input. We can then apply a convolution layer to the upsampled output with additional padding but with no pooling or striding to keep the dimensions of the upsampled output the same but to fill in the 0s. This process shows up as "deconvolutions" but is implemented with fractional striding in order to increase instead of decrease the dimensions of the input to that convolutional layer.

### Deep Convolutional GAN (DCGAN)

This model is explained in [6] and shows up in the architecture for [1]. DCGANs are GANs that use specific CNN architectures for both the generator and discriminator. Page 3 of the DCGAN paper has a summary list of the details for DCGAN that we will copy over here:

- Replace any pooling layers with strided convolutions (discriminator) and fractional-strided convolutions (generator).
- Use batchnorm in both the generator and the discriminator.
- Remove fully connected hidden layers for deeper architectures.
- Use ReLU activation in generator for all layers except for the output, which uses Tanh.
- Use LeakyReLU activation in the discriminator for all layers. [Radford et al., 2016, p. 3]

### Recurrent Neural Networks (RNNs)

RNNs introduce recurrence, a deviation from feed-forward network architectures. CNNs are one example of feed-forward networks where the inputs get passed forward through the network but do not move sideways or backwards. In RNNs, we have recurrent layers that look sort of like hidden layers in other artificial neural networks but that are able to pass their output back into themselves, creating a loop in place of feed-forward hidden layers. Technically the RNN has one recurrent hidden layer, but we can think of that repeating hidden layer as unfolding into sequential hidden layers that take inputs from the previous hidden layers as well as new inputs from the tabular data. Note that the weights for the hidden recurrent layers are shared in an RNN, that being different from hidden layers in other architectures having their own separate weights.

We will cover RNNs in [2] in more detail below, but the general idea is that, for a given row, we will pass the first feature into the RNN, and then we will pass output weights from that first feature back into the RNN along with the second feature, then we will pass weights from that output back into the RNN along with the third feature, and on and on. The focus here is that we are treating features in a row like sequential inputs where each input can receive information from previous inputs. This idea of features as sequential inputs for tabular data is interesting in opening up the possibility of using RNNs to generate tabular synthetic data.

### Word Embeddings (do we need these?)

Only [2] references embedding vectors, but we will introduce them here in case we end up needing them. Machine learning models work with numbers, not with text, so we need a way to turn text into numbers. One approach could be to identify all the unique words in a feature -- the vocabulary for that feature -- and then one-hot encode each individual word, creating a one-hot encoded matrix per feature. Skipping over some of the cons of this approach, next we could instead create a matrix with one row per row in the base dataset and one column for each word in the combined vocabulary, and then we could put 1s for each word that shows up in a given row for a feature. A third approach could be to create a list of the vocabulary for each feature, and then we can replace each word that shows up in that feature with the index of that word in the vocabulary list.

From here, there are more complicated embedding approaches such as TFIDF, Word2Vec, and GloVe. We will not go into the specifics of these, but each has a different way of converting text into numbers.

I am surprised that word embeddings do not show up more often across all three papers. When we review the papers in detail below, we will keep an eye out for what might be going on here or if the researchers assume that implementations will handle embeddings without their needing to direct how.


## Research Paper Architecture Review

The three papers at the start of the notebook present different approaches to generating tabular synthetic data, tabular data being data in a table format with rows and columns. All three use GANs for the generative aspect of their architectures. [1] and [2] come earlier than [3] and have slightly simpler architectures, though they still introduce layers of complexity that are hard to sort through. [3] comes later and looks more effective than [1] and [2], but it introduces an extra layer of complexity with a conditional GAN.

For terminology, we will refer to the model in [1] as table-GAN, in [2] as TGAN, and in [3] as CTGAN.

Note that we are only covering new info not discussed above. Each paper has some level of review of other methods for preserving privacy and utility. The papers cover many areas related to risk, exposure, and competing methods. To keep the scope down, we will focus on the core of the approaches and architectures that each paper uses and that we can try to implement ourselves.

### [1] Data Synthesis Based on Generative Adversarial Networks

- Trains machine learning models on real and synthetic tables and shows that performance is similar -- called model compatibility or the concept that the synthetic table can replace the real table [Park et al., 2018, p. 1071]
- table-GAN can handle real tabular data that includes categorical, discrete, and continuous values and leaves other types out for now [Park et al., 2018, p. 1071]
- Consists of three ANNs: generator, discriminator, and classifier [Park et al., 2018, p. 1072]
- Classifier increases the "semantic integrity" of the synthetic data by learning the semantics in the real data, meaning the classifier learns what combinations of values in different features are legitimate so we do not end up with something like most recent course being 8th grade but having a high school graduation date [Park et al., 2018, p. 1072]
- Includes three loss functions: 1) the standard objective function for GANs with the minimax between the generator and discriminator; 2) information loss that matches the mean and standard deviation across row features, making sure that synthetic rows have the same statistical properties of real rows (based on the paper, maybe maximum-margin in hinge loss); and 3) classification loss that maintains semantic integrity, adding much more complexity (this is a step we can add in once we are ready) [Park et al., 2018, p. 1072]
- Uses DCGAN as basis for table-GAN [Park et al., 2018, p. 1073]
 
Overall workflow [1074]
- Convert records into square matrices with zero padding as needed
- Train table-GAN on square matrices (see details below)
- table-GAN generates synthetic square matrices that we convert into records and combine into a table
- Train models and run analysis on synthetic table
- Evaluate statistics and performance on synthetic table for evaluation purposes [Park et al., 2018, p. 1074]
 
table-GAN architecture [1074-5]
- DCGAN as explained above along with an additional classifier model (classifier can be an additional layer of complexity we add in after getting the remaining modeling working)
- Discriminator
    - CNN with multiple layers including batchnorm and leaky ReLU
    - Final layer is a sigmoid layer that predicts 1 for real data and 0 for synthetic data
- Generator
    - Also a CNN with multiple de-convolutional layers
    - Latent space input is a tensor with each value in the range of [-1,1]
    - De-convolutional layers convert input into a 2D matrix that matches the dimensions for the records in the synthetic table
- Classifier
    - Same architecture as the discriminator
    - Trained by ground-truth labels in the real table
    - Can train the generator if the records it produces are semantically correct, meaning are the synthetic values accurate combinations
    - Semantically incorrect synthetic records are likely to be flagged as fake by the discriminator, though the discriminator's main goal is not semantic integrity

Loss functions
- Generator uses all three losses, discriminator uses DCGAN loss, and classifier uses classification loss [1075]
- Original loss is standard GAN loss [1075]
    - Discriminator maximizes this loss while the generator minimizes it [1076]
- Information loss is the discrepancy between two statistics of synthetic and real records [1075]
    - We pull these statistics just before the sigmoid activation of the discriminator predicts the record as real or fake [1076]
    - We want this loss for both mean and standard deviation to be 0, indicating that the discriminator may not be able to distinguish between them [1076]
    - Third generator information loss lets us control deltas to set privacy levels where smaller deltas mean the synthetic data is closer to the original data. We can raise delta if we need to share synthetic data with less trustworthy recipients and want to create greater differences between the real and synthetic data. [1076]
    - These deltas are hyperparameters.
- Classification loss is the discrepancy between the label predicted by the classifier and the synthesized label [1075]

Training algorithm
- Use minibatch stochastic gradient descent (SGD) [1076]
- One issue with using SGD is that we cannot calculate the mean and standard deviation of all records for a given feature for information loss [1076]
- We use an exponentially weighted moving average to approximate the mean and standard deviation for each feature. The weight should be close to 1 for stability (paper uses 0.99) [1076-7]
- 1) train discriminator with GAN loss; 2) train classifier with classification loss for classifier; and 3) train generator with GAN loss + information loss for generator + classification loss for generator [1077]
- Once trained, we pass in latent vector z to create one synthetic record [1077]

Evaluation
- Distance to the closest record (DCR), statistical comparisons, and comparing classification or regression performance between models trained on the synthetic and the real tables [1078]
     
Reminder of DCGAN architecture

- Replace any pooling layers with strided convolutions (discriminator) and fractional-strided convolutions (generator).
- Use batchnorm in both the generator and the discriminator.
- Remove fully connected hidden layers for deeper architectures.
- Use ReLU activation in generator for all layers except for the output, which uses Tanh.
- Use LeakyReLU activation in the discriminator for all layers. [Radford et al., 2016, p. 3]
    
### [2] Synthesizing Tabular Data using Generative Adversarial Networks
     
### [3] Modeling Tabular Data using Conditional GAN

- Modeling the probability distribution of rows in tabular data
- Generate synthetic data from that row-based probability distribution
- Tabular data has a mix of discrete and continuous columns
- Designed CTGAN with a conditional generator to address challenges
- Need to model discrete data, continuous data, and imbalanced categorical data
- CTGAN
  - mode-specific normalization
  - architectural changes
  - conditional generator and training-by-example to address data imbalance
- Before CTGAN, model each column as a random variable and model a joint probability distribution, and then sample from that distribution to generate synthetic data
  - Limited by type of distributions
  - Limited by computational issues
  - Both hinder the synthetic data's fidelity
- CTGAN
  - Each column is a random variable
  - Continuous columns and discrete columns
  - These random variables follow an unknown joint probability distribution
  - One row is one observation from this joint probability distribution
  - Evaluation
    - Ttrain, Ttest, Tsyn
    - Do the columns in Tsyn follow the same joint distribution as Ttrain
    - If we train a classifier or regressor for one column based on remaining columns, does Tsyn have the same performance as Ttest
- Mixed data types: CTGAN must apply softmax and tanh on the output
- Non-Gaussian distributions: continuous values in tabular data are often non-Gaussian
- Multi-modal distributions: GANs struggle to model multi-modal continuous columns
- Sparse one-hot encodings: generative model is trained to generate a probability distribution over all columns using softmax while he real data is represented in a one-hot vector (check on what this means)
- Highly imbalanced categorical columns: creates mode collapse and other issues

Mode-Specific Normalization
- Previous models used min-max normalization to shift continuous features to [-1,1]
- Use Gaussian mixture models in place of kmeans clustering to determine modes for each continuous feature




## Losses

GAN Loss

$$\min_{G}\max_{D}\mathbb{E}_{x\sim p_{\text{data}}(x)}[\log{D(x)}] +  \mathbb{E}_{z\sim p_{\text{z}}(z)}[1 - \log{D(G(z))}]$$

Information Loss

\begin{align*}
    L_{mean} &= || \mathbb{E}[\mathbf{f}_x]_{x\sim p_{\text{data}}(x)} - \mathbb{E}[\mathbf{f}_{G(z)}]_{z\sim p_{z}} ||_2 \\
    L_{sd} &= || \mathbb{SD}[\mathbf{f}_x]_{x\sim p_{\text{data}}(x)} - \mathbb{SD}[\mathbf{f}_{G(z)}]_{z\sim p_{z}} ||_2 \\
    L^G_{\text{info}} &= \max(0, L_{mean} - \delta_{mean}) + \max(0, L_{sd} - \delta_{sd}) \\
\end{align*}

Classification Loss

\begin{align*}
    L^C_{\text{class}} &= \mathbb{E}[|\ell(x) - C(remove(x))|]_{x \sim p_{\text{data}}(x)} \\
    L^G_{\text{class}} &= \mathbb{E}[|\ell(G(z)) - C(remove(G(z)))|]_{z \sim p(z)}
\end{align*}




## Evaluation

Distance to Closest Record (DCR) introduced in [1] calculates the Euclidean distance between a synthetic record and the closest real record. DCR = 0 leaks information since the synthetic data is the same as a real record. We apply attribute-wise normalization so that each attribute contributes equally to the distance. [Park et al., 2018, p. 1078]

Statistical comparison introduced in [1] computes per attribute metrics and compares those between real and synthetic data attributes. We want these to be very close. [Park et al., 2018, p. 1078]

Machine learning score similarity introduced in [1] trains separate classifiers or regressors on the real and synthetic tables and then checks if the F1 score for classification or the mean relative error (MRE) for regression are the same between the models trained on the real and the synthetic data. [Park et al., 2018, p. 1078]




## References

Academic papers (core to project)
- https://arxiv.org/abs/1907.00503 (Modeling Tabular Data using Conditional GAN)
- https://arxiv.org/abs/1811.11264 (Synthesizing Tabular Data using Generative Adversarial Networks)
- https://www.vldb.org/pvldb/vol11/p1071-park.pdf (Data Synthesis based on Generative Adversarial Networks)
- https://arxiv.org/abs/2303.01230v3 (Synthetic Data: Methods, Use Cases, and Risks)

Academic papers (references for project)
- https://www2.stat.duke.edu/~jerry/Papers/jos03.pdf (Multiple Imputation for Statistical Disclosure Limitation)
- https://www2.stat.duke.edu/~jerry/Papers/sm04.pdf (Simultaneous Use of Multiple Imputation for Missing Data and Disclosure Limitation)
- https://dl.acm.org/doi/10.1145/3636424 (A Survey of Generative Adversarial Networks for Synthesizing Structured Electronic Health Records)
- https://arxiv.org/abs/1609.05473 (SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient)
- https://arxiv.org/abs/1611.04051 (GANs for Sequences of Discrete Elements with the Gumbel-softmax Distribution)
- https://arxiv.org/abs/1810.06640 (Adversarial Text Generation Without Reinforcement Learning)
- https://arxiv.org/abs/2403.04190v1 (Generative AI for Synthetic Data Generation: Methods, Challenges, and the Future)
- https://arxiv.org/abs/1511.06434 (Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks)

Web articles
- https://medium.com/@aldolamberti/synthetic-data-101-synthetic-data-vs-real-dummy-data-237d790433a9
- https://machinelearningmastery.com/mostly-generate-synethetic-data-machine-learning-why/
- https://towardsdatascience.com/generative-ai-synthetic-data-generation-with-gans-using-pytorch-2e4dde8a17dd
- https://becominghuman.ai/generative-adversarial-networks-for-text-generation-part-1-2b886c8cab10
- https://becominghuman.ai/generative-adversarial-networks-for-text-generation-part-3-non-rl-methods-70d1be02350b
- https://towardsdatascience.com/how-to-generate-tabular-data-using-ctgans-9386e45836a6
- https://medium.com/analytics-vidhya/a-step-by-step-guide-to-generate-tabular-synthetic-dataset-with-gans-d55fc373c8db
- https://towardsdatascience.com/gaussian-mixture-model-clearly-explained-115010f7d4cf

Source code
- https://github.com/sdv-dev/TGAN
- https://github.com/sdv-dev/CTGAN

Videos
- https://www.youtube.com/watch?v=yujdA46HKwA (GANs for Tabular Synthetic Data Generation)
- https://www.youtube.com/watch?v=Ei0klF38CNs (Synthetic data generation with CTGAN)
- https://www.youtube.com/watch?v=ROLugVqjf00 (Generation of Synthetic Financial Time Series with GANs - Casper Hogenboom)
- https://www.youtube.com/watch?v=HIusawrGBN4 (What is Synthetic Data? No, It's Not "Fake" Data)
- https://www.youtube.com/watch?v=FLTWjkx0kWE (Generate Synthetic Tabular Data with GANs)
- https://www.youtube.com/watch?v=zC3_kM9Qwo0 (QuantUniversity Summer School 2020 | Generating Synthetic Data with (GANs))
