How are synthetic datasets generated? #252
-
Questions: Background: |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 8 replies
-
Oftentimes synthetic datasets are generated with GAN models, generative adversarial networks. GANs are independent from differential privacy. They consist of two sub-networks, a generator and a discriminator. The generator feeds into the discriminator. The generator has a small number of input nodes and widens over multiple layers to the number of columns in your dataset. The generator learns the distribution of the underlying data such that, when you feed it a batch of noise, it emits a batch of synthetic data. The discriminator is the opposite shape- it has just as many inputs as the generator has outputs, and narrows down into a single output node that discriminates if the input is fake or real. You train the discriminator by giving it a mixture of examples of real data and synthetic data emitted from the generator. It learns to distinguish between real and fake. On the other hand, you train the generator by maximizing the probability that the discriminator fails. To train the overall GAN, alternate training the two sub-networks to keep them evenly matched adversaries. Eventually, you can discard the discriminator network and feed noise into the generator network to get synthetic data. Training is done via gradient descent: Judging by the above formula, each of the parameters in models trained with gradient descent can be decomposed into the sum of some random noise Now we just need to get DPgrad. Each training step, Pytorch/TensorFlow/etc derives a transformation from a batch of sensitive observations to a batch of gradients (instance-level gradients) for each parameter in the network. DPgrad is straightforward: for each parameter, compute the DP mean of its instance-level gradients. To make this concrete, if your batch size is 10, just compute a DP mean on the instance grads with n=10. You should now be able to backtrack back up through each paragraph to get a DP-GAN. Sample from the generator (by feeding it noise) to get your synthetic dataset. In practice, we reduce privacy budgets by computing vectorized aggregations (yay!), and destroy performance by computing gradients for each input observation separately (nay). There are many other methods of making synthetic data, but this is the most popular right now. Ultimately, there is no difference between DP summary statistics and DP-GAN synthetic data, except that DP-GAN synthetic data consists of more summary statistics and a lot more post-processing. Everything is summary statistics. |
Beta Was this translation helpful? Give feedback.
-
@Shoeboxam Thanks Mike! This is super clear and helpful. To your last example, an extended question would be how to explain to the friend that the non-DP mean computed on the synthetic data sampled from the DP-GAN is not the actual mean of the private data. I wonder how this is done in practice that would cause the least confusion. |
Beta Was this translation helpful? Give feedback.
-
I would like to point out a different approach in dealing with generating synthetic data using differential privacy. The core idea is to use n-way marginals. How? | col1 | col2 | col3 | col4 | col5 | where num = numeric data type, cat = categorical data type. Now one can find joint probability distribution of all the categorical columns and sample data from them (n-way marginal). The obtained count previously must be "noised" by a Laplace or Gaussian distribution. Since the function used here is count, the sensitivity is 1, therefore the Laplace is Lap(1/e). Doing a joint probability distribution will preserve the correlation among the columns. Similarly for the numeric data, in order to preserve correlation, can be divided into finite bins. And then similar to categorical fields above the n-way marginal can be calculated in together with the categorical fields. The bins can later be replaced by random sampling of values. You can also add a small noise to your numerical data before binning. Let me know your thoughts. Maybe I am late to the party, nevertheless :) |
Beta Was this translation helpful? Give feedback.
Oftentimes synthetic datasets are generated with GAN models, generative adversarial networks. GANs are independent from differential privacy. They consist of two sub-networks, a generator and a discriminator. The generator feeds into the discriminator.
The generator has a small number of input nodes and widens over multiple layers to the number of columns in your dataset. The generator learns the distribution of the underlying data such that, when you feed it a batch of noise, it emits a batch of synthetic data.
The discriminator is the opposite shape- it has just as many inputs as the generator has outputs, and narrows down into a single output node that discriminates if the input is fak…