How are synthetic datasets generated? #252

anniewu332 · 2021-08-26T18:24:27Z

anniewu332
Aug 26, 2021
Maintainer

Questions:
How are DP synthetic datasets generated? What parameters are needed to generate them? How is this process different from generating DP summary statistics? What can users do and not do with DP synthetic datasets?

Background:
Synthetic datasets seem to be the most common first ask from users who are interested in DP but don’t have much knowledge of it. It is very natural and an intuitive start point to think of protecting privacy with noise as using a magically-treated dataset that includes some noise and somehow would not leak privacy.

Answered by Shoeboxam

Aug 27, 2021

Oftentimes synthetic datasets are generated with GAN models, generative adversarial networks. GANs are independent from differential privacy. They consist of two sub-networks, a generator and a discriminator. The generator feeds into the discriminator.

The generator has a small number of input nodes and widens over multiple layers to the number of columns in your dataset. The generator learns the distribution of the underlying data such that, when you feed it a batch of noise, it emits a batch of synthetic data.

The discriminator is the opposite shape- it has just as many inputs as the generator has outputs, and narrows down into a single output node that discriminates if the input is fak…

View full answer

Shoeboxam · 2021-08-27T14:11:36Z

Shoeboxam
Aug 27, 2021
Maintainer

Oftentimes synthetic datasets are generated with GAN models, generative adversarial networks. GANs are independent from differential privacy. They consist of two sub-networks, a generator and a discriminator. The generator feeds into the discriminator.

The generator has a small number of input nodes and widens over multiple layers to the number of columns in your dataset. The generator learns the distribution of the underlying data such that, when you feed it a batch of noise, it emits a batch of synthetic data.

The discriminator is the opposite shape- it has just as many inputs as the generator has outputs, and narrows down into a single output node that discriminates if the input is fake or real.

You train the discriminator by giving it a mixture of examples of real data and synthetic data emitted from the generator. It learns to distinguish between real and fake. On the other hand, you train the generator by maximizing the probability that the discriminator fails. To train the overall GAN, alternate training the two sub-networks to keep them evenly matched adversaries. Eventually, you can discard the discriminator network and feed noise into the generator network to get synthetic data.

Training is done via gradient descent:
w_{t+1} = w_t - grad * step_size
Where w_t is the network parameter after t steps.

Judging by the above formula, each of the parameters in models trained with gradient descent can be decomposed into the sum of some random noise w_0 (initial random value) and a postprocessing of the gradient at each step. Since the initial random noise w_0 is independent from the data, we can make the whole model DP by replacing grad with DPgrad. This holds by a simple postprocessing argument.

Now we just need to get DPgrad. Each training step, Pytorch/TensorFlow/etc derives a transformation from a batch of sensitive observations to a batch of gradients (instance-level gradients) for each parameter in the network. DPgrad is straightforward: for each parameter, compute the DP mean of its instance-level gradients. To make this concrete, if your batch size is 10, just compute a DP mean on the instance grads with n=10.

You should now be able to backtrack back up through each paragraph to get a DP-GAN. Sample from the generator (by feeding it noise) to get your synthetic dataset.

In practice, we reduce privacy budgets by computing vectorized aggregations (yay!), and destroy performance by computing gradients for each input observation separately (nay). There are many other methods of making synthetic data, but this is the most popular right now.

Ultimately, there is no difference between DP summary statistics and DP-GAN synthetic data, except that DP-GAN synthetic data consists of more summary statistics and a lot more post-processing. Everything is summary statistics.

2 replies

Shoeboxam Aug 27, 2021
Maintainer

What can you do with a synthetic dataset?

In the DP-GAN example we spent our budget to create a generative model. The model itself is the DP release. If we wanted to, we could sample from this DP model an unlimited number of times and never incur additional privacy costs. Each sample is a post-processing of the release.

Any person who sees a DP release incurs the budget cost that was used to release it. The same holds for seeing any postprocessing of a DP release. Given this foundation, it is intuitive that seeing multiple different post-processings of the same DP release does not stack the privacy usage.

Therefore, viewing any number of synthetic DP-GAN datasets sampled from one DP-GAN release incurs the privacy cost of training that DP-GAN once.

In extension, all conceivable computation you may want to do on synthetic datasets is also post-processing, and also incurs no additional privacy cost beyond the cost of training the model.

Shoeboxam Aug 27, 2021
Maintainer

An extreme example to drive the point home:
Let's say you fit a DP-GAN, sample a synthetic dataset from it, and then only share a non-DP mean computed on the synthetic data with your friend. Your friend incurs the full privacy cost of training the entire DP-GAN.

If your friend also has access to the private data and independently computed their own DP-GAN on it, your friend incurs the full privacy cost of training both DP-GANs.

anniewu332 · 2021-09-01T17:39:45Z

anniewu332
Sep 1, 2021
Maintainer Author

@Shoeboxam Thanks Mike! This is super clear and helpful.

To your last example, an extended question would be how to explain to the friend that the non-DP mean computed on the synthetic data sampled from the DP-GAN is not the actual mean of the private data. I wonder how this is done in practice that would cause the least confusion.

1 reply

Shoeboxam Sep 17, 2021
Maintainer

When you train a GAN, it captures approximations of all statistics. Respectively, when you train a DP-GAN, it captures DP approximations of all statistics. Therefore, the DP-GAN mean is a DP mean of the private data!

I could stop here, but I think we should analyze the statistical properties of these estimators. This DP-GAN mean estimator is:

Biased toward the mean of the synthetic data from the initial random weights.
More importantly, an incredibly inefficient, high variance estimator. It remains nearly as inefficient even if you account for sampling variance by taking the limit of the number of samples from the generator.

There is also no reasonable way to quantify the variance or bias of this estimator. This is something a direct DP mean estimator (like make_bounded_mean) can do.

The synthetic data approach is a tradeoff; it provides a low accuracy answer to every conceivable query. This means accuracy for statistical queries on synthetic data is comparatively low compared to direct estimators.

Another thing to keep in mind is that the efficiency for which GANs capture statistics in the learned distribution will vary among different kinds of statistics. As a rule of thumb, I'd anticipate that direct queries with high sensitivity would also be more difficult for the GAN to capture, leading to a commensurate loss in utility.

For example, a mean with O(1/n) sensitivity is likely better captured by the GAN than a variance with O(1/sqrt(n)) sensitivity.

sidify · 2022-02-24T18:18:20Z

sidify
Feb 24, 2022

I would like to point out a different approach in dealing with generating synthetic data using differential privacy.

The core idea is to use n-way marginals. How?
e.g., suppose your dataset is something like this:

| col1 | col2 | col3 | col4 | col5 |
| num | catg | catg | num | catg|

where num = numeric data type, cat = categorical data type.

Now one can find joint probability distribution of all the categorical columns and sample data from them (n-way marginal). The obtained count previously must be "noised" by a Laplace or Gaussian distribution. Since the function used here is count, the sensitivity is 1, therefore the Laplace is Lap(1/e). Doing a joint probability distribution will preserve the correlation among the columns.

Similarly for the numeric data, in order to preserve correlation, can be divided into finite bins. And then similar to categorical fields above the n-way marginal can be calculated in together with the categorical fields. The bins can later be replaced by random sampling of values. You can also add a small noise to your numerical data before binning.

Let me know your thoughts. Maybe I am late to the party, nevertheless :)
@anniewu332 @Shoeboxam

5 replies

Shoeboxam Mar 12, 2022
Maintainer

You are right, I was too single-minded about the other approach!

I wrote a demo script for generating synthetic data from a DP histogram release on a pandas dataframe with mixed types:
https://github.com/opendp/prelim/blob/ba082f0b11aa03fdbd820b82acd6d7257e69de8b/python/post_histogram_synthetic_data.py#L114-L158

sidify Mar 16, 2022

Thanks @Shoeboxam, is the implementation of n-way marginal approach for generating synthetic data?

Shoeboxam Mar 16, 2022
Maintainer

Right, the code is exactly as you said. It releases an n-way marginal and samples from the cdf to generate synthetic data.

I don't add noise to the numeric data before binning, though. I don't see any benefit in doing that.

sidify Mar 17, 2022

Sometimes a numeric data can be unique in itself, e.g. IP addresses. Hence I added noise.

Shoeboxam Mar 18, 2022
Maintainer

It's fine to bin IP addresses as if they were a continuous variable, without noising, but I think it would make more sense to treat the first k bits as a categorical variable, so that it's easier to interpret. Unless I'm completely misunderstanding your approach, noising each IP individually doesn't net you anything.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How are synthetic datasets generated? #252

{{title}}

Replies: 3 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How are synthetic datasets generated? #252

anniewu332 Aug 26, 2021 Maintainer

Replies: 3 comments · 8 replies

Shoeboxam Aug 27, 2021 Maintainer

Shoeboxam Aug 27, 2021 Maintainer

Shoeboxam Aug 27, 2021 Maintainer

anniewu332 Sep 1, 2021 Maintainer Author

Shoeboxam Sep 17, 2021 Maintainer

sidify Feb 24, 2022

Shoeboxam Mar 12, 2022 Maintainer

sidify Mar 16, 2022

Shoeboxam Mar 16, 2022 Maintainer

sidify Mar 17, 2022

Shoeboxam Mar 18, 2022 Maintainer

anniewu332
Aug 26, 2021
Maintainer

Replies: 3 comments 8 replies

Shoeboxam
Aug 27, 2021
Maintainer

Shoeboxam Aug 27, 2021
Maintainer

Shoeboxam Aug 27, 2021
Maintainer

anniewu332
Sep 1, 2021
Maintainer Author

Shoeboxam Sep 17, 2021
Maintainer

sidify
Feb 24, 2022

Shoeboxam Mar 12, 2022
Maintainer

Shoeboxam Mar 16, 2022
Maintainer

Shoeboxam Mar 18, 2022
Maintainer