What scenario does the simple attack represent? #245

Kratos-zzz · 2021-08-21T11:43:50Z

Kratos-zzz
Aug 21, 2021

Hi, I have been exploring differential privacy and trying out the simple attack.

I am wondering the significance of the simple attacks presented in the attack notebook at opendp/smartnoise-samples/attacks/ and what real life scenario this could represent? How could it be expanded into a more sophisticated attack?

Thanks

Answered by tombisho

Aug 25, 2021

I'd like to join this discussion if I may!

I am familiar with the example in the "simple attack" notebook. In the notebook you allow the query to be run 10k times to demonstrate that even after that many examples, the attacker doesn't get the true POI salary. In the comments you say that the attacker in fact could only run the query once (using their epsilon of 1).

How do you prevent them from running the query more than once? I guess my question is how do you actually enforce the budget? For example I might ask a very slightly different query each time (e.g. removing some other person from the data each time). How do you monitor that? It feels like I am missing a fundamental part of how …

View full answer

Shoeboxam · 2021-08-25T15:41:03Z

Shoeboxam
Aug 25, 2021
Maintainer

Hi Kratos,
A very common example for the use of differential privacy is the census. There's a recent news article about this here, and this paper goes into more mathy detail. There's also an article on protecting privacy for genomic databases here. There are a lot of applications around sharing geolocation data. A few more specific examples are CrisisReady and the broadband coverage dataset. I've personally been involved with an effort to train a differentially private speech recognition system.

There are many more resources out there that bring this technique into more specialized niches, and it seems like everyone has a use-case once they have access to private data. There are also a ton of sequestered datasets out there that would have collective benefits if people with access to the data and access to the right DP tooling were to create differentially private releases.

You also asked about more sophisticated attacks. This paper from Cynthia Dwork et al discusses some additional attacks. In addition, this paper talks about membership inference attacks specifically for machine learning.

0 replies

tombisho · 2021-08-25T17:22:59Z

tombisho
Aug 25, 2021

I'd like to join this discussion if I may!

I am familiar with the example in the "simple attack" notebook. In the notebook you allow the query to be run 10k times to demonstrate that even after that many examples, the attacker doesn't get the true POI salary. In the comments you say that the attacker in fact could only run the query once (using their epsilon of 1).

How do you prevent them from running the query more than once? I guess my question is how do you actually enforce the budget? For example I might ask a very slightly different query each time (e.g. removing some other person from the data each time). How do you monitor that? It feels like I am missing a fundamental part of how SmartNoise works!

Thanks and congratulations on the great work so far

Tom

7 replies

tombisho Aug 26, 2021

Hi Mike, thanks for this reply. I am an engineer not a mathematician so the references to the Dwork/Roth book are challenging for me. I will spend some more time digesting: maybe I have not fully understood how your response answers my question, but at the moment it feels like it is not quite what I was looking for. To clarify, my attack is simply to rerun the query 10k times - the part about running a slightly different query each time is an additional nuance.

In your first edit you suggested that my question is about the security model, and actually that is correct. I understand that in the real world the user wouldn't have direct access to the data. I can see that within an execution of with sn.Analysis() as analysis: there is a privacy budget which limits what can be done within that group of queries. In the example the same group of queries are re-run 10k times to show that in this case you get close to recovering the POI salary. Yet in the text it says that

In practice, they would see the result of only one simulation.

How is that enforced? It seems that if it is not, then the privacy budget equally is not enforcable.

I feel like I am missing a fundamental point somewhere. Maybe the assumption is that someone is not trying to break the privacy and so won't do the 10k queries?

As some background, I am a user of DataSHIELD, which has the familiar goals of allowing users to use the data without having full access. I'd be interested in your views on how DataSHIELD works compared to SmartNoise. Generally DataSHIELD relies more on k anonymity and other restrictions in the functions that can be applied to data.

Shoeboxam Aug 26, 2021
Maintainer

The clarification helps to narrow the question! Your budget is charged for each and every query you release, even if the query has an identical blueprint as a previous query. Based on that Dwork/Roth reference, it is completely valid to run and be charged for that same query multiple times. In fact, this is something a legitimate analyst may want to do. If an analyst wants to make one of their prior queries more accurate, they can release the same query again and average them together. In your example, you are charged 10k times.

In SmartNoise, each statistic you add to your analysis is only ever computed once, regardless of how many times you call release, because the result is cached. From the user perspective, if you want to re-release a prior query, you need to explicitly add it to the analysis again. From the library perspective, the overall budget usage is computed with the simplifying assumption that each statistic in the analysis is computed exactly once- and it's completely fine if these statistics share the same blueprints. This should show soundness up to each instance of an analysis.

In an ideal world, an analyst should only ever edit and view one analysis instance. There are some practical issues with this, but I won't go into that yet.

Finally getting to the trust model- if you have data in-hand, then we need to trust that you only ever edit and view one analysis instance. If you don't have data in-hand, then we trust that an access management system only lets an analyst edit and view one analysis instance.

Shoeboxam Aug 26, 2021
Maintainer

DataSHIELD looks like a project with the same goals, but with a fundamentally different theory (k-anonymity) that lends weaker protections than differential privacy.

tombisho Aug 27, 2021

Ok great, thank you, it is beginning to make sense!

I am still trying to get straight in my head how SmartNoise can then work for open-ended analyses. In this article it hints that it is a problem not yet solved:

Eventually, Bird hopes that differential privacy will extend to allowing researchers to make dynamic queries against data sets "to advance the state of the art for society but not reveal private information." That's the most challenging scenario, however.

"You need to be able to optimise the queries automatically and find the right point in the trade-off space between accuracy and privacy and computational efficiency. Then you also need dynamic budget tracking governance around who gets how much of what budget, and do you actually retire the data set?" she said.

Is that an accurate summary of the current situation?

I am thinking that I would like to try to rewrite a DataSHIELD function to use SmartNoise. For example ds.mean currently relies on k-anonymity, so it would be nice to have a DP version. There are already many projects using DataSHIELD so it would be a way to increase exposure for SmartNoise.

The current DataSHIELD architecture means that the data of interest is held in an R session. This leads to the questions:

it looks like the bindings are available in Python at the moment, not R, but it is planned for R. Is that correct?
if the bindings did exist in R, then one could write code analogous to that in the sample Python notebooks? i.e. if the data are in an R session
it looks like it is non-trivial to write the R bindings? 😄

Shoeboxam Aug 27, 2021
Maintainer

I'm not sure how to interpret "open-ended analyses," so I'll pin down the space DP does carve out.

The theory behind differential privacy makes a pretty strong statement that the privacy protections apply simultaneously for any input dataset in the domain of all datasets. So the analyses are "open-ended" in terms of the dataset you feed into the computation, because the proof is agnostic to the choice of dataset within a very large set of potential databases.
The theory behind differential privacy also allows for really flexible access patterns, where analysts can choose queries adaptively based on information released in prior queries. So the analyses are "open-ended" in terms of the statistics that they are composed of.

So from a theory perspective, analyses can be extremely open-ended. In practice, we need to implement algorithms on finite computers with observable state and side-effects. It takes a lot of work, but we can account for this in a trust model where the analyst is curious but honest. The part that's clearly not solved is creating an access management system for sensitive data that is open to the internet. There will always need to be some level of trust not to collude with others who have their own budgets, and there is still significant work involved in making the algorithms less susceptible to side-channels like floating-point irregularities, or measuring timings and resource usage.

Extending DataSHIELD sounds like a great way to support both projects! I wouldn't use the SmartNoise-Core library for this, as we're currently working towards replacing it with this OpenDP library. We just merged a PR that adds notebooks that demonstrate how to use the OpenDP library to do this. R bindings are on the roadmap. They will take a significant effort, but will piggyback on the FFI layer we've already built for Python.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What scenario does the simple attack represent? #245

{{title}}

Replies: 2 comments 7 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

What scenario does the simple attack represent? #245

Kratos-zzz Aug 21, 2021

Replies: 2 comments · 7 replies

Shoeboxam Aug 25, 2021 Maintainer

tombisho Aug 25, 2021

tombisho Aug 26, 2021

Shoeboxam Aug 26, 2021 Maintainer

Shoeboxam Aug 26, 2021 Maintainer

tombisho Aug 27, 2021

Shoeboxam Aug 27, 2021 Maintainer

Kratos-zzz
Aug 21, 2021

Replies: 2 comments 7 replies

Shoeboxam
Aug 25, 2021
Maintainer

tombisho
Aug 25, 2021

Shoeboxam Aug 26, 2021
Maintainer

Shoeboxam Aug 26, 2021
Maintainer

Shoeboxam Aug 27, 2021
Maintainer