Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strawman: Target privacy constraints #17

Closed
benjaminsavage opened this issue Jun 22, 2022 · 15 comments
Closed

Strawman: Target privacy constraints #17

benjaminsavage opened this issue Jun 22, 2022 · 15 comments
Labels

Comments

@benjaminsavage
Copy link

benjaminsavage commented Jun 22, 2022

Strawman target privacy constraints

A private measurement API should only return aggregated, anonymous information. That means this information:

  1. Cannot be linked to a specific individual
  2. Cannot be "de-anonymized"

The implications of this is are that:

  1. Some amount of random noise must be added to the information to provide a differential privacy guarantee.
  2. (tentative) Each breakdown key b in the output should be an aggregation across at least K (strawman: 100) people. Specifically, the cardinality of the set of unique match keys appearing across the set of source-events with breakdown key equal to b is at least 100)

NOTE: implication 2 is marked as (tentative) because it does not improve privacy in the worst case. This is because an adversary could generate a group of 100 source events comprised of 1 authentic event and 99 fake source events, or 1 authentic event and 99 real source events they know will not generate matches. However, this type of constraint would make it much simpler to explain the system to people (e.g. "Each breakdown is aggregated across a group of at least 100 people, after which a small amount of random noise is added to further protect privacy").

Regarding information leakage over time

In each "epoch" (strawman: each week), a private measurement API should provide some upper bound, or limit, on the total amount of cross-site/cross-app information a caller can learn about a given person. That limit should be low enough that it does not in effect leak browsing history.

@martinthomson
Copy link
Collaborator

Thanks for the writeup Ben,

I note that your information leakage is linked to pairwise to sites. One thing I liked about what we worked out for IPA is that it wasn't pairwise, which can ultimately provide a stronger privacy guarantee (which you might also be able to trade for better utility through larger $\varepsilon$). I'm OK with what you have here but I want to note that this is a baseline.

About the "budget" notion more generally as opposed to simply stating that there is an upper bound on the amount of information that a site can gain about activity on other sites in each epoch. Again, whether that is an upper bound for each site, or a global bound is something I am happy to leave open.

Friendly amendment, add "single" here under security constraints:

If any single entity involved in operating a private measurement API

That might have been implied, but I think it is important. Again, we might offer stronger assurances, but this is a reasonable baseline. No doubt some will object to this constraint (I note that any system that relies exclusively on TEE cannot pass the proposed test), but ... well, we can have that debate when it comes time.

Regarding open source requirement on client code, I think that most of the participating browsers will have no problem there, but I don't know if that is universally true. But I don't know whether the requirement is necessary at this stage. What steps a browser vendor takes to allow users to trust that their browser is good are out of scope for standardization. However, if there are cases where code is run by browsers on behalf of others, then maybe this is a fine requirement. It might be premature to add that now.

The server-side stuff probably needs a little more development. The best we might reasonably say right now is that there will need to be a process by which server operators are authorized to operate the service. That process probably involves browsers certifying particular operators, but we'll need to get into that more as we get further into this.

@benjaminsavage
Copy link
Author

Thank you martinthomson for the feedback! I've updated my issue to reflect your comments.

  • I did not intend to imply a pairwise privacy budget. I intended to say a global limit across all sites. I clearly failed to communicate this well =). I've tried to re-phrase it to do a better job saying what I meant to say.
  • I've also left it more open ended in terms of how this "upper bound" is achieved.
  • I've added single to that sentence. It was indeed what I intended, and I think you're right that adding single makes this intention more clear.
  • I've taken your wording regarding the server-side authorisation and certification process.

The only place I disagree is about the Client-side code. I think for both privacy AND competition we need to have something.

If we are going to be consistent in our application of the "3 Cs" framework, we need to consider the browser as a potential point of compromise. If TEEs are not acceptable because the TEE manufacturer or cloud operator is a single point of failure which can be compelled to break privacy, then why is that not ALSO the case for the browser / OS? I thought "open source code" was a pretty low bar to aim for, which as you say is already the case for most participating browsers (although NOT the case for iOS).

I also think it would be really good to have something to banish the spectre of doubt that stems from competition concerns. If everyone can be confident that Google / Microsoft / Apple are all using the same private measurement API everyone else is, and everyone can validate that there isn't some privileged side-channel Chrome / Edge / Safari are also running in addition to said private measurement API, that will help build trust in the ecosystem.

@martinthomson
Copy link
Collaborator

Working for a browser-maker, I feel obligated to defend our ability to defend data that doesn't leave the browser.

Protecting browsing history is on the list of things that browsers have had to do forever. We also protect cookies and passwords and other much more sensitive stuff. The distinction that I think is relevant here is between treatment of data on a device that the user controls1 and data that leaves that space. As Luke mentioned yesterday, where things are most challenging is where data leaves that a zone the user controls (where we have a well-established understanding or at least expectations about how data is treated), which might make that data available to others. Additional scrutiny on data that exits user-controlled space, particularly when it involves data from multiple people, is entirely appropriate. Systems that aggregate private information from many people are something of a novelty here. But I don't see it as within our remit to talk about treatment within browsers, especially for such a narrow domain.

I understand the competition angle (I would be OK with having a bigger discussion about that, is it worth a separate thread?), but I think that we should limit our discussion there to data that leaves the user's device. We are best not talking about the issues of self-preferencing that might occur within larger companies, limited to browser vendors2. The W3C is not even the right place to have that conversation. Various competition regulators are taking a keen interest, for instance.

Footnotes

  1. Obligatory footnote about cross-device synchronization features. These exist and often involve data living on servers operated by a browser vendor. There are some rather fundamentally different approaches taken in the market here. I consider those to be within the established envelope for data sharing as long as data is only synchronized. I realize that this is not always the case, which muddies things considerably. Again though, this is not something unique to this particular domain and I would rather we didn't add that problem to our workload.

  2. This is a problem of scale: Mozilla and other smaller browser vendors aren't immune to those pressures; we just have a narrower product portfolio and less opportunity to gain advantage.

@alextcone
Copy link

First, I believe @benjaminsavage's strawperson and the incorporated updates suggested by @martinthomson are a very defensible position to begin with. Second, like @benjaminsavage, I believe some affordance should be made here on the part of the browser maker to demonstrate the system they are a part of (if/when these APIs are generally available), which hopefully becomes part of powering a multi-billion dollar/euro/etc industry, is a transparent actor. The ask @benjaminsavage is making here is like he says "a pretty low bar," but I think the concession is an important one. We don't have to talk about self-preferencing in browser or OS companies to stipulate that the standard we are developing here has openness for all parties involved in making the API(s) work for software developers and ultimately the better data protection and privacy of end users.

@csharrison
Copy link
Collaborator

Quick initial thoughts reading this proposal:

High level comment: I would prefer if we split out security and privacy constraints, mostly because I think we can have meaningful discussions about them in relative isolation without mixing things.

I have strong concerns about enforcing k = 100 , since for some advertisers conversions can be quite rare events and even a relatively tight epsilon should give good data for many values of k < 100 (e.g. eps=1 will yield only a ~15% error on counts of 10).

Regarding privacy unit / privacy grain, I think what is written now is stronger even than IPA which has a privacy unit of user x site. Did you intend to propose full user-level privacy here: "total amount of cross-site/cross-app information a caller can learn about a given person". We should try to be very precise about this.

@alextcone
Copy link

I have strong concerns about enforcing k = 100 , since for some advertisers conversions can be quite rare events and even a relatively tight epsilon should give good data for many values of k < 100 (e.g. eps=1 will yield only a ~15% error on counts of 10).

This is a very good point, @csharrison.

@benjaminsavage
Copy link
Author

High level comment: I would prefer if we split out security and privacy constraints, mostly because I think we can have meaningful discussions about them in relative isolation without mixing things.

You're right. I think splitting this conversation into two will be helpful in making more rapid progress and having more focused conversations. Per your suggestion I've moved the security model into a separate issue: #18

I apologize to all the other commenters for this - as it leaves your comments looking really confusing as the sections of the post they were referencing are no longer visible here. Sorry! Please feel free to copy-paste your comments over to the new issue if you'd like.

@benjaminsavage
Copy link
Author

I have strong concerns about enforcing k = 100 , since for some advertisers conversions can be quite rare events and even a relatively tight epsilon should give good data for many values of k < 100 (e.g. eps=1 will yield only a ~15% error on counts of 10).

Two thoughts:

Firstly, I agree that conversions are rare. The vast majority of Facebook advertisers have only a handful of conversions to measure each week. I care a great deal about supporting small businesses and I want to develop an API that can support their measurement needs.

But just to be clear, I am proposing K=100 applies to the impressions, not to the conversions. Even the smallest advertisers who just spend a few dollars wind up getting at least 100 impressions.

Just to give an example, if an advertiser spent $5 on ads and got 200 impression, which led to just 3 conversions, that would pass this proposed bar. We would need to add some random noise to the number 3, so the API might add some Gaussian noise, but so long as those 3 conversions originated from a group of > 100 people we would pass this bar.

The thinking here is that we can say: "Yeah, there were roughly 3 conversions. We do not know which of these 100 people they came from." This feels like a pretty simple to communicate privacy story. Blending in with a crowd of 100+ people is something all of us have experience with every day.

@csharrison
Copy link
Collaborator

@benjaminsavage thanks for clarifying. I see indeed you mentioned "unique match keys appearing across the set of source-events with breakdown key equal to b". I will need to think about this constraint more to see if I am comfortable with it, but yeah it's definitely better than counting attributed convs.

@benjaminsavage
Copy link
Author

Regarding privacy unit / privacy grain, I think what is written now is stronger even than IPA which has a privacy unit of user x site. Did you intend to propose full user-level privacy here: "total amount of cross-site/cross-app information a caller can learn about a given person". We should try to be very precise about this.

This is super difficult to explain. I agree we should try to be very precise. Let me try again and see if I can do better on try 3 =).

So here is what I wrote:

In each "epoch" (strawman: each week), a private measurement API should provide some upper bound, or limit, on the total amount of cross-site/cross-app information a caller can learn about a given person. That limit should be low enough that it does not in effect leak browsing history.

I used the term "a caller" without defining it. I think this is where I need to be more precise.

  • Let's assume that in order to utilize a private measurement API, entities have to "sign up", perhaps providing a payment instrument to pay for processing their queries consume. Let's call each entity that "signs up" an "API User".
  • Let's assume that each app/website is allocated a per-user privacy budget.
  • Let's assume each app/website can either opt to make "privacy preserving measurement queries" on their own, or can contract with some measurement partner, delegating this responsibility to them. The choice is up to the app / website.

In the paragraph I wrote, I had each "app / website" in mind when I wrote the words "a caller".

This is one way in which we could achieve an upper bound on the total per-user information leakage to a given app/website.

As @martinthomson alluded to above, an alternative way to achieve the same goal would be to do a data analysis to see how many apps / website the P95 user actually interacts with, and based on that decide on a pairwise privacy budget (i.e. a separate budget per source-site x trigger-site combo). This seems to me like it would incur more noise due to the uneven distribution of sites visited per user.

@alextcone
Copy link

But just to be clear, I am proposing K=100 applies to the impressions, not to the conversions. Even the smallest advertisers who just spend a few dollars wind up getting at least 100 impressions.

Helpful clarification @benjaminsavage. I too thought you were talking about conversions. K=100 impressions seems like an ok lower bound.

@jalbertoroman
Copy link

About the minimum number of events to be aggregated. There is the concept of K-anonymity. The lower bound recommended in Europe is 15. Meaning each cohort has at least 15 individuals.

@dmarti
Copy link

dmarti commented Jun 27, 2022

"Cannot be linked to a specific individual" is not enough. Any widely available system will be used by adversaries to carry out attacks on users (Microtargeting as Information Warfare) not just by legit advertisers making win-win offers.

Group size will have to depend on adversary's capabilities and goals. Instead of "cannot be linked", the (fraction of targeted users in the group) * (cost of attacking one target) needs to be lower than whatever value the adversary places on a successful attack.

@AramZS
Copy link
Contributor

AramZS commented Jul 18, 2023

Would we like to present these in an upcoming meeting?

@benjaminsavage
Copy link
Author

I think we've already covered this discussion. @csharrison's presentation on "private measurement of individual events" answered the "(tentative)" part of this original proposal in the negative and we've captured this as an area of consensus here: https://github.com/patcg/docs-and-reports/blob/main/design-dimensions/Dimensions-with-General-Agreement.md#private-measurement-of-single-events

From my perspective I think we can close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants