Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Impact to Privacy of Usage of Cross Domain Signals in Opaque Processing with DP/K Output Constraints #26

Open
thegreatfatzby opened this issue Aug 16, 2023 · 8 comments

Comments

@thegreatfatzby
Copy link

thegreatfatzby commented Aug 16, 2023

As I've gotten deeper into this I've been pondering something: what would be the impact to this core privacy model if user bidding signals were:

  • Partitioned in any untrusted or persistent environment
  • Viewable and deleteable by a user on their browser
  • But could be viewed together in a transient process by a function in an opaque environment such as a TEE, provided the output of that process still had to have DP and K enforced.

I haven't had the chance to try to work through the math here (some serious cobwebs to dust off for any proof'ing) but I wonder if this would still meet the privacy model laid out here from a "happy path perspective" (meaning impact to "reidentification across context"), with the full understanding that any hacks on that environment would incur a worse privacy loss than if a single-partition-process is hacked.

@michaelkleber
Copy link
Owner

I agree, this is an interesting question. It's definitely one of the areas where this high-level document doesn't get into enough detail to offer an opinion one way or the other. I tried to highlight this sort of grey area when I wrote

there is room to allow sufficiently useful information to flow in a privacy-respecting way. Both "sufficiently useful" and "privacy-respecting" must be evaluated on a case-by-case basis.

In various conversations I've had in the years of Privacy Sandbox development, some people are skeptical of any system which has the ability to build a user profile based on data from many different contexts, even if that pool of data is under the user's control and can only be used for targeting inside an opaque environment:

  • On the technical side, some people have pointed out that if the output is used for ad selection, then an ad click inherently means some amount of leakage of information out of the environment. This makes it much harder than protecting measurement use cases, for example.

  • On the philosophical side, combining data across multiple contexts inherently leads to the ability to make inferences that wouldn't be possible based just on behavior in a single context. This possibility of "novel inferences" is itself a line that some people do not want to cross, even if those inferences are "just" used to pick an ad you see on your own screen. (And of course any leakage vector, including the previous bullet point, means that "just" is somewhat suspect.)

All this is a longwinded way of saying that I don't think there is consensus on the question you're asking, and the document's ambiguity reflects that.

@lbdvt
Copy link

lbdvt commented Aug 25, 2023

Interesting thoughts.

On the technical side, some people have pointed out that if the output is used for ad selection, then an ad click inherently means some amount of leakage of information out of the environment. This makes it much harder than protecting measurement use cases, for example.

Could you please provide an example of when that would be an issue?

A maybe naïve assumption is that when clicking on an ad, the user expects the ad information (the product the user clicked on) to be leaked out to the environment, even if that ad information comes from different contexts.

For example, if a user, in different contexts, shows interest in healthy drinks and sport, and that user is shown an ad for a healthy sport drink, and clicks on that ad, the user would surely expect to land on a site selling healthy sport drinks.

@michaelkleber
Copy link
Owner

Examples come naturally from the combination of the "novel inferences" and "click-time leakage" risks.

I think the canonical "novel inferences" example is the famous "Target knows you're pregnant" story. If the on-device ad selection model can be based on items browsed or purchases made across many unrelated sites, it facilitates picking an ad which inherently means "the model believes this person is pregnant".

The chosen ad might not be for something obviously pregnancy-related at all. If, as that NYTimes article says, Target thinks it's really valuable to get the bit of information that you are pregnant, then they could show you a good coupon for something unrelated, but with a click-through URL that says &probablyPregnant=true.

[Note that the NYTimes article is paywalled. Click here instead to get the story non-paywalled / "gifted" from my subscription... which, of course, means this link now contains tracking information!]

@lbdvt
Copy link

lbdvt commented Aug 25, 2023

Ok, I see, the ad would be hiding the novel inference, in order to pass the information without the user being in a position to know that the information is being passed...

@thegreatfatzby
Copy link
Author

thegreatfatzby commented Sep 4, 2023

Scope

Some of what I'm chewing on here has to do with the wording of the Attestation, but I'm curious about your thoughts on the model more abstractly. So I'll say explicitly that here I'm not asking for comment on the Attestation, what it requires, etc.

Definition

So, thinking about it more the Novel Inferences thing is interesting. I think I think that the idea/concept/threat of "Novel Inference" isn't identical (no pun intended) to the idea/concept/threat of "Re-identification across Contexts". It seems like Novel Inference certainly can include a "complete re-identification" between contexts A and B (max inference), but that B learning something that is tied to your identity in A doesn't imply re-identification.

So, just definitionally, does that seem right?

Threat Model

I can still see how a user (including me) would want to avoid Novel Inference of particular sets of their characteristics from Context A in Context B, like the pregnancy case or if I don't want my insurance company to know about my arthritis. It would be worse if that Novel Inference (arthritis) could be exported and queried in a permanent and clear way attached to my identity in B (insurance industry)...but even if that was transient (say it prevents me from getting a better insurance quote in my stubborn browser) that is bad.

So, then two questions:

  • Does this Privacy Model intend to include Novel Inference? All or any?
  • Does the Attestation intend to prevent Novel Inference? All or any? (Feel free to point me to the attestation repo for this one).

Philosophizing

To dive in a little bit, I'd like to kick a hackysack around on the quad (or toss a frisbee, your choice) with you and ask your thoughts on what it means to:

  1. Join identities across contexts (from Bullet 2 here).
  2. Re-identify across contexts (from Attestation).
  3. Join data across contexts.

On one side of the line, I definitely see Persisting a Graph of Unique Identifiers to an ACID store with RAID 10 disks and global replication via Quantum Entanglement, as quite clearly "joining identities", as you can see the result of the join at your leisure, use it to do further ID based lookups in different contexts, know how to repeat it in the future.

I think on the other side of the line, I can see a transient process mixing data attached to each ID into a single list, assuming that list does not contain the IDs or any deterministic derivative of it, and that list never leaves a the strongly gated process...that one is tougher. If the output of the join does not contain unique identifiers I couuulld argue you've not joined IDs, re-identification can't happen directly (especially given k-anon output gates), and we've just joined user data.

I think that:

  • (3) implies Novel Inference Threat, even with DP/K constraints.
  • However (3) doesn't imply RE-ID Threat, and I would think RE-ID Threat can be driven down with higher DP/K constraints.
  • (1) implies the ability to RE-ID.
  • RE-ID can occur without (1) but, similar to above, I would think the threat of RE-ID can be driven down with higher DP/K constraints

Does (1) mean being able to operate on the IDs in a common process in any way? Observing the output of the join, rather than just the input? Can I assume Chrome does not want to take a position on this? :)

Dangerous Philosophizing

OK, this might be going to far from the quad, but like, what does "join", even, like, mean, man? Are we referring to:

  • The observable outputs of the join
  • The inputs to the join
  • Just like, the entire process encompassed by the join.

@michaelkleber
Copy link
Owner

It seems like Novel Inference certainly can include a "complete re-identification" between contexts A and B (max inference), but that B learning something that is tied to your identity in A doesn't imply re-identification.

First, I certainly agree that "B learning something that is tied to your identity in A doesn't imply re-identification." This is what I was getting at in my privacy model doc when I wrote "A per-first-party identity can only be associated with small amounts of cross-site information" as one of its big three points. Of course this is where all of the hard questions end up — just to quote my 2019 self a little more:

The fuzziness of "small amounts of information" recognizes the balancing act that browsers need to perform between user privacy and web platform usability. Potential use cases must respect the invariant that it remain hard to join identity across first parties, but subject to that limit, there is room to allow sufficiently useful information to flow in a privacy-respecting way. Both "sufficiently useful" and "privacy-respecting" must be evaluated on a case-by-case basis.

However, in the previous discussion in this issue, I was trying to use the term "novel inference" to mean something a little different: some probabilistic belief about a person that was not being made based on their behavior on any single site, but rather was made only by drawing on information from multiple contexts. That is, it's not about "B learning something that is tied to your identity in A", but rather "B learning something based on your behavior on A1, A2, A3, etc, which was not derivable from your identity on any one of the Ai alone." The fact that it was not previously tied to any of your partitioned-by-site identities is what makes it "novel".

Again, this is surely a different question from that of "joining identity". We know that we don't want someone to be able to join identities across sites — once that happens, the game is lost, and the browser no longer has any hope of preserving the user's privacy or restricting the flow of information across contexts. But if identity is not joined, then the browser does have a chance to be opinionated about the cross-context flow of information. All these other questions are trying to figure out what that opinion ought to be.

@Sarrac3873

This comment was marked as spam.

@Sarrac3873

This comment was marked as spam.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants