Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPA Sharding with Sensitivity Bound #35

Open
bmcase opened this issue Feb 22, 2023 · 18 comments
Open

IPA Sharding with Sensitivity Bound #35

bmcase opened this issue Feb 22, 2023 · 18 comments

Comments

@bmcase
Copy link
Collaborator

bmcase commented Feb 22, 2023

IPA Sharding with Sensitivity Bound

Thus far sharding IPA has proven difficult because of the inability to bound the sensitivity of a matchkey’s contribution to shard size. An attacker could request a large number N of encryptions of a matchkey on two websites and then (colluding with a Helper Party) see whether one shard increased by 2N (meaning it was the same user across the two websites). Several ideas for sharding were presented at the last PATCG and referred to this paper for differentially private histograms but for these to be applied the matchkeys would first need to have bounded multiplicity.

As a step towards bounding the sensitivity of a matchkey it was suggested (@benjaminsavage mentioned discussing with several folks at the last PATCG) that the device can cache the encryption of the matchkey per domain/app (or report collector) that is requesting it. Then when a website requests N encryption of a matchkey from a user, they will get back the exact same authenticated ciphertext N times. The Helper Party Network processing these reports in a query will be able to filter out or reject this query as malicious.

If a Match Key Provider is the one colluding with a Helper Party to launch this attack, they are able to bypass the sensitivity capping for a matchkey by directly encrypting with different encryptions as many times as they like. This seems to allow the same style of attacks to continue.

However, we claim that the technique for caching encryptions on the device to get a bound on the sensitivity of unknown matchkeys is sufficient to provide protection against cardinality attacks in sharding, even if the Matchkey Provider is able to add arbitrarily many encryptions for known matchkeys into the sharding process. The reason is that even though a MKP can create an arbitrary number N of encryptions of a known matchkey, the attack only works by observing the sum of this number and the number M of matchkeys that come from a user’s actual devices. If we can bound M with some reasonable sensitivity (say 100 = (# different devices) * (max #domains visited per device)), then we can add non-negative DP noise such that the sum N + M + noise is differentially private.

(This may have been clear to some discussing this at the PATCG, but overall I've found it helpful write out the approach more fully).

Amount of Noise Needed

For DP parameters that are a couple orders of magnitude less than the overall DP budget, we can get by with adding around 4k dummy records per shard. This is a relatively small amount compared to the overall query sizes and does not depend on the size of the total query or the number of shards. The number of dummy records that need to be added only depends on the sensitivity bound for unknown matchkeys and the epsilon and delta DP parameters.

More specifically, if the sensitivity of a matchkey is 100, then we can make the value N + M visible to any attacker differentially private with epsilon = 0.01 and delta = 10^{-8} by adding noise from a Truncated Geometric distribution with the expected size of the noise being 1827 (see appendix section for details). To let this be the amount of noise from the perspective of all parties, we will probably need at least two of the Helper Parties to add this many (in expectation) dummy records. So we can expect to add fewer than 4k dummy rows per shard.

Two stage sharding

To enforce the removal of identical encryptions of matchkeys, we will need to use an extra sharding stage, since we don’t assume the full input set could fit on any one shard for deduplication. To be maliciously secure, we most likely need each Helper Party to enforce the deduplication on its own set encrypted shares.

Stage 1: Ciphertext Sharding: First we will shard based on the value of the ciphertext itself, specifically the ciphertexts that encrypt the matchkey. This cannot leak any additional information since the ciphertext values are public to all parties. In the case the matchkey is encrypted as replicated shares (6 ciphertexts in all) the Helpers could agree on public sharding function such as hashing all the ciphertexts in a certain order and then interpreting those bits as an integer, y, to then compute the share as i = y modulo number of shards.

Stage 2: PRF Sharding: Then each shard will decrypt and compute a PRF of the matchkey using MPC. Here each of these shards for a particular Helper Party will need to communicate with the paired shards of the other Helpers.

IPA_ Two Stage Sharding

Optimized encryption for shard id computation

When the device encrypts the matchkey towards the helper parties, it may be a useful tradeoff to let the matchkey be encrypted in two ways:

  1. As replicated shares to be used in the MPC for sorting
  2. With some other encryption to be used specifically for the computation of a PRF. For example the PRF referenced here could begin by having the UA encrypt the matchkey using threshold ElGamal encryption. To get malicious security we will want to use a maliciously secure PRF such as the Dodis-Yampolski PRF. Daniel Masney and some academic collaborators have an efficient protocol for computing this in the honest majority setting that we can share more on.

Sharding messes up the presort on timestamps

One optimization we’ve been hoping to use is to have the report collector presort on timestamps in the clear before submitting a query. However, in the computation of the shard id with the PRF we must shuffle all the rows. This breaks the previous sort on timestamps. It is an open problem to figure out if there is a way to maintain this presorting optimization in some way.

An old approach revisited

An earlier approach that we considered for IPA was to perform the join by applying an OPRF to all of the matchkeys, instead of using sorting. This approach is vulnerable to known cardinality attacks on the matchkeys and for this reason we moved away from it. It is worth noting here though that bounding the cardinality of unknown matchkeys is not sufficient to let us go back to this approach.

First consider the direction we would like to go in to make that OPRF approach have differentially private leakage. If the sensitivity bound on a user’s matchkeys coming off real devices is M = 100, there will be some number of matchkeys with cardinality 1, some with cardinality 2, … some with cardinality 100. For each of these histogram buckets will we add non-negative DP noise by adding sets of dummy reports with those certain numbers of identical matchkeys. We could add DP noise to each of the buckets with sensitivity 1. Here the bound on the user’s number of matchkeys gives us the number of buckets in the histogram.

The flaw here is that we cannot actually bound the number of times a user’s matchkey is submitted. A malicious MKP can submit a targeted user’s matchkey many times (e.g. in a source-fan-out query there could be many copies of a matchkey with source events or split across multiple sets of trigger reports from fake trigger sites).

As a result the histogram has an unbounded number of buckets that we would need to add noise to. To use the notation above, even though M is bounded N is unbounded, but to make this approach work there would have to be a way to bound N + M.

Truncated Geometric/Discrete Laplace with large sensitivity

We can use the techniques for non-negative DP from our other paper. In the paper we had focused on the case the sensitivity $\Delta=1$. We now go back and show first for the truncated geometric / discrete laplace how to determine what the parameters of the noise distribution should be for large values of $\Delta$. Other distributions could also be used. This extends section 3.2 of the paper.

In that section we consider a DP mechanism using the following distribution:
$M(X) = f(X) + z; \quad z \sim p_{DoubleGeometric}(n)$
$\textrm{Pr}_{DoubleGeometric}(x|n) = Ae^{-\epsilon|n-x|}; \qquad x\in {0,\ldots,2n}$
For some normalizing constant $0< A< 1$ and some $n \in \mathbb{N}$.

As a probability this must sum to 1, which lets us solve for $A$. Letting $r=e^{-\epsilon}$, we get (see equation 10 in paper)
$A=(1-r)/(1+r-2r^{n+1})$

If we have a sensitivity and desired and , then we need to cover the tail as
\delta \geq A \sum_{k=n-Delta+1}^{n} e^{-k \epsilon}.

For larger than 1, we give a simple script to compute the smallest n such that the above equation holds for a given $\Delta, \epsilon, \delta$.

 import math

def RHS(n, Delta, epsilon): 
	# computes the sum A * \sum_{k = n - Delta + 1}^n  e^{- k * epsilon}
	# where A = (1 - r ) / (1+r - 2 r^{n+1})
	# where r = e^{-\epsilon}
	# for e the base of ln. 
	# We can write the sum as A *  \sum_{k = n - Delta + 1}^n  r^(k)

	r = math.exp(-epsilon)
	A = (1-r) / (1+r - 2 * r**(n+1))
	result = 0
	for k in range(n - Delta + 1, n+1):
		result += r**k

	return A * result 

def find_smallest_n(Delta, epsilon, small_delta):
	# find the smallest positive integer n such that delta >=  RHS(n, Delta, epsilon)
	# Lemma:  The smallest n will always be larger than Delta.  

	for n in range(Delta,2**32,1):
		# print("n =", n, "RHS =", RHS(n,Delta,epsilon))
		if small_delta >= RHS(n,Delta,epsilon):
			print("n =", n, "RHS =", RHS(n,Delta,epsilon))
			return n
			
assert(find_smallest_n(1, 0.5, 10**(-6)) == 25)  # example in paper 

find_smallest_n(100, 0.01,10**(-8))

Thanks @benjaminsavage, @eriktaubeneck, @akoshelev, and @danielmasney for lots of discussions on this.

@martinthomson
Copy link
Collaborator

Stage 1: Ciphertext Sharding

Note that each helper will receive a different ciphertext. Thus, the sharding here will need to be based on having one helper pick and the others follow, or it might involve some communication in order to agree on the shard allocation for each event.

If one helper picks and it is malicious, then it can send events with the same ciphertext to multiple shards, which might be used to bust any user-level capping. This only works if there are sufficient events for a user to be able to get a significant contribution on multiple shards, but that is a condition that a malicious site can engineer. This suggests that we would need something more sophisticated for the allocation process.

@csharrison
Copy link
Contributor

cc @marianapr who has been looking at malicious secure protocols for dummy insertion. One thing we need to be careful of here is malicious helpers amplifying/duplicating unknown match keys from within the protocol, which could violate the sensitivity bounds.

As a step towards bounding the sensitivity of a matchkey it was suggested (@benjaminsavage mentioned discussing with several folks at the last PATCG) that the device can cache the encryption of the matchkey per domain/app (or report collector) that is requesting it. Then when a website requests N encryption of a matchkey from a user, they will get back the exact same authenticated ciphertext N times. The Helper Party Network processing these reports in a query will be able to filter out or reject this query as malicious.

Is there a reason this is better than just encrypting a dummy record (null match key) on the client when a threshold of match keys M have been extracted? This avoids the need for a dedupe stage. I have some concerns that caching the ciphertext constrains the amount we could allow match keys to change, as rotating a match key would have to invalidate the cache (and therefore leak that the match key changed)

@bmcase
Copy link
Collaborator Author

bmcase commented Feb 22, 2023

Note that each helper will receive a different ciphertext. Thus, the sharding here will need to be based on having one helper pick and the others follow, or it might involve some communication in order to agree on the shard allocation for each event.

@martinthomson, or we could just send all ciphertexts to all parties for this step. They can then use all the values consistently to shard and that hopefully helps avoid depending on one to do it. Shouldn't reveal anything as each can only decrypt their own.

cc @marianapr who has been looking at malicious secure protocols for dummy insertion.

@csharrison, that's great; the mechanics of adding dummy records are still something we need to flush out. I feel the best place to do it should be in the shuffle that happens before computing a PRF to get the shard id in Stage 2 sharding.

One thing we need to be careful of here is malicious helpers amplifying/duplicating unknown match keys from within the protocol, which could violate the sensitivity bounds.

Yes, I think I see you're point. You could try to duplicate an unknown matchkey (to increase M above) while also inserting a known matchkey to have a large N and check whether they collide or not in sharding. Maybe when inserting the dummy records their needs to be a way to prove you've added them from some space of dummy records disjoint from the matchkey space (may only need to check a couple bits in the MPC if we go with matchkeys like #34).

Is there a reason this is better than just encrypting a dummy record (null match key) on the client when a threshold of match keys M have been extracted? This avoids the need for a dedupe stage.

It could better in terms of utility if we would ever process null matchkeys instead of real ones. I'm not sure it would avoid needing a dedupe stage either, as even if you just call get_encrypted_matchkey() one time you could use that encrypted matchkey to submit multiple reports.

I have some concerns that caching the ciphertext constrains the amount we could allow match keys to change, as rotating a match key would have to invalidate the cache (and therefore leak that the match key changed)

That is a good point I haven't thought about. At first it doesn't seem too bad. If we limit rotating matchkeys to once per epoch anyway (as needed to maintain the DP capping against malicious MKPs) and update to the new one at the start of each epoch if the MKP requested before that time, then we could also just reset all the caching at the start of every epoch. Since the epoch id is in the associated data encrypted with the matchkey, it should give a way for the Helper Parties to have visibility into how the sensitivity bound of a matchkey has changed and to use a different sensitivity bound in a cross-epoch query.

@akoshelev
Copy link

or we could just send all ciphertexts to all parties for this step.

does it contradict with the statement above?

To enforce the removal of identical encryptions of matchkeys, we will need to use an extra sharding stage, since we don’t assume the full input set could fit on any one shard for deduplication.

I was hoping with 1B events we could actually fit a few hundred GB into a single host for de-duplication purposes, so multicasting ciphertexts to helper parties may be a plausible mitigation

@bmcase
Copy link
Collaborator Author

bmcase commented Feb 22, 2023

@akoshelev, sorry wasn't very clear. I'm suggesting each of the three Helper parties could receive all the ciphertexts -- "all" as in all reports and for each report the encryptions of all 6 replicated shares. If all of those do not fit into a single host run by each Helper for de-duplication, then each Helper would need to do Stage 1 sharding for them.

@csharrison
Copy link
Contributor

I'm not sure it would avoid needing a dedupe stage either, as even if you just call get_encrypted_matchkey() one time you could use that encrypted matchkey to submit multiple reports.

Yup, I missed that. We can't remove the dedupe stage :)

That is a good point I haven't thought about. At first it doesn't seem too bad. If we limit rotating matchkeys to once per epoch anyway (as needed to maintain the DP capping against malicious MKPs) and update to the new one at the start of each epoch if the MKP requested before that time, then we could also just reset all the caching at the start of every epoch.

Yes I agree with the current design this is benign so it is probably fine, but it makes me a bit nervous for match key epochs to be load bearing in two different ways. If we came up with a new and improved server-side protocol which made it OK for match keys to change arbitrarily, we might not want to continue being burdened by this constraint. That being said, it's a minor issue.

@eriktaubeneck
Copy link
Contributor

If a Match Key Provider is the one colluding with a Helper Party to launch this attack, they are able to bypass the sensitivity capping for a matchkey by directly encrypting with different encryptions as many times as they like. This seems to allow the same style of attacks to continue.

I believe we can address this, though it will require an extra step when a Match Key Provider is involved in the query (causing a slightly higher cost.) An honest Match Key Provider should be able to provide a consistent encryption for each match key (I am assuming their first party cookie / storage should persist just as long as a match key which is set by them.)

We could then require that Match Key Providers provide consistent encryption for each match key they provide, and then have the helper parties, as a new step, calculate both the number of distinct encryptions and (via MPC) the cardinality of distinct match keys. If these two values differ, the helper parties would abort.

Given these match keys come from an individual website/app, we could potentially use an optimized approach to calculate this unique cardinality, such as an OPRF.

@bmcase
Copy link
Collaborator Author

bmcase commented Feb 23, 2023

@eriktaubeneck, I think I follow where you're going but not quite sure what you mean by "consistent encryptions for each matchkey"

We could then require that Match Key Providers provide consistent encryption for each match key they provide

I think the issue here is that we can't necessarily know which matchkeys were encrypted by a MKP and which by a UA. Was discussing with @benjaminsavage and tried to include this in the section on "An old approach revisted" that

A malicious MKP can submit a targeted user’s matchkey many times (e.g. in a source-fan-out query there could be many copies of a matchkey with source events or split across multiple sets of trigger reports from fake trigger sites).

I believe what you're suggesting can help with the case that a MKP is a source site letting us check that the set of source events they submit doesn't have duplicate matchkeys. But I don't think it would prevent a malicious MKP from submitting duplicate matchkeys as if they were trigger reports, because for real trigger reports we can't expect uniqueness of matchkeys to hold.

@bmcase
Copy link
Collaborator Author

bmcase commented Feb 23, 2023

The other thing I was thinking about in terms of deduplicating encrypted reports is that it could mess up a real use case where a website simply had multiple different events (source or trigger) in an epoch for a user and wanted to include all of them in a query. Since the encryption of the matchkey is cached, they wouldn't be able to get results from all of them if they would be removed by deduplication.

I see a couple possible ways around this:

  1. Deduplication can deduplicate only above a certain cap (e.g. we allow up to 5 duplicate encryptions of a matchkey on different reports). We can factor this cap into the sensitivity bound of unknown matchkeys to still add the right amount of DP noise.

  2. A report with a single matchkey can contain the data for multiple events of that user (mk, event1, event2, event3). We would need to cap and pad all reports to have the same number of events so they would be indistinguishable when sorted. After sorting we could flatten these vectors of events out before doing attribution.

  3. Some optimized combination of 1) and 2) depending on the sparsity of users having many different events.

@marianapr
Copy link

Hello,

Correct me if I am wrong, but my understanding here was that the only harmful way in which the DP sensitivity bound can be violated is if one party manages to duplicate the contributions of a certain client, which would be reflected in a noticeable way in one of the released aggregates because the DP noise will not be sufficient. One thing that I am still hazy about is why this will be a problem only in the PRF based approaches and not in the sorting circuit based MPC (in the later case the DP bound violation will be reflected in the released counts right?)

Going back to solving the issue, it seems that as long as we prevent the malicious actor from being able to duplicate records, this should be resolved. This can be guaranteed with respect to parties that collect encrypted data and then run the helper network to obtain the aggregate output. The remaining challenge for mechanisms that rely on record duplication in the DP privacy mechanism, is to prove that the duplication was done according to honestly sampled DP noise distribution. Though this is more about proving that dummy records were generated correctly never contributing anything but 0 value.

I saw above some discussion about multiple match keys encrypting the same value, I think you can do something about this by providing PRF value of the encrypted key and proving that the match keys encrypt the same values as the ones on which you evaluated the PRF, and then make sure that different match keys do not encrypt the same value. Attacks where the client is tricked to submit more contributions with the same match key that honest executions, should be stopped at the client I guess.

@bmcase
Copy link
Collaborator Author

bmcase commented Feb 28, 2023

Hello @marianapr

Correct me if I am wrong, but my understanding here was that the only harmful way in which the DP sensitivity bound can be violated is if one party manages to duplicate the contributions of a certain client, which would be reflected in a noticeable way in one of the released aggregates because the DP noise will not be sufficient.

If by "released aggregates" you mean the final outputs of the protocol, I actually don't think duplicating will be an issue there because we have the per matchkey sensitivity capping that happens in the MPC. What we are mainly concerned about here is leaking information from sharding (shard sizes) if we can't maintain a sensitiving for unknow matchkeys.

@benjaminsavage
Copy link
Collaborator

I actually don't think duplicating will be an issue there because we have the per matchkey sensitivity capping that happens in the MPC

I'll just elaborate on this a bit because I think this is different from the ARA proposal and might be unfamiliar to folks reading this comment thread.

In the IPA protocol there are a few stages. After the "attribution" stage, where trigger events are attributed to source events, there is a "user contribution capping" stage. This is one of our more expensive stages. In MPC we compute the total contribution of each user to the measurement result (across all histogram buckets). If this contribution exceeds the per-user cap, we drop their contribution down to the cap.

In our current implementation this requires doing a bit-decomposition of their "total contribution" value, then performing a secure comparison protocol on those bits to get a sharing of 0 or 1 indicating if they've exceeded the cap.

@marianapr
Copy link

OK, I see, maybe I missed the contribution capping (you are bounding the L1 norm of user contributions). Does the current prototype do already the capping?

@bmcase
Copy link
Collaborator Author

bmcase commented Feb 28, 2023

That's right, our capping is essentially bounding the L1 norm of a user's contribution. Yes, we have it implemented in both MP-SPDZ and Rust. As of the most recent PATCG Rust numbers capping was the second most expensive stage in terms of network (~30%).

@benjaminsavage
Copy link
Collaborator

benjaminsavage commented Apr 19, 2023

This approach (caching ciphertexts) will result in some amount of information leakage to the report collector and helper nodes.

In the case of a source-site that delegates measurement to a different entity that acts as the report collector, it would reveal something akin to their total unique user count, and their ad-load per user.

While this is not cross-site data, it is information a business might want to keep confidential and could be concerned about the report collector (or potentially even the helper parties) being exposed to.

An alternative design to caching ciphertexts would be to cache the same values for the secret-shares, but have the browser generate a different encryption of these shares each time. In this fashion, no information would be leaked to a delegated report collector, but the helper parties could still enforce a sensitivity bound by looking for identical share values (after decryption).

This would only work in a large field where the chances of two shares (in a replicated secret sharing) having identical values for two different people is very small. Assuming a 32-bit prime, the odds of two different people randomly having share 1 and share 2 the same would be 1 out of 2^64, which seems sufficiently small.

This would entail identical storage requirements in the browser, but a different value would be cached, shares instead of ciphertexts.

This approach does not address the issue of information leakage to the helper parties. I don't have any good idea of how to solve that problem.

@csharrison
Copy link
Contributor

This would only work in a large field where the chances of two shares (in a replicated secret sharing) having identical values for two different people is very small. Assuming a 32-bit prime, the odds of two different people randomly having share 1 and share 2 the same would be 1 out of 2^64, which seems sufficiently small.

Isn't this better modeled by the birthday problem paradox with N users and looking at the expected # of collisions?

@bmcase
Copy link
Collaborator Author

bmcase commented Apr 19, 2023

Yeah the birthday paradox seems the right direction to analyze. If we're doing on device generated matchkeys, we probably need those to be 64 bits or more anyway though (for birthday paradox implications with the actual matching), so with replicated shares of that value that we would be looking for collisions in a space of size 2^128.

@bmcase
Copy link
Collaborator Author

bmcase commented Apr 25, 2023

Was chatting with Daniel and he pointed out one other detail on the DP analysis for sharding is that the total amount of noise (across all shards) will be known. Doesn't seem like it is directly an issue but we should figure out how to factor into the formal analysis of the privacy parameters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants