New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collect without interval #273
Comments
In this case the aggregators no longer need to worry about |
This seems similar to my suggestion in #218 of having a |
Yes, But I think allowing advanced filter can open up privacy concerns. When slicing the batches, one not only have to make sure the sliced batch meets all privacy guarantees, but also all the deltas with previously slices. We may find a way to do this with time intervals, but adding other metadata like region will make it extremely hard. These metadata will likely be related to client measurements, so I think it's better to encode such information in the measurement, or designing different tasks for them. There are use cases where drill down or fine-grained filtering is not needed, usr simply want aggregate results with privacy guarantee. |
Agreed, previous discussion on this topic is mainly here: #195 |
I think this would be useful. Let me check if I understand the protocol changes that are required:
|
I think exact |
Per discussion in IETF 114, let's aim to support this in the next version of the draft, with a simpler (but equivalent) version of the existing query validation. |
Looking at the implementation strategy for #297, I think there is a different approach which would be less disruptive to the existing design/provide better unification between the different query types, allow better precision w.r.t the number of reports per batch in the new The high-level idea is: the current approach in #297 for
I touch on a few more points in the original PR comment but this captures the big idea. [1] #297 includes [2] Somewhat off-topic from the issue at hand, but I think allowing collect requests to be made before a batch interval is complete is valuable for performance reasons, too. Tasks using a VDAF with a trivial (i.e. |
@branlwyd I think it's time to flesh out your idea in a PR. I would suggest cribbing off of #297, i.e., creating a new branch based on the current one. My main requirements are as follows:
|
@branlwyd @cjpatton Choosing batch-interval dynamically is not ideal because: 1). As @junyechen1996 and others have mentioned, the coarse-grained timestamp would prevent aggregators from effectively defining intervals in 2). In the
I think this can be worked out by leader itself. It is the leader's responsibility to create a On a more general term, I'd prefer we support different collect requirements explicitly and separately at this stage, once we know more about their behaviour and the challenges in implementation, we can consider unifying if necessary. Otherwise we may find ourselves unable to support use case of one type without breaking the other type. |
@cjpatton we should define what exactly does "(3.) a query determines a unique batch" means. From the PR:
This is a valid concern but since with pure interval based collection leader can still do this, I don't see what addressing this can improve. Also worth pointing out that if the task is protected by differential privacy, and client is sending information like user-agent in the clear, then the task should accept that group batches by the same user-agent is still privacy preserving as long as |
Thanks for your feedback; I have sent #300 which hopefully makes the approach I am suggesting clear. To respond to a few of the concerns raised in this issue:
The randomness is used as a timestamp tie-breaker, so a
This only works if the batch intervals are allowed to overlap, as you note later in your comment. Other than that, this is somewhat inefficient: a few failed report aggregations can delay a batch from being collectable until another aggregation job is run, and the Leader will only realize it needs to run another aggregation job once the prior aggregation jobs have completed. This "extra" aggregation job is also fairly likely to be quite small compared to the normal batch sizes in use.
The current approach in #297 (IMO) effectively creates two similar-but-separate protocols embedded inside the top-level DAP protocol. The approach I suggest in #300 achieves the same overall goal (i.e. generate batches as soon as
The Leader can only do this if they throw away any reports that don't meet their criteria. Open issue #130 may end up with the client sending report shares directly to the leader & helper; if so, at that point the Helper will be able to detect if the Leader is dropping reports. |
How large is the timestamp rounding (i.e. truncation) duration? If it has to be too large the delay would be problematic; I'm not sure what is generally considered large enough to protect privacy. |
The difference is that with interval based collection the leader is restricted to splitting into:
With fixed-size-batch the leader can just insert |
For our use case it'll have to be hours. |
@simon-friedberger what is the attack in this case, if we are talking about sybil attack, then I don't see what's the difference in the two types: for interval query, leader can generate fake reports with timestamp in |
One reason to support fixed-size collection is to deliver an aggregation result as soon as the batch size condition is met. If leader has to wait then it might defeat this purpose. In my example above, if I know the task will usually aggregate
True, but my point was the interval doesn't matter, so there's no real "overlap". Besides, I think it's better for privacy if the chronological order of reports are not preserved inter- or intra-batches, when slicing on time window is not required. We can work around the inefficiency above by allowing leader to collect slightly more than
That's indeed my preference, the advantage is adopters that are only interested in one query type don't need to worry about the other one so much. I'm not against unifying implementation details, but I think it's better to have two distinct sub-protocols that support the two use cases well, than one sub-protocol that has to compromise. At this point, I think I should go read your PR :) |
@wangshan I don't think you are. It's a minor difference in the power of the attacker. In the first case other reports in the interval have to either be dropped or will add noise. In the second the attack can be executed more precisely.
Is this coherent? Even if the timestamp is 1 hour for privacy, the correct timestamp is probably in the last 10 minutes... |
At least one privacy violation I can think of requires knowing the clock skew between client & aggregator. Specifically, a malicious leader could bucket reports by apparent clock skew between client timestamp & server timestamp. Then if they learn a client device's clock skew, they can correlate the reports in the relevant buckets as likely having been generated by that device. Rounding timestamps to a large enough degree protects against this attack, since all reports will fall into the same bucket. (Though I think if the malicious actor could control when a client submitted reports, they might be able to figure out bounds on the client's clock skew by triggering a report upload near a rounding boundary and seeing how the client rounds the timestamp.) And even if we know the report arrived at the server in the last 10 minutes, that doesn't tell us what the clock skew between client & aggregator is. That said, I'm curious if there are other privacy violations to consider here -- has the threat model around the client timestamp been discussed somewhere already? |
The main concern about timestamps -- or any other report metadata -- is that it creates a fingerprinting surface that a Helper can use to control the reports that end up in the batch. Of course it should be noted that the Leader also has this capability, and more. My primary concerns about #300 (as it's currently spelled) are more operational:
|
@branlwyd I originally brought this up in this issue: #274 The threat I was referring to is the following: if there is a proxy between client and aggregator (like the one described in #294), and the proxy is owned by the leader, (for e.g leader's edge server). Assuming we encrypt the report from client to the proxy and to aggregator (so no one other than leader sees the timestamp), if the timestamp is precise, then an attacker from leader with access to the input of the proxy can look up the time packages arrived, and figure out which client the report is coming from (maybe with help of other info like size of the package). Rounding timestamp makes this a lot harder. In the fixed-size query case, the main purpose of the timestamp is to allow aggregators filter out very old (or very future) reports. |
If the collector is not interested in time and interval, but simply want to collect aggregation in a batch that meets
min_batch_size
. Then can the protocol support a 'batch-based collection' instead of current interval-based?Consider a Prio usecase, when
max_batch_lifetime == 1
, collector will only need to collect the aggregation so far, with a batch size B >=min_batch_size
. This can be orchestrated by the leader, which can implement a counting service to track batch size for a task, once it reachesmin_batch_size
, leader sends AggregateShareReq to collect helper'saggregate_share
and return to collector.This requires a new id to tie agg-flow with agg-share-flow. For example, in addition to
agg_job_id
, leader can send a uniquebatch_id
in every AggregateReq. At collect time, leader use the samebatch_id
to collectoutput_share
in helper (helper can still proactively aggregate output_shares toaggregate_share
, since there are no more batch windows, helper can storeaggregate_share
byagg_job_id
, or accumulate all aggregation jobs'output_share
to oneaggregate_share
, and store it withbatch_id
). Illustrated as following:Here [T0, Tm] is the time takes to aggregate
min_batch_size
number of reports, it has no relationship withmin_batch_duration
or collector interval.As this issue pointed out: #195, to avoid privacy leak, each batch window must contain at least
min_batch_size
reports, otherwise attacker can find ways to slice intervals to break privacy guarantee. But if the protocol does require each batch window meetsmin_batch_size
, then the collect interval itself is no longer useful, since the duration that takes to meetmin_batch_size
is the smallest duration that can be queried. Therefore, it seems to make sense to base collection entirely on batch size, not interval.The text was updated successfully, but these errors were encountered: