-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorized aggregation #1009
Vectorized aggregation #1009
Conversation
there are plenty of changes coming from 1008 that add some noise - can you rebase it on top of main after #1008 is merged? |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1009 +/- ##
==========================================
+ Coverage 90.24% 90.43% +0.19%
==========================================
Files 172 173 +1
Lines 25727 26390 +663
==========================================
+ Hits 23218 23867 +649
- Misses 2509 2523 +14 ☔ View full report in Codecov by Sentry. |
@@ -272,7 +261,7 @@ impl<'a, B: ShardBinding, F: ExtendableField> UpgradedContext<F> for Upgraded<'a | |||
T: Send, | |||
UpgradeContext<'a, Self, F>: UpgradeToMalicious<'a, T, M>, | |||
{ | |||
UpgradeContext::new(self.narrow(&UpgradeStep::UpgradeSemiHonest), NoRecord) | |||
UpgradeContext::new(self.clone(), NoRecord) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand what's going on here (even given the comments above), and I'm not sure this change is appropriate -- but it did resolve issues I had with compact gate after removing modulus conversion from OPRF IPA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iirc, it was implemented because an early version of compact gate had to use an ignorelist for steps that don't trigger communications and tracked active steps through send
. #711 may shed some light on why it was done this way.
I could be wrong, but I believe compact gate uses narrow
to track steps now, so it may be the reason why you're not getting panic. If you could share the errors you're getting from modulus conversion, we can dig deeper into this. For now it seems that it should just work
you may want to fix the comment too (lines 254..258)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The errors were "error: ipa_macros::step expects an enum with variants that match the steps in steps.txt" and "error[E0277]: the trait bound compact::Compact: StepNarrow<UpgradeStep>
is not satisfied" (i.e. the usual errors when a step is not exercised in the steps.txt generation).
Part of why I was unsure what to do here, is that these methods were not being exercised -- so I wasn't sure there would be test coverage for whatever change I made.
But with the benefit of your comment and some more thinking, I have a proposal here that I'm more confident in. I'm going to open a separate PR for it (to be considered in advance of this one).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#1030. After this PR is merged / rebased to include that, it will need to stub these impls based on cfg(descriptive-gate)
, because it removes the last place we invoke share upgrades in the protocol (modulus conversion). They'll later need to come back when we add MAC-based malicious security for PRF evaluation and shuffle.
|
||
pub type AggResult<const B: usize> = Result<BitDecomposed<Replicated<Boolean, B>>, Error>; | ||
|
||
pub async fn aggregate_values<'fut, C, OV, const B: usize>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe nice to add a test for just this aggregate_values
function.
let bit_futures = index_contribution | ||
.zip(repeat((bit_of_bdkey, bucket_c.clone()))) | ||
.enumerate() | ||
.map(|(i, (a, (b, bucket_c)))| { | ||
a.multiply(b, bucket_c.narrow(&BitOpStep::Bit(i)), record_id) | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This got a lot more complex looking.
Does it no longer work to multiply a multi-bit value with a single bit? That's too bad. Could we migrate from BitDecomposed to a BooleanArray and use select
instead?
Doesn't need to be in this PR - this PR is too long already =).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue here is that we're working with a 2-D array. The type BitDecomposed<AdditiveShare<Boolean, N>>
actually does contain a boolean array -- the vector type in AdditiveShare<Boolean, N>
is the boolean array type of length N.
We could flatten everything into a boolean array of size N * HV_BITS, but then we'd have to do a bunch of 2-D array index calculations, which I think would make things even more complicated.
I do like the idea of using select, though. I'll see if I can do something with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After looking at it some, I don't think it makes sense to do in this PR... it's a 100 lines elsewhere for 10 lines here tradeoff. Which may be the right tradeoff, if it improves the modularity of the code, but I don't think it's the right tradeoff if those 100 lines are getting added to this PR.
The concrete options I see are:
- Implementing
BooleanArrayMul
forBitDecomposed<AdditiveShare<Boolean, N>>
. This in turn requires either adding some moreFoo: SecureMul<C>
bounds in various places, or else adding aC
type parameter to theBooleanArrayMul
trait. - Converting
select
into a trait so we can implement it for arbitrary types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm totally supportive of landing this PR as is and working on this in a future PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My plan is to defer this and saturating addition to a follow-up. I am working on the (rather gruesome) merge conflicts presently, but I'm hoping this can be ready tonight or tomorrow.
// bdbit_contribution is either zero or equal to row_contribution. So it | ||
// is okay to do a carryless "subtraction" here. | ||
for (r, b) in row_contribution[left_index] | ||
.iter_mut() | ||
.zip(bdbit_contribution.iter()) | ||
{ | ||
*r -= b; | ||
} | ||
if right_index < breakdown_count { | ||
row_contribution[right_index] = bdbit_contribution; | ||
for (r, b) in row_contribution[right_index] | ||
.iter_mut() | ||
.zip(bdbit_contribution) | ||
{ | ||
*r = b; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part also seems like it could be simplified if we stopped using BitDecomposed and used BooleanArray instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very long PR... I really wish it was split into smaller, more digestible chunks...
I mostly focused my review on the protocols. It's pretty awesome how you were able to eliminate the conversions to the prime field. I LOVE the way you are able to just chunk streams now (very nice!), and the highly efficient bitwise addition that only deals with as many digits as actually necessary is great.
This PR actually does about 1/2 the work I was trying to do in the feature_label_dot_product
protocol... I should have read this PR first before starting to implement!
So I'm keen to get this landed so that I can build upon this work. I know @bmcase is as well.
The only concern I have is with the wrapping addition (not saturating). I would suggest a follow-up PR to just make it saturating.
if a.len() < usize::try_from(OV::BITS).unwrap() { | ||
sum.push(carry); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this implements wrapping addition, basically addition modulo 2^OV::BITS
. This could potentially yield unexpected results (like zero) when the sum wraps around.
An alternative would be to implement a "saturating addition".
We could implement that with:
if /* it could potentially overflow on this step */ {
let sum = integer_sat_add(/* stuff */)
} else {
let (mut sum, carry) = integer_add(/* stuff */).await?;
sum.push(carry);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to move this circuit to a separate function? Seeing wrapping_integer_add
or sat_integer_add
here makes the intent more obvious imo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old version didn't have saturating math either, but I agree that now that we can do it relatively cheaply, it's probably worth doing. Is there any interaction with DP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason we never had such logic before is that we were moving into a huge field, and assumed we would never encounter a sum that reached 2^32. Now that we wrap around at a configurable value, it's possible to select a number that's fairly low.
I think it might be fine to just document this behavior and warn everyone to please select an output type that's sufficiently large so that there is no overflow. That would keep the code simpler. I don't think there's any impact on DP. If there is wrapping, it will essentially decrease the overall contribution of some people to the query output. But it's not predictable. Some people will get their contribution removed and others will not. So if we take a "worst case" scenario type of approach, we probably can't say anything definitive and cannot benefit from this.
None => { | ||
// Input stream ended at a chunk boundary. Unlike the partial chunk case, we return | ||
// None, so we shouldn't be polled again regardless of dummy_fn signalling, but | ||
// might as well make this a fused stream since it's easy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea shouldn't we just fuse the inner stream? I haven't look closely into what would happen, but it appears that it may work too. Maybe there is a better argument name for dummy_fn
too (padder?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The case where fusing the inner stream doesn't work very well is when we terminate due to an error -- we'd have to drain it.
/// Process stream through a function that operates on chunks. | ||
/// | ||
/// Processes `stream` by collecting chunks of `N` items into `buffer`, then calling `process_fn` | ||
/// for each chunk. If there is a partial chunk at the end of the stream, `dummy_fn` is called |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it enough for our use case to require T::Default
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I considered that -- it would require implementing Default
on several layers, down to AdditiveShare
. There is also a semantic required some places this is used that isn't exactly Default
-- the dummy records need to contribute nothing to the overall result.
I think I would slightly prefer the version with Default
, but I didn't want to presume to make that change given the impact.
@@ -272,7 +261,7 @@ impl<'a, B: ShardBinding, F: ExtendableField> UpgradedContext<F> for Upgraded<'a | |||
T: Send, | |||
UpgradeContext<'a, Self, F>: UpgradeToMalicious<'a, T, M>, | |||
{ | |||
UpgradeContext::new(self.narrow(&UpgradeStep::UpgradeSemiHonest), NoRecord) | |||
UpgradeContext::new(self.clone(), NoRecord) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iirc, it was implemented because an early version of compact gate had to use an ignorelist for steps that don't trigger communications and tracked active steps through send
. #711 may shed some light on why it was done this way.
I could be wrong, but I believe compact gate uses narrow
to track steps now, so it may be the reason why you're not getting panic. If you could share the errors you're getting from modulus conversion, we can dig deeper into this. For now it seems that it should just work
you may want to fix the comment too (lines 254..258)
get_bits::<Fp32BitPrime>(breakdown_key.try_into().unwrap(), Gf8Bit::BITS); | ||
let value = Fp32BitPrime::truncate_from(VALUE); | ||
async fn move_to_bucket(count: usize, breakdown_key: usize, robust: bool) -> Vec<BA8> { | ||
let breakdown_key_bits = BitDecomposed::decompose(Gf8Bit::BITS, |i| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it may be worth explaining why 8 bit value is enough here. (I believe it is related to MAX_BREAKDOWN_COUNT
and if that's true, maybe a static assertion is possible too)
if a.len() < usize::try_from(OV::BITS).unwrap() { | ||
sum.push(carry); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to move this circuit to a separate function? Seeing wrapping_integer_add
or sat_integer_add
here makes the intent more obvious imo
.set_total_records(num_outputs), | ||
flattened_stream, | ||
0..<FV as SharedValue>::BITS, | ||
let flattened_stream = Box::pin( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this now work w/o multithreading or we still rely on it being on for aggregation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it is still required, perhaps even more so. We can add the buffered stream adapter if we really want things to work without multi-threading.
32 => oprf_ipa::<C, BA8, BA3, BA20, BA5, F>(ctx, input, aws).await, | ||
64 => oprf_ipa::<C, BA8, BA3, BA20, BA6, F>(ctx, input, aws).await, | ||
128 => oprf_ipa::<C, BA8, BA3, BA20, BA7, F>(ctx, input, aws).await, | ||
8 => oprf_ipa::<C, BA8, BA3, HV, BA20, BA3, 256>(ctx, input, aws).await, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it worth to have a generic histogram value parameter with everything else being hardcoded or I misunderstand what's going on here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's a fair question -- I was modeling it after what was already there (HV
is the bitwise equivalent of F
). Eventually all of these (maybe excluding SS
) need to be runtime-configurable (#953). I don't have a good sense of how similar that implementation will look to the type-generic version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is an unsolved problem. We never made anything runtime configurable. We just re-built the code for every different combination of query params =).
Capturing some open items:
|
Double-posting a comment I made in a thread here for better visibility: My plan is to defer this [using |
I think it's fine to defer the saturation addition as well. |
(They are unused for the moment now that modulus conversion is gone, but they will come back at some point.)
#1046 has the promised follow-up changes. |
This changes aggregation to work with shares in$\mathbb{F}_2$ rather than $\mathbb{F}_p$ (required to support DZKP-based malicious security), and vectorizes it.