Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorized aggregation #1009

Merged
merged 12 commits into from
May 2, 2024
Merged

Conversation

andyleiserson
Copy link
Collaborator

@andyleiserson andyleiserson commented Apr 11, 2024

This changes aggregation to work with shares in $\mathbb{F}_2$ rather than $\mathbb{F}_p$ (required to support DZKP-based malicious security), and vectorizes it.

@akoshelev
Copy link
Collaborator

there are plenty of changes coming from 1008 that add some noise - can you rebase it on top of main after #1008 is merged?

Copy link

codecov bot commented Apr 17, 2024

Codecov Report

Attention: Patch coverage is 97.11934% with 28 lines in your changes are missing coverage. Please review.

Project coverage is 90.43%. Comparing base (762393b) to head (3a4c449).

Files Patch % Lines
ipa-core/src/helpers/stream/chunks.rs 95.69% 12 Missing ⚠️
ipa-core/src/protocol/context/mod.rs 63.63% 4 Missing ⚠️
ipa-core/src/test_fixture/ipa.rs 55.55% 4 Missing ⚠️
ipa-core/src/test_fixture/world.rs 81.25% 3 Missing ⚠️
ipa-core/src/protocol/ipa_prf/aggregation/mod.rs 99.42% 2 Missing ⚠️
ipa-core/src/secret_sharing/vector/transpose.rs 97.97% 2 Missing ⚠️
ipa-core/src/query/executor.rs 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1009      +/-   ##
==========================================
+ Coverage   90.24%   90.43%   +0.19%     
==========================================
  Files         172      173       +1     
  Lines       25727    26390     +663     
==========================================
+ Hits        23218    23867     +649     
- Misses       2509     2523      +14     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -272,7 +261,7 @@ impl<'a, B: ShardBinding, F: ExtendableField> UpgradedContext<F> for Upgraded<'a
T: Send,
UpgradeContext<'a, Self, F>: UpgradeToMalicious<'a, T, M>,
{
UpgradeContext::new(self.narrow(&UpgradeStep::UpgradeSemiHonest), NoRecord)
UpgradeContext::new(self.clone(), NoRecord)
Copy link
Collaborator Author

@andyleiserson andyleiserson Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand what's going on here (even given the comments above), and I'm not sure this change is appropriate -- but it did resolve issues I had with compact gate after removing modulus conversion from OPRF IPA.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc, it was implemented because an early version of compact gate had to use an ignorelist for steps that don't trigger communications and tracked active steps through send. #711 may shed some light on why it was done this way.

I could be wrong, but I believe compact gate uses narrow to track steps now, so it may be the reason why you're not getting panic. If you could share the errors you're getting from modulus conversion, we can dig deeper into this. For now it seems that it should just work

you may want to fix the comment too (lines 254..258)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The errors were "error: ipa_macros::step expects an enum with variants that match the steps in steps.txt" and "error[E0277]: the trait bound compact::Compact: StepNarrow<UpgradeStep> is not satisfied" (i.e. the usual errors when a step is not exercised in the steps.txt generation).

Part of why I was unsure what to do here, is that these methods were not being exercised -- so I wasn't sure there would be test coverage for whatever change I made.

But with the benefit of your comment and some more thinking, I have a proposal here that I'm more confident in. I'm going to open a separate PR for it (to be considered in advance of this one).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1030. After this PR is merged / rebased to include that, it will need to stub these impls based on cfg(descriptive-gate), because it removes the last place we invoke share upgrades in the protocol (modulus conversion). They'll later need to come back when we add MAC-based malicious security for PRF evaluation and shuffle.

@andyleiserson andyleiserson mentioned this pull request Apr 22, 2024

pub type AggResult<const B: usize> = Result<BitDecomposed<Replicated<Boolean, B>>, Error>;

pub async fn aggregate_values<'fut, C, OV, const B: usize>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe nice to add a test for just this aggregate_values function.

Comment on lines +127 to +132
let bit_futures = index_contribution
.zip(repeat((bit_of_bdkey, bucket_c.clone())))
.enumerate()
.map(|(i, (a, (b, bucket_c)))| {
a.multiply(b, bucket_c.narrow(&BitOpStep::Bit(i)), record_id)
});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This got a lot more complex looking.

Does it no longer work to multiply a multi-bit value with a single bit? That's too bad. Could we migrate from BitDecomposed to a BooleanArray and use select instead?

Doesn't need to be in this PR - this PR is too long already =).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue here is that we're working with a 2-D array. The type BitDecomposed<AdditiveShare<Boolean, N>> actually does contain a boolean array -- the vector type in AdditiveShare<Boolean, N> is the boolean array type of length N.

We could flatten everything into a boolean array of size N * HV_BITS, but then we'd have to do a bunch of 2-D array index calculations, which I think would make things even more complicated.

I do like the idea of using select, though. I'll see if I can do something with that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking at it some, I don't think it makes sense to do in this PR... it's a 100 lines elsewhere for 10 lines here tradeoff. Which may be the right tradeoff, if it improves the modularity of the code, but I don't think it's the right tradeoff if those 100 lines are getting added to this PR.

The concrete options I see are:

  • Implementing BooleanArrayMul for BitDecomposed<AdditiveShare<Boolean, N>>. This in turn requires either adding some more Foo: SecureMul<C> bounds in various places, or else adding a C type parameter to the BooleanArrayMul trait.
  • Converting select into a trait so we can implement it for arbitrary types.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm totally supportive of landing this PR as is and working on this in a future PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My plan is to defer this and saturating addition to a follow-up. I am working on the (rather gruesome) merge conflicts presently, but I'm hoping this can be ready tonight or tomorrow.

Comment on lines +143 to +157
// bdbit_contribution is either zero or equal to row_contribution. So it
// is okay to do a carryless "subtraction" here.
for (r, b) in row_contribution[left_index]
.iter_mut()
.zip(bdbit_contribution.iter())
{
*r -= b;
}
if right_index < breakdown_count {
row_contribution[right_index] = bdbit_contribution;
for (r, b) in row_contribution[right_index]
.iter_mut()
.zip(bdbit_contribution)
{
*r = b;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part also seems like it could be simplified if we stopped using BitDecomposed and used BooleanArray instead.

Copy link
Collaborator

@benjaminsavage benjaminsavage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very long PR... I really wish it was split into smaller, more digestible chunks...

I mostly focused my review on the protocols. It's pretty awesome how you were able to eliminate the conversions to the prime field. I LOVE the way you are able to just chunk streams now (very nice!), and the highly efficient bitwise addition that only deals with as many digits as actually necessary is great.

This PR actually does about 1/2 the work I was trying to do in the feature_label_dot_product protocol... I should have read this PR first before starting to implement!

So I'm keen to get this landed so that I can build upon this work. I know @bmcase is as well.

The only concern I have is with the wrapping addition (not saturating). I would suggest a follow-up PR to just make it saturating.

ipa-core/src/protocol/ipa_prf/aggregation/mod.rs Outdated Show resolved Hide resolved
Comment on lines +236 to +238
if a.len() < usize::try_from(OV::BITS).unwrap() {
sum.push(carry);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this implements wrapping addition, basically addition modulo 2^OV::BITS. This could potentially yield unexpected results (like zero) when the sum wraps around.

An alternative would be to implement a "saturating addition".

We could implement that with:

if /* it could potentially overflow on this step */ {
  let sum = integer_sat_add(/* stuff */)
} else {
  let (mut sum, carry) = integer_add(/* stuff */).await?;
  sum.push(carry);
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to move this circuit to a separate function? Seeing wrapping_integer_add or sat_integer_add here makes the intent more obvious imo

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old version didn't have saturating math either, but I agree that now that we can do it relatively cheaply, it's probably worth doing. Is there any interaction with DP?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we never had such logic before is that we were moving into a huge field, and assumed we would never encounter a sum that reached 2^32. Now that we wrap around at a configurable value, it's possible to select a number that's fairly low.

I think it might be fine to just document this behavior and warn everyone to please select an output type that's sufficiently large so that there is no overflow. That would keep the code simpler. I don't think there's any impact on DP. If there is wrapping, it will essentially decrease the overall contribution of some people to the query output. But it's not predictable. Some people will get their contribution removed and others will not. So if we take a "worst case" scenario type of approach, we probably can't say anything definitive and cannot benefit from this.

None => {
// Input stream ended at a chunk boundary. Unlike the partial chunk case, we return
// None, so we shouldn't be polled again regardless of dummy_fn signalling, but
// might as well make this a fused stream since it's easy.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea shouldn't we just fuse the inner stream? I haven't look closely into what would happen, but it appears that it may work too. Maybe there is a better argument name for dummy_fn too (padder?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case where fusing the inner stream doesn't work very well is when we terminate due to an error -- we'd have to drain it.

/// Process stream through a function that operates on chunks.
///
/// Processes `stream` by collecting chunks of `N` items into `buffer`, then calling `process_fn`
/// for each chunk. If there is a partial chunk at the end of the stream, `dummy_fn` is called
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it enough for our use case to require T::Default?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I considered that -- it would require implementing Default on several layers, down to AdditiveShare. There is also a semantic required some places this is used that isn't exactly Default -- the dummy records need to contribute nothing to the overall result.

I think I would slightly prefer the version with Default, but I didn't want to presume to make that change given the impact.

ipa-core/src/helpers/stream/chunks.rs Show resolved Hide resolved
@@ -272,7 +261,7 @@ impl<'a, B: ShardBinding, F: ExtendableField> UpgradedContext<F> for Upgraded<'a
T: Send,
UpgradeContext<'a, Self, F>: UpgradeToMalicious<'a, T, M>,
{
UpgradeContext::new(self.narrow(&UpgradeStep::UpgradeSemiHonest), NoRecord)
UpgradeContext::new(self.clone(), NoRecord)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc, it was implemented because an early version of compact gate had to use an ignorelist for steps that don't trigger communications and tracked active steps through send. #711 may shed some light on why it was done this way.

I could be wrong, but I believe compact gate uses narrow to track steps now, so it may be the reason why you're not getting panic. If you could share the errors you're getting from modulus conversion, we can dig deeper into this. For now it seems that it should just work

you may want to fix the comment too (lines 254..258)

get_bits::<Fp32BitPrime>(breakdown_key.try_into().unwrap(), Gf8Bit::BITS);
let value = Fp32BitPrime::truncate_from(VALUE);
async fn move_to_bucket(count: usize, breakdown_key: usize, robust: bool) -> Vec<BA8> {
let breakdown_key_bits = BitDecomposed::decompose(Gf8Bit::BITS, |i| {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it may be worth explaining why 8 bit value is enough here. (I believe it is related to MAX_BREAKDOWN_COUNT and if that's true, maybe a static assertion is possible too)

Comment on lines +236 to +238
if a.len() < usize::try_from(OV::BITS).unwrap() {
sum.push(carry);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to move this circuit to a separate function? Seeing wrapping_integer_add or sat_integer_add here makes the intent more obvious imo

.set_total_records(num_outputs),
flattened_stream,
0..<FV as SharedValue>::BITS,
let flattened_stream = Box::pin(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this now work w/o multithreading or we still rely on it being on for aggregation?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it is still required, perhaps even more so. We can add the buffered stream adapter if we really want things to work without multi-threading.

32 => oprf_ipa::<C, BA8, BA3, BA20, BA5, F>(ctx, input, aws).await,
64 => oprf_ipa::<C, BA8, BA3, BA20, BA6, F>(ctx, input, aws).await,
128 => oprf_ipa::<C, BA8, BA3, BA20, BA7, F>(ctx, input, aws).await,
8 => oprf_ipa::<C, BA8, BA3, HV, BA20, BA3, 256>(ctx, input, aws).await,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it worth to have a generic histogram value parameter with everything else being hardcoded or I misunderstand what's going on here?

Copy link
Collaborator Author

@andyleiserson andyleiserson Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a fair question -- I was modeling it after what was already there (HV is the bitwise equivalent of F). Eventually all of these (maybe excluding SS) need to be runtime-configurable (#953). I don't have a good sense of how similar that implementation will look to the type-generic version.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is an unsolved problem. We never made anything runtime configurable. We just re-built the code for every different combination of query params =).

ipa-core/src/protocol/ipa_prf/aggregation/mod.rs Outdated Show resolved Hide resolved
@andyleiserson
Copy link
Collaborator Author

andyleiserson commented Apr 30, 2024

Capturing some open items:

  • Test for aggregate_values
  • Saturating addition (edit: deferred to follow-up PR)
  • dummy_fn changes (Default? rename?)
  • Aggregation transpose test

@andyleiserson
Copy link
Collaborator Author

Double-posting a comment I made in a thread here for better visibility:

My plan is to defer this [using select in move_value_to_bucket] and saturating addition to a follow-up. I am working on the (rather gruesome) merge conflicts presently, but I'm hoping this can be ready tonight or tomorrow.

@benjaminsavage
Copy link
Collaborator

I think it's fine to defer the saturation addition as well.

(They are unused for the moment now that modulus conversion is gone, but
they will come back at some point.)
@andyleiserson andyleiserson merged commit 656ec18 into private-attribution:main May 2, 2024
11 checks passed
@andyleiserson andyleiserson deleted the vec-agg branch May 2, 2024 18:02
@andyleiserson
Copy link
Collaborator Author

#1046 has the promised follow-up changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants