Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

P-54 Binary search for holding time assertion #2401

Merged

Conversation

grumpygreenguy
Copy link
Contributor

@grumpygreenguy grumpygreenguy commented Jan 11, 2024

Context

Reduce the number of data provider requests needed to implement the "holding time" assertion.

Resolves P-54

How (Optional)

The basic idea is to do a binary search over the range of relevant dates, testing every relevant address on each iteration, and discarding any addresses that prove to be irrelevant, that is, where we know for sure that the longest holding time is not held by that address.

An address is proven irrelevant if:

  • The data provider query returns false for that address on a given date
  • The data provider query returns true for another address on the same date

Note on Error Handling

Since the goal is to find the earliest holding date (and not necessarily to identify the corresponding address), it is OK to ignore certain errors as long as the search can proceed. In a nutshell, the search can only get "stuck" (i.e. we don't know whether to choose the earlier or the later half of the date range) if:

  • No data provider query returns true
  • At least one query returned an error.

In that case, the actual value of the failed query may have been true or false, so we don't know in which direction the search should continue.

If at least one query returned true, we know that the search continues in the earlier half; at worst we are failing to remove some irrelevant addresses that will be discarded in the subsequent step when the query returns false for them.

TODO

  • Pass CI

@grumpygreenguy grumpygreenguy added the rust Pull requests that update Rust code label Jan 11, 2024
Copy link

linear bot commented Jan 11, 2024

Copy link
Collaborator

@Kailai-Wang Kailai-Wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, the intention is quite clear!

Have you considered partition_point? It applies the binary search under the hood too.

We can treat the date array as "sorted"/ partitioned because holding_time(index) == true implies holding_time(index + 1) == true (but not vice versa). If that's not the case in practice, then data provider malfunctions and you won't get the right result anyway.

Comment on lines 193 to 200
if outcomes.iter().any(is_positive) {
let new_accounts = accounts
.into_iter()
.zip(outcomes.iter())
.filter_map(|(account, outcome)| (!is_negative(outcome)).then_some(account))
.collect();
return (Ok(true), new_accounts)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we merge it into the accounts.iter() traverse above? I feel it should be possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly, but I'm not sure it's worth it. We only do the filtering if any of the queries returned true in that iteration; if we want to merge the two loops we'd have to speculatively build the filtered array in advance, and I think that would make the logic messier to follow. In terms of runtime, I don't think the added loop to construct the filtered list makes much of a difference; what dominates here is in any case the query provider requests.

@grumpygreenguy
Copy link
Contributor Author

@Kailai-Wang

Thank you, the intention is quite clear!

Thanks!

Have you considered partition_point? It applies the binary search under the hood too.

Didn't know about it; will have a look! Thanks

We can treat the date array as "sorted"/ partitioned because holding_time(index) == true implies holding_time(index + 1) == true (but not vice versa). If that's not the case in practice, then data provider malfunctions and you won't get the right result anyway.

Exactly; that's the entire justification for the search algorithm.

@Kailai-Wang
Copy link
Collaborator

partition_point is basically a binary search by predicate, applying it would make the code simpler (so better readability) and less error-prone (as it's a library func) IMO

@grumpygreenguy
Copy link
Contributor Author

partition_point is basically a binary search by predicate, applying it would make the code simpler (so better readability) and less error-prone (as it's a library func) IMO

On further look, there's a problem with error handling -- the predicate arg to partition_point must return bool, but the predicate depends on data provider queries which could fail, so the search logic needs to be able to handle that (at the very least, to bubble the error without panicking)

@Kailai-Wang
Copy link
Collaborator

We should be able to use additional flag/storage to tackle that 🤔

@grumpygreenguy
Copy link
Contributor Author

We should be able to use additional flag/storage to tackle that 🤔

Care to elaborate on that? Or, do you have a link with some info? Because from the API docs (and some cursory search) I haven't found any way to work around that (without reimplementing the search algorithm) 🤔

@grumpygreenguy grumpygreenguy marked this pull request as ready for review January 23, 2024 12:04
Copy link
Collaborator

@Kailai-Wang Kailai-Wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partition_point might produce more readable code, but let's leave it to another PR even if that's the case

@outofxxx
Copy link
Contributor

Hi, @grumpygreenguy, It's good to have this try, thanks. and i have a question, how much can requests be reduced in the worst-case scenario after this optimization?

@grumpygreenguy
Copy link
Contributor Author

@zhouhuitian

Hi, @grumpygreenguy, It's good to have this try, thanks. and i have a question, how much can requests be reduced in the worst-case scenario after this optimization?

The complexity goes from linear in the number of dates to logarithmic in the same; specifically, if we're checking for a single address, the number of requests in the worst case goes from 15 to 5.
If there are multiple addresses to check, the complexity is still in the worst case linear in the number of addresses as well, but depending on how quickly the search can spot and eliminate irrelevant addresses, the cost of each iteration can be drastically reduced. Would need to have a better idea of the distribution of holding times between multiple accounts of real users to make an estimate of the average runtime, tho.

Comment on lines 243 to 248
let mut pred = |date: &&str| {
let (outcome, new_accounts) =
holding_time_search_step(&mut client, q_min_balance, accounts, date);
holding_time_search_step(&mut client, q_min_balance, accounts.clone(), *date);
accounts = new_accounts;
outcome.map(|is_holding| !is_holding) // negated to match the partition_point API
};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Kailai-Wang for fixing this issue! Still gotta wrap my head around some of the subtleties here, apparently :D

In this case the extra clone shouldn't make a huge difference overall, since the cost is in any case dominated by the HTTP requests; in general though it does feel a bit wasteful to have to clone the array each time, only to replace the original anyway after the call to holding_time_search_step. What would (in theory) have been the better approach here? (Not for this PR to be sure, just as an overall learning)

@grumpygreenguy grumpygreenguy enabled auto-merge (squash) January 26, 2024 13:15
@grumpygreenguy grumpygreenguy merged commit f120d87 into dev Jan 26, 2024
32 checks passed
@grumpygreenguy grumpygreenguy deleted the p-54-reduce-number-of-requests-in-holder-assertion-building branch January 27, 2024 10:02
kziemianek pushed a commit that referenced this pull request Jan 28, 2024
Co-authored-by: Kailai Wang <kailai.wang@trustcomputing.de>
Co-authored-by: Zhouhui Tian <zhouhui@liteng.io>
Kailai-Wang added a commit that referenced this pull request Jan 30, 2024
* bitacross init

* add ci fmt

* fmt

* comment out docker upload job

* comment out docker image based jobs

* more adjustments

* remove parachain dependency on tee (#2433)

* adjust crate name

* P-54 Binary search for holding time assertion (#2401)

Co-authored-by: Kailai Wang <kailai.wang@trustcomputing.de>
Co-authored-by: Zhouhui Tian <zhouhui@liteng.io>

* clippy fix

---------

Co-authored-by: Zhouhui Tian <125243011+zhouhuitian@users.noreply.github.com>
Co-authored-by: Ariel Birnbaum <ariel@litentry.com>
Co-authored-by: Kailai Wang <kailai.wang@trustcomputing.de>
Co-authored-by: Zhouhui Tian <zhouhui@liteng.io>
Co-authored-by: Kai <7630809+Kailai-Wang@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rust Pull requests that update Rust code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants