Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend query support in leveled #433

Open
martinsumner opened this issue Apr 11, 2024 · 3 comments
Open

Extend query support in leveled #433

martinsumner opened this issue Apr 11, 2024 · 3 comments

Comments

@martinsumner
Copy link
Owner

Currently in Riak/leveled query support is based on range queries with the application of regular expressions to allow filtering of additional attributes which have been concatenated to the term.

The following extensions are proposed:

  • Allow for a capture_filter where a conditions can be applied to a captured output, in particular >, >=, <, <= comparisons. These are very tricky to achieve in regular expressions - so instead capture in the regular expression and apply an additional capture filter to the captured output if it exists.
  • Extend the return options, so in the future there could be return_terms, return_count, return_count_unique and return_group_count. The return_count would return the count of matches (without de-duplication on key), whereas return_count_unique would return the count of unique key matches. The return_group_count would return the count by unique items in a given set of captured outputs (i.e. the regular expression could capture YearOfBirth and Status - and return the count by Status and YearOfBirth).
  • Allow for a pre-query, where the results returned from the main query are only those that have a key that is returned from the pre-query, and where the pre-query and main query are run on the same snapshot.

The dependency on regular expressions raises potential issues with regards to performance. Regular expression queries should be added to perf_SUITE. There is a proposal to change the erlang re library to google re2 in OTP 28, with some potentially significant performance benefits. It may be worth experimenting with a RE2 NIF in preparation.

@martinsumner
Copy link
Owner Author

#434 - extends performance testing to understand the impact of potential changes.

For regex queries, within this test the time to apply the regular expression dominates:

Profile regex_query:

FUNCTION                                          CALLS        %      TIME  [uS / CALLS]
--------                                          -----  -------      ----  [----------]
prim_file:pread_nif/3                              5938     0.51    109603  [     18.46]
leveled_penciller:'-find_nextkeys/6-fun-1-'/1   5111790     0.89    190579  [      0.04]
leveled_codec:'-accumulate_index/2-fun-2-'/6    5276501     1.77    380060  [      0.07]
leveled_codec:maybe_accumulate/5                5441603     1.87    401815  [      0.07]
maps:update_with/3                              5276892     2.09    449656  [      0.09]
zstd:quick_decompress/1                          105536     4.66   1002559  [      9.50]
erlang:binary_to_term/1                          105536     8.25   1775847  [     16.83]
leveled_penciller:find_nextkeys/6              22701169     9.57   2059858  [      0.09]
re:run/2                                        5276501    67.90  14619748  [      2.77]
---------------------------------------------  --------  -------  --------  [----------]
Total:                                         52969030  100.00%  21531323  [      0.41]

The regex query test will filter out and return 1:170 results. On M1 Mac:

Fetch of 125073 index entries by regex in 423 queries took 31764 ms

Thats an average of 75ms per query to return about 300 results having scanned 50K entries.

@martinsumner
Copy link
Owner Author

martinsumner commented Apr 12, 2024

Comparison if re2 is plugged into leveled:

Profile regex_query:

FUNCTION                                          CALLS        %      TIME  [uS / CALLS]
--------                                          -----  -------      ----  [----------]
leveled_penciller:'-find_nextkeys/6-fun-1-'/1   4986634     1.96    196044  [      0.04]
leveled_codec:'-accumulate_index/2-fun-2-'/6    5147292     3.70    370345  [      0.07]
leveled_codec:maybe_accumulate/5                5308330     4.04    404005  [      0.08]
maps:update_with/3                              5147672     4.25    425160  [      0.08]
zstd:quick_decompress/1                          102708    10.27   1027339  [     10.00]
erlang:binary_to_term/1                          102708    17.00   1700697  [     16.56]
leveled_penciller:find_nextkeys/6              20524284    18.39   1839717  [      0.09]
re2:run/2                                       5147292    33.67   3368657  [      0.65]
---------------------------------------------  --------  -------  --------  [----------]
Total:                                         49960989  100.00%  10003893  [      0.20]

So re2 at 0.65 microseconds per call is 75% faster than PCRE regex in OTP 26.

Fetch of 125252 index entries by regex in 428 queries took 20453 ms

Thats an average of 48ms per query to return about 300 results having scanned 50K entries.

@martinsumner
Copy link
Owner Author

#435 - switches to use a NIF of. the re2 library taken from https://github.com/dukesoferl/re2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant