Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support to read and use bloom filters #99

Merged
merged 1 commit into from Mar 16, 2022
Merged

Conversation

jorgecarleitao
Copy link
Owner

@jorgecarleitao jorgecarleitao commented Mar 16, 2022

This PR adds support to read and use bloom filters to skip row groups. See more details here, as well as the addition to the example where a bloom filter from a parquet file is used to check if a value is (not) on a column from a row group.

This tool is essentially an improvement over using statistics, to skip row groups.

This PR also adds the necessary algorithms to write bloom filters, but does not provide an API do so (yet), as usually bloom filters are created while serializing pages, which requires a larger change, as we currently only receive compressed pages, over which we can't create a bloom filter from

Close #98

@jorgecarleitao jorgecarleitao added the enhancement New feature or request label Mar 16, 2022
@codecov-commenter
Copy link

codecov-commenter commented Mar 16, 2022

Codecov Report

Merging #99 (3e033fc) into main (60f279c) will increase coverage by 0.58%.
The diff coverage is 98.55%.

@@            Coverage Diff             @@
##             main      #99      +/-   ##
==========================================
+ Coverage   67.50%   68.08%   +0.58%     
==========================================
  Files          69       72       +3     
  Lines        3840     3910      +70     
==========================================
+ Hits         2592     2662      +70     
  Misses       1248     1248              
Impacted Files Coverage Δ
src/lib.rs 78.71% <ø> (+0.08%) ⬆️
src/bloom_filter/split_block.rs 97.82% <97.82%> (ø)
src/bloom_filter/hash.rs 100.00% <100.00%> (ø)
src/bloom_filter/mod.rs 100.00% <100.00%> (ø)
src/statistics/binary.rs 0.00% <0.00%> (-5.56%) ⬇️
src/types.rs 50.90% <0.00%> (+3.63%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 60f279c...3e033fc. Read the comment docs.

@jorgecarleitao jorgecarleitao merged commit e499570 into main Mar 16, 2022
dantengsky pushed a commit to datafuse-extras/parquet2 that referenced this pull request Apr 1, 2022
@jorgecarleitao jorgecarleitao added feature A new feature and removed enhancement New feature or request labels Apr 15, 2022
@jorgecarleitao jorgecarleitao deleted the bloom_filter branch April 15, 2022 07:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Added support for bloom filters
2 participants