Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use hyperloglog for cardinality estimation for dictionary encoding #2555

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

niyue
Copy link
Contributor

@niyue niyue commented Jul 3, 2024

Currently, to determine whether dictionary encoding should be applied, we use a HashSet for accurate cardinality calculation. However, I believe that perfect accuracy in cardinality isn't necessary in this context. Therefore, we could use HyperLogLog for a rough cardinality estimation, which might save memory and potentially speed up the cardinality check.

  • HyperLogLog uses a fixed size of memory, determined by the precision in the code. And the HashSet uses the threshold * each_item_string_size of memory, if certain items are large, HashSet may use non trivial amount of memory
  • HyperLogLog has an error rate (1.56%, translated from precision 12), while HashSet is accurate

Copy link

github-actions bot commented Jul 3, 2024

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.


let num_total_rows = arrays.iter().map(|arr| arr.len()).sum::<usize>();
Copy link
Contributor Author

@niyue niyue Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if I misunderstand or not, and the num_total_rows doesn't seem very useful since the unique_values calculated below is guaranteed to be less or equal to num_total_rows so I simply remove it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can short circuit the check if num_total_nums <= get_dict_encoding_threshold, but I guess it's also fine to just remove it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @westonpace suggested that we want strict inequality (#2409 (comment))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I got it. I added a short circuit check now, so that empty arrays can be handled correctly (previously empty array will be dict encoded, which I think is not expected since we don't even dict encode smaller-than-threshold-arrays).

And I added several more unit tests to verify the behavior for this part.

@codecov-commenter
Copy link

codecov-commenter commented Jul 3, 2024

Codecov Report

Attention: Patch coverage is 98.11321% with 1 line in your changes missing coverage. Please review.

Project coverage is 80.00%. Comparing base (aadb38e) to head (e8f8c37).
Report is 5 commits behind head on main.

Files Patch % Lines
rust/lance-encoding/src/encoder.rs 98.11% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2555      +/-   ##
==========================================
- Coverage   80.01%   80.00%   -0.01%     
==========================================
  Files         209      209              
  Lines       59821    59883      +62     
  Branches    59821    59883      +62     
==========================================
+ Hits        47866    47911      +45     
- Misses       9132     9144      +12     
- Partials     2823     2828       +5     
Flag Coverage Δ
unittests 80.00% <98.11%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@niyue niyue changed the title Use hyperloglog for cardinality estimation when checking if dictionary encoding could be applied or not feat: Use hyperloglog for cardinality estimation when checking if dictionary encoding could be applied or not Jul 3, 2024
@github-actions github-actions bot added the enhancement New feature or request label Jul 3, 2024
@niyue
Copy link
Contributor Author

niyue commented Jul 3, 2024

I wrote a very simple test case to verify if the performance of create_array_encoder is impacted, using similar code in test_jumbo_string like this:

    #[test_log::test(tokio::test)]
    async fn test_creating_string_encoder() {
        let timer = std::time::Instant::now();
        let mut string_builder = StringBuilder::new();
        // a 1 MiB string
        let giant_string = String::from_iter((0..(1024 * 1024)).map(|_| '0'));
        for _ in 0..1000 {
            string_builder.append_option(Some(&giant_string));
        }
        let giant_array = Arc::new(string_builder.finish()) as ArrayRef;
        let arrs = vec![giant_array];
        let data_ready_duration = timer.elapsed();
        println!(
            "Time elapsed in data preparation is: {:?}",
            data_ready_duration
        );
        let encoding_strategy = CoreArrayEncodingStrategy {};
        let maybe_encoder = encoding_strategy.create_array_encoder(&*arrs);
        assert!(maybe_encoder.is_ok());
        let duration = timer.elapsed() - data_ready_duration;
        println!("Time elapsed in creating array encoder is: {:?}", duration);
    }

I ran the test for HashSet implementation and HyperLogLog implementation, and the create_array_encoder is roughly 10% faster when switching to HLL in my local environment (from 350ms to 315ms)

@niyue
Copy link
Contributor Author

niyue commented Jul 3, 2024

@westonpace I checked out dictionary encoding today, and found its cardinality calculation could be improved and submitted this PR, could you please help to review? Thanks.

@niyue niyue changed the title feat: Use hyperloglog for cardinality estimation when checking if dictionary encoding could be applied or not feat: use hyperloglog for cardinality estimation for dictionary encoding Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants