feat: use hyperloglog for cardinality estimation for dictionary encoding #2555

niyue · 2024-07-03T07:08:30Z

Currently, to determine whether dictionary encoding should be applied, we use a HashSet for accurate cardinality calculation. However, I believe that perfect accuracy in cardinality isn't necessary in this context. Therefore, we could use HyperLogLog for a rough cardinality estimation, which might save memory and potentially speed up the cardinality check.

HyperLogLog uses a fixed size of memory, determined by the precision in the code. And the HashSet uses the threshold * each_item_string_size of memory, if certain items are large, HashSet may use non trivial amount of memory
HyperLogLog has an error rate (1.56%, translated from precision 12), while HashSet is accurate

github-actions · 2024-07-03T07:08:51Z

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

niyue · 2024-07-03T07:11:39Z

rust/lance-encoding/src/encoder.rs


-    let num_total_rows = arrays.iter().map(|arr| arr.len()).sum::<usize>();


I am not sure if I misunderstand or not, and the num_total_rows doesn't seem very useful since the unique_values calculated below is guaranteed to be less or equal to num_total_rows so I simply remove it.

we can short circuit the check if num_total_nums <= get_dict_encoding_threshold, but I guess it's also fine to just remove it

I think @westonpace suggested that we want strict inequality (#2409 (comment))

Ok. I got it. I added a short circuit check now, so that empty arrays can be handled correctly (previously empty array will be dict encoded, which I think is not expected since we don't even dict encode smaller-than-threshold-arrays).

And I added several more unit tests to verify the behavior for this part.

codecov-commenter · 2024-07-03T07:24:12Z

Codecov Report

Attention: Patch coverage is 98.11321% with 1 line in your changes missing coverage. Please review.

Project coverage is 80.00%. Comparing base (aadb38e) to head (e8f8c37).
Report is 5 commits behind head on main.

Files	Patch %	Lines
rust/lance-encoding/src/encoder.rs	98.11%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2555      +/-   ##
==========================================
- Coverage   80.01%   80.00%   -0.01%     
==========================================
  Files         209      209              
  Lines       59821    59883      +62     
  Branches    59821    59883      +62     
==========================================
+ Hits        47866    47911      +45     
- Misses       9132     9144      +12     
- Partials     2823     2828       +5

Flag	Coverage Δ
unittests	`80.00% <98.11%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

niyue · 2024-07-03T07:55:41Z

I wrote a very simple test case to verify if the performance of create_array_encoder is impacted, using similar code in test_jumbo_string like this:

    #[test_log::test(tokio::test)]
    async fn test_creating_string_encoder() {
        let timer = std::time::Instant::now();
        let mut string_builder = StringBuilder::new();
        // a 1 MiB string
        let giant_string = String::from_iter((0..(1024 * 1024)).map(|_| '0'));
        for _ in 0..1000 {
            string_builder.append_option(Some(&giant_string));
        }
        let giant_array = Arc::new(string_builder.finish()) as ArrayRef;
        let arrs = vec![giant_array];
        let data_ready_duration = timer.elapsed();
        println!(
            "Time elapsed in data preparation is: {:?}",
            data_ready_duration
        );
        let encoding_strategy = CoreArrayEncodingStrategy {};
        let maybe_encoder = encoding_strategy.create_array_encoder(&*arrs);
        assert!(maybe_encoder.is_ok());
        let duration = timer.elapsed() - data_ready_duration;
        println!("Time elapsed in creating array encoder is: {:?}", duration);
    }

I ran the test for HashSet implementation and HyperLogLog implementation, and the create_array_encoder is roughly 10% faster when switching to HLL in my local environment (from 350ms to 315ms)

niyue · 2024-07-03T08:03:28Z

@westonpace I checked out dictionary encoding today, and found its cardinality calculation could be improved and submitted this PR, could you please help to review? Thanks.

…y encoding should be applied or not.

niyue commented Jul 3, 2024

View reviewed changes

niyue force-pushed the feature/hll-dict-check branch from c5154d9 to df33daf Compare July 3, 2024 07:44

niyue changed the title ~~Use hyperloglog for cardinality estimation when checking if dictionary encoding could be applied or not~~ feat: Use hyperloglog for cardinality estimation when checking if dictionary encoding could be applied or not Jul 3, 2024

github-actions bot added the enhancement New feature or request label Jul 3, 2024

niyue changed the title ~~feat: Use hyperloglog for cardinality estimation when checking if dictionary encoding could be applied or not~~ feat: use hyperloglog for cardinality estimation for dictionary encoding Jul 3, 2024

niyue force-pushed the feature/hll-dict-check branch from df33daf to 10ac87f Compare July 4, 2024 03:24

Use hyperloglog for cardinality estimation when checking if dictionar…

e8f8c37

…y encoding should be applied or not.

niyue force-pushed the feature/hll-dict-check branch from 10ac87f to e8f8c37 Compare July 4, 2024 03:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: use hyperloglog for cardinality estimation for dictionary encoding #2555

feat: use hyperloglog for cardinality estimation for dictionary encoding #2555

niyue commented Jul 3, 2024 •

edited

Loading

github-actions bot commented Jul 3, 2024

niyue Jul 3, 2024 •

edited

Loading

broccoliSpicy Jul 3, 2024

raunaks13 Jul 4, 2024

niyue Jul 4, 2024

codecov-commenter commented Jul 3, 2024 •

edited

Loading

niyue commented Jul 3, 2024

niyue commented Jul 3, 2024


		let num_total_rows = arrays.iter().map(\|arr\| arr.len()).sum::<usize>();

feat: use hyperloglog for cardinality estimation for dictionary encoding #2555

Are you sure you want to change the base?

feat: use hyperloglog for cardinality estimation for dictionary encoding #2555

Conversation

niyue commented Jul 3, 2024 • edited Loading

github-actions bot commented Jul 3, 2024

niyue Jul 3, 2024 • edited Loading

Choose a reason for hiding this comment

broccoliSpicy Jul 3, 2024

Choose a reason for hiding this comment

raunaks13 Jul 4, 2024

Choose a reason for hiding this comment

niyue Jul 4, 2024

Choose a reason for hiding this comment

codecov-commenter commented Jul 3, 2024 • edited Loading

Codecov Report

niyue commented Jul 3, 2024

niyue commented Jul 3, 2024

niyue commented Jul 3, 2024 •

edited

Loading

niyue Jul 3, 2024 •

edited

Loading

codecov-commenter commented Jul 3, 2024 •

edited

Loading