-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: use hyperloglog for cardinality estimation for dictionary encoding #2555
base: main
Are you sure you want to change the base?
Conversation
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
|
||
let num_total_rows = arrays.iter().map(|arr| arr.len()).sum::<usize>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if I misunderstand or not, and the num_total_rows
doesn't seem very useful since the unique_values
calculated below is guaranteed to be less or equal to num_total_rows
so I simply remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can short circuit the check if num_total_nums
<= get_dict_encoding_threshold
, but I guess it's also fine to just remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @westonpace suggested that we want strict inequality (#2409 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I got it. I added a short circuit check now, so that empty arrays can be handled correctly (previously empty array will be dict encoded, which I think is not expected since we don't even dict encode smaller-than-threshold-arrays).
And I added several more unit tests to verify the behavior for this part.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2555 +/- ##
==========================================
- Coverage 80.01% 80.00% -0.01%
==========================================
Files 209 209
Lines 59821 59883 +62
Branches 59821 59883 +62
==========================================
+ Hits 47866 47911 +45
- Misses 9132 9144 +12
- Partials 2823 2828 +5
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
c5154d9
to
df33daf
Compare
I wrote a very simple test case to verify if the performance of #[test_log::test(tokio::test)]
async fn test_creating_string_encoder() {
let timer = std::time::Instant::now();
let mut string_builder = StringBuilder::new();
// a 1 MiB string
let giant_string = String::from_iter((0..(1024 * 1024)).map(|_| '0'));
for _ in 0..1000 {
string_builder.append_option(Some(&giant_string));
}
let giant_array = Arc::new(string_builder.finish()) as ArrayRef;
let arrs = vec![giant_array];
let data_ready_duration = timer.elapsed();
println!(
"Time elapsed in data preparation is: {:?}",
data_ready_duration
);
let encoding_strategy = CoreArrayEncodingStrategy {};
let maybe_encoder = encoding_strategy.create_array_encoder(&*arrs);
assert!(maybe_encoder.is_ok());
let duration = timer.elapsed() - data_ready_duration;
println!("Time elapsed in creating array encoder is: {:?}", duration);
} I ran the test for |
@westonpace I checked out dictionary encoding today, and found its cardinality calculation could be improved and submitted this PR, could you please help to review? Thanks. |
df33daf
to
10ac87f
Compare
…y encoding should be applied or not.
10ac87f
to
e8f8c37
Compare
Currently, to determine whether dictionary encoding should be applied, we use a
HashSet
for accurate cardinality calculation. However, I believe that perfect accuracy in cardinality isn't necessary in this context. Therefore, we could use HyperLogLog for a rough cardinality estimation, which might save memory and potentially speed up the cardinality check.HyperLogLog
uses a fixed size of memory, determined by the precision in the code. And theHashSet
uses thethreshold * each_item_string_size
of memory, if certain items are large,HashSet
may use non trivial amount of memoryHyperLogLog
has an error rate (1.56%, translated from precision 12), whileHashSet
is accurate