Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add nullability and u64 support to list codec #2255

Merged
merged 8 commits into from
Apr 30, 2024

Conversation

westonpace
Copy link
Contributor

This adds support for nulls in lists and adds support for lists that contain more than i32::MAX items (note: this is a lot more common since we write offset pages much slower than item pages)

@westonpace
Copy link
Contributor Author

Leaving in draft as I still need:

  • Tests for u64 (in particular, test the case where the decode returns a shorter-than-expected array. I'm pretty certain this doesn't work.
  • Revisit comments in the list encoder

@westonpace westonpace marked this pull request as ready for review April 25, 2024 14:34
// contain `base + len(list) + first_invalid_offset` where `base`
// is defined the same as above.
//
// When reading a list at index i only only needs the offsets at
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only only is this a typo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed.

Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial questions. I don't think I understand the new offsets scheme.

Comment on lines 83 to 86
// If the list at index i is not null then offsets[i] will
// contain `base + len(list)` where `base` is defined as:
// i == 0: 0
// i > 0: (offsets[i-1] % first_invalid_offset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the process for generating these indices is:

(Example array: [[1, 2], [], [3], null, [4, 5, 6]])

  1. Compute the sizes of each array ([2, 0, 1, 0, 3])
  2. Transform the sizes to cumulative sum ([2, 2, 3, 3, 7])
  3. Choose a value for first_invalid_offset that's greater than any existing offset (e.g. 100)
  4. Add the first_invalid_offset to any slot that is null ([2, 2, 3, 103, 7])

Is that correct?

And then on the read size, seems like we can do:

  1. If i == 0, get the offset.
    1. If the offset is first_invalid_offset, then the slot is null
    2. Otherwise, the range for the list is 0, offsets[0].
  2. If i > 0, get the offsets at i and i - 1.
    1. If i >= first_invalid_offset, then the slot is null
    2. Otherwise, the range is offsets[i - 1] % first_invalid_offset, offsets[i]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have an example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compute the sizes of each array ([2, 0, 1, 0, 3])
Transform the sizes to cumulative sum ([2, 2, 3, 3, 7])

Yes. If there are no nulls then this is the exact same thing as Arrow's list encoding except the leading 0 has been removed (as it is redundant, though I could be convinced to add it back in, it's about the same amount of complexity either way though so I figured we might as well save a few bytes)

That being said, I think you want [2, 2, 3, 3, 6] as the last list has 3 items, not 4. The arrow offsets will be [0, 2, 2, 3, 3, 6].

Choose a value for first_invalid_offset that's greater than any existing offset (e.g. 100)

No. You can't pick any invalid value. It needs to be the last valid offset + 1. In other words, given the example of [2, 2, 3, 3, 7] it must be 8.

Add the first_invalid_offset to any slot that is null ([2, 2, 3, 103, 7])

Yes.

And then on the read size, seems like we can do:

Yes, more or less.

Otherwise, the range for the list is 0, offsets[0].

Not if offsets[0] >= first_invalid_offset which will happen if the first list is null.

Otherwise, the range is offsets[i - 1] % first_invalid_offset, offsets[i]

It's not exactly %. I thought it was originally, but I was wrong. If we used 7 as first_invalid_offset then it would be % but unfortunately, we can't do that, because then we can't tell if the last item in the list is null or not. Thanks for pointing this out. I will clean this up. It should be...

if offsets[i - 1] >= first_invalid_offset { offsets[i - 1] - first_invalid_offset - 1 } else { offsets[i - 1] }, offsets[i]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have an example.

I will add one. There is an example in the decoder that I walk through but I agree we need one here too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I'm second guessing my answer above. Let me step through real quick.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. The reason I was thinking it needed to be first (and not any arbitrary invalid offset) is because we sometimes want to know the number of items referenced by the list before we've done any actually loading of offsets (for scheduling purposes). I think, however, there are other ways the scheduler could get this information (e.g. pages have a num_rows and so the list scheduler could add up the num_rows on all the assigned item pages) so this may not be a valid requirement any more.

However, from a performance perspective, there is another reason we want to use the first invalid offset. By using this we force the range of encoded values to be [0, 2N] which means we are much more likely to get good results from bit packing.

Also, I see that in some places I call it the last_valid_offset and in others I call it the first_invalid_offset 🤦 . Those are not even the same thing. I will clean this up as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I can see how "first valid offset" might be construed as "first offset where a list item is valid (not null)". Maybe I will call it "null offset adjustment" instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point it's probably easiest to ignore everything I said in this entire thread 😆 I've added an example to the .proto file so start there instead.

Comment on lines 105 to 106
// All valid offsets will be less than this value.
uint64 first_invalid_offset = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this called first? Isn't the same offset applied to all nulls? I wonder if we can just call this invalid_offset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's basically num_items + 1 but I don't know what to call that. I used first_invalid_offset because this is "the first number that will never appear as an offset in the offsets array".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer required (but still recommended for performance reasons)

let mut items_end = offsets_values[num_offsets - 1];
// Repair any null value
if items_end > last_valid_offset {
items_end = items_end - last_valid_offset - 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we subtract 1 here? Wouldn't that make [10, 20, 120, ... into [10, 20, 19, ...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last valid offset is 99 so it is 120 - 99 - 1 = 20

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has changed from...

if items_end > last_valid_offset { ... } else { ... }

to

let items_end = offset_values[num_offsets - 1] % null_offset_adjustment;

@westonpace
Copy link
Contributor Author

@wjones127 Ok, I think this is ready for another look. The first_invalid_offset / last_valid_offset has become null_offset_adjustment and is recommended to be num_items + 1 but we no longer rely on this. In addition, I was able to move back to using % and get rid of the if branches and the -1 all over the place. I have also added more details to the .proto around how the encoding works.

@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 82.33083% with 94 lines in your changes are missing coverage. Please review.

Project coverage is 80.89%. Comparing base (fb43192) to head (32b66fc).
Report is 6 commits behind head on main.

Files Patch % Lines
rust/lance-encoding/src/encodings/logical/list.rs 81.88% 67 Missing and 8 partials ⚠️
.../lance-encoding/src/encodings/logical/primitive.rs 80.39% 10 Missing ⚠️
rust/lance-encoding/src/testing.rs 73.91% 6 Missing ⚠️
rust/lance-encoding/src/decoder.rs 93.18% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2255      +/-   ##
==========================================
- Coverage   81.17%   80.89%   -0.29%     
==========================================
  Files         187      190       +3     
  Lines       54598    55866    +1268     
  Branches    54598    55866    +1268     
==========================================
+ Hits        44319    45191     +872     
- Misses       7784     8159     +375     
- Partials     2495     2516      +21     
Flag Coverage Δ
unittests 80.89% <82.33%> (-0.29%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for providing the example in the docs. Make much more sense now.

@westonpace westonpace merged commit 098f730 into lancedb:main Apr 30, 2024
17 checks passed
@wjones127 wjones127 added the enhancement New feature or request label May 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants