feat: add nullability and u64 support to list codec #2255

westonpace · 2024-04-24T22:32:31Z

This adds support for nulls in lists and adds support for lists that contain more than i32::MAX items (note: this is a lot more common since we write offset pages much slower than item pages)

westonpace · 2024-04-24T22:33:18Z

Leaving in draft as I still need:

Tests for u64 (in particular, test the case where the decode returns a shorter-than-expected array. I'm pretty certain this doesn't work.
Revisit comments in the list encoder

albertlockett · 2024-04-25T15:15:45Z

protos/encodings.proto

+    // contain `base + len(list) + first_invalid_offset` where `base`
+    // is defined the same as above.
+    //
+    // When reading a list at index i only only needs the offsets at


only only is this a typo?

Good catch, fixed.

wjones127

Some initial questions. I don't think I understand the new offsets scheme.

wjones127 · 2024-04-26T16:43:33Z

protos/encodings.proto

+    // If the list at index i is not null then offsets[i] will
+    // contain `base + len(list)` where `base` is defined as:
+    //   i == 0: 0
+    //   i >  0: (offsets[i-1] % first_invalid_offset)


IIUC, the process for generating these indices is:

(Example array: [[1, 2], [], [3], null, [4, 5, 6]])

Compute the sizes of each array ([2, 0, 1, 0, 3])

Transform the sizes to cumulative sum ([2, 2, 3, 3, 7])

Choose a value for first_invalid_offset that's greater than any existing offset (e.g. 100)

Add the first_invalid_offset to any slot that is null ([2, 2, 3, 103, 7])

Is that correct?

And then on the read size, seems like we can do:

If i == 0, get the offset.

If the offset is first_invalid_offset, then the slot is null

Otherwise, the range for the list is 0, offsets[0].

If i > 0, get the offsets at i and i - 1.

If i >= first_invalid_offset, then the slot is null

Otherwise, the range is offsets[i - 1] % first_invalid_offset, offsets[i]

It would be nice to have an example.

Compute the sizes of each array ([2, 0, 1, 0, 3])
Transform the sizes to cumulative sum ([2, 2, 3, 3, 7])

Yes. If there are no nulls then this is the exact same thing as Arrow's list encoding except the leading 0 has been removed (as it is redundant, though I could be convinced to add it back in, it's about the same amount of complexity either way though so I figured we might as well save a few bytes)

That being said, I think you want [2, 2, 3, 3, 6] as the last list has 3 items, not 4. The arrow offsets will be [0, 2, 2, 3, 3, 6].

Choose a value for first_invalid_offset that's greater than any existing offset (e.g. 100)

No. You can't pick any invalid value. It needs to be the last valid offset + 1. In other words, given the example of [2, 2, 3, 3, 7] it must be 8.

Add the first_invalid_offset to any slot that is null ([2, 2, 3, 103, 7])

Yes.

And then on the read size, seems like we can do:

Yes, more or less.

Otherwise, the range for the list is 0, offsets[0].

Not if offsets[0] >= first_invalid_offset which will happen if the first list is null.

Otherwise, the range is offsets[i - 1] % first_invalid_offset, offsets[i]

It's not exactly %. I thought it was originally, but I was wrong. If we used 7 as first_invalid_offset then it would be % but unfortunately, we can't do that, because then we can't tell if the last item in the list is null or not. Thanks for pointing this out. I will clean this up. It should be...

if offsets[i - 1] >= first_invalid_offset { offsets[i - 1] - first_invalid_offset - 1 } else { offsets[i - 1] }, offsets[i]

It would be nice to have an example.

I will add one. There is an example in the decoder that I walk through but I agree we need one here too.

Now I'm second guessing my answer above. Let me step through real quick.

Ok. The reason I was thinking it needed to be first (and not any arbitrary invalid offset) is because we sometimes want to know the number of items referenced by the list before we've done any actually loading of offsets (for scheduling purposes). I think, however, there are other ways the scheduler could get this information (e.g. pages have a num_rows and so the list scheduler could add up the num_rows on all the assigned item pages) so this may not be a valid requirement any more.

However, from a performance perspective, there is another reason we want to use the first invalid offset. By using this we force the range of encoded values to be [0, 2N] which means we are much more likely to get good results from bit packing.

Also, I see that in some places I call it the last_valid_offset and in others I call it the first_invalid_offset 🤦 . Those are not even the same thing. I will clean this up as well.

And I can see how "first valid offset" might be construed as "first offset where a list item is valid (not null)". Maybe I will call it "null offset adjustment" instead

At this point it's probably easiest to ignore everything I said in this entire thread 😆 I've added an example to the .proto file so start there instead.

wjones127 · 2024-04-26T16:44:36Z

protos/encodings.proto

+    // All valid offsets will be less than this value.
+    uint64 first_invalid_offset = 2;


Why is this called first? Isn't the same offset applied to all nulls? I wonder if we can just call this invalid_offset.

It's basically num_items + 1 but I don't know what to call that. I used first_invalid_offset because this is "the first number that will never appear as an offset in the offsets array".

This is no longer required (but still recommended for performance reasons)

wjones127 · 2024-04-26T16:53:39Z

rust/lance-encoding/src/encodings/logical/list.rs

+                let mut items_end = offsets_values[num_offsets - 1];
+                // Repair any null value
+                if items_end > last_valid_offset {
+                    items_end = items_end - last_valid_offset - 1;


Why do we subtract 1 here? Wouldn't that make [10, 20, 120, ... into [10, 20, 19, ...?

The last valid offset is 99 so it is 120 - 99 - 1 = 20

This has changed from...

if items_end > last_valid_offset { ... } else { ... }

to

let items_end = offset_values[num_offsets - 1] % null_offset_adjustment;

… readable and improve performance

westonpace · 2024-04-26T20:50:50Z

@wjones127 Ok, I think this is ready for another look. The first_invalid_offset / last_valid_offset has become null_offset_adjustment and is recommended to be num_items + 1 but we no longer rely on this. In addition, I was able to move back to using % and get rid of the if branches and the -1 all over the place. I have also added more details to the .proto around how the encoding works.

codecov-commenter · 2024-04-26T21:04:57Z

Codecov Report

Attention: Patch coverage is 82.33083% with 94 lines in your changes are missing coverage. Please review.

Project coverage is 80.89%. Comparing base (fb43192) to head (32b66fc).
Report is 6 commits behind head on main.

Files	Patch %	Lines
rust/lance-encoding/src/encodings/logical/list.rs	81.88%	67 Missing and 8 partials ⚠️
.../lance-encoding/src/encodings/logical/primitive.rs	80.39%	10 Missing ⚠️
rust/lance-encoding/src/testing.rs	73.91%	6 Missing ⚠️
rust/lance-encoding/src/decoder.rs	93.18%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2255      +/-   ##
==========================================
- Coverage   81.17%   80.89%   -0.29%     
==========================================
  Files         187      190       +3     
  Lines       54598    55866    +1268     
  Branches    54598    55866    +1268     
==========================================
+ Hits        44319    45191     +872     
- Misses       7784     8159     +375     
- Partials     2495     2516      +21

Flag	Coverage Δ
unittests	`80.89% <82.33%> (-0.29%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

wjones127

Thank you for providing the example in the docs. Make much more sense now.

westonpace added 4 commits April 25, 2024 06:52

Add nullability and u64 support to list codec

4c96c9b

Work out various bugs and document

cc6ee65

Add comments

12c4f24

Document the list codec more

693a031

westonpace force-pushed the feat/list-nulls-u64 branch from 64b477f to 693a031 Compare April 25, 2024 14:34

westonpace marked this pull request as ready for review April 25, 2024 14:34

albertlockett reviewed Apr 25, 2024

View reviewed changes

westonpace added 2 commits April 25, 2024 08:45

Fix typo

20f44a7

Fix typo

e1d9cb3

westonpace requested review from eddyxu and wjones127 April 25, 2024 23:02

westonpace mentioned this pull request Apr 25, 2024

feat: don't load all list items before returning a batch #2262

Merged

wjones127 reviewed Apr 26, 2024

View reviewed changes

westonpace added 2 commits April 26, 2024 13:42

Address review comments, clean up list implementation to make it more…

2b03cf2

… readable and improve performance

Apply clippy suggestions

32b66fc

westonpace requested a review from wjones127 April 26, 2024 20:50

wjones127 approved these changes Apr 30, 2024

View reviewed changes

westonpace merged commit 098f730 into lancedb:main Apr 30, 2024
17 checks passed

wjones127 added the enhancement New feature or request label May 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add nullability and u64 support to list codec #2255

feat: add nullability and u64 support to list codec #2255

westonpace commented Apr 24, 2024

westonpace commented Apr 24, 2024

albertlockett Apr 25, 2024

westonpace Apr 25, 2024

wjones127 left a comment

wjones127 Apr 26, 2024

wjones127 Apr 26, 2024

westonpace Apr 26, 2024

westonpace Apr 26, 2024

westonpace Apr 26, 2024

westonpace Apr 26, 2024 •

edited

westonpace Apr 26, 2024

westonpace Apr 26, 2024

wjones127 Apr 26, 2024

westonpace Apr 26, 2024

westonpace Apr 26, 2024

wjones127 Apr 26, 2024

westonpace Apr 26, 2024

westonpace Apr 26, 2024

westonpace commented Apr 26, 2024

codecov-commenter commented Apr 26, 2024

wjones127 left a comment

		// All valid offsets will be less than this value.
		uint64 first_invalid_offset = 2;

feat: add nullability and u64 support to list codec #2255

feat: add nullability and u64 support to list codec #2255

Conversation

westonpace commented Apr 24, 2024

westonpace commented Apr 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wjones127 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace Apr 26, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace commented Apr 26, 2024

codecov-commenter commented Apr 26, 2024

Codecov Report

wjones127 left a comment

Choose a reason for hiding this comment

westonpace Apr 26, 2024 •

edited