feat: update lance file format to support per-page encoding #1857

westonpace · 2024-01-22T21:35:37Z

This is a draft proposal for a format that supports:

Per-page encodings
More flexible encoding descriptions
Files without row groups
Wide schemas

There is no rush to merge this in. I am beginning to add create a file reader and writer that will support this format as part of adding support for nullability. Until the reader/writer are ready these format changes are useless on their own. However, I am creating this PR as a proposal for initial discussion.

github-actions · 2024-01-22T21:35:59Z

ACTION NEEDED

Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

wjones127 · 2024-01-22T22:08:54Z

protos/file.proto

+  // This message is optional.  If a file is not self-describing then this must
+  // point to the start of the metadata block (e.g. representing a manifest with
+  // size 0)


What would we think of making the FileDescriptor not optional for Lance v2? IMO it would make the format easier to debug if individual files were readable independently.

That's fine with me. It shouldn't be too large and readers don't have to read it so I don't see any downside to making it mandatory.

I've changed the language to reinforce that this is required in v2 files.

wjones127 · 2024-01-22T22:11:28Z

protos/file.proto

+    // The file offset to the start of the page data
+    uint64 offset = 3;
+    // The size (in bytes) of the page data
+    uint64 size = 4;


Is this redundant? Could we just say the offset is the first buffer_offsets? And the size is the sum of buffer_sizes?

Good point, this is redundant.

I've removed these fields.

wjones127 · 2024-01-22T22:15:33Z

protos/file.proto

+// | (optional padding)           |
+// | Page 1, Buffer 0             |
+// | ...                          |
+// | Page N2, Buffer N3           |


The labels N2 and N3 are confusing. Perhaps they should be different labels, described in text above. For example:

M columns, N[m] pages for column m in 0..M.

I'll think this over. I started with variables but since there were 5 of them I had N, M, X, Y, and Z and so I wasn't sure that was very readable either. I wanted to avoid implying that pages have the same # of buffers or that there are the same number of metadata pages as there are column pages, etc.

I added a header explaining the lettering a bit and changed to something kind of like your suggestion. What do you think?

wjones127 · 2024-01-22T22:16:44Z

protos/file.proto

+    // These offsets might point to the column's data section or they
+    // might point to the column's metadata section.
+    //


What does in the data vs metadata section?

Data section is data that is unique to a page. Metadata section is data that is common across all pages in a column. E.g. if you wanted to have a single dictionary for all dictionary encoded pages in a column you could put it in the metadata section.

wjones127 · 2024-01-22T22:17:28Z

protos/file.proto

+  // arrays is equal the the number of pages for the column.  The number of
+  // statistics pages may be smaller than the number of column pages (in most
+  // cases there will only be one page of statistics)
+  message ColumnStatistics {


Where do these go?

The buffers will be in the metadata section.

I realized I forgot an actual reference to this message. I've added that in ColumnMetadata:

repeated Page pages = 1; ColumnStatistics statistics = 2;

wjones127 · 2024-01-22T22:21:28Z

protos/file.proto

+// | Column 0 Metadata Position   |
+// | Column 0 Metadata Size       |
+// | ...                          |
+// | Column N5 Metadata Position  |
+// | Column N5 Metadata Size      |


This is the "column metadata offsets array"?

wjones127 · 2024-01-22T22:31:40Z

protos/file.proto

+// +------------------------------+
+// | Page 0, Buffer 0             |
+// | ...                          |


Might be nice to label some sections:

(https://asciiflow.com/ is useful for this)

// ┌───────────────────────────────┐ // │Data Pages │ // │ Page 0, Buffer 0 │ // │ ... │ // │ Page 0, Buffer N │ // │ (optional padding) │ // │ Page 1, Buffer 0 │ // │ ... │ // │ Page N2, Buffer N3 │ // │ (optional padding) │ // ├───────────────────────────────┤ // │Column Metadata │ // │ Column 0 Metadata │ // │ Column 0 Metadata Buffer 0 │ // │ ... │ // │ Column 0 Metadata Buffer N4 │ // │ (optional padding) │ // │ Column 1 Metadata │ // │ ... │ // │ Column N5 Metadata Buffer N6 │ // │ (optional padding) │ // ├───────────────────────────────┤ // │Column Metadata Offset Table │ // │ Column 0 Metadata Position │ // │ Column 0 Metadata Size │ // │ ... │ // │ Column N5 Metadata Position │ // │ Column N5 Metadata Size │ // ├───────────────────────────────┤ // │FileDescriptor │ // ├───────────────────────────────┤ // │Metadata │ // ├───────────────────────────────┤ // │Footer │ // │ i64: Metadata position │ // │ u16: Major version │ // │ u16: Minor version │ // │ "LANC" │ // └───────────────────────────────┘

Added labels

wjones127 · 2024-01-30T21:22:12Z

protos/file.proto

+  // The start of the column metadata section
+  //
+  // If column projection is not needed, or if the goal is to cache all column
+  // metadata, then this field can be used to quickly load the entire column metadata
+  // section in a single read without referring to the column metadata offsets array
+  //
+  // This field is ignored in Lance version 1 files
+  uint64 column_metadata_start = 7;


This gives us the start, how do we know where the end is?

I suppose the intention was to have it end at manifest_position - (16 * num_columns)

wjones127 · 2024-01-30T21:27:05Z

protos/file.proto

+  // The number of columns in the file (including inner columns when there
+  // are nested fields)
+  //
+  // This can be used to access the column metadata offsets array which is
+  // stored immediately before manifest_position.
+  //
+  // Given N columns the column metadata positions and sizes are stored in a
+  // contiguous buffer of 2*N uint64 values immediately preceding the file
+  // descriptor (or the metadata if the file is not self describing).
+  //
+  // If we let `column_offsets_pos` be:
+  //   manifest_position - (16 * num_columns)
+  //
+  // Then the metadata for column x starts at the uint64:
+  //   file[column_offsets_pos + (16 * x)]
+  // The size of the metadata for column x is given by the uint64:
+  //   file[column_offsets_pos + (16 * x) + 8]
+  //
+  // This field is ignored in Lance version 1 files (the page table is used instead)
+  uint32 num_columns = 6;


I think this solves problem 1 described in #1809, but I'm not sure if it solves problem 2. Could you describe how x is derived? Do we skip fields with children that have no buffers of their own (struct, fsl in the future)?

I think there's two parts:

Gaps in field ids

I'm not sure yet why we needed to fill these gaps in the old approach. It's very possible I'm missing something. I was thinking, when we read the file, the column metadata at index X corresponds to the field at index X (in dfs-pre-order) of the schema in the file descriptor. Field IDs are a table concern and not something that the file format bothers with (other than storing and loading them faithfully)

Transient columns (structs / lists)

It's probably true that lists will never have a validity buffer (since we can always use negative values as sentinels) but we still need a flag in the metadata telling us if we need to post-process the sentinels or not. Also, (non-fixed-size) lists do have an offsets buffer, and there are many different ways that buffer could be encoded.

Struct might have a nullability bitmap or they might not. If the page doesn't have any nulls we can skip the buffer. If the page has nulls but doesn't have any all-null-structs then we can use a sentinel. If the page has nulls and all-null-structs then we need a bitmap.

We could also encode structs using a sparse bitmap like they do in procella. In this case there is a one validity bitmap buffer AND an offsets buffer per child.

The same goes with fixed-size-list. There may be a need for a validity buffer or we may choose to do some kind of sparse encoding of the FSL (basically turning it into a variable size list where zero-size lists represent null) if most of the data is null.

I'm not sure yet why we needed to fill these gaps in the old approach.

In the old way, the page table is represented in memory as a BTreeMap<FieldId, (Offset, Length)>. That BTreeMap is written to disk as an array, where the ith element has field id field_ids_offset + i. So as-is, the page table is coupled to the field ids.

I was thinking, when we read the file, the column metadata at index X corresponds to the field at index X (in dfs-pre-order) of the schema in the file descriptor.

This is fine, we just need to be really specific about what the dfs-pre-order is. Right now FSL doesn't have child field, and thus the current page table ignores them, even though it does have entries for structs. However, later we want to change schemas to include child fields of FSL, so they are aligned with Arrow schemas. We can either do that now, in our definition of the dfs-pre-order, or find some compatible way to make that change in the future.

I think my main concern at this point is the transition plan for FSL children. The status quo that we have column metadata for lists and structs makes sense.

I think I follow your concern better now.

The pragmatic answer to your question is that the only thing that matters for nesting is struct columns. FSL, regular lists, and maps do not have "depth" in the proposal (the protobuf encoding is recursive and does have depth but this is true even for primitive columns).

The outside-the-box new-age answer is that the file format does not have any depth at all. There is just a flat list of columns. "nesting" is an in-memory / arrow concept and we have to map that concept onto the file format in the reader /writer. For example, we could choose to map arrow struct arrays into a single "packed" column. We could choose to map arrow struct arrays into a single "shredded" column (the list of children, instead of being at the "column" level, would be instead nested into the encoding). Both of these approaches are probably bad ideas since they prevent any kind of nested projection pushdown (though the "packed struct" may still be useful in certain cases).

We could even nest other things. We could create a special kind of FSL, let's call it a tuple, where we encode each index into its own "lance column". This would allow us to do nested tuple projection. Although, again, being practical, it would be less work for everyone else (e.g. arrow, etc.) if we implemented such a thing as a struct.

wjones127 · 2024-01-30T21:31:56Z

protos/file.proto

+  // If we let `column_offsets_pos` be:
+  //   manifest_position - (16 * num_columns)


I wonder if we should just have a direct offset here. That would make it possible to have another optional section later.

I'm not sure I understand.

As in column_offsets_pos should be an actual field, and not just an implicit computed one. Then we could add a new section between the column metadata and the file metadata.

More generally, just feeling a bit squeamish about having to make these kind of computations to get the offset and size of a section.

Ok, I get it now. I'll add that in.

wjones127

Would like to see one example of a schema (with struct, list, fsl) and how that translates to what num_columns means and what each value 0..num_columns references.

wjones127 · 2024-01-31T21:33:48Z

protos/file.proto

+//             Null values are present but the bytes are garbage and should be
+//             ignored.
+//  * Validity - A boolean array representing whether each value is null or not
+message MaskedEncoding {


BTW are you going to mention Sentinel encoding?

westonpace · 2024-02-16T17:33:28Z

Closing in favor of #1965

wjones127 reviewed Jan 22, 2024

View reviewed changes

westonpace added 2 commits January 23, 2024 20:38

First attempt at defining the format for Lance v2

fda23fc

Addressing review comments

0397756

westonpace force-pushed the feat/lance-v2 branch from bd80190 to 0397756 Compare January 30, 2024 18:42

Minor fix

a4ad3ca

westonpace changed the title ~~feat: Lance file format v2~~ feat: update lance file format to support per-page encoding Jan 30, 2024

westonpace marked this pull request as ready for review January 30, 2024 18:59

westonpace requested a review from wjones127 January 30, 2024 18:59

wjones127 reviewed Jan 30, 2024

View reviewed changes

wjones127 reviewed Jan 31, 2024

View reviewed changes

westonpace closed this Feb 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: update lance file format to support per-page encoding #1857

feat: update lance file format to support per-page encoding #1857

westonpace commented Jan 22, 2024

github-actions bot commented Jan 22, 2024

wjones127 Jan 22, 2024

westonpace Jan 23, 2024

westonpace Jan 30, 2024

wjones127 Jan 22, 2024

westonpace Jan 23, 2024

westonpace Jan 30, 2024

wjones127 Jan 22, 2024

westonpace Jan 23, 2024

westonpace Jan 30, 2024

wjones127 Jan 22, 2024

westonpace Jan 23, 2024

wjones127 Jan 22, 2024

westonpace Jan 23, 2024

westonpace Jan 30, 2024

wjones127 Jan 22, 2024

westonpace Jan 23, 2024

wjones127 Jan 22, 2024

westonpace Jan 30, 2024

wjones127 Jan 30, 2024

wjones127 Jan 30, 2024

wjones127 Jan 30, 2024

westonpace Jan 30, 2024 •

edited

wjones127 Jan 30, 2024

westonpace Jan 31, 2024

wjones127 Jan 30, 2024

westonpace Jan 30, 2024

wjones127 Jan 30, 2024 •

edited

westonpace Jan 30, 2024

wjones127 left a comment

wjones127 Jan 31, 2024

westonpace commented Feb 16, 2024

		// If we let `column_offsets_pos` be:
		// manifest_position - (16 * num_columns)

feat: update lance file format to support per-page encoding #1857

feat: update lance file format to support per-page encoding #1857

Conversation

westonpace commented Jan 22, 2024

github-actions bot commented Jan 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace Jan 30, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wjones127 Jan 30, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wjones127 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace commented Feb 16, 2024

westonpace Jan 30, 2024 •

edited

wjones127 Jan 30, 2024 •

edited