The page table has several flaws, which needs to be addressed at some point.
Issues with the current design
Problem 1: The length of the page table is poorly defined
The page table is a table of (offset, length) pairs pointing to pages in the file. It is written as an flat array, where each field has it’s pages written contiguously.
<field 0 page 0>...<field 0 page N><field 1 page 0>...<field 1 page N>...
The page table is pointed to by a single offset in the manifest: Metadata.page_table_position. This presents the first problem: The length of the page table is poorly defined. The Lance reader needs to infer the correct length of the page table for a given file. Right now, it takes the minimum of the field ids in the schema, the maximum of the field ids, and subtracts those to get the number of fields. Then it multiplies by the number of batches to get the total size of the array. If anything in this calculation is wrong, then its easy to accidentally read past the end of the file, or just read invalid data.
Problem 2: The page table assumes field ids are contiguous and always contain buffers
In Lance, schemas may start out with a contiguous set of field ids that start at 0. However, as schemas evolve, columns may be dropped and holes will develop in the field ids.
Some field types in Lance don’t contain buffers. For example, struct fields themselves don’t contain any buffers. Their children do, but those are separate fields. But because every field needs to have entries in the page table, their entries are filled with (0, 0) (zero offset, zero length).
The page table assumes field ids are contiguous and always contain buffers. In practice, it often needs to fill the table with zeros to handle these fields, wasting significant space.
Problem 3: The page table assumes all fields have same number of pages
The page table assumes that each column has the same number of pages and that pages in each position are equal in length. This same assumption is made in other parts of the format, such as the layout of the page statistics.
However, this presents a problem for optimal performance in Lance: for small columns, such as primitive ones, the optimal page size is large. There is a 5x performance difference between pages of size 10240 and 1024 when scanning. Meanwhile, for large columns, it’s better to have smaller pages, both for scan performance and to reduce the amount of data that must be buffered.
De-coupling the page sizes of different columns would involve more than just changes to the page table, but the current design of the page table notably does not allow this.
Proposed solution
First, make the Metadata also store the length of the page table. This will make it more robust to detect the size of it.
Second, reformat the page table as a protobuf message, so that it can be adapted in the future. Each of the entries should also store the specific field_id it corresponds to, removing the guess work of which field they correspond to.
message PageTable {
repeated FieldPages fields = 1;
}
message FieldPages {
int32 field_id = 1;
repeated pages PageInfo = 2;
}
message PageInfo {
int64 offset = 1;
int64 length = 2;
}
The structure of the messages means this can be larger than the flag encoding, however this is offset by the fact that:
- We will no longer be filing zeros for fields that have no pages (struct fields and missing fields)
- We can compress this message when serializing, which may bring additional IO savings.
For now, this could have the constraint that each of the fields must have the same number of pages. But it does allow for that to be lifted in the future.
The page table has several flaws, which needs to be addressed at some point.
Issues with the current design
Problem 1: The length of the page table is poorly defined
The page table is a table of
(offset, length)pairs pointing to pages in the file. It is written as an flat array, where each field has it’s pages written contiguously.The page table is pointed to by a single offset in the manifest:
Metadata.page_table_position. This presents the first problem: The length of the page table is poorly defined. The Lance reader needs to infer the correct length of the page table for a given file. Right now, it takes the minimum of the field ids in the schema, the maximum of the field ids, and subtracts those to get the number of fields. Then it multiplies by the number of batches to get the total size of the array. If anything in this calculation is wrong, then its easy to accidentally read past the end of the file, or just read invalid data.Problem 2: The page table assumes field ids are contiguous and always contain buffers
In Lance, schemas may start out with a contiguous set of field ids that start at 0. However, as schemas evolve, columns may be dropped and holes will develop in the field ids.
Some field types in Lance don’t contain buffers. For example, struct fields themselves don’t contain any buffers. Their children do, but those are separate fields. But because every field needs to have entries in the page table, their entries are filled with
(0, 0)(zero offset, zero length).The page table assumes field ids are contiguous and always contain buffers. In practice, it often needs to fill the table with zeros to handle these fields, wasting significant space.
Problem 3: The page table assumes all fields have same number of pages
The page table assumes that each column has the same number of pages and that pages in each position are equal in length. This same assumption is made in other parts of the format, such as the layout of the page statistics.
However, this presents a problem for optimal performance in Lance: for small columns, such as primitive ones, the optimal page size is large. There is a 5x performance difference between pages of size 10240 and 1024 when scanning. Meanwhile, for large columns, it’s better to have smaller pages, both for scan performance and to reduce the amount of data that must be buffered.
De-coupling the page sizes of different columns would involve more than just changes to the page table, but the current design of the page table notably does not allow this.
Proposed solution
First, make the
Metadataalso store the length of the page table. This will make it more robust to detect the size of it.Second, reformat the page table as a protobuf message, so that it can be adapted in the future. Each of the entries should also store the specific
field_idit corresponds to, removing the guess work of which field they correspond to.The structure of the messages means this can be larger than the flag encoding, however this is offset by the fact that:
For now, this could have the constraint that each of the fields must have the same number of pages. But it does allow for that to be lifted in the future.