-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace the page table #1809
Comments
Originally, storing the page table directly on disk was for that we can quickly do O(1) 16 bytes disk read to take one offset. Can we take into account that the new design is friendly with very sparse take (i.e., open a file with 1000+ groups and take 1 row out) as well? Another minor thing is if we store the offset and length continuously. like Field {
int32 field_id = 1;
repeated int64 offset = 2;
repeated int64 length = 2;
} When we have a fixed |
That's good to know. I hadn't considered that part, since I didn't see that in the codebase yet. |
Btw, you might want to consider compression and nullability support as well. Depend on the final design, there might be another level below "page" level to get decent random IOs. cc @westonpace |
Yes, I don't think we can replace the page table unless we know what our plan is for nullability and encodings. |
The page table has several flaws, which needs to be addressed at some point.
Issues with the current design
Problem 1: The length of the page table is poorly defined
The page table is a table of
(offset, length)
pairs pointing to pages in the file. It is written as an flat array, where each field has it’s pages written contiguously.The page table is pointed to by a single offset in the manifest:
Metadata.page_table_position
. This presents the first problem: The length of the page table is poorly defined. The Lance reader needs to infer the correct length of the page table for a given file. Right now, it takes the minimum of the field ids in the schema, the maximum of the field ids, and subtracts those to get the number of fields. Then it multiplies by the number of batches to get the total size of the array. If anything in this calculation is wrong, then its easy to accidentally read past the end of the file, or just read invalid data.Problem 2: The page table assumes field ids are contiguous and always contain buffers
In Lance, schemas may start out with a contiguous set of field ids that start at 0. However, as schemas evolve, columns may be dropped and holes will develop in the field ids.
Some field types in Lance don’t contain buffers. For example, struct fields themselves don’t contain any buffers. Their children do, but those are separate fields. But because every field needs to have entries in the page table, their entries are filled with
(0, 0)
(zero offset, zero length).The page table assumes field ids are contiguous and always contain buffers. In practice, it often needs to fill the table with zeros to handle these fields, wasting significant space.
Problem 3: The page table assumes all fields have same number of pages
The page table assumes that each column has the same number of pages and that pages in each position are equal in length. This same assumption is made in other parts of the format, such as the layout of the page statistics.
However, this presents a problem for optimal performance in Lance: for small columns, such as primitive ones, the optimal page size is large. There is a 5x performance difference between pages of size 10240 and 1024 when scanning. Meanwhile, for large columns, it’s better to have smaller pages, both for scan performance and to reduce the amount of data that must be buffered.
De-coupling the page sizes of different columns would involve more than just changes to the page table, but the current design of the page table notably does not allow this.
Proposed solution
First, make the
Metadata
also store the length of the page table. This will make it more robust to detect the size of it.Second, reformat the page table as a protobuf message, so that it can be adapted in the future. Each of the entries should also store the specific
field_id
it corresponds to, removing the guess work of which field they correspond to.The structure of the messages means this can be larger than the flag encoding, however this is offset by the fact that:
For now, this could have the constraint that each of the fields must have the same number of pages. But it does allow for that to be lifted in the future.
The text was updated successfully, but these errors were encountered: