-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lance File Format Version 2 (technically v0.3) #1929
Comments
The motivation and bigger picture are covered in more detail in #1929 This PR builds on top of #1918 and #1964 to create a new version of the Lance file format. There is still much to do, but this end-to-end MVP should provide the overall structure for the work. It can currently read and write primitive columns and list columns and supports some very basic encodings.
Hey @westonpace, I'm intrigued by the v2 format and I'm looking into adding support for general compression. I'd like to explore the possibility of encoding each page's buffer with zstd compression, similar to Arrow IPC's record batch body buffer compression. However, Lance's v2 format seems to offer more flexibility, as different fields may use different page sizes. I've taken a look at the code and glanced over the current implementation. It seems that logical encoders like |
Hello again @niyue :)
You're right that there is a piece missing at the moment. There will need to be some kind of "encoding picker" API that will need to be extensible. This component often calculates some basic statistics to figure out which encoding would be best to apply. For example, encodings like RLE are often only applied if there is a small range of possible values. I think we will also want some kind of mechanism for user configuration but I'm not entirely sure what shape that will take yet (maybe field metadata). For now, I think we can probably choose whether or not to apply general compression based on an environment variable. Then we can hook it into the configuration mechanism later, once it's been developed. So, if the environment variable is set, all value buffers will have general compression applied. If it is not set, no buffers will.
There will be some changes coming up. I had been planning on inviting others to help with encodings a little bit later (in maybe about a month). However, I think the actual API for physical encodings is pretty stable. If you want to make an attempt at adding compression I think it would be fine.
I think you are right. The scheduler will be a bit interesting because we cannot determine the exact range to read when using general compression. So the page scheduler will simply need to load the entire range, and send the requested range to the decoder. Then, the decoder, after it applies the decompression, can select the parts that were asked for. Longer term (can be a future PR) I would like a general compression encoding to be able to utilize a metadata buffer for a skip table. For example, even though we have an 8MB page we can compress it in 32KB chunks. We can record the number of values per chunk in the encoding description (e.g. if this is int32 we would have 8K values per chunk). For each chunk we can record the compressed size of the chunk. This would give us 256 sizes which should all be 16-bit values. We can then store this 512-byte buffer in one of the column metadata buffers. Then, during scheduling, if the user is asking for a specific row or small range of rows we can use this metadata buffer to figure out exactly which chunks we need to load, reducing the I/O for a small (512 byte) metadata cost. There are some pieces needed (the ability to store metadata buffers) that are not yet ready for this longer term feature. I will be working on pushdown filtering soon and I expect the pieces we need will get developed then. However, I wanted to share where my thinking was on this. |
Thanks for the great suggestions.
Do we have a rough roadmap for when this might be developed? I'll follow your suggestion to start with an environment variable-controlled approach. However, in my use case, I anticipate applying general compression to specific fields only, which means we'll need some user configuration mechanism eventually.
This is essentially what I'd like to achieve. Initially, I thought it could be accomplished by having different |
Currently I was planning on adding pushdown predicates and robustness testing this month, with the hope of making lance v2 the default for lance datasets by the end of the month. After that I was planning on making the encodings more extensible, so that others could start developing encodings. I think adding configuration would be part of this work. So I would estimate it should be ready by the end of June.
This is our goal as well :) Since most of our queries on the inference path are vector searches this means we need to do a lot of "select X rows by offset" and so point lookups are important. However, we want to balance this will full scans since those are very common in the training path. My thinking is that bitpacking, frame of reference and delta are good first encodings. It's pretty cheap to determine if they will be beneficial and there is no affect on random access. RLE, FSST, dictionary, and general compression are the next set. These do have some affect on random access but, if the chunk sizes are small enough, hopefully it won't be too significant. I also think various sentinel encodings are important too because they avoid an IOP during a point lookup. I have others that are starting to help me on this encodings work and so it will probably happen in parallel with the things I mentioned above. Bitpacking was just opened today: #2333 |
Do you think field metadata will be a good tool for users to specify this configuration? Or do you have any other idea? |
Thanks for the insight.
Experimenting with general compression is useful in my scenario, especially since it can be applied to all types of data, whether integer, float, or string. This flexibility could prove Lance as a viable format for my project, even without additional encodings. Currently, we utilize dictionary encoding for low cardinality fields, and I may explore incorporating dictionary encoding later on. I also experimented with FSST previously, as documented here, but it seems more suited for short strings and has specific application domains.
Using field metadata to specify configuration seems like a useful approach. In my project, we currently utilize Arrow IPC with multiple record batches to store a portion of the data. We aim to support both point queries and analytical queries that involve scanning large amounts of data. Currently, we chunk a field in an IPC file into multiple record batches, dynamically calculating the chunk size based on the average size of the field. To ensure the file is self-contained, we store the chunk size in the IPC file as customized metadata, which IPC file natively supports, allowing readers to access the file without additional external metadata. Lance v2 format appears more flexible, and I'm considering leveraging it to enable multiple fields to have different chunk sizes, thus enhancing the efficiency of randomly accessing these fields. This is particularly crucial as some fields are large, while others are trivial in size. |
regarding |
Well boolean is difficult because we usually represent them as bits, so there's no value other than 0 or 1.
I think during the encoding process we'll collect statistics for arrays, such as min, max, null count, distinct count. These will be saved for page skipping, but also be used to decide how to encode the page. An easy way to find a sentinel value would be |
I have drafted a PR (#2368) to add support for compressing the value page buffer. Could you please review it to see if it fits well? And please let me know if a new issue should be opened for this PR. As I am relatively new to Lance and Rust, there might be some mistakes in the PR. Please excuse any oversights. I am also uncertain if the current solution is the best fit for Lance. If it isn't, feel free to reject this PR. I am open to suggestions and willing to give it another try if we can figure out a better approach to address this issue. Thanks. |
I have been investigating potential changes to the Lance file format. These changes are for a number of reasons but the highlights are:
This change will not be a single PR. I'm creating a parent task to track the work that needs to be done.
Initially, these changes will not be accessible at all (e.g. nothing will use a v2 writer by default)
Ready (new file format can do everything the old format can do with equal or better perf)
More thorough testing
Struct<List<...>>
andStruct<Struct<...>>
List<Struct<...>>
)Benchmarking
The above, but with filter pushdown (e.g. statistics)(statistics will be post-MVP)Switchover
use_experimental_writer
touse_legacy_writer
#2393Beyond MVP (extra things to enable / investigate later)
_)FragmentCreateBuilder
The text was updated successfully, but these errors were encountered: