-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lance File Format Version 2 (technically v0.3) #1929
Comments
The motivation and bigger picture are covered in more detail in #1929 This PR builds on top of #1918 and #1964 to create a new version of the Lance file format. There is still much to do, but this end-to-end MVP should provide the overall structure for the work. It can currently read and write primitive columns and list columns and supports some very basic encodings.
Hey @westonpace, I'm intrigued by the v2 format and I'm looking into adding support for general compression. I'd like to explore the possibility of encoding each page's buffer with zstd compression, similar to Arrow IPC's record batch body buffer compression. However, Lance's v2 format seems to offer more flexibility, as different fields may use different page sizes. I've taken a look at the code and glanced over the current implementation. It seems that logical encoders like |
Hello again @niyue :)
You're right that there is a piece missing at the moment. There will need to be some kind of "encoding picker" API that will need to be extensible. This component often calculates some basic statistics to figure out which encoding would be best to apply. For example, encodings like RLE are often only applied if there is a small range of possible values. I think we will also want some kind of mechanism for user configuration but I'm not entirely sure what shape that will take yet (maybe field metadata). For now, I think we can probably choose whether or not to apply general compression based on an environment variable. Then we can hook it into the configuration mechanism later, once it's been developed. So, if the environment variable is set, all value buffers will have general compression applied. If it is not set, no buffers will.
There will be some changes coming up. I had been planning on inviting others to help with encodings a little bit later (in maybe about a month). However, I think the actual API for physical encodings is pretty stable. If you want to make an attempt at adding compression I think it would be fine.
I think you are right. The scheduler will be a bit interesting because we cannot determine the exact range to read when using general compression. So the page scheduler will simply need to load the entire range, and send the requested range to the decoder. Then, the decoder, after it applies the decompression, can select the parts that were asked for. Longer term (can be a future PR) I would like a general compression encoding to be able to utilize a metadata buffer for a skip table. For example, even though we have an 8MB page we can compress it in 32KB chunks. We can record the number of values per chunk in the encoding description (e.g. if this is int32 we would have 8K values per chunk). For each chunk we can record the compressed size of the chunk. This would give us 256 sizes which should all be 16-bit values. We can then store this 512-byte buffer in one of the column metadata buffers. Then, during scheduling, if the user is asking for a specific row or small range of rows we can use this metadata buffer to figure out exactly which chunks we need to load, reducing the I/O for a small (512 byte) metadata cost. There are some pieces needed (the ability to store metadata buffers) that are not yet ready for this longer term feature. I will be working on pushdown filtering soon and I expect the pieces we need will get developed then. However, I wanted to share where my thinking was on this. |
Thanks for the great suggestions.
Do we have a rough roadmap for when this might be developed? I'll follow your suggestion to start with an environment variable-controlled approach. However, in my use case, I anticipate applying general compression to specific fields only, which means we'll need some user configuration mechanism eventually.
This is essentially what I'd like to achieve. Initially, I thought it could be accomplished by having different |
Currently I was planning on adding pushdown predicates and robustness testing this month, with the hope of making lance v2 the default for lance datasets by the end of the month. After that I was planning on making the encodings more extensible, so that others could start developing encodings. I think adding configuration would be part of this work. So I would estimate it should be ready by the end of June.
This is our goal as well :) Since most of our queries on the inference path are vector searches this means we need to do a lot of "select X rows by offset" and so point lookups are important. However, we want to balance this will full scans since those are very common in the training path. My thinking is that bitpacking, frame of reference and delta are good first encodings. It's pretty cheap to determine if they will be beneficial and there is no affect on random access. RLE, FSST, dictionary, and general compression are the next set. These do have some affect on random access but, if the chunk sizes are small enough, hopefully it won't be too significant. I also think various sentinel encodings are important too because they avoid an IOP during a point lookup. I have others that are starting to help me on this encodings work and so it will probably happen in parallel with the things I mentioned above. Bitpacking was just opened today: #2333 |
Do you think field metadata will be a good tool for users to specify this configuration? Or do you have any other idea? |
Thanks for the insight.
Experimenting with general compression is useful in my scenario, especially since it can be applied to all types of data, whether integer, float, or string. This flexibility could prove Lance as a viable format for my project, even without additional encodings. Currently, we utilize dictionary encoding for low cardinality fields, and I may explore incorporating dictionary encoding later on. I also experimented with FSST previously, as documented here, but it seems more suited for short strings and has specific application domains.
Using field metadata to specify configuration seems like a useful approach. In my project, we currently utilize Arrow IPC with multiple record batches to store a portion of the data. We aim to support both point queries and analytical queries that involve scanning large amounts of data. Currently, we chunk a field in an IPC file into multiple record batches, dynamically calculating the chunk size based on the average size of the field. To ensure the file is self-contained, we store the chunk size in the IPC file as customized metadata, which IPC file natively supports, allowing readers to access the file without additional external metadata. Lance v2 format appears more flexible, and I'm considering leveraging it to enable multiple fields to have different chunk sizes, thus enhancing the efficiency of randomly accessing these fields. This is particularly crucial as some fields are large, while others are trivial in size. |
regarding |
Well boolean is difficult because we usually represent them as bits, so there's no value other than 0 or 1.
I think during the encoding process we'll collect statistics for arrays, such as min, max, null count, distinct count. These will be saved for page skipping, but also be used to decide how to encode the page. An easy way to find a sentinel value would be |
I have drafted a PR (#2368) to add support for compressing the value page buffer. Could you please review it to see if it fits well? And please let me know if a new issue should be opened for this PR. As I am relatively new to Lance and Rust, there might be some mistakes in the PR. Please excuse any oversights. I am also uncertain if the current solution is the best fit for Lance. If it isn't, feel free to reject this PR. I am open to suggestions and willing to give it another try if we can figure out a better approach to address this issue. Thanks. |
I've cleaned up the task list a bit, removing completed items, and restructuring a bit. We have a pretty solid set of basic encodings. There are a few "completion tasks" that need to be done to round out the capabilities. At the same time I have come up with a design for new struct/list encodings that better support random access. I plan to be working on this over the next month or two. I'd appreciate any feedback on the document: https://docs.google.com/document/d/19QNZq7A-797CXt8Z5pCrDEcxcRxEE8J0_sw4goqqgIY/edit?usp=sharing CC @niyue / @broccoliSpicy who may be interested. |
so excited to see your ideas on struct/list encodings! @westonpace |
a few thoughts about the doc:
|
sorry, after rethinking about this, I think this is not feasible using only |
No worries. If we come up with a good algorithm at any point we can always plug it in.
Unfortunately, I think we'd need to store two copies of the data. Because, even if we have the compressed null bitmap we still need to read the data and it would have the null bit attached to it. I do think we might not take the zipped nulls approach for integer / fp data. For example, if you have integers and you have bitpacking then, in most cases, I expect you will be able to store 1024 integers AND the compressed null bitmap for that block in less than one 4KB disk sector. So I expect zipped nulls will be most useful for larger data types and the overhead of unzipping should be fairly minor.
Can you expand on this? |
sorry for the delay of response, I will find sometime to read |
@westonpace |
@niyue Yes, this is still a goal. However, I don't know if I'm going to get to it until closer to the end of the year (which, as you noticed, is well behind the schedule I had originally hoped for). The main reason is that I think we want more confidence in the traits before making an SDK. Right now, encoders & decoders need to worry about both scheduling AND encoding/decoding. I'm slowly working on splitting the encoders/decoders into three types:
Do your custom encodings fit nicely into one of those three categories? The current implementation is missing "custom indices" and the traits for "structural" and "compressive" encodings are wrong:
However, if you wanted to start now, I think that's fine, it will just be more complicated. I can also draft up what the traits should look like if you let me know which of the three categories you are most interested in. |
@westonpace thanks so much for the detailed info. Sorted Index Encoding:I have a numeric field (specifically a timestamp field) that is sorted. This sorted field can serve as an index for other fields, enabling efficient lookups. The typical access pattern would be to query a time range, use this sorted field to determine the corresponding range of row IDs, and then access the rows within that range. Since this field is part of the dataset, I’d prefer not to duplicate it as a dedicated index file outside of Lance. It would be ideal if Lance's custom encoding could unlock this potential by utilizing the field both as data and as an index. While it seems related to the custom indices category, the fact that this field is embedded in the data suggests there may be a need for specialized loading, decoding, or scheduling. CLP Encoding:CLP (Compressed Log Processing) is a domain-specific encoding designed for compressing logs (refer to OSDI 2021 paper). It breaks a log message like |
Both of those sound cool to have.
This should be enabled by "pushdown filtering" which is something I was working on last week in #2913 . There are a few challenges I am working through and I have some basic solutions for but I don't love them. So it's still rough around the edges even when that PR gets merged. Definitely some bugs probably there and we need some end-to-end testing before the feature is ready to be turned on. One of the challenges is that the Lance package will need to adopt an expression language that it can use for filter expressions. There is https://crates.io/crates/datafusion-expr but I had been avoiding it as the datafusion package is rather large (in the PR I work around this by making the encoding an "extension" in the The "zone maps encoding" works like this:
This sounds great! What sorts of searches would you want to satisfy?
Wouldn't a large percentage of rows satisfy this query? |
I have been investigating potential changes to the Lance file format. These changes are for a number of reasons but the highlights are:
This change will not be a single PR. I'm creating a parent task to track the work that needs to be done.
Initially, these changes will not be accessible at all (e.g. nothing will use a v2 writer by default)
Complete implementation
Switchover
Columnar Encodings for Random Access
Design: https://docs.google.com/document/d/19QNZq7A-797CXt8Z5pCrDEcxcRxEE8J0_sw4goqqgIY/edit?usp=sharing
Benchmarking
Low Priority
The text was updated successfully, but these errors were encountered: