Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lance File Format Version 2 (technically v0.3) #1929

Open
29 of 77 tasks
Tracked by #2079
westonpace opened this issue Feb 8, 2024 · 9 comments
Open
29 of 77 tasks
Tracked by #2079

Lance File Format Version 2 (technically v0.3) #1929

westonpace opened this issue Feb 8, 2024 · 9 comments

Comments

@westonpace
Copy link
Contributor

westonpace commented Feb 8, 2024

I have been investigating potential changes to the Lance file format. These changes are for a number of reasons but the highlights are:

  • Allow encodings to change on a per-page basis
  • Get rid of row groups because there is no way to specify an ideal row group size when working with large data
  • Allow for writing data one column at a time in addition to one row group at a time
  • Allow columns to be different sizes (this will make it possible to use lance files in places we can't use them today like to help with shuffling)
  • Allow more flexibility into where metadata is stored

This change will not be a single PR. I'm creating a parent task to track the work that needs to be done.

Initially, these changes will not be accessible at all (e.g. nothing will use a v2 writer by default)

Ready (new file format can do everything the old format can do with equal or better perf)

More thorough testing

Benchmarking

  • perf: benchmarking encodings (read time, write time) #2383
  • Python level take scan parameterized by (# columns, batch size, data types, metadata cached)
  • The above, but with very large items
  • The above, but with filter pushdown (e.g. statistics) (statistics will be post-MVP)
  • Few columns from very many columns (using true column projection)

Switchover

Beyond MVP (extra things to enable / investigate later)

  • Projection support in python FileReader
  • Add out-of-batch coalescing #1960
  • Implementation of out-of-core shuffle based on new format
  • Add support for "schedule_all" to the encodings
  • Add support for dictionary encoded fields to the v2 reader/writer #2347
  • Add support for scheduling partial pages
  • Add backpressure to decoder scheduler #1957
  • Union type
  • Add support for fixed size list as a logical encoding
  • Support for pushdown scans
  • Test write of 0 batches
  • Very large items (e.g. single items larger than a page)
  • New encodings
    • Sentinel encoding for nulls
    • Dense null encoding
    • Compressed bitmap encoding
    • Per-page dictionary encoding
    • FOR encoding
    • Delta encoding
    • feat: add bitpack encoding for LanceV2 #2333
    • FSST encoding
    • General compression (e.g. zlib)
  • Investigate metadata-caching friendly parameters
  • Add support for statistics / zone maps
  • Make it possible for users to supply their own encodings
  • Add example showing how to create a custom encoding
  • Add support for backpressure (I/O faster than decoder)
  • Primitive columns where individual items are more than 2^32 bytes (_)
  • Add v2 as an option in FragmentCreateBuilder
  • Allow specifying readahead as bytes instead of rows
@wjones127 wjones127 added this to the (WIP) Lance Roadmap milestone Mar 12, 2024
@wjones127 wjones127 mentioned this issue Mar 15, 2024
20 tasks
westonpace added a commit that referenced this issue Apr 9, 2024
The motivation and bigger picture are covered in more detail in
#1929

This PR builds on top of #1918 and
#1964 to create a new version of
the Lance file format.

There is still much to do, but this end-to-end MVP should provide the
overall structure for the work.

It can currently read and write primitive columns and list columns and
supports some very basic encodings.
@westonpace westonpace changed the title Lance File Format Version 0.2 Lance File Format Version 2 (technically v0.3) Apr 17, 2024
@niyue
Copy link
Contributor

niyue commented May 13, 2024

Hey @westonpace, I'm intrigued by the v2 format and I'm looking into adding support for general compression. I'd like to explore the possibility of encoding each page's buffer with zstd compression, similar to Arrow IPC's record batch body buffer compression. However, Lance's v2 format seems to offer more flexibility, as different fields may use different page sizes.

I've taken a look at the code and glanced over the current implementation. It seems that logical encoders like PrimitiveFieldEncoder are hardcoded to use the physical encoder ValueEncoder internally. I believe the "General compression" encoder would be a type of physical encoder, but I'm unsure how to integrate this new physical encoder into ValueEncoder. Do you have any guidance on this? Additionally, do you think it's the right time to pursue such an enhancement, considering that this part of the codebase is still actively being developed? Thanks for any insights you can provide.

@westonpace
Copy link
Contributor Author

Hello again @niyue :)

It seems that logical encoders like PrimitiveFieldEncoder are hardcoded to use the physical encoder ValueEncoder internally

You're right that there is a piece missing at the moment. There will need to be some kind of "encoding picker" API that will need to be extensible. This component often calculates some basic statistics to figure out which encoding would be best to apply. For example, encodings like RLE are often only applied if there is a small range of possible values. I think we will also want some kind of mechanism for user configuration but I'm not entirely sure what shape that will take yet (maybe field metadata). For now, I think we can probably choose whether or not to apply general compression based on an environment variable. Then we can hook it into the configuration mechanism later, once it's been developed. So, if the environment variable is set, all value buffers will have general compression applied. If it is not set, no buffers will.

Additionally, do you think it's the right time to pursue such an enhancement, considering that this part of the codebase is still actively being developed?

There will be some changes coming up. I had been planning on inviting others to help with encodings a little bit later (in maybe about a month). However, I think the actual API for physical encodings is pretty stable. If you want to make an attempt at adding compression I think it would be fine.

I believe the "General compression" encoder would be a type of physical encoder

I think you are right. The scheduler will be a bit interesting because we cannot determine the exact range to read when using general compression. So the page scheduler will simply need to load the entire range, and send the requested range to the decoder. Then, the decoder, after it applies the decompression, can select the parts that were asked for.

Longer term (can be a future PR) I would like a general compression encoding to be able to utilize a metadata buffer for a skip table. For example, even though we have an 8MB page we can compress it in 32KB chunks. We can record the number of values per chunk in the encoding description (e.g. if this is int32 we would have 8K values per chunk). For each chunk we can record the compressed size of the chunk. This would give us 256 sizes which should all be 16-bit values. We can then store this 512-byte buffer in one of the column metadata buffers. Then, during scheduling, if the user is asking for a specific row or small range of rows we can use this metadata buffer to figure out exactly which chunks we need to load, reducing the I/O for a small (512 byte) metadata cost.

There are some pieces needed (the ability to store metadata buffers) that are not yet ready for this longer term feature. I will be working on pushdown filtering soon and I expect the pieces we need will get developed then. However, I wanted to share where my thinking was on this.

@niyue
Copy link
Contributor

niyue commented May 14, 2024

Thanks for the great suggestions.

we can hook it into the configuration mechanism later, once it's been developed

Do we have a rough roadmap for when this might be developed? I'll follow your suggestion to start with an environment variable-controlled approach. However, in my use case, I anticipate applying general compression to specific fields only, which means we'll need some user configuration mechanism eventually.

I would like a general compression encoding to be able to utilize a metadata buffer for a skip table

even though we have an 8MB page we can compress it in 32KB chunks

This is essentially what I'd like to achieve. Initially, I thought it could be accomplished by having different data_cache_bytes write option for different columns, resulting in pages of varying sizes. However, your suggestion of employing a chunk in-page approach has me reconsidering. In my scenario, I aim to accelerate random access while maintaining reasonable compression. Sometimes it's challenging to determine the optimal compression method, so having the option for general compression could be beneficial.

@westonpace
Copy link
Contributor Author

westonpace commented May 14, 2024

Do we have a rough roadmap for when this might be developed? I'll follow your suggestion to start with an environment variable-controlled approach. However, in my use case, I anticipate applying general compression to specific fields only, which means we'll need some user configuration mechanism eventually.

Currently I was planning on adding pushdown predicates and robustness testing this month, with the hope of making lance v2 the default for lance datasets by the end of the month.

After that I was planning on making the encodings more extensible, so that others could start developing encodings. I think adding configuration would be part of this work. So I would estimate it should be ready by the end of June.

I aim to accelerate random access while maintaining reasonable compression.

This is our goal as well :) Since most of our queries on the inference path are vector searches this means we need to do a lot of "select X rows by offset" and so point lookups are important. However, we want to balance this will full scans since those are very common in the training path.

My thinking is that bitpacking, frame of reference and delta are good first encodings. It's pretty cheap to determine if they will be beneficial and there is no affect on random access. RLE, FSST, dictionary, and general compression are the next set. These do have some affect on random access but, if the chunk sizes are small enough, hopefully it won't be too significant. I also think various sentinel encodings are important too because they avoid an IOP during a point lookup.

I have others that are starting to help me on this encodings work and so it will probably happen in parallel with the things I mentioned above. Bitpacking was just opened today: #2333

@westonpace
Copy link
Contributor Author

However, in my use case, I anticipate applying general compression to specific fields only, which means we'll need some user configuration mechanism eventually.

Do you think field metadata will be a good tool for users to specify this configuration? Or do you have any other idea?

@niyue
Copy link
Contributor

niyue commented May 14, 2024

Thanks for the insight.

RLE, FSST, dictionary, and general compression are the next set

Experimenting with general compression is useful in my scenario, especially since it can be applied to all types of data, whether integer, float, or string. This flexibility could prove Lance as a viable format for my project, even without additional encodings. Currently, we utilize dictionary encoding for low cardinality fields, and I may explore incorporating dictionary encoding later on. I also experimented with FSST previously, as documented here, but it seems more suited for short strings and has specific application domains.

Do you think field metadata will be a good tool for users to specify this configuration?

Using field metadata to specify configuration seems like a useful approach. In my project, we currently utilize Arrow IPC with multiple record batches to store a portion of the data. We aim to support both point queries and analytical queries that involve scanning large amounts of data. Currently, we chunk a field in an IPC file into multiple record batches, dynamically calculating the chunk size based on the average size of the field. To ensure the file is self-contained, we store the chunk size in the IPC file as customized metadata, which IPC file natively supports, allowing readers to access the file without additional external metadata. Lance v2 format appears more flexible, and I'm considering leveraging it to enable multiple fields to have different chunk sizes, thus enhancing the efficiency of randomly accessing these fields. This is particularly crucial as some fields are large, while others are trivial in size.

@broccoliSpicy
Copy link
Contributor

broccoliSpicy commented May 16, 2024

regarding Sentinel encoding for nulls,
for datatype boolean, i guess we can chose whatever value that is not false, true
for datatypes like timestamp, Date32, Date64, Time32, Time64, Duration, Interval, since these types use signed integers underneath and valid values are always non-negative, we can chose a negative value as the sentinel.
but for other datatypes like int, uint, float, etc., how can we pick a sentinel value for them?
any insights @westonpace @niyue ?

@wjones127
Copy link
Contributor

for datatype boolean, i guess we can chose whatever value that is not false, true

Well boolean is difficult because we usually represent them as bits, so there's no value other than 0 or 1.

how can we pick a sentinel value for them?

I think during the encoding process we'll collect statistics for arrays, such as min, max, null count, distinct count. These will be saved for page skipping, but also be used to decide how to encode the page. An easy way to find a sentinel value would be max+1 or min-1, if these don't overflow. If this doesn't give a match, we can either scan for an unused value or simply choose a bitmap null encoding.

@niyue
Copy link
Contributor

niyue commented May 22, 2024

@westonpace

I have drafted a PR (#2368) to add support for compressing the value page buffer. Could you please review it to see if it fits well? And please let me know if a new issue should be opened for this PR.

As I am relatively new to Lance and Rust, there might be some mistakes in the PR. Please excuse any oversights. I am also uncertain if the current solution is the best fit for Lance. If it isn't, feel free to reject this PR. I am open to suggestions and willing to give it another try if we can figure out a better approach to address this issue. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants