Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lance File Format Version 2 (technically v0.3) #1929

Open
12 of 30 tasks
Tracked by #2079
westonpace opened this issue Feb 8, 2024 · 19 comments
Open
12 of 30 tasks
Tracked by #2079

Lance File Format Version 2 (technically v0.3) #1929

westonpace opened this issue Feb 8, 2024 · 19 comments

Comments

@westonpace
Copy link
Contributor

westonpace commented Feb 8, 2024

I have been investigating potential changes to the Lance file format. These changes are for a number of reasons but the highlights are:

  • Allow encodings to change on a per-page basis
  • Get rid of row groups because there is no way to specify an ideal row group size when working with large data
  • Allow for writing data one column at a time in addition to one row group at a time
  • Allow columns to be different sizes (this will make it possible to use lance files in places we can't use them today like to help with shuffling)
  • Allow more flexibility into where metadata is stored

This change will not be a single PR. I'm creating a parent task to track the work that needs to be done.

Initially, these changes will not be accessible at all (e.g. nothing will use a v2 writer by default)

Complete implementation

Switchover

  • Introduce concept of "max writer version" to lance datasets
  • Add writer version to v2 writer which will control what encodings are used
  • Add a migration command that will allow a dataset to switch over to v2 files and/or upgrade the writer version
  • Clear documentation around version capabilities and benefits
  • Change the writer default to write v2 files #2394

Columnar Encodings for Random Access

Design: https://docs.google.com/document/d/19QNZq7A-797CXt8Z5pCrDEcxcRxEE8J0_sw4goqqgIY/edit?usp=sharing

  • Breakdown into list of tasks

Benchmarking

Low Priority

  • Add out-of-batch coalescing #1960
  • Union type
  • Add support for fixed size list as a logical encoding
  • Potentially new encodings
    • Compressed bitmap encoding
    • Per-column dictionary encoding
    • Delta encoding
  • Make it possible for users to supply their own encodings
  • Add example showing how to create a custom encoding
  • Allow specifying readahead as bytes instead of rows
@wjones127 wjones127 added this to the (WIP) Lance Roadmap milestone Mar 12, 2024
@wjones127 wjones127 mentioned this issue Mar 15, 2024
20 tasks
westonpace added a commit that referenced this issue Apr 9, 2024
The motivation and bigger picture are covered in more detail in
#1929

This PR builds on top of #1918 and
#1964 to create a new version of
the Lance file format.

There is still much to do, but this end-to-end MVP should provide the
overall structure for the work.

It can currently read and write primitive columns and list columns and
supports some very basic encodings.
@westonpace westonpace changed the title Lance File Format Version 0.2 Lance File Format Version 2 (technically v0.3) Apr 17, 2024
@niyue
Copy link
Contributor

niyue commented May 13, 2024

Hey @westonpace, I'm intrigued by the v2 format and I'm looking into adding support for general compression. I'd like to explore the possibility of encoding each page's buffer with zstd compression, similar to Arrow IPC's record batch body buffer compression. However, Lance's v2 format seems to offer more flexibility, as different fields may use different page sizes.

I've taken a look at the code and glanced over the current implementation. It seems that logical encoders like PrimitiveFieldEncoder are hardcoded to use the physical encoder ValueEncoder internally. I believe the "General compression" encoder would be a type of physical encoder, but I'm unsure how to integrate this new physical encoder into ValueEncoder. Do you have any guidance on this? Additionally, do you think it's the right time to pursue such an enhancement, considering that this part of the codebase is still actively being developed? Thanks for any insights you can provide.

@westonpace
Copy link
Contributor Author

Hello again @niyue :)

It seems that logical encoders like PrimitiveFieldEncoder are hardcoded to use the physical encoder ValueEncoder internally

You're right that there is a piece missing at the moment. There will need to be some kind of "encoding picker" API that will need to be extensible. This component often calculates some basic statistics to figure out which encoding would be best to apply. For example, encodings like RLE are often only applied if there is a small range of possible values. I think we will also want some kind of mechanism for user configuration but I'm not entirely sure what shape that will take yet (maybe field metadata). For now, I think we can probably choose whether or not to apply general compression based on an environment variable. Then we can hook it into the configuration mechanism later, once it's been developed. So, if the environment variable is set, all value buffers will have general compression applied. If it is not set, no buffers will.

Additionally, do you think it's the right time to pursue such an enhancement, considering that this part of the codebase is still actively being developed?

There will be some changes coming up. I had been planning on inviting others to help with encodings a little bit later (in maybe about a month). However, I think the actual API for physical encodings is pretty stable. If you want to make an attempt at adding compression I think it would be fine.

I believe the "General compression" encoder would be a type of physical encoder

I think you are right. The scheduler will be a bit interesting because we cannot determine the exact range to read when using general compression. So the page scheduler will simply need to load the entire range, and send the requested range to the decoder. Then, the decoder, after it applies the decompression, can select the parts that were asked for.

Longer term (can be a future PR) I would like a general compression encoding to be able to utilize a metadata buffer for a skip table. For example, even though we have an 8MB page we can compress it in 32KB chunks. We can record the number of values per chunk in the encoding description (e.g. if this is int32 we would have 8K values per chunk). For each chunk we can record the compressed size of the chunk. This would give us 256 sizes which should all be 16-bit values. We can then store this 512-byte buffer in one of the column metadata buffers. Then, during scheduling, if the user is asking for a specific row or small range of rows we can use this metadata buffer to figure out exactly which chunks we need to load, reducing the I/O for a small (512 byte) metadata cost.

There are some pieces needed (the ability to store metadata buffers) that are not yet ready for this longer term feature. I will be working on pushdown filtering soon and I expect the pieces we need will get developed then. However, I wanted to share where my thinking was on this.

@niyue
Copy link
Contributor

niyue commented May 14, 2024

Thanks for the great suggestions.

we can hook it into the configuration mechanism later, once it's been developed

Do we have a rough roadmap for when this might be developed? I'll follow your suggestion to start with an environment variable-controlled approach. However, in my use case, I anticipate applying general compression to specific fields only, which means we'll need some user configuration mechanism eventually.

I would like a general compression encoding to be able to utilize a metadata buffer for a skip table

even though we have an 8MB page we can compress it in 32KB chunks

This is essentially what I'd like to achieve. Initially, I thought it could be accomplished by having different data_cache_bytes write option for different columns, resulting in pages of varying sizes. However, your suggestion of employing a chunk in-page approach has me reconsidering. In my scenario, I aim to accelerate random access while maintaining reasonable compression. Sometimes it's challenging to determine the optimal compression method, so having the option for general compression could be beneficial.

@westonpace
Copy link
Contributor Author

westonpace commented May 14, 2024

Do we have a rough roadmap for when this might be developed? I'll follow your suggestion to start with an environment variable-controlled approach. However, in my use case, I anticipate applying general compression to specific fields only, which means we'll need some user configuration mechanism eventually.

Currently I was planning on adding pushdown predicates and robustness testing this month, with the hope of making lance v2 the default for lance datasets by the end of the month.

After that I was planning on making the encodings more extensible, so that others could start developing encodings. I think adding configuration would be part of this work. So I would estimate it should be ready by the end of June.

I aim to accelerate random access while maintaining reasonable compression.

This is our goal as well :) Since most of our queries on the inference path are vector searches this means we need to do a lot of "select X rows by offset" and so point lookups are important. However, we want to balance this will full scans since those are very common in the training path.

My thinking is that bitpacking, frame of reference and delta are good first encodings. It's pretty cheap to determine if they will be beneficial and there is no affect on random access. RLE, FSST, dictionary, and general compression are the next set. These do have some affect on random access but, if the chunk sizes are small enough, hopefully it won't be too significant. I also think various sentinel encodings are important too because they avoid an IOP during a point lookup.

I have others that are starting to help me on this encodings work and so it will probably happen in parallel with the things I mentioned above. Bitpacking was just opened today: #2333

@westonpace
Copy link
Contributor Author

However, in my use case, I anticipate applying general compression to specific fields only, which means we'll need some user configuration mechanism eventually.

Do you think field metadata will be a good tool for users to specify this configuration? Or do you have any other idea?

@niyue
Copy link
Contributor

niyue commented May 14, 2024

Thanks for the insight.

RLE, FSST, dictionary, and general compression are the next set

Experimenting with general compression is useful in my scenario, especially since it can be applied to all types of data, whether integer, float, or string. This flexibility could prove Lance as a viable format for my project, even without additional encodings. Currently, we utilize dictionary encoding for low cardinality fields, and I may explore incorporating dictionary encoding later on. I also experimented with FSST previously, as documented here, but it seems more suited for short strings and has specific application domains.

Do you think field metadata will be a good tool for users to specify this configuration?

Using field metadata to specify configuration seems like a useful approach. In my project, we currently utilize Arrow IPC with multiple record batches to store a portion of the data. We aim to support both point queries and analytical queries that involve scanning large amounts of data. Currently, we chunk a field in an IPC file into multiple record batches, dynamically calculating the chunk size based on the average size of the field. To ensure the file is self-contained, we store the chunk size in the IPC file as customized metadata, which IPC file natively supports, allowing readers to access the file without additional external metadata. Lance v2 format appears more flexible, and I'm considering leveraging it to enable multiple fields to have different chunk sizes, thus enhancing the efficiency of randomly accessing these fields. This is particularly crucial as some fields are large, while others are trivial in size.

@broccoliSpicy
Copy link
Contributor

broccoliSpicy commented May 16, 2024

regarding Sentinel encoding for nulls,
for datatype boolean, i guess we can chose whatever value that is not false, true
for datatypes like timestamp, Date32, Date64, Time32, Time64, Duration, Interval, since these types use signed integers underneath and valid values are always non-negative, we can chose a negative value as the sentinel.
but for other datatypes like int, uint, float, etc., how can we pick a sentinel value for them?
any insights @westonpace @niyue ?

@wjones127
Copy link
Contributor

for datatype boolean, i guess we can chose whatever value that is not false, true

Well boolean is difficult because we usually represent them as bits, so there's no value other than 0 or 1.

how can we pick a sentinel value for them?

I think during the encoding process we'll collect statistics for arrays, such as min, max, null count, distinct count. These will be saved for page skipping, but also be used to decide how to encode the page. An easy way to find a sentinel value would be max+1 or min-1, if these don't overflow. If this doesn't give a match, we can either scan for an unused value or simply choose a bitmap null encoding.

@niyue
Copy link
Contributor

niyue commented May 22, 2024

@westonpace

I have drafted a PR (#2368) to add support for compressing the value page buffer. Could you please review it to see if it fits well? And please let me know if a new issue should be opened for this PR.

As I am relatively new to Lance and Rust, there might be some mistakes in the PR. Please excuse any oversights. I am also uncertain if the current solution is the best fit for Lance. If it isn't, feel free to reject this PR. I am open to suggestions and willing to give it another try if we can figure out a better approach to address this issue. Thanks.

@westonpace
Copy link
Contributor Author

I've cleaned up the task list a bit, removing completed items, and restructuring a bit. We have a pretty solid set of basic encodings. There are a few "completion tasks" that need to be done to round out the capabilities. At the same time I have come up with a design for new struct/list encodings that better support random access. I plan to be working on this over the next month or two. I'd appreciate any feedback on the document: https://docs.google.com/document/d/19QNZq7A-797CXt8Z5pCrDEcxcRxEE8J0_sw4goqqgIY/edit?usp=sharing

CC @niyue / @broccoliSpicy who may be interested.

@broccoliSpicy
Copy link
Contributor

so excited to see your ideas on struct/list encodings! @westonpace

@broccoliSpicy
Copy link
Contributor

a few thoughts about the doc:

  1. during integer encoding, if we group integers in size of 1024, then when the bit-width is >10 bits, it is guaranteed to find a sentinel through constant * 1024 runs of XOR operations
  2. is there any reason we don't like to store 2 copies of null bitmap? one in another page for the whole column, for full scan queries, one in the same page with this page data, for random accesses. one benefit doing so is that during compression and decompression, we don't need to mask out this one-bit in front of the number or change the data layout to cover this one-bit in front of the number
  3. for structs with only fixed-width fields, ideas like row groups may be applied inside the struct to accelerate queries like select struct.a + 3.14 from table

@broccoliSpicy
Copy link
Contributor

during integer encoding, if we group integers in size of 1024, then when the bit-width is >10 bits, it is guaranteed to find a sentinel through constant * 1024 runs of XOR operations

sorry, after rethinking about this, I think this is not feasible using only constant * 1024 runs of XOR operations, there might be many missing numbers

@westonpace
Copy link
Contributor Author

sorry, after rethinking about this, I think this is not feasible using only constant * 1024 runs of XOR operations, there might be many missing numbers

No worries. If we come up with a good algorithm at any point we can always plug it in.

is there any reason we don't like to store 2 copies of null bitmap? one in another page for the whole column, for full scan queries, one in the same page with this page data, for random accesses. one benefit doing so is that during compression and decompression, we don't need to mask out this one-bit in front of the number or change the data layout to cover this one-bit in front of the number

Unfortunately, I think we'd need to store two copies of the data. Because, even if we have the compressed null bitmap we still need to read the data and it would have the null bit attached to it.

I do think we might not take the zipped nulls approach for integer / fp data. For example, if you have integers and you have bitpacking then, in most cases, I expect you will be able to store 1024 integers AND the compressed null bitmap for that block in less than one 4KB disk sector. So I expect zipped nulls will be most useful for larger data types and the overhead of unzipping should be fairly minor.

for structs with only fixed-width fields, ideas like row groups may be applied inside the struct to accelerate queries like select struct.a + 3.14 from table

Can you expand on this?

@broccoliSpicy
Copy link
Contributor

sorry, after rethinking about this, I think this is not feasible using only constant * 1024 runs of XOR operations, there might be many missing numbers

No worries. If we come up with a good algorithm at any point we can always plug it in.

is there any reason we don't like to store 2 copies of null bitmap? one in another page for the whole column, for full scan queries, one in the same page with this page data, for random accesses. one benefit doing so is that during compression and decompression, we don't need to mask out this one-bit in front of the number or change the data layout to cover this one-bit in front of the number

Unfortunately, I think we'd need to store two copies of the data. Because, even if we have the compressed null bitmap we still need to read the data and it would have the null bit attached to it.

I do think we might not take the zipped nulls approach for integer / fp data. For example, if you have integers and you have bitpacking then, in most cases, I expect you will be able to store 1024 integers AND the compressed null bitmap for that block in less than one 4KB disk sector. So I expect zipped nulls will be most useful for larger data types and the overhead of unzipping should be fairly minor.

for structs with only fixed-width fields, ideas like row groups may be applied inside the struct to accelerate queries like select struct.a + 3.14 from table

Can you expand on this?

sorry for the delay of response, I will find sometime to read the doc a few more times and get back to you

@niyue
Copy link
Contributor

niyue commented Sep 18, 2024

@westonpace
I am currently exploring some special and domain-specific encodings for experimental purposes (which might not be suitable as default options within Lance), and I recall seeing a mention of a custom encodings SDK in a previous blog post about Lance v2 beta (https://blog.lancedb.com/lance-v2-is-now-in-beta/). I am very interested in the potential of the custom encodings SDK for my experiments. I would like to know if the development of this SDK is still part of the roadmap (it was marked in July in the blog post), and if there have been any recent updates or advancements regarding it. Thanks.

@westonpace
Copy link
Contributor Author

@niyue Yes, this is still a goal. However, I don't know if I'm going to get to it until closer to the end of the year (which, as you noticed, is well behind the schedule I had originally hoped for). The main reason is that I think we want more confidence in the traits before making an SDK.

Right now, encoders & decoders need to worry about both scheduling AND encoding/decoding. I'm slowly working on splitting the encoders/decoders into three types:

  • Custom indices (e.g. zone maps / pushdown statistics) - these are involved early in the scheduling process to refine the request but don't actually load / decode / schedule any data.
  • Structural encodings (e.g. struct / list / primitive) - these are currently called "field encoders". They are mostly involved in scheduling.
  • Compressive encodings (e.g. gzip / bitpacking / frame of reference / etc) - these are currently called "array encoders" or "page encoders".

Do your custom encodings fit nicely into one of those three categories?

The current implementation is missing "custom indices" and the traits for "structural" and "compressive" encodings are wrong:

  • Compressive encodings are still involved in scheduling and shouldn't be
  • The structural encodings don't allow for "chunking"
  • The structural encodings don't yet handle nullability or many levels of nesting correctly

However, if you wanted to start now, I think that's fine, it will just be more complicated. I can also draft up what the traits should look like if you let me know which of the three categories you are most interested in.

@niyue
Copy link
Contributor

niyue commented Sep 20, 2024

@westonpace thanks so much for the detailed info.
Here are the two encodings I would like to experiment with:

Sorted Index Encoding:

I have a numeric field (specifically a timestamp field) that is sorted. This sorted field can serve as an index for other fields, enabling efficient lookups. The typical access pattern would be to query a time range, use this sorted field to determine the corresponding range of row IDs, and then access the rows within that range. Since this field is part of the dataset, I’d prefer not to duplicate it as a dedicated index file outside of Lance. It would be ideal if Lance's custom encoding could unlock this potential by utilizing the field both as data and as an index. While it seems related to the custom indices category, the fact that this field is embedded in the data suggests there may be a need for specialized loading, decoding, or scheduling.

CLP Encoding:

CLP (Compressed Log Processing) is a domain-specific encoding designed for compressing logs (refer to OSDI 2021 paper). It breaks a log message like [timestamp] the request finishes in 30ms into several fields, such as a template string field the request finishes in {}ms and a variable numeric field 30. The template and variable fields can then be further encoded using techniques like dictionary encoding and delta encoding.
A key innovation in CLP is that encoded log data can be searched without decompression. For instance, the template string field (when dictionary-encoded) allows a search query to check the dictionary to identify which rows use a particular template—acting somewhat like an inverted index. The template field itself can also be used to reconstruct the log message. CLP encoding requires sophisticated scheduling, and likely falls into the structural encoding category. However, it’s unclear to me yet how well the search-without-decompressing feature can be addressed.

@westonpace
Copy link
Contributor Author

Both of those sound cool to have.

I have a numeric field (specifically a timestamp field) that is sorted. This sorted field can serve as an index for other fields, enabling efficient lookups.

This should be enabled by "pushdown filtering" which is something I was working on last week in #2913 . There are a few challenges I am working through and I have some basic solutions for but I don't love them. So it's still rough around the edges even when that PR gets merged. Definitely some bugs probably there and we need some end-to-end testing before the feature is ready to be turned on.

One of the challenges is that the Lance package will need to adopt an expression language that it can use for filter expressions. There is https://crates.io/crates/datafusion-expr but I had been avoiding it as the datafusion package is rather large (in the PR I work around this by making the encoding an "extension" in the lance-encoding-datafusion crate. Inside of lance-encoding the filter expression is just Bytes and I am using Substrait to encode the filter expression as bytes. This means we can do something like "python using pyarrow compute expressions, we convert to substrait, lance passes on to the encodings, the encoding decodes the expression from substrait into datafusion and applies it". This integration has a few rough edges at the moment but the Datafusion project seems pretty committed to supporting Substrait for serialization and so I think it can work.

The "zone maps encoding" works like this:

  • During encoding, we wrap each encoder with something that calculates the statistics of the column (I think, IIRC, we use datafusion's min/max accumulators to do this) every 10,000 rows (this can be configurable, nothing special about 10k). Once the file is done being written these statistics are written out (as tiny lance files) into the main file somewhere and the location of these is recorded in the column metadata for the file.
  • During a read, we insert a top-level decoder to process the statistics. It loads these stats (and in the future it should cache them so it should be cheap to do many searches on one open LanceFileReader) and then uses them to refine the range that is being read. So you say something like "read the entire file with this filter" and it translates that into "read these 5 blocks of 10k rows because they are the only 5 blocks that match the filter".

CLP (Compressed Log Processing) is a domain-specific encoding designed for compressing logs (refer to OSDI 2021 paper). It breaks a log message like [timestamp] the request finishes in 30ms into several fields, such as a template string field the request finishes in {}ms and a variable numeric field 30. The template and variable fields can then be further encoded using techniques like dictionary encoding and delta encoding.

This sounds great! What sorts of searches would you want to satisfy?

allows a search query to check the dictionary to identify which rows use a particular template

Wouldn't a large percentage of rows satisfy this query?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants