Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add projection support to the v2 format #2296

Merged
merged 5 commits into from
May 8, 2024

Conversation

westonpace
Copy link
Contributor

@westonpace westonpace commented May 3, 2024

We can simplify the projection a bit more down the line. With this PR you must specify both:

  • The column indices to load
  • What data types you want to load into

This is because the file format doesn't technically have a type system. Each column is just a bunch of data encoded in various ways. So, for example, you could interpret a column as List<u8>, or String, or LargeString, or Binary or LargeBinary or StringView or BinaryView or ...

However, at some point we may want to create a "pick a default column to decode into based on the type" extension point that can be extended by users (this needs to be extensible since encodings are extensible) which picks an appropriate type based on the encoding. This is slightly more complicated than it seems because there can be multiple encodings for a single column and because encodings may be user-defined. This way users will only need to provide the column indices to load.

In the short/medium term this is not too urgent. The Lance table format has a type system and stores the schema and so it can easily figure out both pieces of required information.

@github-actions github-actions bot added the enhancement New feature or request label May 3, 2024
@westonpace
Copy link
Contributor Author

westonpace commented May 3, 2024

Leaving in draft until #2291 is resolved

@westonpace westonpace marked this pull request as draft May 3, 2024 21:24
@westonpace westonpace marked this pull request as ready for review May 6, 2024 15:29
@@ -207,19 +208,50 @@ impl GenericFileReader for FileReader {
}
}

#[derive(Debug, Clone)]
struct V2Reader {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. can this move to lance-io?
  2. lets just use a mod v2 { struct Reader } so that V2 will not leave in the codebase forever?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. No, lance-io is too low level. It doesn't know about the file readers. I could maybe move it to lance-file but at the moment I'd rather keep it here. This type exists to adapt the v2 reader to GenericFileReader and GenericFileReader is currently "the methods that a lance dataset expects from a file reader" so it makes sense to me that the trait / impl / adapters exist at the dataset level.
  2. Done.

@westonpace westonpace requested a review from eddyxu May 7, 2024 16:11
@westonpace
Copy link
Contributor Author

The test_optimize_index_recall failure does not use the v2 path and so is unrelated

@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 81.36483% with 71 lines in your changes are missing coverage. Please review.

Project coverage is 80.68%. Comparing base (0bfa3af) to head (372adbd).

Files Patch % Lines
rust/lance/src/dataset/fragment.rs 62.13% 33 Missing and 6 partials ⚠️
rust/lance-file/src/v2/reader.rs 92.40% 9 Missing and 9 partials ⚠️
.../lance-encoding/src/encodings/logical/primitive.rs 0.00% 7 Missing ⚠️
...ust/lance-encoding/src/encodings/logical/struct.rs 85.71% 3 Missing ⚠️
rust/lance-encoding/src/encoder.rs 0.00% 1 Missing ⚠️
...ust/lance-encoding/src/encodings/logical/binary.rs 0.00% 1 Missing ⚠️
...-encoding/src/encodings/logical/fixed_size_list.rs 0.00% 1 Missing ⚠️
rust/lance-encoding/src/encodings/logical/list.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2296      +/-   ##
==========================================
+ Coverage   80.65%   80.68%   +0.03%     
==========================================
  Files         191      191              
  Lines       56437    56656     +219     
  Branches    56437    56656     +219     
==========================================
+ Hits        45520    45714     +194     
- Misses       8365     8382      +17     
- Partials     2552     2560       +8     
Flag Coverage Δ
unittests 80.68% <81.36%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@westonpace westonpace merged commit 8623007 into lancedb:main May 8, 2024
17 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants