-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add projection support to the v2 format #2296
Conversation
|
c0f3189
to
e94934c
Compare
rust/lance/src/dataset/fragment.rs
Outdated
@@ -207,19 +208,50 @@ impl GenericFileReader for FileReader { | |||
} | |||
} | |||
|
|||
#[derive(Debug, Clone)] | |||
struct V2Reader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- can this move to
lance-io
? - lets just use a
mod v2 { struct Reader }
so thatV2
will not leave in the codebase forever?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- No,
lance-io
is too low level. It doesn't know about the file readers. I could maybe move it tolance-file
but at the moment I'd rather keep it here. This type exists to adapt the v2 reader toGenericFileReader
andGenericFileReader
is currently "the methods that a lance dataset expects from a file reader" so it makes sense to me that the trait / impl / adapters exist at the dataset level. - Done.
e94934c
to
a787f86
Compare
The |
… sizes do not line up cleanly
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2296 +/- ##
==========================================
+ Coverage 80.65% 80.68% +0.03%
==========================================
Files 191 191
Lines 56437 56656 +219
Branches 56437 56656 +219
==========================================
+ Hits 45520 45714 +194
- Misses 8365 8382 +17
- Partials 2552 2560 +8
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
We can simplify the projection a bit more down the line. With this PR you must specify both:
This is because the file format doesn't technically have a type system. Each column is just a bunch of data encoded in various ways. So, for example, you could interpret a column as
List<u8>
, orString
, orLargeString
, orBinary
orLargeBinary
orStringView
orBinaryView
or ...However, at some point we may want to create a "pick a default column to decode into based on the type" extension point that can be extended by users (this needs to be extensible since encodings are extensible) which picks an appropriate type based on the encoding. This is slightly more complicated than it seems because there can be multiple encodings for a single column and because encodings may be user-defined. This way users will only need to provide the column indices to load.
In the short/medium term this is not too urgent. The Lance table format has a type system and stores the schema and so it can easily figure out both pieces of required information.