Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: mvp for lance version 0.2 reader / writer #1965

Merged
merged 4 commits into from
Apr 9, 2024

Conversation

westonpace
Copy link
Contributor

The motivation and bigger picture are covered in more detail in #1929

This PR builds on top of #1918 and #1964 to create a new version of the Lance file format.

There is still much to do, but this end-to-end MVP should provide the overall structure for the work.

It can currently read and write primitive columns and list columns and supports some very basic encodings.

Comment on lines 97 to 103
// | Global Buffers |
// | |C| Global Meta Buffer 0* |
// | ... |
// | Global Meta Buffer GN* |
// ├──────────────────────────────────┤
// | Global Buffers Offset Table |
// | |D| Global Buffer 0 Position* |
// | Global Buffer 0 Size |
// | ... |
// | Global Buffer GN Position |
// | Global Buffer GN Size |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial thought here is that it's unclear why this is preferable to just having a protobuf message for global data. Protobuf (or some other serialization framework) gives us an easy way to describe messages and handling changes to the message. I worry we are taking on some engineering overhead to add new types of metadata to the format.

@westonpace westonpace force-pushed the feat/v2-read-write branch 6 times, most recently from cc32490 to d6eb25f Compare April 4, 2024 21:01
@codecov-commenter
Copy link

codecov-commenter commented Apr 4, 2024

Codecov Report

Attention: Patch coverage is 76.30662% with 136 lines in your changes are missing coverage. Please review.

Project coverage is 81.08%. Comparing base (8df64a6) to head (ea495d2).
Report is 3 commits behind head on main.

Files Patch % Lines
rust/lance-file/src/v2/reader.rs 77.86% 55 Missing and 32 partials ⚠️
rust/lance-file/src/v2/writer.rs 72.92% 9 Missing and 40 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1965      +/-   ##
==========================================
+ Coverage   81.02%   81.08%   +0.05%     
==========================================
  Files         179      181       +2     
  Lines       50933    51357     +424     
  Branches    50933    51357     +424     
==========================================
+ Hits        41271    41645     +374     
+ Misses       7277     7263      -14     
- Partials     2385     2449      +64     
Flag Coverage Δ
unittests 81.08% <76.30%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines 46 to 48
// We don't use this today because we get here by subtracting backward from global_buf_start
#[allow(dead_code)]
column_meta_offsets_start: u64,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason not to use this? Is there any risk this becomes a bug later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can update this comment. We will need this if/when we want to support "true column projection" where we don't load any of the metadata for columns we aren't interested in.

I'm not sure we will want this anytime soon because then:

  • If you are not using cached metadata then you are adding at least one extra IOP (first read footer, then read column metadata offsets table, then read column metadatas of interest, this last step is only one IOP of you coalesce which is non trivial)
  • If you are using cached metadata then your metadata caching gets trickier (what happens if you have cached metadata for the file for columns X, and Z and the user now wants column Y?) You will need to cache on a per-column basis and I wanted to avoid that complexity for now.

let (tx, rx) = mpsc::unbounded_channel();

let scheduler = self.scheduler.clone() as Arc<dyn EncodingsIo>;
// FIXME: spawn this, change this method to sync
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's your philosophy here on what should be sync and what should be async?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decode happens by polling the returned stream. This gives a rust user "what they expect" when it comes to parallelism / readahead. If the user does not call buffered then there is one decode thread and no "batch parallelism". If the user calls buffered then there are N decode threads / batch parallelism.

Right now that decode won't start until the scheduling has completely finished which is not ideal but not terrible (scheduling is pretty fast). It'll be more of an issue when there is a chance of backpressure being needed (e.g. decoder falls behind I/O).

If this method is async then it means some kind of "potentially slow" task is happening before the decode thread starts and I can't think of any valid reason for that.

Also, the schedule_range method itself is roughly synchronous. Ideally it would be entirely synchronous. Scheduling should never have to pause and wait for I/O (we go through great lengths to avoid this in the list scheduler). This is because any pause here runs the risk of starving our I/O parallelism.

I haven't yet concluded if I can safely make schedule_range a synchronous method though. My only remaining concern is backpressure. The scheduler may need to pause if it gets too far ahead of the decoder. On the other hand, I don't think this is very important because the size of "all scheduled tasks" should be roughly equivalent to the size of the file metadata (e.g. not too large) and so backpressure is more of a concern for the I/O scheduler than it is for the scheduling thread.

@westonpace westonpace merged commit 3ac0074 into lancedb:main Apr 9, 2024
17 checks passed
westonpace added a commit that referenced this pull request Apr 10, 2024
chebbyChefNEQ pushed a commit that referenced this pull request Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants