feat: mvp for lance version 0.2 reader / writer #1965

westonpace · 2024-02-16T14:46:21Z

The motivation and bigger picture are covered in more detail in #1929

This PR builds on top of #1918 and #1964 to create a new version of the Lance file format.

There is still much to do, but this end-to-end MVP should provide the overall structure for the work.

It can currently read and write primitive columns and list columns and supports some very basic encodings.

wjones127 · 2024-02-20T00:25:54Z

protos/file2.proto

+// | Global Buffers                   |
+// | |C| Global Meta Buffer 0*        |
+// |     ...                          |
+// |     Global Meta Buffer GN*       |
+// ├──────────────────────────────────┤
+// | Global Buffers Offset Table      |
+// | |D| Global Buffer 0 Position*    |
+// |     Global Buffer 0 Size         |
+// |     ...                          |
+// |     Global Buffer GN Position    |
+// |     Global Buffer GN Size        |


My initial thought here is that it's unclear why this is preferable to just having a protobuf message for global data. Protobuf (or some other serialization framework) gives us an easy way to describe messages and handling changes to the message. I worry we are taking on some engineering overhead to add new types of metadata to the format.

codecov-commenter · 2024-04-04T21:26:51Z

Codecov Report

Attention: Patch coverage is 76.30662% with 136 lines in your changes are missing coverage. Please review.

Project coverage is 81.08%. Comparing base (8df64a6) to head (ea495d2).
Report is 3 commits behind head on main.

Files	Patch %	Lines
rust/lance-file/src/v2/reader.rs	77.86%	55 Missing and 32 partials ⚠️
rust/lance-file/src/v2/writer.rs	72.92%	9 Missing and 40 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1965      +/-   ##
==========================================
+ Coverage   81.02%   81.08%   +0.05%     
==========================================
  Files         179      181       +2     
  Lines       50933    51357     +424     
  Branches    50933    51357     +424     
==========================================
+ Hits        41271    41645     +374     
+ Misses       7277     7263      -14     
- Partials     2385     2449      +64

Flag	Coverage Δ
unittests	`81.08% <76.30%> (+0.05%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

wjones127 · 2024-04-08T17:04:17Z

rust/lance-file/src/v2/reader.rs

+    // We don't use this today because we get here by subtracting backward from global_buf_start
+    #[allow(dead_code)]
+    column_meta_offsets_start: u64,


Is there a reason not to use this? Is there any risk this becomes a bug later?

I can update this comment. We will need this if/when we want to support "true column projection" where we don't load any of the metadata for columns we aren't interested in.

I'm not sure we will want this anytime soon because then:

If you are not using cached metadata then you are adding at least one extra IOP (first read footer, then read column metadata offsets table, then read column metadatas of interest, this last step is only one IOP of you coalesce which is non trivial)

If you are using cached metadata then your metadata caching gets trickier (what happens if you have cached metadata for the file for columns X, and Z and the user now wants column Y?) You will need to cache on a per-column basis and I wanted to avoid that complexity for now.

wjones127 · 2024-04-08T17:17:13Z

rust/lance-file/src/v2/reader.rs

+        let (tx, rx) = mpsc::unbounded_channel();
+
+        let scheduler = self.scheduler.clone() as Arc<dyn EncodingsIo>;
+        // FIXME: spawn this, change this method to sync


What's your philosophy here on what should be sync and what should be async?

Decode happens by polling the returned stream. This gives a rust user "what they expect" when it comes to parallelism / readahead. If the user does not call buffered then there is one decode thread and no "batch parallelism". If the user calls buffered then there are N decode threads / batch parallelism.

Right now that decode won't start until the scheduling has completely finished which is not ideal but not terrible (scheduling is pretty fast). It'll be more of an issue when there is a chance of backpressure being needed (e.g. decoder falls behind I/O).

If this method is async then it means some kind of "potentially slow" task is happening before the decode thread starts and I can't think of any valid reason for that.

Also, the schedule_range method itself is roughly synchronous. Ideally it would be entirely synchronous. Scheduling should never have to pause and wait for I/O (we go through great lengths to avoid this in the list scheduler). This is because any pause here runs the risk of starving our I/O parallelism.

I haven't yet concluded if I can safely make schedule_range a synchronous method though. My only remaining concern is backpressure. The scheduler may need to pause if it gets too far ahead of the decoder. On the other hand, I don't think this is very important because the size of "all scheduled tasks" should be roughly equivalent to the size of the file metadata (e.g. not too large) and so backpressure is more of a concern for the I/O scheduler than it is for the scheduling thread.

Prerequisites: - [x] #1965

This was referenced Feb 16, 2024

feat: update lance file format to support per-page encoding #1857

Closed

Lance File Format Version 2 (technically v0.3) #1929

Open

wjones127 reviewed Feb 20, 2024

View reviewed changes

westonpace force-pushed the feat/v2-read-write branch 6 times, most recently from cc32490 to d6eb25f Compare April 4, 2024 21:01

westonpace force-pushed the feat/v2-read-write branch from d6eb25f to 5fe31cb Compare April 5, 2024 16:17

westonpace marked this pull request as ready for review April 5, 2024 16:18

westonpace requested a review from wjones127 April 5, 2024 16:45

westonpace mentioned this pull request Apr 5, 2024

feat: add python bindings for the v2 reader/writer #2158

Merged

1 task

westonpace requested a review from eddyxu April 6, 2024 00:07

westonpace force-pushed the feat/v2-read-write branch from 4c6cb7d to 6be30bb Compare April 8, 2024 12:26

westonpace added 3 commits April 8, 2024 08:00

Add a new reader/writer for a format v0.2

3adc329

MINOR_VERSION_TWO -> MINOR_VERSION_NEXT

92f3d6f

Remove unused options

956709b

westonpace force-pushed the feat/v2-read-write branch from 6be30bb to 956709b Compare April 8, 2024 15:01

wjones127 approved these changes Apr 8, 2024

View reviewed changes

Refine comment based on PR review suggestion

ea495d2

westonpace merged commit 3ac0074 into lancedb:main Apr 9, 2024
17 checks passed

westonpace added a commit that referenced this pull request Apr 10, 2024

feat: add python bindings for the v2 reader/writer (#2158)

3f92746

Prerequisites: - [x] #1965

chebbyChefNEQ pushed a commit that referenced this pull request Apr 10, 2024

feat: add python bindings for the v2 reader/writer (#2158)

fa289d0

Prerequisites: - [x] #1965

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: mvp for lance version 0.2 reader / writer #1965

feat: mvp for lance version 0.2 reader / writer #1965

westonpace commented Feb 16, 2024

wjones127 Feb 20, 2024

codecov-commenter commented Apr 4, 2024 •

edited

Loading

wjones127 Apr 8, 2024

westonpace Apr 8, 2024

wjones127 Apr 8, 2024

westonpace Apr 8, 2024

feat: mvp for lance version 0.2 reader / writer #1965

feat: mvp for lance version 0.2 reader / writer #1965

Conversation

westonpace commented Feb 16, 2024

wjones127 Feb 20, 2024

Choose a reason for hiding this comment

codecov-commenter commented Apr 4, 2024 • edited Loading

Codecov Report

wjones127 Apr 8, 2024

Choose a reason for hiding this comment

westonpace Apr 8, 2024

Choose a reason for hiding this comment

wjones127 Apr 8, 2024

Choose a reason for hiding this comment

westonpace Apr 8, 2024

Choose a reason for hiding this comment

codecov-commenter commented Apr 4, 2024 •

edited

Loading