-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: rework decoding to fix bugs in nested struct decoding #2337
fix: rework decoding to fix bugs in nested struct decoding #2337
Conversation
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2337 +/- ##
==========================================
- Coverage 80.80% 80.77% -0.04%
==========================================
Files 192 192
Lines 56189 56343 +154
Branches 56189 56343 +154
==========================================
+ Hits 45403 45509 +106
- Misses 8117 8169 +52
+ Partials 2669 2665 -4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
… waiting for scheduled decoders and waiting for IO into two different phases. See PR for more details.
f7ed387
to
7698a10
Compare
current_top_level_row - top_level_row | ||
); | ||
context | ||
.sink |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should we have a similar method on context like emit
to send a ScanLine to the sink?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of intentionally hid this because encodings shouldn't be able to send scan lines. The struct encoding is a "special" encoding because we will always have one struct encoding that lives at the top of the decoder tree (even if there is only one column). In the next PR I make this even more obvious. So, it is somewhat intentional here that the struct encoding has to peek into the private impl of the scheduler context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
[![Mend Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com) This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [lance](https://togithub.com/lancedb/lance) | dependencies | minor | `0.10.16` -> `0.12.0` | --- ### Release Notes <details> <summary>lancedb/lance (lance)</summary> ### [`v0.12.1`](https://togithub.com/lancedb/lance/releases/tag/v0.12.1) [Compare Source](https://togithub.com/lancedb/lance/compare/v0.12.0...v0.12.1) <!-- Release notes generated using configuration in .github/release.yml at v0.12.1 --> #### What's Changed ##### Bug Fixes 🐛 - fix: incorrect chunking was making lance datasets use too much RAM by [@​westonpace](https://togithub.com/westonpace) in [lancedb/lance#2438 **Full Changelog**: lancedb/lance@v0.12.0...v0.12.1 ### [`v0.12.0`](https://togithub.com/lancedb/lance/releases/tag/v0.12.0) [Compare Source](https://togithub.com/lancedb/lance/compare/v0.11.1...v0.12.0) <!-- Release notes generated using configuration in .github/release.yml at v0.12.0 --> #### What's Changed ##### Breaking Changes 🛠 - feat: change dataset uri to return full qualified url instead of object store path by [@​eddyxu](https://togithub.com/eddyxu) in [lancedb/lance#2416 ##### New Features 🎉 - feat: new shuffler by [@​BubbleCal](https://togithub.com/BubbleCal) in [lancedb/lance#2404 - feat: new index builder by [@​BubbleCal](https://togithub.com/BubbleCal) in [lancedb/lance#2401 - feat: stable row id manifest changes by [@​wjones127](https://togithub.com/wjones127) in [lancedb/lance#2363 - feat: once a table has been created with v1 or v2 format then it should always use that format by [@​westonpace](https://togithub.com/westonpace) in [lancedb/lance#2435 ##### Bug Fixes 🐛 - fix: fix file writer which was not writing page buffers in the correct order by [@​westonpace](https://togithub.com/westonpace) in [lancedb/lance#2413 ##### Other Changes - refactor: refactor logical decoders into "field decoders" by [@​westonpace](https://togithub.com/westonpace) in [lancedb/lance#2407 - refactor: rename use_experimental_writer to use_legacy_format by [@​westonpace](https://togithub.com/westonpace) in [lancedb/lance#2433 - refactor: minor refactor to allow I/O scheduler to be cloned in page schedulers by [@​westonpace](https://togithub.com/westonpace) in [lancedb/lance#2432 **Full Changelog**: lancedb/lance@v0.11.1...v0.12.0 ### [`v0.11.1`](https://togithub.com/lancedb/lance/releases/tag/v0.11.1) [Compare Source](https://togithub.com/lancedb/lance/compare/v0.11.0...v0.11.1) <!-- Release notes generated using configuration in .github/release.yml at v0.11.1 --> #### What's Changed ##### New Features 🎉 - feat(java): support jdk8 by [@​LuQQiu](https://togithub.com/LuQQiu) in [lancedb/lance#2362 - feat: support kmode with hamming distance by [@​eddyxu](https://togithub.com/eddyxu) in [lancedb/lance#2366 - feat: row id index structures (experimental) by [@​wjones127](https://togithub.com/wjones127) in [lancedb/lance#2303 - feat: update merge_insert to add statistics for inserted, updated, deleted rows by [@​raunaks13](https://togithub.com/raunaks13) in [lancedb/lance#2357 - feat: define Flat index as a scan over VectorStorage by [@​chebbyChefNEQ](https://togithub.com/chebbyChefNEQ) in [lancedb/lance#2380 - feat: add some schema utility methods to the v2 reader/writer by [@​westonpace](https://togithub.com/westonpace) in [lancedb/lance#2389 - feat: general compression for value page buffer by [@​niyue](https://togithub.com/niyue) in [lancedb/lance#2368 - feat: make the index cache size (in bytes) available by [@​westonpace](https://togithub.com/westonpace) in [lancedb/lance#2381 - feat: add special uri scheme to use CloudFileReader for local fs by [@​chebbyChefNEQ](https://togithub.com/chebbyChefNEQ) in [lancedb/lance#2402 - feat: add encoder utilities for pushdown by [@​westonpace](https://togithub.com/westonpace) in [lancedb/lance#2388 ##### Bug Fixes 🐛 - fix: concat batches before writing to avoid small IO slow down by [@​chebbyChefNEQ](https://togithub.com/chebbyChefNEQ) in [lancedb/lance#2384 - fix: low recall if the num partitions is more than num rows by [@​BubbleCal](https://togithub.com/BubbleCal) in [lancedb/lance#2386 - fix: f32 reduce_min for x86 by [@​heiher](https://togithub.com/heiher) in [lancedb/lance#2385 - fix: fix incorrect validation logic in updater by [@​westonpace](https://togithub.com/westonpace) in [lancedb/lance#2408 ##### Performance Improvements 🚀 - perf: make VectorStorage and DistCalculator static to generate better code by [@​BubbleCal](https://togithub.com/BubbleCal) in [lancedb/lance#2355 - perf: optimize IO path for reading manifest by [@​wjones127](https://togithub.com/wjones127) in [lancedb/lance#2396 ##### Other Changes - refactor: make proto conversion fallible and not copy by [@​wjones127](https://togithub.com/wjones127) in [lancedb/lance#2371 - refactor: separate take and schema evolution impls to own files by [@​wjones127](https://togithub.com/wjones127) in [lancedb/lance#2372 - Revert "fix: concat batches before writing to avoid small IO slow down ([#​2384](https://togithub.com/lancedb/lance/issues/2384))" by [@​chebbyChefNEQ](https://togithub.com/chebbyChefNEQ) in [lancedb/lance#2387 - refactor: shuffle around v2 metadata sections to allow read-on-demand statistics by [@​westonpace](https://togithub.com/westonpace) in [lancedb/lance#2400 #### New Contributors - [@​niyue](https://togithub.com/niyue) made their first contribution in [lancedb/lance#2368 - [@​heiher](https://togithub.com/heiher) made their first contribution in [lancedb/lance#2385 **Full Changelog**: lancedb/lance@v0.11.0...v0.11.1 ### [`v0.11.0`](https://togithub.com/lancedb/lance/releases/tag/v0.11.0) [Compare Source](https://togithub.com/lancedb/lance/compare/v0.10.18...v0.11.0) <!-- Release notes generated using configuration in .github/release.yml at v0.11.0 --> #### What's Changed ##### Breaking Changes 🛠 - feat(rust)!: use BoxedError in Error::IO by [@​broccoliSpicy](https://togithub.com/broccoliSpicy) in [lancedb/lance#2329 ##### New Features 🎉 - feat: add v2 support to fragment merge / update paths by [@​westonpace](https://togithub.com/westonpace) in [lancedb/lance#2311 - feat: add priority to I/O scheduler by [@​westonpace](https://togithub.com/westonpace) in [lancedb/lance#2315 - feat: add take_rows operation to the v2 file reader's python bindings by [@​westonpace](https://togithub.com/westonpace) in [lancedb/lance#2331 - feat: added example for reading and writing dataset in rust by [@​raunaks13](https://togithub.com/raunaks13) in [lancedb/lance#2349 - feat: new HNSW implementation by [@​BubbleCal](https://togithub.com/BubbleCal) in [lancedb/lance#2353 - feat: add fragment take / fixed-size-binary support to v2 format by [@​westonpace](https://togithub.com/westonpace) in [lancedb/lance#2354 ##### Bug Fixes 🐛 - fix: recognize a simple expression like 'is_foo' as a scalar index query by [@​westonpace](https://togithub.com/westonpace) in [lancedb/lance#2356 - fix: rework list encoder to handle list-struct by [@​westonpace](https://togithub.com/westonpace) in [lancedb/lance#2344 - fix: minor bug fixes for v2 by [@​westonpace](https://togithub.com/westonpace) in [lancedb/lance#2361 ##### Documentation 📚 - docs: clearify comments in table.proto -> message DataFragment -> physical_rows by [@​broccoliSpicy](https://togithub.com/broccoliSpicy) in [lancedb/lance#2346 ##### Performance Improvements 🚀 - perf: use the file metadata cache in scalar indices by [@​westonpace](https://togithub.com/westonpace) in [lancedb/lance#2330 ##### Other Changes - chore: remove `m_max` and `use_heuristic` params from HNSW builder by [@​BubbleCal](https://togithub.com/BubbleCal) in [lancedb/lance#2336 - fix(java): fix JNI jar loader issue by [@​LuQQiu](https://togithub.com/LuQQiu) in [lancedb/lance#2340 - ci: fix labeler permissions by [@​wjones127](https://togithub.com/wjones127) in [lancedb/lance#2348 - fix: rework decoding to fix bugs in nested struct decoding by [@​westonpace](https://togithub.com/westonpace) in [lancedb/lance#2337 #### New Contributors - [@​broccoliSpicy](https://togithub.com/broccoliSpicy) made their first contribution in [lancedb/lance#2346 - [@​raunaks13](https://togithub.com/raunaks13) made their first contribution in [lancedb/lance#2349 **Full Changelog**: lancedb/lance@v0.10.18...v0.11.0 </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Enabled. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Mend Renovate](https://www.mend.io/free-developer-tools/renovate/). View repository job log [here](https://developer.mend.io/github/spiraldb/vortex). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy4zOTMuMCIsInVwZGF0ZWRJblZlciI6IjM3LjM5My4wIiwidGFyZ2V0QnJhbmNoIjoiZGV2ZWxvcCIsImxhYmVscyI6W119--> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
[![Mend Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com) This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [lance](https://togithub.com/lancedb/lance) | dependencies | minor | `0.10.16` -> `0.12.0` | --- ### Release Notes <details> <summary>lancedb/lance (lance)</summary> ### [`v0.12.1`](https://togithub.com/lancedb/lance/releases/tag/v0.12.1) [Compare Source](https://togithub.com/lancedb/lance/compare/v0.12.0...v0.12.1) <!-- Release notes generated using configuration in .github/release.yml at v0.12.1 --> #### What's Changed ##### Bug Fixes 🐛 - fix: incorrect chunking was making lance datasets use too much RAM by [@​westonpace](https://togithub.com/westonpace) in [lancedb/lance#2438 **Full Changelog**: lancedb/lance@v0.12.0...v0.12.1 ### [`v0.12.0`](https://togithub.com/lancedb/lance/releases/tag/v0.12.0) [Compare Source](https://togithub.com/lancedb/lance/compare/v0.11.1...v0.12.0) <!-- Release notes generated using configuration in .github/release.yml at v0.12.0 --> #### What's Changed ##### Breaking Changes 🛠 - feat: change dataset uri to return full qualified url instead of object store path by [@​eddyxu](https://togithub.com/eddyxu) in [lancedb/lance#2416 ##### New Features 🎉 - feat: new shuffler by [@​BubbleCal](https://togithub.com/BubbleCal) in [lancedb/lance#2404 - feat: new index builder by [@​BubbleCal](https://togithub.com/BubbleCal) in [lancedb/lance#2401 - feat: stable row id manifest changes by [@​wjones127](https://togithub.com/wjones127) in [lancedb/lance#2363 - feat: once a table has been created with v1 or v2 format then it should always use that format by [@​westonpace](https://togithub.com/westonpace) in [lancedb/lance#2435 ##### Bug Fixes 🐛 - fix: fix file writer which was not writing page buffers in the correct order by [@​westonpace](https://togithub.com/westonpace) in [lancedb/lance#2413 ##### Other Changes - refactor: refactor logical decoders into "field decoders" by [@​westonpace](https://togithub.com/westonpace) in [lancedb/lance#2407 - refactor: rename use_experimental_writer to use_legacy_format by [@​westonpace](https://togithub.com/westonpace) in [lancedb/lance#2433 - refactor: minor refactor to allow I/O scheduler to be cloned in page schedulers by [@​westonpace](https://togithub.com/westonpace) in [lancedb/lance#2432 **Full Changelog**: lancedb/lance@v0.11.1...v0.12.0 ### [`v0.11.1`](https://togithub.com/lancedb/lance/releases/tag/v0.11.1) [Compare Source](https://togithub.com/lancedb/lance/compare/v0.11.0...v0.11.1) <!-- Release notes generated using configuration in .github/release.yml at v0.11.1 --> #### What's Changed ##### New Features 🎉 - feat(java): support jdk8 by [@​LuQQiu](https://togithub.com/LuQQiu) in [lancedb/lance#2362 - feat: support kmode with hamming distance by [@​eddyxu](https://togithub.com/eddyxu) in [lancedb/lance#2366 - feat: row id index structures (experimental) by [@​wjones127](https://togithub.com/wjones127) in [lancedb/lance#2303 - feat: update merge_insert to add statistics for inserted, updated, deleted rows by [@​raunaks13](https://togithub.com/raunaks13) in [lancedb/lance#2357 - feat: define Flat index as a scan over VectorStorage by [@​chebbyChefNEQ](https://togithub.com/chebbyChefNEQ) in [lancedb/lance#2380 - feat: add some schema utility methods to the v2 reader/writer by [@​westonpace](https://togithub.com/westonpace) in [lancedb/lance#2389 - feat: general compression for value page buffer by [@​niyue](https://togithub.com/niyue) in [lancedb/lance#2368 - feat: make the index cache size (in bytes) available by [@​westonpace](https://togithub.com/westonpace) in [lancedb/lance#2381 - feat: add special uri scheme to use CloudFileReader for local fs by [@​chebbyChefNEQ](https://togithub.com/chebbyChefNEQ) in [lancedb/lance#2402 - feat: add encoder utilities for pushdown by [@​westonpace](https://togithub.com/westonpace) in [lancedb/lance#2388 ##### Bug Fixes 🐛 - fix: concat batches before writing to avoid small IO slow down by [@​chebbyChefNEQ](https://togithub.com/chebbyChefNEQ) in [lancedb/lance#2384 - fix: low recall if the num partitions is more than num rows by [@​BubbleCal](https://togithub.com/BubbleCal) in [lancedb/lance#2386 - fix: f32 reduce_min for x86 by [@​heiher](https://togithub.com/heiher) in [lancedb/lance#2385 - fix: fix incorrect validation logic in updater by [@​westonpace](https://togithub.com/westonpace) in [lancedb/lance#2408 ##### Performance Improvements 🚀 - perf: make VectorStorage and DistCalculator static to generate better code by [@​BubbleCal](https://togithub.com/BubbleCal) in [lancedb/lance#2355 - perf: optimize IO path for reading manifest by [@​wjones127](https://togithub.com/wjones127) in [lancedb/lance#2396 ##### Other Changes - refactor: make proto conversion fallible and not copy by [@​wjones127](https://togithub.com/wjones127) in [lancedb/lance#2371 - refactor: separate take and schema evolution impls to own files by [@​wjones127](https://togithub.com/wjones127) in [lancedb/lance#2372 - Revert "fix: concat batches before writing to avoid small IO slow down ([#​2384](https://togithub.com/lancedb/lance/issues/2384))" by [@​chebbyChefNEQ](https://togithub.com/chebbyChefNEQ) in [lancedb/lance#2387 - refactor: shuffle around v2 metadata sections to allow read-on-demand statistics by [@​westonpace](https://togithub.com/westonpace) in [lancedb/lance#2400 #### New Contributors - [@​niyue](https://togithub.com/niyue) made their first contribution in [lancedb/lance#2368 - [@​heiher](https://togithub.com/heiher) made their first contribution in [lancedb/lance#2385 **Full Changelog**: lancedb/lance@v0.11.0...v0.11.1 ### [`v0.11.0`](https://togithub.com/lancedb/lance/releases/tag/v0.11.0) [Compare Source](https://togithub.com/lancedb/lance/compare/v0.10.18...v0.11.0) <!-- Release notes generated using configuration in .github/release.yml at v0.11.0 --> #### What's Changed ##### Breaking Changes 🛠 - feat(rust)!: use BoxedError in Error::IO by [@​broccoliSpicy](https://togithub.com/broccoliSpicy) in [lancedb/lance#2329 ##### New Features 🎉 - feat: add v2 support to fragment merge / update paths by [@​westonpace](https://togithub.com/westonpace) in [lancedb/lance#2311 - feat: add priority to I/O scheduler by [@​westonpace](https://togithub.com/westonpace) in [lancedb/lance#2315 - feat: add take_rows operation to the v2 file reader's python bindings by [@​westonpace](https://togithub.com/westonpace) in [lancedb/lance#2331 - feat: added example for reading and writing dataset in rust by [@​raunaks13](https://togithub.com/raunaks13) in [lancedb/lance#2349 - feat: new HNSW implementation by [@​BubbleCal](https://togithub.com/BubbleCal) in [lancedb/lance#2353 - feat: add fragment take / fixed-size-binary support to v2 format by [@​westonpace](https://togithub.com/westonpace) in [lancedb/lance#2354 ##### Bug Fixes 🐛 - fix: recognize a simple expression like 'is_foo' as a scalar index query by [@​westonpace](https://togithub.com/westonpace) in [lancedb/lance#2356 - fix: rework list encoder to handle list-struct by [@​westonpace](https://togithub.com/westonpace) in [lancedb/lance#2344 - fix: minor bug fixes for v2 by [@​westonpace](https://togithub.com/westonpace) in [lancedb/lance#2361 ##### Documentation 📚 - docs: clearify comments in table.proto -> message DataFragment -> physical_rows by [@​broccoliSpicy](https://togithub.com/broccoliSpicy) in [lancedb/lance#2346 ##### Performance Improvements 🚀 - perf: use the file metadata cache in scalar indices by [@​westonpace](https://togithub.com/westonpace) in [lancedb/lance#2330 ##### Other Changes - chore: remove `m_max` and `use_heuristic` params from HNSW builder by [@​BubbleCal](https://togithub.com/BubbleCal) in [lancedb/lance#2336 - fix(java): fix JNI jar loader issue by [@​LuQQiu](https://togithub.com/LuQQiu) in [lancedb/lance#2340 - ci: fix labeler permissions by [@​wjones127](https://togithub.com/wjones127) in [lancedb/lance#2348 - fix: rework decoding to fix bugs in nested struct decoding by [@​westonpace](https://togithub.com/westonpace) in [lancedb/lance#2337 #### New Contributors - [@​broccoliSpicy](https://togithub.com/broccoliSpicy) made their first contribution in [lancedb/lance#2346 - [@​raunaks13](https://togithub.com/raunaks13) made their first contribution in [lancedb/lance#2349 **Full Changelog**: lancedb/lance@v0.10.18...v0.11.0 </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Enabled. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Mend Renovate](https://www.mend.io/free-developer-tools/renovate/). View repository job log [here](https://developer.mend.io/github/spiraldb/vortex). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy4zOTMuMCIsInVwZGF0ZWRJblZlciI6IjM3LjM5My4wIiwidGFyZ2V0QnJhbmNoIjoiZGV2ZWxvcCIsImxhYmVscyI6W119--> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
This PR modifies both scheduling and the decode stream. These modifications should mainly affect the logical encodings (list / struct). The changes to the other encodings are fairly minimal.
In the scheduler we now track where we are in the field tree. This is both for debugging purposes (so we can log in trace messages the current path) and because we now need to send the path as part of the message we send to the decode stream.
The decoder changes are more significant. Previously, we combined waiting for I/O and waiting for additional encoders into the same phase. This logic was more complex, but more importantly, it also assumed that the decoder could recreate the order in which the scheduler scheduled pages. For example, if we scan through the columns and encounter one that needs more data, then we grab exactly one page, and continue the column scan, grabbing the next page when we come by in another pass. This works in many cases but doesn't work in others. For example, the scheduler might schedule the many pages in a row for the same column if that column is wide or the page size is small. The decoder would get out of sync in these cases.
Now, the logic is simpler, and more correct. The decoder first waits for enough scheduling to be done that a batch can be delivered. During this pass we drain encoders from the scheduler and insert them into the current decoders not by "current position in the decode" but by the path that is now included with the decoder.
Once enough scheduling has been done we then wait for I/O.
Once I/O is done we then drain the decoders in the same fashion we did before.