Skip to content

fix: use physical scan stream for update#6741

Merged
wjones127 merged 1 commit into
lance-format:mainfrom
wojiaodoubao:fix-update
May 13, 2026
Merged

fix: use physical scan stream for update#6741
wjones127 merged 1 commit into
lance-format:mainfrom
wojiaodoubao:fix-update

Conversation

@wojiaodoubao
Copy link
Copy Markdown
Contributor

@wojiaodoubao wojiaodoubao commented May 12, 2026

Update rewrites affected rows using the dataset physical schema. Avoid wrapping the scan output in DatasetRecordBatchStream, which may convert internal JSON columns from lance.json/LargeBinary to arrow.json/Utf8 for user-facing reads and cause schema mismatches during rewrite. Add coverage for updating both regular and JSON columns.

Error msg

Traceback (most recent call last):
  File "/Users/bytedance/Project/emr/jinglun-lance-hello-python/main.py", line 298, in <module>
    chunsheng_debug()
    ~~~~~~~~~~~~~~~^^
  File "/Users/bytedance/Project/emr/jinglun-lance-hello-python/main.py", line 291, in chunsheng_debug
    ds.update({'speaker_id': '"SPEAKER_9172"'}, where="name='沈逸'")
    ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/bytedance/miniconda3/lib/python3.13/site-packages/lance/dataset.py", line 2577, in update
    return self._ds.update(updates, where, conflict_retries, retry_timeout)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: Encountered internal error. Please file a bug report at https://github.com/lance-format/lance/issues. Expected schema Schema { fields: [Field { name: "id", data_type: Int64, nullable: true }, Field { name: "users_id", data_type: Int64, nullable: true }, Field { name: "user_id", data_type: Utf8, nullable: true }, Field { name: "name", data_type: Utf8, nullable: true }, Field { name: "files", data_type: LargeBinary, nullable: true, metadata: {"ARROW:extension:metadata": "", "ARROW:extension:name": "lance.json"} }, Field { name: "user_tags", data_type: LargeBinary, nullable: true, metadata: {"ARROW:extension:name": "lance.json", "ARROW:extension:metadata": ""} }, Field { name: "output_text", data_type: Utf8, nullable: true }, Field { name: "ai_tags", data_type: LargeBinary, nullable: true, metadata: {"ARROW:extension:name": "lance.json", "ARROW:extension:metadata": ""} }, Field { name: "created_time", data_type: Timestamp(Microsecond, Some("Asia/Shanghai")), nullable: true }, Field { name: "updated_time", data_type: Timestamp(Microsecond, Some("Asia/Shanghai")), nullable: true }, Field { name: "del_flag", data_type: Int32, nullable: true }, Field { name: "speaker_id", data_type: Utf8, nullable: true }], metadata: {} } but got Schema { fields: [Field { name: "id", data_type: Int64, nullable: true }, Field { name: "users_id", data_type: Int64, nullable: true }, Field { name: "user_id", data_type: Utf8, nullable: true }, Field { name: "name", data_type: Utf8, nullable: true }, Field { name: "files", data_type: Utf8, nullable: true, metadata: {"ARROW:extension:metadata": "", "ARROW:extension:name": "arrow.json"} }, Field { name: "user_tags", data_type: Utf8, nullable: true, metadata: {"ARROW:extension:name": "arrow.json", "ARROW:extension:metadata": ""} }, Field { name: "output_text", data_type: Utf8, nullable: true }, Field { name: "ai_tags", data_type: Utf8, nullable: true, metadata: {"ARROW:extension:name": "arrow.json", "ARROW:extension:metadata": ""} }, Field { name: "created_time", data_type: Timestamp(Microsecond, Some("Asia/Shanghai")), nullable: true }, Field { name: "updated_time", data_type: Timestamp(Microsecond, Some("Asia/Shanghai")), nullable: true }, Field { name: "del_flag", data_type: Int32, nullable: true }, Field { name: "speaker_id", data_type: Utf8, nullable: true }], metadata: {} }, /Users/runner/work/lance/lance/rust/lance/src/dataset/write/update.rs:274:24

Process finished with exit code 1

Closes #6329

Update rewrites affected rows using the dataset physical schema. Avoid
wrapping the scan output in DatasetRecordBatchStream, which may convert
internal JSON columns from lance.json/LargeBinary to arrow.json/Utf8 for
user-facing reads and cause schema mismatches during rewrite.

Add coverage for updating both regular and JSON columns.
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions github-actions Bot added the bug Something isn't working label May 12, 2026
@wojiaodoubao
Copy link
Copy Markdown
Contributor Author

I ran into error when update a table with json type. Hi @Xuanwo @majin1102 , could you help review this when you have time, thanks very much.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 12, 2026

Codecov Report

❌ Patch coverage is 98.57143% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset/write/update.rs 98.41% 0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

@wjones127
Copy link
Copy Markdown
Contributor

Does this solve #6329?

Copy link
Copy Markdown
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thank you!

@wjones127 wjones127 merged commit f5ebba1 into lance-format:main May 13, 2026
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug(rust): dataset.update() fails with schema mismatch on datasets containing pa.json_() columns

2 participants