Hi, I have a lance dataset written in pylance 2.0, and I directly copied the content of the dataset and write the content to new dataset through pylance 6.0. The write process can be finished, but when I tried to read the 6.0 dataset through pylance 6.0, it crashes.
My original 2.0 dataset has nested struct as schema. The crash happens when we set data_storage_version="2.2". When we use default data_storage_version, everything works fine.
Here is the code to reproduce and the crash log. Lance dataset is also attached
internal_road_maneuver.lance.zip
#!/usr/bin/env python3
import argparse
import csv
import json as pyjson
import lance
import logging
import os
import sys
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s | %(levelname)s | %(message)s",
stream=sys.stdout,
)
logger = logging.getLogger(__name__)
def main():
input_dir = "~/Downloads/lance_2_0/internal_road_maneuver.lance"
ds = lance.dataset(input_dir)
table = ds.to_table()
output_dir = "~/Downloads/lance_6_0/internal_road_maneuver.lance"
lance.write_dataset(
table,
output_dir,
mode="overwrite",
data_storage_version="2.2",
)
Traceback (most recent call last):
File "/workspace/yi.guo@xiaopeng.com/fm_deploy/xpilot_vision/ai_foundation/projects/e2e/e2e/data/datalake/lance_version_test.py", line 42, in <module>
main()
File "/workspace/yi.guo@xiaopeng.com/fm_deploy/xpilot_vision/ai_foundation/projects/e2e/e2e/data/datalake/lance_version_test.py", line 35, in main
table = ds.to_table()
^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/lance/dataset.py", line 1342, in to_table
).to_table()
^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/lance/dataset.py", line 5824, in to_table
return self.to_reader().read_all()
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/ipc.pxi", line 762, in pyarrow.lib.RecordBatchReader.read_all
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: External error: Encountered internal error. Please file a bug report at https://github.com/lance-format/lance/issues. Error decoding batch: Invalid user input: Invalid argument error: Incorrect array length for StructArray field "remaining_dist", expected 363 got 209, /home/runner/work/lance/lance/rust/lance-encoding/src/encodings/logical/struct.rs:394:26, /home/runner/work/lance/lance/rust/lance-encoding/src/decoder.rs:2704:26
Hi, I have a lance dataset written in pylance 2.0, and I directly copied the content of the dataset and write the content to new dataset through pylance 6.0. The write process can be finished, but when I tried to read the 6.0 dataset through pylance 6.0, it crashes.
My original 2.0 dataset has nested struct as schema. The crash happens when we set
data_storage_version="2.2". When we use default data_storage_version, everything works fine.Here is the code to reproduce and the crash log. Lance dataset is also attached
internal_road_maneuver.lance.zip