Skip to content

pylance 6.0 write nested struct data with data storage version 2.2 crashed #6776

@GUOYI1

Description

@GUOYI1

Hi, I have a lance dataset written in pylance 2.0, and I directly copied the content of the dataset and write the content to new dataset through pylance 6.0. The write process can be finished, but when I tried to read the 6.0 dataset through pylance 6.0, it crashes.

My original 2.0 dataset has nested struct as schema. The crash happens when we set data_storage_version="2.2". When we use default data_storage_version, everything works fine.

Here is the code to reproduce and the crash log. Lance dataset is also attached

internal_road_maneuver.lance.zip

#!/usr/bin/env python3
import argparse
import csv
import json as pyjson
import lance
import logging
import os
import sys


logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(message)s",
    stream=sys.stdout,
)
logger = logging.getLogger(__name__)


def main():
    input_dir = "~/Downloads/lance_2_0/internal_road_maneuver.lance"
    ds = lance.dataset(input_dir)
    table = ds.to_table()
    output_dir = "~/Downloads/lance_6_0/internal_road_maneuver.lance"
    lance.write_dataset(
        table,
        output_dir,
        mode="overwrite",
        data_storage_version="2.2",
    )

Traceback (most recent call last):
  File "/workspace/yi.guo@xiaopeng.com/fm_deploy/xpilot_vision/ai_foundation/projects/e2e/e2e/data/datalake/lance_version_test.py", line 42, in <module>
    main()
  File "/workspace/yi.guo@xiaopeng.com/fm_deploy/xpilot_vision/ai_foundation/projects/e2e/e2e/data/datalake/lance_version_test.py", line 35, in main
    table = ds.to_table()
            ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/lance/dataset.py", line 1342, in to_table
    ).to_table()
      ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/lance/dataset.py", line 5824, in to_table
    return self.to_reader().read_all()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/ipc.pxi", line 762, in pyarrow.lib.RecordBatchReader.read_all
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: External error: Encountered internal error. Please file a bug report at https://github.com/lance-format/lance/issues. Error decoding batch: Invalid user input: Invalid argument error: Incorrect array length for StructArray field "remaining_dist", expected 363 got 209, /home/runner/work/lance/lance/rust/lance-encoding/src/encodings/logical/struct.rs:394:26, /home/runner/work/lance/lance/rust/lance-encoding/src/decoder.rs:2704:26

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions