Skip to content

Offset overflow errors can be confusing for users #2775

@westonpace

Description

@westonpace

When using binary or string columns a single batch of data cannot contain more than 2GiB of data. Users will either need to use large_binary and large_string or make sure to set a custom batch size when reading this data.

However, the error they run into, an "offset overflow" error, is a panic (not great) and very confusing. It is not obvious that the solution is to reduce the batch size:

thread 'lance_background_thread' panicked at .../arrow-data-52.2.0/src/transform/utils.rs:42:56:
offset overflow
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at .../rust/lance-encoding/src/decoder.rs:1267:65:
called `Result::unwrap()` on an `Err` value: JoinError::Panic(Id(12814), ...)

Ideally we should be returning an Err here (not panic) and the message should say something like "Could not create array with more than 2GiB of string/binary data. Please try reducing the batch_size."

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions