Skip to content

Support u128 bitpacking for decimal128 columns #6857

@LuciferYang

Description

@LuciferYang

Background

The miniblock encoder picks an inline bitpacking width by walking {8, 16, 32, 64} and choosing the narrowest one that fits the column's value range, but anything wider — i.e. decimal128 — falls through to raw 128-bit storage today, even when the actual values comfortably fit in 32 or 64 bits. In practice, real-world decimal128 columns almost never use the full 128-bit range; money, prices, taxes, accounting figures all sit in narrow ranges, so a non-trivial fraction of every dataset that uses decimal128 is being stored at the wrong bit-width.

Impact

I measured this on TPC-DS SF=100 store_sales (288 M rows, 23 columns including 12 × decimal128(7,2)):

Without u128 bitpacking With u128 bitpacking
On-disk size 34 GiB 15.873 GiB
Bytes per row ~127 ~59

A ~53 % reduction entirely on the decimal128 columns, with schema, row count, and file format version (v2.1) all unchanged. This isn't a TPC-DS quirk — decimal(7,2) only needs ~24 bits of actual range, and that pattern repeats across most real decimal columns I've seen.

Proposal

Extend the miniblock bitpacking chooser to also consider bits = 128 and route that case through a scalar BitPacking kernel for u128. The picked width is still chosen the same way (narrowest that fits), so a pathological column that genuinely uses the full 128-bit range stays at 128 bits and pays nothing extra.

Scope of the change is narrow:

  • rust/compression/bitpacking/ — add the scalar u128 kernel.
  • rust/lance-encoding/src/encodings/physical/bitpacking.rs — wire u128 through the miniblock encode/decode path.
  • rust/lance-encoding/src/compression.rs — extend the chooser to match bits ∈ {8, 16, 32, 64, 128}.
  • rust/lance-encoding/src/statistics.rs — minor stat-related plumbing.

No new public API, no on-wire format change beyond a bit-width that v2.1 readers already accept (the encoder just didn't previously emit it), and no FastLanes-transposed kernel for u128 — scalar only for now, that's a natural follow-up.

Limitations

  • Decimal128-only. Other types are unchanged.
  • Range-dependent. Columns whose values genuinely span the full 128-bit range won't compress and fall through to raw 128 just like today.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions