feat: support bf16 from pytorch dataset by eddyxu · Pull Request #6342 · lance-format/lance

eddyxu · 2026-03-30T15:54:48Z

Summary

Support round-trip to use bf16 from PyTorch

Co-authored-by: Claude Opus 4.6 noreply@anthropic.com

github-actions · 2026-03-30T15:56:34Z

PR Review

P0 Bug: `uint64_as_int64` conversion broken for numeric types

The insertion of the elif _is_bfloat16_type(arr.type): branch has accidentally moved the uint64_as_int64 guard into the wrong branch.

Before (base):

        elif (
            pa.types.is_integer(arr.type)
            or pa.types.is_floating(arr.type)
            or pa.types.is_boolean(arr.type)
        ):
            tensor = torch.from_numpy(arr.to_numpy(zero_copy_only=False))

            if uint64_as_int64 and tensor.dtype == torch.uint64:  # ← inside numeric branch ✓
                tensor = tensor.to(torch.int64)

After (this PR):

        elif (
            pa.types.is_integer(arr.type)
            ...
        ):
            tensor = torch.from_numpy(arr.to_numpy(zero_copy_only=False))
        elif _is_bfloat16_type(arr.type):
            tensor = _bf16_to_tensor(arr)

            if uint64_as_int64 and tensor.dtype == torch.uint64:  # ← now inside bf16 branch ✗
                tensor = tensor.to(torch.int64)

The uint64_as_int64 conversion no longer runs for numeric uint64 columns (regression), and is dead code inside the bf16 branch (bf16 tensors will never have torch.uint64 dtype).

Fix: Move the if uint64_as_int64 block back into (or after) the numeric elif, e.g. by dedenting it to run after all branches, or by keeping it inside the numeric branch:

        ):
            tensor = torch.from_numpy(arr.to_numpy(zero_copy_only=False))

            if uint64_as_int64 and tensor.dtype == torch.uint64:
                tensor = tensor.to(torch.int64)
        elif _is_bfloat16_type(arr.type):
            tensor = _bf16_to_tensor(arr)
        elif hf_converter is not None:

Minor: No null handling in `_bf16_to_tensor`

If a bf16 array contains nulls, _bf16_to_tensor will silently produce garbage values for null slots. This is consistent with the existing numeric path (to_numpy(zero_copy_only=False) also fills nulls with 0), so not blocking — but worth a brief doc comment noting the assumption.

Overall the approach (reinterpreting uint16 storage bytes as bfloat16) is sound and the test coverage for the happy path is good.

westonpace

Just some nits

westonpace · 2026-03-31T16:20:34Z

+    Null values are replaced with NaN.
+    """
+    storage = arr.storage if isinstance(arr.type, pa.ExtensionType) else arr
+    buf = storage.buffers()[1]


Should we do a sanity check that the data type of storage is a 16-bit type at this point?

westonpace · 2026-03-31T16:22:43Z

+    buf = storage.buffers()[1]
+    offset = storage.offset * 2  # 2 bytes per bf16 value
+    np_uint16 = np.frombuffer(buf, dtype=np.uint16, count=len(storage), offset=offset)
+    tensor = torch.from_numpy(np_uint16.copy()).view(torch.bfloat16)


Is the copy here so that the resulting buffer can be mutable?

westonpace · 2026-03-31T16:23:12Z

+    np_uint16 = np.frombuffer(buf, dtype=np.uint16, count=len(storage), offset=offset)
+    tensor = torch.from_numpy(np_uint16.copy()).view(torch.bfloat16)
+    if arr.null_count > 0:
+        null_mask = torch.from_numpy(arr.is_null().to_numpy(zero_copy_only=False))


Seems like there should be a way to do this without a copy but maybe not.

asked claude / codex to do double checks, to make this opportunist

westonpace · 2026-03-31T16:23:46Z

-            if uint64_as_int64 and tensor.dtype == torch.uint64:
+            if (
+                uint64_as_int64 and tensor.dtype == torch.uint64
+            ):  # ← inside numeric branch ✓


Kind of a strange comment. I'm not really sure what it means.

github-actions Bot added enhancement New feature or request A-python Python bindings labels Mar 30, 2026

eddyxu requested a review from westonpace March 30, 2026 16:22

westonpace approved these changes Mar 31, 2026

View reviewed changes

github-actions Bot added the A-java Java bindings + JNI label Mar 31, 2026

eddyxu added 2 commits April 1, 2026 12:15

support bf16 from torch

60e8da4

m;

0d48a7c

eddyxu force-pushed the lei/torch_bf16 branch from ffffd9b to 0d48a7c Compare April 1, 2026 19:15

eddyxu and others added 4 commits April 1, 2026 12:57

only opportunitically not copy data

d3996c9

reduce number of memory cpies

0c155a6

fix ruff

0123dea

Merge branch 'main' into lei/torch_bf16

595e7d4

eddyxu merged commit 21d830a into main Apr 1, 2026
12 checks passed

eddyxu deleted the lei/torch_bf16 branch April 1, 2026 20:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support bf16 from pytorch dataset#6342

feat: support bf16 from pytorch dataset#6342
eddyxu merged 6 commits into
mainfrom
lei/torch_bf16

eddyxu commented Mar 30, 2026

Uh oh!

github-actions Bot commented Mar 30, 2026

Uh oh!

westonpace left a comment

Uh oh!

westonpace Mar 31, 2026

Uh oh!

westonpace Mar 31, 2026

Uh oh!

westonpace Mar 31, 2026

Uh oh!

eddyxu Apr 1, 2026

Uh oh!

westonpace Mar 31, 2026

Uh oh!

eddyxu Apr 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

eddyxu commented Mar 30, 2026

Summary

Uh oh!

github-actions Bot commented Mar 30, 2026

PR Review

P0 Bug: uint64_as_int64 conversion broken for numeric types

Minor: No null handling in _bf16_to_tensor

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

westonpace Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

westonpace Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

westonpace Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

eddyxu Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

westonpace Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

eddyxu Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

P0 Bug: `uint64_as_int64` conversion broken for numeric types

Minor: No null handling in `_bf16_to_tensor`