Skip to content

Implement Starfix byte-level specification with type canonicalization#6

Closed
eywalker wants to merge 7 commits intomainfrom
claude/fix-logical-hashing-tests-Z52tP
Closed

Implement Starfix byte-level specification with type canonicalization#6
eywalker wants to merge 7 commits intomainfrom
claude/fix-logical-hashing-tests-Z52tP

Conversation

@eywalker
Copy link
Copy Markdown
Contributor

@eywalker eywalker commented Mar 5, 2026

Summary

This PR implements the complete Starfix byte-level hashing specification as documented in docs/byte-layout-spec.md. The changes ensure deterministic, language-agnostic hashing of Apache Arrow schemas and record batches by precisely specifying every byte fed into SHA-256.

Key Changes

  • Type Canonicalization: Implement logical type equivalence classes:

    • Binary and LargeBinary both canonicalize to "LargeBinary"
    • Utf8 and LargeUtf8 both canonicalize to "LargeUtf8"
    • List and LargeList both canonicalize to {"LargeList": ...}
    • Dictionary arrays are resolved to their plain value type before hashing
  • Schema Serialization: Generate canonical JSON with:

    • Field names sorted alphabetically (via BTreeMap)
    • Nested object keys sorted recursively
    • Struct fields serialized with name, data_type, and nullable keys
    • List element types serialized without Arrow-internal field names
  • Field Digest Finalization: Implement proper nullable field serialization:

    • Bit count as usize little-endian (8 bytes)
    • Validity words as usize big-endian (8 bytes per word)
    • Data digest as SHA-256 finalized (32 bytes)
  • Dictionary Array Handling: Cast dictionary arrays to their value type before hashing to ensure consistent results regardless of encoding

  • Comprehensive Test Suite: Add 10 worked examples (example_a through example_j) that manually compute expected SHA-256 hashes and verify library conformance:

    • Two-column tables with mixed nullable/non-nullable fields
    • Boolean arrays with bit-packing (Msb0 for data, Lsb0 for validity)
    • Fixed-size and variable-length types
    • Column-order independence
    • Type equivalence verification
    • Multi-batch streaming
    • Empty tables
  • Schema Validation: Update record batch schema validation to compare canonical serializations rather than exact object equality, allowing batches with reordered columns to be accepted

Notable Implementation Details

  • Validity BitVec uses Lsb0 (least significant bit first) for storage and Msb0 (most significant bit first) for Boolean data packing
  • All length prefixes for variable-length types use u64 little-endian (8 bytes), regardless of Arrow's offset type
  • Null elements in nullable fields are skipped entirely—no bytes are fed to the data digest
  • Fields are always processed in alphabetical order by path (e.g., address/city before address/zip)
  • Output format is 35 bytes: 3-byte version prefix (0x00 0x00 0x01) + 32-byte SHA-256 digest

https://claude.ai/code/session_01FdWd9bkZjS3c7oUuo8QSPX

claude added 2 commits March 5, 2026 12:05
Address all 7 design-spec issues to make starfix produce identical
hashes for logically equivalent Arrow tables regardless of column order,
struct field order, encoding, or type variant.

Core implementation changes (src/arrow_digester_core.rs):
- Issue 1: Sort struct fields alphabetically in data_type_to_value
- Issue 2: Apply sort_json_value recursively for deterministic JSON
- Issue 3: Use u64 (not usize) for binary length prefixes
- Issue 4: Remove NULL_BYTES sentinel from binary/string nullable paths
- Issue 5: Canonicalize Binary→LargeBinary, Utf8→LargeUtf8, List→LargeList
- Issue 6: Resolve dictionary arrays to plain arrays before hashing
- Issue 7: Use logical schema comparison in update() (canonical serialization)

Also improved schema JSON format for cross-language stability by dropping
Arrow-internal field names (e.g. "item") from List element serialization.

All 13 previously-ignored tests now pass. Updated golden hash values and
golden schema JSON to reflect the new canonical serialization.

https://claude.ai/code/session_01FdWd9bkZjS3c7oUuo8QSPX
Add docs/byte-layout-spec.md describing the exact byte-level serialization
for schema JSON, fixed-size types, booleans, variable-length types, lists,
validity bitmaps, and the final combining digest. Every byte fed into
SHA-256 is specified, making cross-language reimplementation possible.

Add 10 verification tests in tests/digest_bytes.rs that manually construct
the expected SHA-256 hash from raw bytes and assert equality with the
library output. Covers:
- Example A: two-column record batch (Int32 + nullable LargeUtf8)
- Example B: boolean array with nulls (Msb0 bit packing)
- Example C: non-nullable Int32 array
- Example D: binary array with type canonicalization (Binary→LargeBinary)
- Example E: column-order independence proof
- Example F: Utf8/LargeUtf8 type equivalence proof
- Example G: nullable Int32 with nulls
- Example H: nullable string array with nulls and type canonicalization
- Example I: empty table (schema only, no data)
- Example J: multi-batch streaming equals single combined batch

https://claude.ai/code/session_01FdWd9bkZjS3c7oUuo8QSPX
@eywalker eywalker requested review from Synicix and Copilot March 5, 2026 16:43
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

claude added 5 commits March 5, 2026 17:28
…t examples

Implement DataType::Struct in array_digest_update for composite hashing
of struct arrays (previously todo!()). Struct children are sorted
alphabetically, each gets an independent digest that is finalized into
the parent's data stream. Struct-level nulls propagate to children via
combined validity buffers to avoid hashing undefined data.

Add finalize_child_into_data helper for writing child digest bytes into
a parent's data stream. Add four new manual verification tests (Examples
K-N) covering struct columns in record batches, hash_array on structs
with and without nulls, and list-of-struct columns. Update byte-layout
spec with corresponding worked examples and updated Section 3.5.

https://claude.ai/code/session_01FdWd9bkZjS3c7oUuo8QSPX
Refactor DigestBufferType from enum to struct with optional `structural`
digest field. For list columns, element counts (sizes) now accumulate in
a separate SHA-256 stream from leaf data, producing: null_bits ||
structural_digest || leaf_digest at finalization. This cleanly separates
structure from data, making collision prevention easier to reason about
while preserving streaming compatibility. Non-list types are unchanged.

https://claude.ai/code/session_01FdWd9bkZjS3c7oUuo8QSPX
List types now separate element counts into a dedicated structural SHA-256
digest stream, while leaf data flows into the data digest. This ensures
differently-grouped lists (e.g. [[1,2],[3]] vs [[1],[2,3]]) produce
different hashes even when their leaf values are identical.

Updated sections: field digest buffer description (Section 3), list types
(Section 3.4), struct composite children (Section 3.5), finalization
(Section 4), hash_array API (Section 6), and Example N.

https://claude.ai/code/session_01FdWd9bkZjS3c7oUuo8QSPX
Add clippy expects for similar_names, redundant_clone, and absolute_paths
in digest_bytes tests. Run cargo fmt to fix all formatting issues across
source and test files.

https://claude.ai/code/session_01FdWd9bkZjS3c7oUuo8QSPX
Add four examples that had tests but were missing from the spec:
- Example G: Nullable Int32 array with nulls (hash_array API)
- Example H: Nullable String array with nulls and type canonicalization
- Example I: Empty table with no data batches
- Example J: Multi-batch streaming batch-split independence

All 14 byte-level spec tests (A-N) now have corresponding worked examples
in the documentation.

https://claude.ai/code/session_01FdWd9bkZjS3c7oUuo8QSPX
@eywalker
Copy link
Copy Markdown
Contributor Author

eywalker commented Mar 7, 2026

Closing this PR as it has been superseded by PR #9, which includes all changes from this PR plus additional improvements.

@eywalker eywalker closed this Mar 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants