Implement Starfix byte-level specification with type canonicalization by eywalker · Pull Request #6 · nauticalab/starfix

eywalker · 2026-03-05T16:43:32Z

Summary

This PR implements the complete Starfix byte-level hashing specification as documented in docs/byte-layout-spec.md. The changes ensure deterministic, language-agnostic hashing of Apache Arrow schemas and record batches by precisely specifying every byte fed into SHA-256.

Key Changes

Type Canonicalization: Implement logical type equivalence classes:
- Binary and LargeBinary both canonicalize to "LargeBinary"
- Utf8 and LargeUtf8 both canonicalize to "LargeUtf8"
- List and LargeList both canonicalize to {"LargeList": ...}
- Dictionary arrays are resolved to their plain value type before hashing
Schema Serialization: Generate canonical JSON with:
- Field names sorted alphabetically (via BTreeMap)
- Nested object keys sorted recursively
- Struct fields serialized with name, data_type, and nullable keys
- List element types serialized without Arrow-internal field names
Field Digest Finalization: Implement proper nullable field serialization:
- Bit count as usize little-endian (8 bytes)
- Validity words as usize big-endian (8 bytes per word)
- Data digest as SHA-256 finalized (32 bytes)
Dictionary Array Handling: Cast dictionary arrays to their value type before hashing to ensure consistent results regardless of encoding
Comprehensive Test Suite: Add 10 worked examples (example_a through example_j) that manually compute expected SHA-256 hashes and verify library conformance:
- Two-column tables with mixed nullable/non-nullable fields
- Boolean arrays with bit-packing (Msb0 for data, Lsb0 for validity)
- Fixed-size and variable-length types
- Column-order independence
- Type equivalence verification
- Multi-batch streaming
- Empty tables
Schema Validation: Update record batch schema validation to compare canonical serializations rather than exact object equality, allowing batches with reordered columns to be accepted

Notable Implementation Details

Validity BitVec uses Lsb0 (least significant bit first) for storage and Msb0 (most significant bit first) for Boolean data packing
All length prefixes for variable-length types use u64 little-endian (8 bytes), regardless of Arrow's offset type
Null elements in nullable fields are skipped entirely—no bytes are fed to the data digest
Fields are always processed in alphabetical order by path (e.g., address/city before address/zip)
Output format is 35 bytes: 3-byte version prefix (0x00 0x00 0x01) + 32-byte SHA-256 digest

https://claude.ai/code/session_01FdWd9bkZjS3c7oUuo8QSPX

Address all 7 design-spec issues to make starfix produce identical hashes for logically equivalent Arrow tables regardless of column order, struct field order, encoding, or type variant. Core implementation changes (src/arrow_digester_core.rs): - Issue 1: Sort struct fields alphabetically in data_type_to_value - Issue 2: Apply sort_json_value recursively for deterministic JSON - Issue 3: Use u64 (not usize) for binary length prefixes - Issue 4: Remove NULL_BYTES sentinel from binary/string nullable paths - Issue 5: Canonicalize Binary→LargeBinary, Utf8→LargeUtf8, List→LargeList - Issue 6: Resolve dictionary arrays to plain arrays before hashing - Issue 7: Use logical schema comparison in update() (canonical serialization) Also improved schema JSON format for cross-language stability by dropping Arrow-internal field names (e.g. "item") from List element serialization. All 13 previously-ignored tests now pass. Updated golden hash values and golden schema JSON to reflect the new canonical serialization. https://claude.ai/code/session_01FdWd9bkZjS3c7oUuo8QSPX

Add docs/byte-layout-spec.md describing the exact byte-level serialization for schema JSON, fixed-size types, booleans, variable-length types, lists, validity bitmaps, and the final combining digest. Every byte fed into SHA-256 is specified, making cross-language reimplementation possible. Add 10 verification tests in tests/digest_bytes.rs that manually construct the expected SHA-256 hash from raw bytes and assert equality with the library output. Covers: - Example A: two-column record batch (Int32 + nullable LargeUtf8) - Example B: boolean array with nulls (Msb0 bit packing) - Example C: non-nullable Int32 array - Example D: binary array with type canonicalization (Binary→LargeBinary) - Example E: column-order independence proof - Example F: Utf8/LargeUtf8 type equivalence proof - Example G: nullable Int32 with nulls - Example H: nullable string array with nulls and type canonicalization - Example I: empty table (schema only, no data) - Example J: multi-batch streaming equals single combined batch https://claude.ai/code/session_01FdWd9bkZjS3c7oUuo8QSPX

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

…t examples Implement DataType::Struct in array_digest_update for composite hashing of struct arrays (previously todo!()). Struct children are sorted alphabetically, each gets an independent digest that is finalized into the parent's data stream. Struct-level nulls propagate to children via combined validity buffers to avoid hashing undefined data. Add finalize_child_into_data helper for writing child digest bytes into a parent's data stream. Add four new manual verification tests (Examples K-N) covering struct columns in record batches, hash_array on structs with and without nulls, and list-of-struct columns. Update byte-layout spec with corresponding worked examples and updated Section 3.5. https://claude.ai/code/session_01FdWd9bkZjS3c7oUuo8QSPX

Refactor DigestBufferType from enum to struct with optional `structural` digest field. For list columns, element counts (sizes) now accumulate in a separate SHA-256 stream from leaf data, producing: null_bits || structural_digest || leaf_digest at finalization. This cleanly separates structure from data, making collision prevention easier to reason about while preserving streaming compatibility. Non-list types are unchanged. https://claude.ai/code/session_01FdWd9bkZjS3c7oUuo8QSPX

List types now separate element counts into a dedicated structural SHA-256 digest stream, while leaf data flows into the data digest. This ensures differently-grouped lists (e.g. [[1,2],[3]] vs [[1],[2,3]]) produce different hashes even when their leaf values are identical. Updated sections: field digest buffer description (Section 3), list types (Section 3.4), struct composite children (Section 3.5), finalization (Section 4), hash_array API (Section 6), and Example N. https://claude.ai/code/session_01FdWd9bkZjS3c7oUuo8QSPX

Add clippy expects for similar_names, redundant_clone, and absolute_paths in digest_bytes tests. Run cargo fmt to fix all formatting issues across source and test files. https://claude.ai/code/session_01FdWd9bkZjS3c7oUuo8QSPX

Add four examples that had tests but were missing from the spec: - Example G: Nullable Int32 array with nulls (hash_array API) - Example H: Nullable String array with nulls and type canonicalization - Example I: Empty table with no data batches - Example J: Multi-batch streaming batch-split independence All 14 byte-level spec tests (A-N) now have corresponding worked examples in the documentation. https://claude.ai/code/session_01FdWd9bkZjS3c7oUuo8QSPX

eywalker · 2026-03-07T04:30:37Z

Closing this PR as it has been superseded by PR #9, which includes all changes from this PR plus additional improvements.

claude added 2 commits March 5, 2026 12:05

eywalker requested review from Synicix and Copilot March 5, 2026 16:43

Copilot AI reviewed Mar 5, 2026

View reviewed changes

claude added 5 commits March 5, 2026 17:28

Fix clippy and formatting issues

c312b0a

Add clippy expects for similar_names, redundant_clone, and absolute_paths in digest_bytes tests. Run cargo fmt to fix all formatting issues across source and test files. https://claude.ai/code/session_01FdWd9bkZjS3c7oUuo8QSPX

eywalker mentioned this pull request Mar 7, 2026

feat: complete logical hashing implementation with platform-independent encoding and type normalization #9

Closed

3 tasks

eywalker closed this Mar 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Starfix byte-level specification with type canonicalization#6

Implement Starfix byte-level specification with type canonicalization#6
eywalker wants to merge 7 commits intomainfrom
claude/fix-logical-hashing-tests-Z52tP

eywalker commented Mar 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

eywalker commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

eywalker commented Mar 5, 2026

Summary

Key Changes

Notable Implementation Details

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

eywalker commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants