Phase 6 ships a unified `amn convert` dispatch covering every
v0.6.0-available format pair through a single CLI subcommand, the
format-symmetric inverse of `parse_gguf` (`write_gguf`, scalar dtypes
today; quantised dtypes reserved for Phase 7.5 through the same
scaffold), and the end-to-end BF16 -> BnB-NF4 safetensors path with
the four-tensor companion layout (`weight`, `weight.absmax`,
`weight.quant_map`, `weight.quant_state.bitsandbytes__nf4`).
New library helpers:
* `write_gguf` / `write_gguf_to_writer` / `GgufWriteTensor`
* `npz_to_safetensors` / `npz_to_safetensors_bytes`
* `write_bnb_nf4_safetensors` / `write_bnb_nf4_safetensors_bytes`
* `BnbWriteInput` / `BnbNf4WriteStats` / `is_eligible_for_nf4` / `classify_inputs`
* `NF4_BLOCK_SIZE`
New CLI surface:
* `amn convert <input> --to {safetensors|gguf|bnb-nf4}`
13 byte-exact integration tests in `tests/cross_validation_convert.rs`
cover every v0.6.0 conversion pair both directions where reversible,
plus a size-matched perf comparison (`t14`, `#[ignore]`d, opt-in via
`--ignored`) against six checked-in Python sidecars
(numpy / safetensors-py / torch.load + safetensors.torch / gguf-py /
bitsandbytes-CPU, plus two PyTorch-CPU equivalents for the non-PyTorch
paths).
Measured CPU performance vs Python at 4096x4096, release,
target-cpu=native:
npz -> safetensors 11.2 ms vs 75.7 ms numpy (6.75x) / 92.5 ms torch (8.24x)
pth -> safetensors 5.7 ms vs 29.6 ms torch (5.18x)
safetensors-BF16 -> GGUF 13.6 ms vs 15.1 ms gguf-py (1.11x) / 29.6 ms torch+ggufpy (2.17x)
safetensors-BF16 -> BnB-NF4 141 ms vs 377 ms bnb-CPU (2.67x)
Quantised GGUF emit (`gguf-q4km`, etc.) lands at v0.7.5 / Phase 7.5
via the same dispatch.