Skip to content

josefs/lexcode

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

lexcode logo

lexcode

A serde-based binary serialization format that preserves lexicographic ordering. When two values satisfy a < b, their serialized byte representations satisfy bytes(a) < bytes(b) under standard byte-wise comparison.

Preserving lexicographic ordering is useful when storing keys in ordered key-value stores and databases (LSM trees, B-trees, etc.) where sort order must be maintained at the byte level.

Note that lexcode does not support preserving the ordering of explicit implementations of Ord or PartialOrd. The ordering lexcode will use for e.g. a struct is to first compare the first field, then the second field etc., regardless of any Ord and PartialOrd implementation.

Usage

Serialize and deserialize any serde-compatible type:

use serde::{Serialize, Deserialize};

#[derive(Debug, PartialEq, Serialize, Deserialize)]
struct Record {
    category: u32,
    name: String,
    score: i64,
}

let record = Record { category: 1, name: "alice".into(), score: -42 };

// Serialize
let bytes = lexcode::to_bytes(&record).unwrap();

// Deserialize
let decoded: Record = lexcode::from_bytes(&bytes).unwrap();
assert_eq!(record, decoded);

The encoded bytes preserve ordering — records sort first by category, then by name, then by score, matching the struct field order:

let a = lexcode::to_bytes(&Record { category: 1, name: "alice".into(), score: 10 }).unwrap();
let b = lexcode::to_bytes(&Record { category: 1, name: "bob".into(),   score: 5  }).unwrap();
let c = lexcode::to_bytes(&Record { category: 2, name: "alice".into(), score: 99 }).unwrap();

assert!(a < b); // same category, "alice" < "bob"
assert!(b < c); // category 1 < category 2

Supported Types

Type Encoding
bool 1 byte (0x00 / 0x01)
u8, u16, u32, u64, u128 Variable-length unsigned varint
i8, i16, i32, i64, i128 Variable-length signed varint (sign bit + magnitude)
f32, f64 IEEE 754 with sign-aware bit flipping (fixed-width)
char Variable-length unsigned varint of code point
String / &str Sentinel-escaped with 0x00 terminator
&[u8] Sentinel-escaped with 0x7F terminator
FixedBytes<N> Raw bytes, exactly N bytes (zero overhead)
Option<T> 0x00 for None, 0x01 + value for Some
Vec<T>, sequences 0x01 + element per entry, 0x00 terminator
Maps 0x01 + key + value per entry, 0x00 terminator
Tuples, structs Fields concatenated in order (fixed-length)
Enums Varint variant index + variant data
(), unit structs Zero bytes

Encoding Details

Integers

Integers use a variable-length encoding that preserves lexicographic ordering while compressing small values into fewer bytes.

Unsigned integers

The number of leading 1-bits in the first byte (and optionally a second header byte for large values) determines how many additional data bytes follow. A 0-bit terminates the unary prefix. The remaining header bits plus the extra bytes store the value in big-endian.

First byte pattern Total bytes Data bits Value range
0xxxxxxx 1 7 0 – 127
10xxxxxx + 1 byte 2 14 128 – 16,511
110xxxxx + 2 bytes 3 21 16,512 – 2,113,663
1110xxxx + 3 bytes 4 28
11110xxx + 4 bytes 5 35
111110xx + 5 bytes 6 42
1111110x + 6 bytes 7 49
11111110 + 7 bytes 8 56

When the first byte is 0xFF, a second header byte extends the scheme in the same way, supporting up to 18 total bytes for the full u128 range.

Within each level, values are stored as an offset from the level's base, ensuring that all values at a shorter encoding are strictly less than all values at a longer encoding.

Signed integers

Bit 7 of the first byte is the sign bit (1 = non-negative, 0 = negative). The remaining 7 bits begin the same unary-prefix scheme for the magnitude:

  • Non-negative values encode the magnitude directly after the sign bit.
  • Negative values encode |v| − 1 as the magnitude, then bitwise-complement all bytes (except the sign bit stays 0). This reverses the ordering so that more-negative values produce lexicographically smaller byte sequences.

Small values near zero (both positive and negative) are encoded compactly in 1 byte. For example, values −64 to 63 fit in a single byte.

Floats

IEEE 754 floats use sign-aware bit manipulation: positive floats have their sign bit flipped; negative floats have all bits flipped. This produces a total ordering over all non-NaN values.

Strings and Byte Slices

Variable-length types use sentinel encoding to allow unambiguous termination without length prefixes. Each occurrence of the sentinel byte within the data is escaped by doubling it (sentinel, 0x01), and the sequence is terminated with sentinel, 0x00.

Sequences and Maps

Variable-length collections prefix each element with 0x01 and end with 0x00. This preserves element-wise lexicographic comparison.

Fixed-Size Byte Arrays (FixedBytes<N>)

For encoding a known number of bytes without encoding overhead, lexcode provides the type FixedBytes<N> — the output is exactly N bytes, identical to the input array. This is ideal for hashes, UUIDs, and other fixed-size binary data where space efficiency matters.

Unlike [u8; N] (which varint-encodes each byte, averaging ~1.5× the size) or &[u8] / Vec<u8> (which uses sentinel encoding with a 2-byte terminator), FixedBytes<N> has zero overhead.

use lexcode::FixedBytes;

#[derive(serde::Serialize, serde::Deserialize)]
struct Record {
    id: u64,
    hash: FixedBytes<32>,   // exactly 32 bytes in output
}

Testing

The test suite uses proptest for property-based testing, verifying both roundtrip correctness and order preservation across all supported types.

cargo test

Data Migration

Because integers use a variable-length encoding that always promotes to u128/i128 internally, the same numeric value produces identical bytes regardless of the Rust integer width. For example, 42u8, 42u16, and 42u64 all serialize to the same bytes. This makes several type changes forward- and backward-compatible.

Compatible changes

Integer widening (e.g. u8u32, i16i64): Always safe. Old data decodes correctly into the wider type since the byte encoding is identical.

Integer narrowing (e.g. u32u16, i64i32): Safe if all stored values fit in the narrower type. Out-of-range values produce an overflow error on deserialization rather than silent truncation.

char ↔ unsigned integers: char uses the same unsigned varint encoding as u8u128, so these types are interchangeable. charu32/u64/u128 always works. u32char works if the value is a valid Unicode code point.

Newtype struct wrapping/unwrapping: struct Wrapper(T) serializes identically to T, so adding or removing a newtype wrapper is always safe.

Widening a field inside a struct or container: Since field encodings are width-independent, changing a struct field from u8 to u32 (or an element type from i16 to i64, etc.) is safe — old data decodes correctly.

Appending enum variants: Adding new variants at the end of an enum is safe for reading old data, since existing variant indices are unchanged.

Incompatible changes

Unsigned ↔ signed: u8u128 and i8i128 use structurally different varint schemes (different header formats), so changing signedness silently produces wrong values.

Float width changes: f32 and f64 use fixed-width encodings (4 and 8 bytes respectively), so they are not interchangeable.

Integer ↔ float: Integers use variable-length varint; floats use fixed-width bit-flipped IEEE 754. These are completely different encodings.

String&[u8]: Strings use sentinel byte 0x00; byte slices use 0x7F. Data encoded as one cannot be decoded as the other.

TOption<T>: Option adds a 0x00/0x01 prefix byte, so wrapping or unwrapping Option changes the encoding.

Adding or removing struct/tuple fields: Fields are concatenated without framing, so changing the field count causes subsequent fields to be misaligned.

Reordering or removing enum variants: Variants are identified by their positional index, so reordering or removing variants changes the meaning of existing encoded data.

[u8; N]&[u8] / Vec<u8>: Fixed-size byte arrays are serialized as tuples (each element varint-encoded, concatenated without framing), while byte slices use sentinel encoding. These are completely different formats.

Limitations

  • deserialize_any is not supported. Since lexcode is a non-self-describing binary format, the deserializer must know the expected type at compile time. This means dynamically-typed values like serde_json::Value cannot be deserialized from lexcode.

Comparison with Similar Crates

See COMPARISON.md for a detailed comparison with bytekey2, ordcode, memcomparable, and other binary serialization formats.

Disclaimer

This library is in no way associated with Lex Fridman. This library cannot conduct a 4-hour podcast interview with your data structures.

License

Licensed under the MIT License.

About

serde-based (de)serializer that preserves lexicographic ordering

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages