Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add MSET "hashing" function spec #272

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 44 additions & 0 deletions specs/mset.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# MSET is a "hashing" function that encode repating sets of bytes

The name is inspired by the memset function.

It is similar to identity as it isn't a hash, but a complete representation of the data.

The main goal of MSET is not to be used for effective data compression.
The goal is instead to compress the trivial cases of data padding.

## Digest decoding

```
<varuint count - 2><pattern>
```

First, you read a varuint, this is the number of time the pattern must be repeated minus two (so you add two to get the true value).

Everything left in your buffer is the pattern to repeat.

If there is nothing left in the buffer (that mean that the count is the ONLY thing in the digest), then the pattern is `0x00`.

The varuint count MUST be minimal and complete, if it's not that an invalid MSET hash.

The pattern size SHOULD be a power of two (implementations could likely use faster vectorized loops then).

## Examples

- `0x0242` -> `0x42424242`; repeat a `uint8` equal to `0x42` 4 times
- `0x001234` -> `0x12341234`; repeat a `uint16` equal to `0x1234` 2 times
- `0x7e42` -> `0x42 * 128`; repeat a `uint8` equal to `0x42` 128 times
- `0x7f1234` -> `0x1234 * 129`; repeat a `uint16` equal to `0x1234` 129 times
- `0x800242` -> `0x42 * 256`; repeat a `uint8` equal to `0x42` 258 times
- `0x01123456` -> `0x123456123456123456`; repeat a `uint24` equal to `0x123456123456123456` 3 times
- `0x03` -> `0x0000000000`; zerofill 5 bytes.

## Rational

- Varuint count minus two.
The varuint count is minus two because counts of 0 and 1 are better served by an identity CIDs, it doesn't make sense to encode them here then.
Subbing by two allows powers 128 to be stored in one less byte.
- No pattern equal zerofill.
An empty pattern would lead a multiplication by 0 of the size which would be empty data, however that is better served by identity CIDs.
We instead reuse this shorter value for something more usefull.
NUL bytes is the most popular padding value used in most apps, it make sense to grant them this one byte shorter opportunity.
1 change: 1 addition & 0 deletions table.csv
Original file line number Diff line number Diff line change
Expand Up @@ -152,6 +152,7 @@ ed25519-priv, key, 0x1300, draft, Ed255
secp256k1-priv, key, 0x1301, draft, Secp256k1 private key
x25519-priv, key, 0x1302, draft, Curve25519 private key
kangarootwelve, multihash, 0x1d01, draft, KangarooTwelve is an extendable-output hash function based on Keccak-p
mset, multihash, 0x3488, draft, MSET "hashing" function; see specs/mset.md file
sm3-256, multihash, 0x534d, draft,
blake2b-8, multihash, 0xb201, draft, Blake2b consists of 64 output lengths that give different hashes
blake2b-16, multihash, 0xb202, draft,
Expand Down