wip: hash consistent sorted trees #332

mikeal · 2020-11-11T00:39:12Z

still in a very early draft state, but i’ve written enough in this web editor that i’m no longer comfortable not having it committed to a branch :)

mikeal · 2020-11-11T05:11:36Z

Now that I’m staring at it all written out I’m realizing that there are some even better approaches we can take.

I’ve been using these simple tail addresses, which are great because they’re 100% consistent so they don’t have much churn on branch merging. The problem is that they’re super vulnerable to attack. @mikolalysenko solved this by using a floating fingerprint which is where all these ideas originated.

But once you spell out the problem the chunker is solving it’s a lot simpler than this. You’re already feeding it fingerprints, so you have a randomized distribution you can convert to an integer. Since the chunker is a state machine, all you need to do is introduce a few hard limits on max chunk sizes and reset the state machine whenever you close a chunk. You’d get the consistency benefits of my approach and the safety of the entropy you get from the fingerprinter.

design/sorted-tree.md

mikeal · 2020-11-11T20:15:07Z

@vmx i just re-wrote the whole doc. you can hold off on reading it for a bit, there’s a lot of churn right now.

mikeal · 2020-11-12T02:13:50Z

@ribasushi what do you think about this:

We drop all the settings to the chunker except for a TARGET_SIZE and base all of our rules on that.
We always pull a Uint32 from the tail of the hash to keep things simple.

When a chunk reaches twice the target size we start applying overflow protection.

In overflow protection we still close on the same entries as we would otherwise but 
we also compute a *sequence identity*. This is calculated by applying the hash 
function to the prior TARGET_SIZE hashes and the current hash. This gives us a 
unique identifier for each entry based on its placement. We convert the tail 
of this new hash to a Uint32 and compare it against the OVERFLOW_LIMIT.

The OVERFLOW_LIMIT is an integer that increases an equal amount from 0 on every 
entry until it reaches MAX_UINT32. The increase in OVERFLOW_LIMIT on each entry 
is `Math.top( MAXUINT32 / TARGET_SIZE)`.

This makes generating sequential data that will keep the chunk open highly difficult 
given a sufficient TARGET_SIZE and still produces relatively (although not completely) 
consistent chunk splits in overflow protection.

ribasushi · 2020-11-12T02:23:22Z

UPDATE: below note was based on a misunderstanding, see second point below.

Do we need the complication of OVERFLOW_LIMIT?

We always deduplicate ( entries are unique )
~~We always sort~~ detail I was missing: we do sort but in a user-defined way, NOT based on the content hash
We use cryptographic hashing

This means that inserting two entries next to each other is exceedingly difficult ( nearly preimage-attack level, albeit with lowered bit-ness )

Therefore: keep track of TARGET_SIZE, and whenever we cross it - instead of just taking the original hash of the node, byte-append the hash of the previous node and get the hash of that ( cheap, since you can reinit state in sha2 ). You then get a new unpredictable hash for the same leaf, and the leaves after it, until you break out of the attack, and things go back to normal.

mikeal · 2020-11-12T02:39:47Z

You have to keep in mind how large the address space is for not matching. If the target size is 1000 you have 99.9% chance of not closing under the defined limit. It’s actually not impossible to find a sequence in which every rolling hash stays within these boundaries especially if the target size is low which we do want to support. Increasing the threshold for a match slowly closes the available address space for an attack and since even a sequence that entirely matches would be evenly distributed it still makes matches fairly consistent after a small number of inserts. A large number of inserts is less of a concern because the overflow into the next node will push new data into the sequence and make any prior sequential attacks highly likely to be cut off by new overflow matches in the front of the sequence.

warpfork · 2021-02-22T16:48:18Z

What's the path forward on this?

If it's a tractable amount of work / there's time for it, can it be finished soon (i.e. in the next week or two)?

If it's not something that is likely to get carried much further forward at this time (or in the next couple weeks)... can we move it to be marked very clearly as a draft thing (and a comment about when it's likely to be expanded on), and merge it that way?

warpfork · 2021-04-11T11:49:21Z

Ping for if we can make this mergeable.

mikeal added 3 commits November 10, 2020 16:38

wip: hash consistent sorted trees

c4ecf13

wip: second tree

479bb99

wip: better chunker definition

b7fa7ba

vmx reviewed Nov 11, 2020

View reviewed changes

design/sorted-tree.md Outdated Show resolved Hide resolved

design/sorted-tree.md Outdated Show resolved Hide resolved

mikeal added 2 commits November 11, 2020 09:43

wip: new method

a1179d8

wip: more work on chunker function

52e6762

wip: better chunker

a8361ea

mikeal added 4 commits November 12, 2020 15:04

wip: overflow protection

0778d01

wip: edits

a303ac0

wip: more edits

e6956c5

wip: more text

94fdef2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wip: hash consistent sorted trees #332

wip: hash consistent sorted trees #332

mikeal commented Nov 11, 2020

mikeal commented Nov 11, 2020

mikeal commented Nov 11, 2020

mikeal commented Nov 12, 2020

ribasushi commented Nov 12, 2020 •

edited

Loading

mikeal commented Nov 12, 2020 •

edited

Loading

warpfork commented Feb 22, 2021

warpfork commented Apr 11, 2021

wip: hash consistent sorted trees #332

Are you sure you want to change the base?

wip: hash consistent sorted trees #332

Conversation

mikeal commented Nov 11, 2020

mikeal commented Nov 11, 2020

mikeal commented Nov 11, 2020

mikeal commented Nov 12, 2020

ribasushi commented Nov 12, 2020 • edited Loading

mikeal commented Nov 12, 2020 • edited Loading

warpfork commented Feb 22, 2021

warpfork commented Apr 11, 2021

ribasushi commented Nov 12, 2020 •

edited

Loading

mikeal commented Nov 12, 2020 •

edited

Loading