Skip to content
This repository has been archived by the owner on Jun 29, 2022. It is now read-only.

wip: hash consistent sorted trees #332

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open

wip: hash consistent sorted trees #332

wants to merge 10 commits into from

Conversation

mikeal
Copy link
Contributor

@mikeal mikeal commented Nov 11, 2020

still in a very early draft state, but i’ve written enough in this web editor that i’m no longer comfortable not having it committed to a branch :)

@mikeal
Copy link
Contributor Author

mikeal commented Nov 11, 2020

Now that I’m staring at it all written out I’m realizing that there are some even better approaches we can take.

I’ve been using these simple tail addresses, which are great because they’re 100% consistent so they don’t have much churn on branch merging. The problem is that they’re super vulnerable to attack. @mikolalysenko solved this by using a floating fingerprint which is where all these ideas originated.

But once you spell out the problem the chunker is solving it’s a lot simpler than this. You’re already feeding it fingerprints, so you have a randomized distribution you can convert to an integer. Since the chunker is a state machine, all you need to do is introduce a few hard limits on max chunk sizes and reset the state machine whenever you close a chunk. You’d get the consistency benefits of my approach and the safety of the entropy you get from the fingerprinter.

design/sorted-tree.md Outdated Show resolved Hide resolved
design/sorted-tree.md Outdated Show resolved Hide resolved
@mikeal
Copy link
Contributor Author

mikeal commented Nov 11, 2020

@vmx i just re-wrote the whole doc. you can hold off on reading it for a bit, there’s a lot of churn right now.

@mikeal
Copy link
Contributor Author

mikeal commented Nov 12, 2020

@ribasushi what do you think about this:

  • We drop all the settings to the chunker except for a TARGET_SIZE and base all of our rules on that.
  • We always pull a Uint32 from the tail of the hash to keep things simple.
When a chunk reaches twice the target size we start applying overflow protection.

In overflow protection we still close on the same entries as we would otherwise but 
we also compute a *sequence identity*. This is calculated by applying the hash 
function to the prior TARGET_SIZE hashes and the current hash. This gives us a 
unique identifier for each entry based on its placement. We convert the tail 
of this new hash to a Uint32 and compare it against the OVERFLOW_LIMIT.

The OVERFLOW_LIMIT is an integer that increases an equal amount from 0 on every 
entry until it reaches MAX_UINT32. The increase in OVERFLOW_LIMIT on each entry 
is `Math.top( MAXUINT32 / TARGET_SIZE)`.

This makes generating sequential data that will keep the chunk open highly difficult 
given a sufficient TARGET_SIZE and still produces relatively (although not completely) 
consistent chunk splits in overflow protection.

@ribasushi
Copy link
Contributor

ribasushi commented Nov 12, 2020

UPDATE: below note was based on a misunderstanding, see second point below.

Do we need the complication of OVERFLOW_LIMIT?

  • We always deduplicate ( entries are unique )
  • We always sort detail I was missing: we do sort but in a user-defined way, NOT based on the content hash
  • We use cryptographic hashing

This means that inserting two entries next to each other is exceedingly difficult ( nearly preimage-attack level, albeit with lowered bit-ness )

Therefore: keep track of TARGET_SIZE, and whenever we cross it - instead of just taking the original hash of the node, byte-append the hash of the previous node and get the hash of that ( cheap, since you can reinit state in sha2 ). You then get a new unpredictable hash for the same leaf, and the leaves after it, until you break out of the attack, and things go back to normal.

@mikeal
Copy link
Contributor Author

mikeal commented Nov 12, 2020

You have to keep in mind how large the address space is for not matching. If the target size is 1000 you have 99.9% chance of not closing under the defined limit. It’s actually not impossible to find a sequence in which every rolling hash stays within these boundaries especially if the target size is low which we do want to support. Increasing the threshold for a match slowly closes the available address space for an attack and since even a sequence that entirely matches would be evenly distributed it still makes matches fairly consistent after a small number of inserts. A large number of inserts is less of a concern because the overflow into the next node will push new data into the sequence and make any prior sequential attacks highly likely to be cut off by new overflow matches in the front of the sequence.

@warpfork
Copy link
Contributor

What's the path forward on this?

If it's a tractable amount of work / there's time for it, can it be finished soon (i.e. in the next week or two)?

If it's not something that is likely to get carried much further forward at this time (or in the next couple weeks)... can we move it to be marked very clearly as a draft thing (and a comment about when it's likely to be expanded on), and merge it that way?

@warpfork
Copy link
Contributor

Ping for if we can make this mergeable.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants