-
Notifications
You must be signed in to change notification settings - Fork 108
wip: hash consistent sorted trees #332
base: master
Are you sure you want to change the base?
Conversation
Now that I’m staring at it all written out I’m realizing that there are some even better approaches we can take. I’ve been using these simple tail addresses, which are great because they’re 100% consistent so they don’t have much churn on branch merging. The problem is that they’re super vulnerable to attack. @mikolalysenko solved this by using a floating fingerprint which is where all these ideas originated. But once you spell out the problem the chunker is solving it’s a lot simpler than this. You’re already feeding it fingerprints, so you have a randomized distribution you can convert to an integer. Since the chunker is a state machine, all you need to do is introduce a few hard limits on max chunk sizes and reset the state machine whenever you close a chunk. You’d get the consistency benefits of my approach and the safety of the entropy you get from the fingerprinter. |
@vmx i just re-wrote the whole doc. you can hold off on reading it for a bit, there’s a lot of churn right now. |
@ribasushi what do you think about this:
|
UPDATE: below note was based on a misunderstanding, see second point below. Do we need the complication of OVERFLOW_LIMIT?
This means that inserting two entries next to each other is exceedingly difficult ( nearly preimage-attack level, albeit with lowered bit-ness ) Therefore: keep track of TARGET_SIZE, and whenever we cross it - instead of just taking the original hash of the node, byte-append the hash of the previous node and get the hash of that ( cheap, since you can reinit state in sha2 ). You then get a new unpredictable hash for the same leaf, and the leaves after it, until you break out of the attack, and things go back to normal. |
You have to keep in mind how large the address space is for not matching. If the target size is 1000 you have 99.9% chance of not closing under the defined limit. It’s actually not impossible to find a sequence in which every rolling hash stays within these boundaries especially if the target size is low which we do want to support. Increasing the threshold for a match slowly closes the available address space for an attack and since even a sequence that entirely matches would be evenly distributed it still makes matches fairly consistent after a small number of inserts. A large number of inserts is less of a concern because the overflow into the next node will push new data into the sequence and make any prior sequential attacks highly likely to be cut off by new overflow matches in the front of the sequence. |
What's the path forward on this? If it's a tractable amount of work / there's time for it, can it be finished soon (i.e. in the next week or two)? If it's not something that is likely to get carried much further forward at this time (or in the next couple weeks)... can we move it to be marked very clearly as a draft thing (and a comment about when it's likely to be expanded on), and merge it that way? |
Ping for if we can make this mergeable. |
still in a very early draft state, but i’ve written enough in this web editor that i’m no longer comfortable not having it committed to a branch :)