New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Max node size limitations #48
Comments
|
First off, I completely agree that this is a massive pain-point and we need to clearly define what users can and can't do.
I'm not sure how we can do this and still get the security guarantees we want. At the end of the day, we need to pick some cutoff where we can say "I'm not accepting this block from you at this point, it's too big and I can't validate it". Now, 1-2MiB may not be the correct solution, given that we make 256KiB nodes all the time. Really, we might want 10MiB as the hard limit but recommend 1-2MiB. However, the larger we make this, the easier it is to attack the network. |
I should re-phrase this to be "no hard size limitations in IPLD." Separately, I'd like to work towards removing size limitations from the transport layer (GraphSync/Bitswap). If the size limit is "my hash validator runs out of memory" or "my serializer runs out of memory," then those are reasonable and in-line with the way this currently works from a developer perspective (if you try to use a JSON parser on an object that is too big it runs out of memory). Storage nodes will end up having their own limitations that are, probably, configurable. A multi-user storage system needs to protect against this in a way that single user nodes may not.
Then we need to implement this guard at the layer it will be attacked. Nothing about enforcing these limits in our serializers helps us against this attack because someone can just write a serializer that doesn't enforce the limit at put nodes in the network. In fact, it's trivial to do this today with As we implement these limits we need to document the reasons for them at that layer. It's entirely possible that one approach to building a network will have a different size limit than another. It's also also possible that different clients will need different limitations depending on their environment, so it may even be necessary to push this out of the transport layer and into the client implementation. If we look at prior art on the centralized web, we have not only size limitations but network timeouts at every proxy and caching layer as well as in all clients but nothing in HTTP, HTML or CSS about a max size. |
|
Ah. I see what you're getting at. Yes, the serializers and deserializers should work with arbitrary objects. As far as I know, we don't have any limits in go-ipld-*, right? If so, that's definitely a bug.
We do have to be careful here. Basically, we don't want to end up in a situation where a user can create a large node but then other users can't fetch them. For new data (doesn't fix, e.g., git), we can consider using merkle-tree based hash functions (for large objects). That would allow us to validate parts of blocks as we receive them. We calculate the CID/hash after we serialize so this should be pretty transparent to the user. |
|
Quite some time have passed since the last update on this issue. Are there any news on this topic? I'm pretty much concerned with what @Stebalien have said about creating large nodes but then other users can't fetch them. It would be nice to know in which places size limits are checked, and what happens when this validation fails in each case, to know how all this affects the DHT, BitSwap, GraphSync and other components of the whole ecosystem. |
|
The DHT only ever stores the CID of a block so the size of the actual block isn’t a problem. Bitswap’s limit is 2MB (or maybe a few bytes below 2MB?) Graphsync implementations may not have a limit now but likely will in the future. The current recommendation is “keep your blocks 1Mb or below and nothing bad will happen” |
mikeal commentedSep 19, 2018
First, I'd like to try and catalog/document the world we have today, then talk about what I think we should move towards.
Today
dag-cborin JS (does Go have a limit?) has a hard limit at something like 500K with the current defaults.I've now seen several issues where these limits are thrown around and selectively enforced. While we have not documented a hard size limit the current limits in our implementations are used as an excuse not to fix limitations elsewhere. There are a lot of reasons why we want to keep nodes small and many performance issues we can hit if node's are too large. However, there's no consistent limit on all nodes and we already know that, at some level, we'll have to support arbitrary sizes to support git.
More importantly, the developer impact of these size limitations are quite punitive. There's no way to know how large a node will be once it is serialized until you serialize it. If a developer wants to implement sharding once a node gets to a particular size they have no way of predicting when the node hits that limitation. We often throw around "use sharing" or "use hamt" as a solution to this problem but there just isn't a good way to predict when this is necessary based on the size of the serialized node. It's totally reasonable to tell developers that "once you have 1K keys you should be sharding" but it's actually not reasonable to say "once the serialized CBOR representation is over 500K" because that means they'll always have to wrap serialization in a try/catch and they'll always be attempting to serialize gigantic nodes in order to figure this out.
Even worse, this creates an incentive to start doing compression at the node layer. I did this in some of the gharchive work and it's not a solution we should drive people towards. It means the compression gains we might see at the transport and storage layer will be redundant and possibly even punitive.
Solution (Future)
The text was updated successfully, but these errors were encountered: