Skip to content
This repository has been archived by the owner on Jun 29, 2022. It is now read-only.

hashmap: 3 kinds of map keys, string, bytes, integers (for discussion) #180

Closed
wants to merge 1 commit into from

Conversation

rvagg
Copy link
Member

@rvagg rvagg commented Aug 24, 2019

Not a firm proposal and I'd like to play with this in code before committing to it.

Comes out of a discussion about map key kinds, @whyrusleeping wanting int keys but IPLD ruling them out so far. This proposal isn't to change the data layer map rules, so with HashMap acting in place of a "map" it would still just be strings. But it enables alternative modes of operations when interacting directly with it.

This change moves key encoding from Bytes, to Strings, Bytes and Ints and introduces specific rules for how to encode integers to bytes for the purpose of hashing.

The major interesting thing this introduces is a form of sparse array, which provides the efficient automatic balancing given by hashing, but it misses out on ordered iteration, so you can get/has/set/delete but when you run any of the iterators you get an effectively randomised ordering. I don't know whether this trade-off makes it even worthwhile doing this, maybe it's a use-case thing?

Letting the key be one of three kinds is going to make this a bit annoying in Go. Not only do you need to be able to allow 3 types in arguments for get/has/set/delete but you need to return one of 3 types in the keys() iterator and entries() needs to return tuples or maps containing a key that's one of three types. I can imagine it getting messy, but that could be my Go inexperience. It could become more practical if we introduced a rule that disallowed mixing of types, allowing for a generic-style type description in typed programming land.

@rvagg rvagg mentioned this pull request Aug 24, 2019
@warpfork
Copy link
Contributor

Not only do you need to be able to allow 3 types in arguments for get/has/set/delete but you need to return one of 3 types in the keys() iterator and entries() needs to return tuples or maps containing a key that's one of three types

Unfortunately that doesn't quite reach the end of it, either. Needing three times more code would be... a notable cost, but palatable.

The rule against mixed types of keys would also be needed, as you note.

We'd also need a mechanisms for discriminating which one of the sets of methods to use. This is where it really gets eye-watering. Is there a way to do that which is palatable? The only solution I can think of is three different kinds of "map" in the data model... which is a pretty fundamental and eyebrow-raising thing to consider.

(I don't mean this comment to say it's not worth considering anyway via this angle of the spec here! Just to add a little more note on that last bit.)

@rvagg
Copy link
Member Author

rvagg commented Aug 26, 2019

The only solution I can think of is three different kinds of "map" in the data model... which is a pretty fundamental and eyebrow-raising thing to consider.

Right, so I'd like to keep these two discussions at least a little separate - this isn't a proposal to change the data model, just adding alternative operating modes to a data structure. That would enable modes that are at odds with the data model, so they're not entirely separate discussions but I want to be clear that I'm not suggesting changing the data model here, that's for a separate issue and someone braver than I.

@mikeal
Copy link
Contributor

mikeal commented Aug 26, 2019

Right, so I'd like to keep these two discussions at least a little separate - this isn't a proposal to change the data model, just adding alternative operating modes to a data structure. That would enable modes that are at odds with the data model, so they're not entirely separate discussions but I want to be clear that I'm not suggesting changing the data model here, that's for a separate issue and someone braver than I.

For me, this proposal sort of puts to rest the discussion about changing the data model because it shows how you can pretty easily support this without data model changes.

Unlike many other data models, the IPLD Data Model isn’t a constraint on what can be expressed, it’s only a constraint in how those expressions are serialized to blocks. The Data Model Map not having integer keys does not in any way preclude building Map data structures that support integer keys, it only specifies that they must serialized into simpler data structures that are easily supported in common formats.

Today, the data model is the thing we have solidified and we have a good deal of code built for working with it. But this is still “Stage 1” of IPLD. We’re working towards a future where most programmers use code built on top of this rather that directly against it. It’s short sighted to complicate the data model to enable features we can build easily on top of it just because, right now, we have better libraries for working w/ the data model than on top of it.

Where does this end? Why not do multi-block collections in the data model? If you start to extend the data model well beyond things that are supported already by common formats there’s no reason not to standardize something as advanced as multi-block collections in the data model. If we don’t draw some sort of line then there’s literally no end to the features we could consider and extend at the data model layer and at that point, what’s the point of a data model or layers at all?

We’re already using IPLD Schemas and will have implementations of advanced layouts by the end of the quarter that will include a Map-like multi-block data structure. We have experiments in Composites already that have this, not to mention Links w/ paths and other functionality. All of these show us paths to supporting this functionality without extending the data model.

@whyrusleeping
Copy link
Contributor

it’s only a constraint in how those expressions are serialized to blocks.

I mean, the whole reason i would choose to have a map with integer keys is because i care about how its serialized to blocks... thats like, the point.

@rvagg
Copy link
Member Author

rvagg commented Aug 28, 2019

@whyrusleeping can you explain this more? One main point of IPLD should be that it abstracts most of your serialization concerns. Would it matter if you loaded a block, got a map with integer keys that you could but it was actually encoded as some funky binary object with CBOR tag 2094?

@anorth
Copy link

anorth commented Sep 5, 2019

Filecoin cares a lot about the serialisation cost and size of the encoded objects that form state referenced by the blockchain. Both space and encoding effort are very expensive because they must be recomputed by every validating node.

@mikeal
Copy link
Contributor

mikeal commented Sep 6, 2019

can someone demonstrate the cost difference between an array of tuples w/ k/v pairs (which is the most likely serialization given the current data model) and a map w/ integer keys?

if the saving is actually that large, we could write a feature into the dag-cbor codec that serializes an array of tuples where all the first elements are integers as a map with integer keys. once it comes out of the deserializer it would still be an array of tuples, but we could optimize the encoding if the savings is all that notable.

there’s a bit of a conflict here. IPLD is designed to be flexible across formats and languages which means that, at some point, there’s going to be a tradeoff in terms of efficiency compared to what you could accomplish at a very low level. even if we supported every possible feature in CBOR we wouldn’t be able to touch what you could do with a custom block format.

in fact, if you really want to bring down the serialization size and you have some limitations you can impose on what is represented, writing a custom block format shouldn’t be out of the question. the purpose of the data model is to abstract these block format level features away enough that we can write common interfaces and one of the reasons for that is so that we can take any of these custom block formats people may create in the future.

@Stebalien
Copy link
Contributor

The overhead is one byte per entry (in CBOR).

@rvagg
Copy link
Member Author

rvagg commented Sep 9, 2019

I think we should probably drop the integer keys idea for now, it's too messy and it turns out that Filecoin needs ordered iteration, which we can't support here, so we don't have a user for unorered-sparse-array yet. Without an alternative I think they're just going to pursue #137.

The String/Buffer keys is more interesting for now, coupled with #184.

@rvagg
Copy link
Member Author

rvagg commented Sep 11, 2019

closing in favour of #192 which is this minus ints and sparse array ponderings

@rvagg rvagg closed this Sep 11, 2019
@rvagg rvagg deleted the rvagg/key-kinds branch September 11, 2019 00:51
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants