Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to derive serializers #6

Closed
bbenne10 opened this issue Aug 17, 2020 · 13 comments
Closed

Unable to derive serializers #6

bbenne10 opened this issue Aug 17, 2020 · 13 comments

Comments

@bbenne10
Copy link
Contributor

I'm currently working on a project where I would like to serialize and de-serialize instances of Bloomf.t to and from disk.
Because the main type is abstract over an input type and the actual implementation is not exposed, I don't have access from external code to implement {de,}serializers.
One very simple solution to this would be to simply expose priv.
Shortly, I'll be opening a PR that does just this (though it also renames priv to impl since it is no longer private).
This is not 100% ideal, as it leaves up to the consumer to reconstruct the abstracted type from context,
but I'm not yet skilled enough to come up with a better solution in a vacuum.
Do you all have thoughts on how one might implement fully type-safe serialization support here?

@pascutto
Copy link
Contributor

Hi! Thanks for your interest.
I'm not really keen to exposing internals of the library to the users, since that would enable them to break some important invariants of the filters.
Do you think writing simple {to,from}_bytes or {to,from}_string would match your use case?

@dinosaure
Copy link
Member

On other side, the Marshal module is another solution to serialize/deserialize any values.

@bbenne10
Copy link
Contributor Author

I'd prefer not to use Marshal, as it breaks between OCaml compiler versions which is a real problem for us.

A {to,from}_bytes or {to,from}_string COULD work.
I noticed after opening this PR that there's an open PR for saving state of the hash functions and the like, which is unimportant for me, but could VERY well be useful. Is there a reason that this hasn't been merged? Or is it just because it needs a rebase?

I completely understand not wanting to expose internals (especially as it breaks type encapsulation on the resultant type), but we'd then need to save a representation of the parametrized type so that the serialize would know how to reconstruct that type in deserialization.

A to/from bytes or string implementation could do that, but would require significant care to ensure that both sides of the operation worked properly. Saving the implementation type works more reliably (technically), but requires care from the developer to know what the type of the original instance was and only compare with that type. I'm fine with the latter, but I'm only comparing one type.

@pascutto
Copy link
Contributor

I noticed after opening this PR that there's an open PR for saving state of the hash functions and the like, which is unimportant for me, but could VERY well be useful. Is there a reason that this hasn't been merged? Or is it just because it needs a rebase?

Indeed, I actually forgot about #1.
The same caveat applies there, exposing types and functions that break the abstraction and enable the user to change or even know about the internals is not the way to go, as the structure contains subtle invariants that should not be breakable (this is enforced by the interface).
I'm not sure what you're refering to with "the state of the hash functions", but hash functions should behave as pure functions, so there is no state to save. Other parameters would indeed have to be saved though, and that would be the responsibility of the {of,to}_bytes functions.

I completely understand not wanting to expose internals (especially as it breaks type encapsulation on the resultant type), but we'd then need to save a representation of the parametrized type so that the serialize would know how to reconstruct that type in deserialization.

The bloom filters do not contain a representation of the parametrized type. (De)serialization is by essence unsafe, so it's always the responsibility of the user/storage to ensure the data is used correctly when written somewhere else, regardless of the serialization strategy.

@bbenne10
Copy link
Contributor Author

I'm not sure what you're refering to with "the state of the hash functions"

Seeds, unless I misread the PR.

The bloom filters do not contain a representation of the parametrized type.

Of course not - that is exactly the problem.
The question is "Given that the storage is trusted and that type definitions of the parametrized type haven't changed, how do we allow the user to serialize both a reference to the parametrized type AND the underlying implementation so that de-serialization is possible?"

Refinements could be made to that question in regards to the parametrized type's definition evolving or something, if you'd like, but this works for me for now.

How do we approach that problem? As mentioned, I'm new to the OCaml world, but quite well versed in the OO world. My first thought is reflection or introspection, but that requires some level of runtime type-checking and type preservation (which - to my understanding - OCaml does away with during compilation). How would you approach this problem? I'm happy to explore and attempt an implementation, but don't want to waste anyone's time with efforts that won't be accepted upstream.

(De)serialization is by essence unsafe, so it's always the responsibility of the user/storage to ensure the data is used correctly when written somewhere else, regardless of the serialization strategy.

Of course. But many of the oft-cited problems with serialization and deserialization fall outside the responsibility (or even the ability) of a library such as this to encapsulate.

Dealing with changed type definitions or mutated hash functions?
Bail if you can (with an exception)
or rely on the developers to know that changes to types/hash functions invalidate serialized data.

Someone else has access to your storage drives and can muck with your data?
Hope you trust them!

There's only so much that a language can do here.

@craigfe
Copy link
Member

craigfe commented Aug 20, 2020

How do we approach that problem? As mentioned, I'm new to the OCaml world, but quite well versed in the OO world. My first thought is reflection or introspection, but that requires some level of runtime type-checking and type preservation (which - to my understanding - OCaml does away with during compilation). How would you approach this problem? I'm happy to explore and attempt an implementation, but don't want to waste anyone's time with efforts that won't be accepted upstream.

Indeed: OCaml values are not tagged with their type at runtime; the assumption being that type errors are for the type checker 🙂 As a result, it's not natural in OCaml for serialised values to contain type information either: it's the users responsibility to ensure that type errors do not happen, and neither the OCaml runtime nor the bloomf library is ultimately capable of ensuring "type safety" (in the way that an OCaml programmer thinks of it).

Having said that it's not natural in OCaml, it's still possible. We can imagine an interface where the codecs require the user to pass an explicit "type representation" (like a JVM klass pointer, but well-typed):

val to_bytes : 'a type_repr -> 'a bloomf -> bytes
val of_bytes : 'a type_repr -> bytes -> 'a bloomf

(* e.g. *)
let buffer : int list bloomf = Bloomf.to_bytes Type.(list int) filter

which could then be included in the bytes representation to protect against users against creating a bloom filter at one type, passing it through a pipe, then trying to use it at another. We even have functions exactly like this in Irmin (where it's strictly required, since the codecs are derived from the runtime type representation). However:

  • the user has to build this representation by hand, so the API becomes heavy;

  • Bloomf doesn't do anything with this data other than performing an equality check later (so the user might as well build their own API on top of Bloomf that provides this run-time check);

  • it doesn't have much benefit: only adds a new sort of run-time error that the user might have better expressed at compile time with types (e.g. by tagging bytes with a phantom parameter)

In short: since dynamic typing can be built on top of static typing (but not vice-versa without a performance penalty), we tend to go with the flow of the OCaml language and not have runtime type representations. If this was a language that had already paid the performance penalty of dynamic types, it might be a different situation 🙂

@bbenne10
Copy link
Contributor Author

So I fail to see how this leaves us in any different a scenario than simply serializing and deserializing priv?
Please don't read this as disrespect or frustration - I just don't see an option that lets us solves this issue in a way that aligns with @pascutto's concerns.

@pascutto
Copy link
Contributor

pascutto commented Aug 20, 2020

So I fail to see how this leaves us in any different a scenario than simply serializing and deserializing priv?

That would be the way to go, my point was that this should be handled by internal functions (e.g. {to,from}_bytes), and not by exposing the type and doing this in the client code, in order to preserve the abstraction.
Of course, these would be responsible for serializing internal parameters along with the bit array.
What remains for the developer is:

  • Make sure that the serialised/deserialised data is used with the same parametrized type
  • Make sure that the same hash function is used before and after serialisation/deserialisation. In particular, that's where the "state of the hash function" fits, as there is no such thing in bloomf at the moment, you have to provide your seeded hash function to the functor instead.

Please don't read this as disrespect or frustration - I just don't see an option that lets us solves this issue in a way that aligns with @pascutto's concerns.

No problem :)
Maybe we misunderstood what your constraints are. Could you please precise what your use case is, and maybe tell us why {to,from}_bytes functions would not work?

@bbenne10
Copy link
Contributor Author

Maybe we misunderstood what your constraints are. Could you please precise what your use case is, and maybe tell us why {to,from}_bytes functions would not work?

I did not intend to imply that they wouldn't. Far from it, in fact, as I believe this would be a good answer! My initial solution is, admittedly, aimed at only my immediate use case (using ppx_deriving_yojson). {to,from}_bytes functions would also be perfectly acceptable and would subsume my need for deriving serializers for Bloomf at all.

I remain unconvinced that I yet understand enough to do this from scratch, but I have attempted to make that clear :P What other state need be serialized? I don't see anything other than m, p_len, k and b that need be written out. The rest of the module seems perfectly well contained and stateless. Am I wrong here? Is there something I'm missing? If not and seeing that all of the fields in question are floats or integers (or serializable to an int, in the case of the actual Bitv), would there be a problem of writing out each variable in question (in some arbitrary fixed order - say alphabetic on variable name) separated by some character (I'd propose ASCII code 31 - Unit Separator)?

@pascutto
Copy link
Contributor

pascutto commented Aug 20, 2020

What other state need be serialized? I don't see anything other than m, p_len, k and b that need be written out. The rest of the module seems perfectly well contained and stateless.

That should indeed be enough.

or serializable to an int, in the case of the actual Bitv

An int is probably not enough, but Bitv provides serialisation functions that you can use directly (see here).

would there be a problem of writing out each variable in question (in some arbitrary fixed order - say alphabetic on variable name) separated by some character (I'd propose ASCII code 31 - Unit Separator)?

That sounds ok to me! Maybe check that the Bitv serialisation does not use that specific character though, or put it at the end.

Let me know if you have any issues implementing this so I can help, or do it if I get the time to :)

@bbenne10
Copy link
Contributor Author

An int is probably not enough, but Bitv provides serialisation functions that you can use directly (see here).

Seems I misread max_length as max_size. I'll look into how to get this done in the morning (we just hit 5:10PM EST, so I'm about to head out for the evening). Hopefully, I should have something by about this time tomorrow (provided none of my other fires gets too large to need putting out ;) )

@bbenne10
Copy link
Contributor Author

I've got what appears to be a working (though maybe not idiomatic) implementation. I'll open a PR for discussion today. I'll have to familiarize myself with alcotest to get some tests working before I feel comfortable with it getting merged though.

@pascutto
Copy link
Contributor

Added in d581f77, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants