Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New serializer API #37

Closed
wants to merge 5 commits into from
Closed

New serializer API #37

wants to merge 5 commits into from

Conversation

pchampin
Copy link
Owner

with TripleSerializer and QuadSerializer traits,
and TripleStringifier and QuadStringifier specializations.
This is meant to replace the "duck-typing" style Config objects.

In the process, removed some duplicate code btw nt and nq serializers.

with TripleSerializer and QuadSerializer traits,
and TripleStringifier and QuadStringifier specializations.
This is meant to replace the "duck-typing" style Config objects.

In the process, removed some duplicate code btw nt and nq serializers.
and also a little bit of the parser module
@pchampin
Copy link
Owner Author

@MattesWhite @Tpt whould you mind having a look at this PR? This is the long awaited refactoring of the serializer module, in the line of what I did with parsers. What do you think?

) -> Result<String, StreamError<TS::Error, Self::Error>>
/// [`TripleSource`]: ../triple/stream/trait.TripleSource.html
/// [`Graph`]: ../graph/trait.Graph.html
pub trait TripleStringifier {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of the derive macro maybe this could be better solved by something like (pseudo code):

trait TripleStringifier: TripleSerializer<T>
where
    String: From<T>,
{
    fn stringify_triples(...) {
        self.serialize_triples_in(...).map(Into::into)
    }
    ...
}

Maybe use rather Vec<u8> instead of T and do a proper conversion (what you actually do in the macro).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding my proposal at line 21 this would be:

fn vec_to_string(v: Vec<u8>) -> String {
    String::from_utf8_lossy(&v).to_owned()
}

trait TripleStringifier: TripleSerializer<Outcome=Vec<u8>> {
    fn stringify_term<TD: TermData>(self, t: &Term<TD>) -> Result<String, Self::Error> {
        let mut ser = self;
        ser.serialize_term(t)?;
        ser.finish().map(vec_to_string)
    }

    ...
}

Maybe add some shorthand functions for Self: Default ...

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan either. Here are alternative solutions I considered, by decreasing order of preference:

  • have TripleStringifier (resp. QuadStringifier) automatically implemented by any impl of TripleSerializer (resp. QuadSerializer) supporting &mut Vec<u8> (which includes those supporting any Writer). I tried to do it that way, but I failed. The problem is that you need to declare a lifetime for the mut reference to Vec<u8>, and this proves to be tricky. I even tried something with a HRTB (for <'a> ...) but that failed as well.

  • have a proper derive macro, so that one could simply write [derive(TripleStringifier)] above their type implementing TripleSerializer. But I don't know procedural macros.

  • have hacky pseudo-derive using a standard macro. This is where we are.

But if you manage to get one the solutions above, I'm all for it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This clearly depends on the Serializer trait. Of course I'll spent time on that to get a working solution. But lets settle this for now and focus on the main trait.


fn new(write: W, config: Self::Config) -> Self;
/// A sink serializing into `target` the triples fed to it.
fn sink(&self, target: T) -> Self::Sink;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this architecture blurrs the lines what a serializer is. On the one hand there is TripleSerializer that offers methods to serialize (serialize_triples_in(to)*()) but it doesn't implement the serializing. On the other hand a sink must be implemented that actually does serialization but don't implements TripleSerializer ... 🤔 . Even worse, thinking about Turtle-serialization for streams and graphs it is required to implement two different algorithms, i.e. different 'sinks' but the trait only allows one associated sink.

I suggest a more plain trait the does not include construction but focuses more on serialization. Indeed, my proposal is pretty close to the original and is similar to a TripleSink.

trait TripleSerializer {
    type Outcome;
    type Error: std::error::Error;

    fn serialize_term<TD: TermData>(&mut self, t: &Term<TD>) -> Result<(), Self::Error>;

    fn serialize_triple<T: Triple>(&mut self, t: &T) -> Result<(), Self::Error>;

    fn serialize_triples<TS: TripleSource>(&mut self, ts: TS) -> Result<(), StreamError<TS::Error, Self::Error>>;

    fn serialize_graph<G>(&mut self, g: &G) -> Result<(), StreamError<G::Error, Self::Error>>;

    fn finish(self) -> Result<Self::Outcome, Self::Error>;
}

A TripleStringifier is would be similar just return String instead of () and provide a blanked implementation for TripleSerializer<Outcome=Vec<u8>>. However, this proposal probably needs some refinement.


* in this case the preposition 'in' indicates me that the serialization is done somewhere inside without modifying the outer thing. I'd rather go with 'into'. For me this is a more common description in Rust for: Take something convert it and modify the target. (BTW the same for in_sink().)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the two design principles that I tried to follow when refactoring the parser and serializer APIS:

  1. be intuitive: a Parser should have a parse method (or several); a Serializer should have a serialize method (or several);
  2. integrate well with the general streaming model of sophia, i.e. provide implementations of TripleSource/TripleSink (or their quad counterpart) whenever possible.

You object that the actual serializing is done in the sink and not in the serializer per se, but the same criticism could be done on parsers as well (the actual parsing is done by the source returned by the parse method).

Note also that I have been aware of this "inconsistency" from the start. This very concern lead me to the previous design, naming the "static" object with parse/serialize methods "Config" rather than "Parser"/"Serializer". This was deemed unintuitive by several people (including yourself, IIRC 😉), hence the new design (and item 1 above).

Even worse, thinking about Turtle-serialization for streams and graphs it is required to implement two different algorithms,

Serializing a stream and serializing (smartly) a graph do indeed require two different algorithms. So it might be desirable to have two different traits... On the down side, this would make the API more complex. On the up side, you might say that implementors could choose to implement only the graph serialization, and not the stream serialization (the opposite is not a problem, given the default implementation of serialize_graph_in). But my feeling is that once you have a the algorithm for graphs, the algorithm for streams is really not that hard to get...

i.e. different 'sinks'

No, the sink is only for parsing streams.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re. the TripleSerializer trait you proposed:

  • serialize_term does not make sense for some serializers (i.e. RDF/XML), where terms can not always be "isolated";

  • serialize_triple is not strictly required by users (they can easily reuse serizlier_triples or serialize_graph to get the same result); and for internal use, such a method does not always make sense (in Turtle or RDF/XML, several triples share the same subject, so not all triples are serialized identically);

  • serializer_triples means that your design also suffers from the "two distinct algorithms" problem discussed above.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re. "in" vs. "into":

I have given this some thought when designing the stream API – first I wanted to name the methods into_sink and into_graph of TripleSource, but this was misleading: it gave the impression that the methods were for converting the TripleSource into something else, which is wrong.

So I went for in instead of into, and I kept this habit for serializers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note also that I have been aware of this "inconsistency" from the start. This very concern lead me to the previous design, naming the "static" object with parse/serialize methods "Config" rather than "Parser"/"Serializer". This was deemed unintuitive by several people (including yourself, IIRC 😉), hence the new design (and item 1 above).

I have to admit that this design is actually pretty good. And the same I come up with on a second thought. However, I still believe that the initialization should not be part of a basic trait.

Indeed, the methods serialize_term() and serialize_triple() add no benefits and can be removed.

Regarding the 'in' vs 'into' discussion... why not name it serialize_triples() and serialize_graph()? This does exactly describe what is done and the API clearly tells the user if the argument is consumed (TripleSource) or just borrowed (Graph). For me this is one of the biggest advantages of Rust's typesystem 👍 .

To sum up, I now propose to stick with my previous proposed trait but remove the two unnecessary methods. Maybe add : TripleSink to get Self::Outcome and Self::Error.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I still believe that the initialization should not be part of a basic trait.

I agree with this general principle. But it seems that we are placing the "initialization" at different levels.

I'm assuming that by "initialization", you are referring to setting the target (a file in general) of the serializer. In my view, this is already a second step; the first step is setting the different options of the serialier, like "should it use ASCII or UTF-8" (for the N-Triples serializer) or "what prefixes should it use" (for a future Turtle serializer)...

I think those two steps should be separate. I think it is important that one can configure, say, a turtle serializer with a number of option (typically, which prefixes it should use), and then use that serializer several times, to serialize different graphs in different files...

Also, I'm trying to make the Parser and Serializer APIs as symmetrical as possible. But I give you that the symmetry is not perfect in the current state. I'll come with another proposal, in the main thread of this PR.

sophia/src/serializer.rs Outdated Show resolved Hide resolved
@MattesWhite MattesWhite mentioned this pull request Jan 24, 2020
Although, strictly speaking, `mut` in front of a parameter name is
not significant for the function signature, it is misleanding,
as pointed out in
#37 (comment)

This change, although semantically equivalent and slightly more
verbose, aims at avoiding that confusion.
@pchampin
Copy link
Owner Author

In a subcomment above, I promised another proposal, and I meant it, but actually I came back to my original design :-/. Let my try to explain how.

Currently, the parser library works like that:

// create a Parser -- this is not specified by the Parser trait
let p = TurtleParser::new().with_default_prefixes(&prefix_map);
// use it to parse file f1 in graph g1
p.parse(f1).in_graph(&mut g1);
// alternatively, use the triple stream API to parse file f2 in graph g2
let parsed_triples = p.parse(f2); // returns a TripleSource
g2.insert_all(parsed_triples);

I find this API intuitive, and easy to read and write.

If we try to imagine a perfectly symmetrical API for serializers, it would look like

// create a Serializer -- this is not specified by the Serializer trait
let s = TurtleSerializer::new().with_prefixes(&prefix_map);
// use it to serializer graph g1 in file f1
s.serialize_graph(&g1).in_(f1);
// alternatively, use the triple stream API to serialize graph g2 in file f2
let turtle_file = s.sink(f2); // returns a TripleSink
g2.triples().in_sink(turtle_file);

This looks (IMO) equally nice to the Parser example,
but it raises the question of what the serialize_graph method returns,
and furthermore what would be the use of the returned object,
except for calling its "in_" method (granted, the name is not good, but that's not the point here).
So I came to the conclusion that this intermediate object should probably not exist,
and I came back to the original design.

// create a Serializer -- this is not specified by the Serializer trait
let s = TurtleSerializer::new().with_prefixes(&prefix_map);
// use it to serializer graph g1 in file f1
s.serialize_graph_in(&g1, f1);
// alternatively, use the triple stream API to serialize graph g2 in file f2
let turtle_file = s.sink(f2); // returns a TripleSink
g2.triples().in_sink(turtle_file);

Does it make more sense?

@MattesWhite
Copy link
Contributor

I don't say that your proposal doesn't make sense. The serializer trait you suggest is perfectly aligned to the source/sink model you follow.

I think of a different model for serializers. For me a serializer's purpose is to serialize data to one target. I agree that this means there a two steps for initialization. First create some configuration that you may be able to share across threads and so on. The difference in my thinking is that the concrete serializer is initialized with that config and a target, so the one serializer is owning it one target. Throughout the life of the serializer I can add one or more graphs, triples and so on. Eventually, the serializer is consumed and the target is returned (therefore fn finish(self) -> Result<Self::Outcome>).

and furthermore what would be the use of the returned object,

Therefore, I suggested to return Result<()> until serialization is finish()ed nothing is returned but errors are reported.

To sum up, I prefer sharing a config and built serializers on demand, e.g. per thread. As far as I understand your solution shares one serializer that is responsible for serializing to several targets. I think that can become a problem when the serializer is used in different threads.

@MattesWhite
Copy link
Contributor

Another point for using one serializer per target is streaming Turtel serialization. In order to be able to produce property and object lists the serializer needs to store the last subject and predicate it serialized. This can be used to continue serialization over two streams. When one serializer is used for several targets this would not work. Worse, it could produce incorrect serializations, e.g. continue the serialization in another target.

@pchampin
Copy link
Owner Author

@MattesWhite Both approaches have their pros and cons. They both have their risks/limitations (for example, it does not always make sense to serialize two graphs into the same file -- with RDF/XML, that would produce an invalid document), but they also both have their valid uses cases.

So my conclusion is that we should probably support both approaches. I will amend the PR in that sense on the next few days.

Thanks a lot for your critical eye, and for challenging me into improving the design of Sophia :-)

@MattesWhite
Copy link
Contributor

Indeed, I thought about this problem two. And I'm curious about your solution. I suggest to solve this problem by consulting the RDF spec:

An RDF document is a document that encodes an RDF graph or RDF dataset in a concrete RDF syntax, [...].

Consequently, the serialization of an Graph would also finalize the serialization process - for a triple serializer, as you proposed. Maybe the best solution is realy to provide two serializer traits. The first for streaming triples that is capable of consuming several TripleSources before finishing. The second takes one complete Graph and finishes immedeatly. Where the streaming quads version should also be able to take Graphs.

To sum up:

trait TripleSerializer {
    type Outcome;
    type Error: std::error::Error;

    fn serialize<TS: TripleSource>(&mut self, ts: TS)
        -> Result<(), StreamError<TS::Error, Self::Error>>;
    fn finish(self) -> Result<Self::Outcome, Self::Error>;
}

trait GraphSerializer {
    type Outcome;
    type Error: std::error::Error;

    fn serialize<G: Graph>(self, g: &G)
        -> Result<Self::Outcome, StreamError<G::Error, Self::Error>>;
}

trait QuadSerializer {
    type Outcome;
    type Error: std::error::Error;

    fn serialize_stream<QS: QuadSource>(&mut self, qs: QS)
        -> Result<(), StreamError<QS::Error, Self::Error>>;
    fn serialize_graph<G: Graph>(&mut self, g: &G, name: Option<&Term>) 
        -> Result<(), StreamError<G::Error, Self::Error>>;
    fn finish(self) -> Result<Self::Outcome, Self::Error>;
}

trait DatasetSerializer {
    type Outcome;
    type Error: std::error::Error;

    fn serialize<D: Dataset>(self, d: &D) 
        -> Result<Self::Outcome, StreamError<D::Error, Self::Error>>;
}

Thanks a lot for your critical eye, and for challenging me into improving the design of Sophia :-)

Your welcome. I realy like our discussions as well and the results we found are pretty well I would say. Also I'm constantly a little bit afraid to go a step to far and start annoying you. So if this is ever the case please feel free to stop me ;-)

We now have two kinds of serializers, so to speak.

* (Triple|Quad)Serializer is mostly a serializing "configuration"

  + it is not attached to any target
  + it is the symetric to the (Triple|Quad)Parser trait
  + in some situations, it can also implement (Triple|Quad)Stringifier

* (Triple|Quad)SerializingSink is a serializer attached to a given target

  + it is produced by the former's `to` method of the former
  + it is responsible of the actual serialization
  + it acts as a (Triple|Quad)Sink, but implementations can also handle a
    whole Graph|Dataset for better serializations...
@pchampin
Copy link
Owner Author

Here is a new version of the serializer API. It is still very similar to my original proposal, but the serializing methods are now clearly carried by the sink (TripleSerializingSink and QuadSerializingSink, respectively).

For the sake of symmetry with the parser API, the use pattern that I described above is still possible, with the serialize_triples_in (etc.) methods. But those are mostly shortcuts for generating a sink and calling the corresponnding method on it.

Another use pattern, closer to the one you were advocating, is now possible:

// serializing graph g1 into file f1
let ttl_file = TurtleParser::new().with_prefixes(&prefix_map).to(f1);
//                                             this is new -> ^^^^^^
ttl_file.serialize_graph(&g1);

Note that I didn't follow your latest suggestion to have separate traits for serializing sources and serializing graphs. I would rather leave it to implementors to decide what they allow and what they disallow. It makes sense in Turtle to serialize two graphs in the same file; it does not in RDF/XML, so that implementation may chose to close the underlying file at the end of serialize_graph. Still, I can imagine some more exotic target where serializing several graphs in RDF/XML would make sense...

@MattesWhite
Copy link
Contributor

Note that I didn't follow your latest suggestion to have separate traits for serializing sources and serializing graphs. I would rather leave it to implementors to decide what they allow and what they disallow.

Okay you are right. It is okay to handle this by throwing an error if a second graph shouldn't be serialized.


For the sake of symmetry with the parser API, ...

I can understand that you stick to the 'serializer + sink' pattern to be symmetric with the parser. But in my view parsing and serialization are two inherent different things. Therefore, symmetry is not a valid design factor for the serializer API for me. On the first glance both parsing and serializing are symmetric. "Go from document to data" and "go from data to document". However, this is already the point where similarity ends. My statement can be proven by simply parsing a document and immediatly serialize it again. I'm pretty sure that for any set of parser and serializer the output will not be the exact same as the input despite being correct (as long as the input does not come from the serializer itself).

On an implementation level, the parser is right in supporting a lazy evaluation, i.e. an Iterator API, as we might only want to check if a certain triple is contained or we want to only parser a certain set of triples. Here the Iterator can help us to stop early and save resources. On the other hand, when we are serializing data we are in control of it already, only giving the triples to the serializer we want to. This means that a serializer will consume a TripleSource at once.

To sum up, I advocate to break the symmetrie of parser and serializer. For me the TripleSerializingSink is all the API we need for a serializer in a core-crate.


BTW the insymmetry of parsing and serializing is an example why I'm not a fan of the Sink-trait in general. It adds complexity where it don't have to be. For example, the Inserter and Remover sinks could be removed by implementing the same thing directly in Graph's methods. Maybe a more convincing point is that there is no special sink type in Rust I can think of no example neither in the std-lib nor in any popular crate I can think of. Of course I will change my mind if you bring any counterevidence.

@pchampin
Copy link
Owner Author

About parsing a file and reserializing it: the fact that its does not always round-trip does not mean that the general process is not symetric... We are considering different levels.

About sinks, you have a point, though...

@pchampin
Copy link
Owner Author

Superceded by 9a97dae

@pchampin pchampin closed this Feb 23, 2020
@pchampin pchampin deleted the serializer_api branch September 19, 2023 11:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants