-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New serializer API #37
Conversation
with TripleSerializer and QuadSerializer traits, and TripleStringifier and QuadStringifier specializations. This is meant to replace the "duck-typing" style Config objects. In the process, removed some duplicate code btw nt and nq serializers.
and also a little bit of the parser module
@MattesWhite @Tpt whould you mind having a look at this PR? This is the long awaited refactoring of the |
) -> Result<String, StreamError<TS::Error, Self::Error>> | ||
/// [`TripleSource`]: ../triple/stream/trait.TripleSource.html | ||
/// [`Graph`]: ../graph/trait.Graph.html | ||
pub trait TripleStringifier { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of the derive macro maybe this could be better solved by something like (pseudo code):
trait TripleStringifier: TripleSerializer<T>
where
String: From<T>,
{
fn stringify_triples(...) {
self.serialize_triples_in(...).map(Into::into)
}
...
}
Maybe use rather Vec<u8>
instead of T
and do a proper conversion (what you actually do in the macro).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding my proposal at line 21 this would be:
fn vec_to_string(v: Vec<u8>) -> String {
String::from_utf8_lossy(&v).to_owned()
}
trait TripleStringifier: TripleSerializer<Outcome=Vec<u8>> {
fn stringify_term<TD: TermData>(self, t: &Term<TD>) -> Result<String, Self::Error> {
let mut ser = self;
ser.serialize_term(t)?;
ser.finish().map(vec_to_string)
}
...
}
Maybe add some shorthand functions for Self: Default
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a big fan either. Here are alternative solutions I considered, by decreasing order of preference:
-
have
TripleStringifier
(resp.QuadStringifier
) automatically implemented by any impl ofTripleSerializer
(resp.QuadSerializer
) supporting&mut Vec<u8>
(which includes those supporting anyWriter
). I tried to do it that way, but I failed. The problem is that you need to declare a lifetime for the mut reference toVec<u8>
, and this proves to be tricky. I even tried something with a HRTB (for <'a> ...
) but that failed as well. -
have a proper
derive
macro, so that one could simply write[derive(TripleStringifier)]
above their type implementingTripleSerializer
. But I don't know procedural macros. -
have hacky pseudo-derive using a standard macro. This is where we are.
But if you manage to get one the solutions above, I'm all for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This clearly depends on the Serializer
trait. Of course I'll spent time on that to get a working solution. But lets settle this for now and focus on the main trait.
sophia/src/serializer.rs
Outdated
|
||
fn new(write: W, config: Self::Config) -> Self; | ||
/// A sink serializing into `target` the triples fed to it. | ||
fn sink(&self, target: T) -> Self::Sink; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this architecture blurrs the lines what a serializer is. On the one hand there is TripleSerializer
that offers methods to serialize (serialize_triples_in(to)*()
) but it doesn't implement the serializing. On the other hand a sink must be implemented that actually does serialization but don't implements TripleSerializer
... 🤔 . Even worse, thinking about Turtle-serialization for streams and graphs it is required to implement two different algorithms, i.e. different 'sinks' but the trait only allows one associated sink.
I suggest a more plain trait the does not include construction but focuses more on serialization. Indeed, my proposal is pretty close to the original and is similar to a TripleSink
.
trait TripleSerializer {
type Outcome;
type Error: std::error::Error;
fn serialize_term<TD: TermData>(&mut self, t: &Term<TD>) -> Result<(), Self::Error>;
fn serialize_triple<T: Triple>(&mut self, t: &T) -> Result<(), Self::Error>;
fn serialize_triples<TS: TripleSource>(&mut self, ts: TS) -> Result<(), StreamError<TS::Error, Self::Error>>;
fn serialize_graph<G>(&mut self, g: &G) -> Result<(), StreamError<G::Error, Self::Error>>;
fn finish(self) -> Result<Self::Outcome, Self::Error>;
}
A TripleStringifier
is would be similar just return String
instead of ()
and provide a blanked implementation for TripleSerializer<Outcome=Vec<u8>>
. However, this proposal probably needs some refinement.
* in this case the preposition 'in' indicates me that the serialization is done somewhere inside without modifying the outer thing. I'd rather go with 'into'. For me this is a more common description in Rust for: Take something convert it and modify the target. (BTW the same for in_sink()
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are the two design principles that I tried to follow when refactoring the parser and serializer APIS:
- be intuitive: a Parser should have a
parse
method (or several); a Serializer should have aserialize
method (or several); - integrate well with the general streaming model of
sophia
, i.e. provide implementations ofTripleSource
/TripleSink
(or their quad counterpart) whenever possible.
You object that the actual serializing is done in the sink and not in the serializer per se, but the same criticism could be done on parsers as well (the actual parsing is done by the source returned by the parse
method).
Note also that I have been aware of this "inconsistency" from the start. This very concern lead me to the previous design, naming the "static" object with parse/serialize methods "Config" rather than "Parser"/"Serializer". This was deemed unintuitive by several people (including yourself, IIRC 😉), hence the new design (and item 1 above).
Even worse, thinking about Turtle-serialization for streams and graphs it is required to implement two different algorithms,
Serializing a stream and serializing (smartly) a graph do indeed require two different algorithms. So it might be desirable to have two different traits... On the down side, this would make the API more complex. On the up side, you might say that implementors could choose to implement only the graph serialization, and not the stream serialization (the opposite is not a problem, given the default implementation of serialize_graph_in
). But my feeling is that once you have a the algorithm for graphs, the algorithm for streams is really not that hard to get...
i.e. different 'sinks'
No, the sink is only for parsing streams.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re. the TripleSerializer
trait you proposed:
-
serialize_term
does not make sense for some serializers (i.e. RDF/XML), where terms can not always be "isolated"; -
serialize_triple
is not strictly required by users (they can easily reuseserizlier_triples
orserialize_graph
to get the same result); and for internal use, such a method does not always make sense (in Turtle or RDF/XML, several triples share the same subject, so not all triples are serialized identically); -
serializer_triples
means that your design also suffers from the "two distinct algorithms" problem discussed above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re. "in" vs. "into":
I have given this some thought when designing the stream API – first I wanted to name the methods into_sink
and into_graph
of TripleSource
, but this was misleading: it gave the impression that the methods were for converting the TripleSource
into something else, which is wrong.
So I went for in
instead of into
, and I kept this habit for serializers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note also that I have been aware of this "inconsistency" from the start. This very concern lead me to the previous design, naming the "static" object with parse/serialize methods "Config" rather than "Parser"/"Serializer". This was deemed unintuitive by several people (including yourself, IIRC 😉), hence the new design (and item 1 above).
I have to admit that this design is actually pretty good. And the same I come up with on a second thought. However, I still believe that the initialization should not be part of a basic trait.
Indeed, the methods serialize_term()
and serialize_triple()
add no benefits and can be removed.
Regarding the 'in' vs 'into' discussion... why not name it serialize_triples()
and serialize_graph()
? This does exactly describe what is done and the API clearly tells the user if the argument is consumed (TripleSource
) or just borrowed (Graph
). For me this is one of the biggest advantages of Rust's typesystem 👍 .
To sum up, I now propose to stick with my previous proposed trait but remove the two unnecessary methods. Maybe add : TripleSink
to get Self::Outcome
and Self::Error
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, I still believe that the initialization should not be part of a basic trait.
I agree with this general principle. But it seems that we are placing the "initialization" at different levels.
I'm assuming that by "initialization", you are referring to setting the target (a file in general) of the serializer. In my view, this is already a second step; the first step is setting the different options of the serialier, like "should it use ASCII or UTF-8" (for the N-Triples serializer) or "what prefixes should it use" (for a future Turtle serializer)...
I think those two steps should be separate. I think it is important that one can configure, say, a turtle serializer with a number of option (typically, which prefixes it should use), and then use that serializer several times, to serialize different graphs in different files...
Also, I'm trying to make the Parser and Serializer APIs as symmetrical as possible. But I give you that the symmetry is not perfect in the current state. I'll come with another proposal, in the main thread of this PR.
Although, strictly speaking, `mut` in front of a parameter name is not significant for the function signature, it is misleanding, as pointed out in #37 (comment) This change, although semantically equivalent and slightly more verbose, aims at avoiding that confusion.
In a subcomment above, I promised another proposal, and I meant it, but actually I came back to my original design :-/. Let my try to explain how. Currently, the parser library works like that: // create a Parser -- this is not specified by the Parser trait
let p = TurtleParser::new().with_default_prefixes(&prefix_map);
// use it to parse file f1 in graph g1
p.parse(f1).in_graph(&mut g1);
// alternatively, use the triple stream API to parse file f2 in graph g2
let parsed_triples = p.parse(f2); // returns a TripleSource
g2.insert_all(parsed_triples); I find this API intuitive, and easy to read and write. If we try to imagine a perfectly symmetrical API for serializers, it would look like // create a Serializer -- this is not specified by the Serializer trait
let s = TurtleSerializer::new().with_prefixes(&prefix_map);
// use it to serializer graph g1 in file f1
s.serialize_graph(&g1).in_(f1);
// alternatively, use the triple stream API to serialize graph g2 in file f2
let turtle_file = s.sink(f2); // returns a TripleSink
g2.triples().in_sink(turtle_file); This looks (IMO) equally nice to the Parser example, // create a Serializer -- this is not specified by the Serializer trait
let s = TurtleSerializer::new().with_prefixes(&prefix_map);
// use it to serializer graph g1 in file f1
s.serialize_graph_in(&g1, f1);
// alternatively, use the triple stream API to serialize graph g2 in file f2
let turtle_file = s.sink(f2); // returns a TripleSink
g2.triples().in_sink(turtle_file); Does it make more sense? |
I don't say that your proposal doesn't make sense. The serializer trait you suggest is perfectly aligned to the source/sink model you follow. I think of a different model for serializers. For me a serializer's purpose is to serialize data to one target. I agree that this means there a two steps for initialization. First create some configuration that you may be able to share across threads and so on. The difference in my thinking is that the concrete serializer is initialized with that config and a target, so the one serializer is owning it one target. Throughout the life of the serializer I can add one or more graphs, triples and so on. Eventually, the serializer is consumed and the target is returned (therefore
Therefore, I suggested to return To sum up, I prefer sharing a config and built serializers on demand, e.g. per thread. As far as I understand your solution shares one serializer that is responsible for serializing to several targets. I think that can become a problem when the serializer is used in different threads. |
Another point for using one serializer per target is streaming Turtel serialization. In order to be able to produce property and object lists the serializer needs to store the last subject and predicate it serialized. This can be used to continue serialization over two streams. When one serializer is used for several targets this would not work. Worse, it could produce incorrect serializations, e.g. continue the serialization in another target. |
@MattesWhite Both approaches have their pros and cons. They both have their risks/limitations (for example, it does not always make sense to serialize two graphs into the same file -- with RDF/XML, that would produce an invalid document), but they also both have their valid uses cases. So my conclusion is that we should probably support both approaches. I will amend the PR in that sense on the next few days. Thanks a lot for your critical eye, and for challenging me into improving the design of Sophia :-) |
Indeed, I thought about this problem two. And I'm curious about your solution. I suggest to solve this problem by consulting the RDF spec:
Consequently, the serialization of an To sum up: trait TripleSerializer {
type Outcome;
type Error: std::error::Error;
fn serialize<TS: TripleSource>(&mut self, ts: TS)
-> Result<(), StreamError<TS::Error, Self::Error>>;
fn finish(self) -> Result<Self::Outcome, Self::Error>;
}
trait GraphSerializer {
type Outcome;
type Error: std::error::Error;
fn serialize<G: Graph>(self, g: &G)
-> Result<Self::Outcome, StreamError<G::Error, Self::Error>>;
}
trait QuadSerializer {
type Outcome;
type Error: std::error::Error;
fn serialize_stream<QS: QuadSource>(&mut self, qs: QS)
-> Result<(), StreamError<QS::Error, Self::Error>>;
fn serialize_graph<G: Graph>(&mut self, g: &G, name: Option<&Term>)
-> Result<(), StreamError<G::Error, Self::Error>>;
fn finish(self) -> Result<Self::Outcome, Self::Error>;
}
trait DatasetSerializer {
type Outcome;
type Error: std::error::Error;
fn serialize<D: Dataset>(self, d: &D)
-> Result<Self::Outcome, StreamError<D::Error, Self::Error>>;
}
Your welcome. I realy like our discussions as well and the results we found are pretty well I would say. Also I'm constantly a little bit afraid to go a step to far and start annoying you. So if this is ever the case please feel free to stop me ;-) |
We now have two kinds of serializers, so to speak. * (Triple|Quad)Serializer is mostly a serializing "configuration" + it is not attached to any target + it is the symetric to the (Triple|Quad)Parser trait + in some situations, it can also implement (Triple|Quad)Stringifier * (Triple|Quad)SerializingSink is a serializer attached to a given target + it is produced by the former's `to` method of the former + it is responsible of the actual serialization + it acts as a (Triple|Quad)Sink, but implementations can also handle a whole Graph|Dataset for better serializations...
Here is a new version of the serializer API. It is still very similar to my original proposal, but the serializing methods are now clearly carried by the sink ( For the sake of symmetry with the parser API, the use pattern that I described above is still possible, with the Another use pattern, closer to the one you were advocating, is now possible: // serializing graph g1 into file f1
let ttl_file = TurtleParser::new().with_prefixes(&prefix_map).to(f1);
// this is new -> ^^^^^^
ttl_file.serialize_graph(&g1); Note that I didn't follow your latest suggestion to have separate traits for serializing sources and serializing graphs. I would rather leave it to implementors to decide what they allow and what they disallow. It makes sense in Turtle to serialize two graphs in the same file; it does not in RDF/XML, so that implementation may chose to close the underlying file at the end of |
Okay you are right. It is okay to handle this by throwing an error if a second graph shouldn't be serialized.
I can understand that you stick to the 'serializer + sink' pattern to be symmetric with the parser. But in my view parsing and serialization are two inherent different things. Therefore, symmetry is not a valid design factor for the serializer API for me. On the first glance both parsing and serializing are symmetric. "Go from document to data" and "go from data to document". However, this is already the point where similarity ends. My statement can be proven by simply parsing a document and immediatly serialize it again. I'm pretty sure that for any set of parser and serializer the output will not be the exact same as the input despite being correct (as long as the input does not come from the serializer itself). On an implementation level, the parser is right in supporting a lazy evaluation, i.e. an To sum up, I advocate to break the symmetrie of parser and serializer. For me the BTW the insymmetry of parsing and serializing is an example why I'm not a fan of the |
About parsing a file and reserializing it: the fact that its does not always round-trip does not mean that the general process is not symetric... We are considering different levels. About sinks, you have a point, though... |
Superceded by 9a97dae |
with TripleSerializer and QuadSerializer traits,
and TripleStringifier and QuadStringifier specializations.
This is meant to replace the "duck-typing" style Config objects.
In the process, removed some duplicate code btw nt and nq serializers.