Performances? #172

fpservant · 2016-04-10T15:11:14Z

Hi,
working with Jena, I ran some tests to compare the performances when serializing RDF data. As it turns out, JSON-LD serialization seems to be rather slow - ~ 20 times slower than turtle.

Here are my results, outputing a graph of 8000 triples (several runs, after warming up everything, writing to /dev/null):
JSON-LD/pretty : 237 ms
JSON-LD/flat : 225 ms
RDF/XML/pretty : 104 ms
RDF/XML/plain : 54 ms
Turtle/blocks : 11 ms
Turtle/flat : 13 ms
Turtle/pretty : 11 ms
N-Triples/utf-8 : 6 ms

The backend of my service uses RDF (Jena). Almost a quarter of a second to return a typical result is too slow.

Are there ways to improve it ?

ansell · 2016-04-23T04:19:05Z

If you could write a test directly using the JSONLD-Java APIs I could help.

The numbers from my testing over 8000 synthetic/random triples are roughly the following, which seem to roughly match up with your numbers:

RDF triples to JSON-LD in-memory (RDFDataset -> Map<String, Object>) : 60ms
JSON-LD in-memory to String (non-pretty-print): 30ms
JSON-LD in-memory to String (pretty-print): 40ms
JSON-LD expansion (Map<String, Object> to Map<String, Object>): 60ms

Testing compaction is difficult without concrete test data from you. Ie, the JSON-LD context you are using and the exact triples.

There may be ways of improving performance slightly, but in the end, the complexity of the JSON-LD algorithms may still limit performance. By comparison, there are no algorithms/transformations involved in any of the other RDF serialisations, other than having to have everything in memory to pretty-print turtle/RDF-XML, and you should use them in preference to JSON-LD if you are concerned about performance and you need to serialise arbitrary RDF triples.

If you have a known schema, you could generate JSON (that is valid JSON-LD) directly using the Jackson (or other JSON library) APIs much faster by hand. The complexity is all in the JSON-LD API's themselves so avoiding them will help. One major difficulty with JSON is the requirement that everything be in-memory before serialising if you want concise documents, which is the reason that JSON is only really used for API results that tend to be very small and others uses streaming-friendly formats.

fpservant · 2016-05-02T15:28:46Z

Hi,
Thanks for the reply. I made some more tests, measuring the times for the different kinds of outputs. I think that I have found some interesting results. It happens that the results vary widely depending on the content of the file: the file I have been working with has something special that explains a large part of the performance problem I have pointed. I think that a change can be made in the code that would solve it.

With a file similar to the one I was using in my previous experiment (and that I enclose here, "slow.jsonld"),

jsonldperfs.zip

I get the following results (export from jena, with a small hack to choose the output form)

model.size() 7559
JSON-LD/pretty EXPANDED:162 ms
JSON-LD/pretty COMPACTED:186 ms
JSON-LD/pretty FLATTENED:339 ms
JSON-LD/flat EXPANDED:162 ms
JSON-LD/flat COMPACTED:184 ms
JSON-LD/flat FLATTENED:337 ms
RDF/XML/pretty:89 ms
RDF/XML/plain:51 ms
Turtle/blocks:8 ms
Turtle/flat:12 ms
Turtle/pretty:9 ms
N-Triples/utf-8:2 ms

again, a very big difference with performances in turtle, but the point is that the relative time of compaction is not the important factor here as, if I understand correctly, the expanded output format corresponds to the basic one, and compacting begins with the same operations as the ones done for the expanded format.

But I also noticed that with other files, results are very different - much better. I investigated the differences, and I found that the factor that slows things done in the previous file is related to the fact that several nodes have a property which has a lot of values. I removed these statements from the file, and here is what I got:

model.size() 1177
JSON-LD/pretty EXPANDED:4 ms
JSON-LD/pretty COMPACTED:12 ms
JSON-LD/pretty FLATTENED:15 ms
JSON-LD/flat EXPANDED:4 ms
JSON-LD/flat COMPACTED:11 ms
JSON-LD/flat FLATTENED:15 ms
RDF/XML/pretty:21 ms
RDF/XML/plain:11 ms
Turtle/blocks:1 ms
Turtle/flat:2 ms
Turtle/pretty:3 ms
N-Triples/utf-8:1 ms

wow, that is fast! Of course, there are a lot less triples (1177 vs 7559), but the gain in times is clearly not proportional to the reduction in size: were it the case, the time for the EXPANDED should be ~25 ms, not just 4ms!

So I suspected that this could be related to some iteration over a list, and I found where this happens in the code: JsonLdAPI.fromRDF, at line 1857:

                // 3.5.6+7)
                JsonLdUtils.mergeValue(node, predicate, value);

the mergeValue has the following test, to ensure that a given value is not added twice:

        if ("@list".equals(key)
                || (value instanceof Map && ((Map<String, Object>) value).containsKey("@list"))
                || !deepContains(values, value)) {
            values.add(value);

we're in the case where deepContains is called, and deepContains iterate over the items in values. Hence the poor performance with my file. To check it, I modified line 1857 in JsonLdAPI to call a modified version of mergeValue - a laxMergeValue, that doesn't verify whether value is already in values before adding it.

Here is the time that I get with the first file (the big "slow" one):

JSON-LD/pretty EXPANDED:9 ms
JSON-LD/pretty COMPACTED:37 ms
JSON-LD/pretty FLATTENED:193 ms
JSON-LD/flat EXPANDED:8 ms
JSON-LD/flat COMPACTED:31 ms
JSON-LD/flat FLATTENED:187 ms
RDF/XML/pretty:88 ms
RDF/XML/plain:49 ms
Turtle/blocks:9 ms
Turtle/flat:12 ms
Turtle/pretty:10 ms
N-Triples/utf-8:3 ms

same time as turtle for the expanded format !

But it is OK to use this "laxMergeValue" instead of mergeValue at line 1857 of JsonLdAPI? Well, I'll leave it to the persons who knows the code, but I think that it could be the case, as it seems to be about adding the triple

node predicate value

to the list of values of the property predicate of the node subject. Anyway, I am sure that it is possible to fix this iteration over the items of a list, that can have a very negative impact on the performances of the API.

Best Regards,

fps

ansell · 2016-05-02T23:13:21Z

Thanks for looking into it further. I will see what I can do about that check (hopefully without breaking the spec!)

ansell · 2016-05-06T23:37:31Z

I can't seem to replicate your results locally, as removing the deepContains call doesn't seem to have any effect. I can't remove the entire if statement and just always add values as that breaks at least 19 of the conformance tests. Can you open a pull request with your proposed changes and I will see if I can replicate it with your version of the code.

ansell · 2016-05-18T05:10:14Z

Released jsonld-java-0.8.3 with this fix in it, should be on Maven Central in a few hours

ansell closed this as completed Apr 23, 2016

ansell reopened this May 2, 2016

ansell closed this as completed May 6, 2016

fpservant mentioned this issue May 8, 2016

Notes, code and request for comments regarding Issue 172 #173

Closed

This was referenced May 18, 2016

Deduplication option for RDFDataset #177

Open

Add performance tests for toRDF and expand and pretty printing #178

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performances? #172

Performances? #172

fpservant commented Apr 10, 2016

ansell commented Apr 23, 2016

fpservant commented May 2, 2016 •

edited

ansell commented May 2, 2016

ansell commented May 6, 2016

ansell commented May 18, 2016

Performances? #172

Performances? #172

Comments

fpservant commented Apr 10, 2016

ansell commented Apr 23, 2016

fpservant commented May 2, 2016 • edited

ansell commented May 2, 2016

ansell commented May 6, 2016

ansell commented May 18, 2016

fpservant commented May 2, 2016 •

edited