Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notes, code and request for comments regarding Issue 172 #173

Closed
wants to merge 2 commits into from

Conversation

fpservant
Copy link

@fpservant fpservant commented May 8, 2016

Hi,

this, as answer to the last comment by Peter Ansell about issue 172

Summary:

  • there is really a performance problem in the situation I mentioned
  • it occurs where I said in the code
  • the change that I suggested cannot be applied blindly
  • but I think that in certain circonstances, the caller may know that he can use this optimisation: namely, when he knows that there are no duplicated triples in the RDFDataset
  • here is included the possibility to optionally call it.

Peter you write:

I can't seem to replicate your results locally, as removing the deepContains call doesn't seem to have any effect.

please have a look at com.github.jsonldjava.core.Issue172Test.slowVsFast(). It shows that it may have a huge effect.

I can't remove the entire if statement and just always add values as that breaks at least 19 of the conformance tests

I got another look at the code. I now think that we cannot make the change that I had suggested without any precaution (more on that later).

But I am surprised when you say that it breaks 19 tests, as I didn't see any one failing when running the tests included in the release. Note that if you run the tests with this version here, none fails but this is not an indication of anything, because as it is, only the unmodified code is called - my modification is only called when explicitly requested. If you want to run the tests with the modification being active (that is, if you want to replicate what I suppose that you have done when you got the 19 tests failing, change line 1801 in my modified JsonLdApi:

    public List<Object> fromRDF(final RDFDataset dataset) throws JsonLdError {
        // by default, don't change anything to current way of doing things
        return fromRDF(dataset, false);
    }

to:

    public List<Object> fromRDF(final RDFDataset dataset) throws JsonLdError {
        // initial suggestion of fps
        return fromRDF(dataset, true);
    }

and then run the tests. I don't see anyone failing. Do I miss something?

Anyway, the change that I suggested cannot be applied blindly. In a previous message, I said:

But it is OK to use this "laxMergeValue" instead of mergeValue at line 1857 of JsonLdAPI? Well, I'll leave it to the persons who knows the code, but I think that it could be the case, as it seems to be about adding the triple

node predicate value

to the list of values of the property predicate of the node subject.

This was wrong, because I was assuming that there are no duplicated triples in a RDFDataset. I was influenced by what we have in Jena, but this is not the case with JSONLD-java RDFDatasets, as I verified in

com.github.jsonldjava.core.Issue172Test.duplicatedTriplesInAnRDFDataset().

This test shows that an RDFDataSet may contain duplicated triples, and that the modification that I had suggested in not OK in such a case.

But I think that what I said is right if we know that RDFDataset doesn't contain duplicated triples: if we know that a given RDFDataset doesn't contain any duplicated triple, we may call a "lax" version of mergeValue in JsonLDApi.fromRDF, around line 1874:

                // 3.5.6+7)
                if (noDupsInDataset) {
                    JsonLdUtils.mergeValue_lax(node, predicate, value);
                } else {
                    JsonLdUtils.mergeValue(node, predicate, value);
                }

What do you, people who know and understand JSON-LD java code, think of this?

What I want to stress is that, in several important cases, a user of the API may know that the RDFDataset doesn't contain any duplicated triple. It is the case, for instance, when Jena passes a Jena model to JSON-LD java using JsonLdProcessor.fromRDF(Object input, RDFParser parser). It is guaranteed that there won't be any duplicated triples in the RDFDataset. So, he could benefit from calling a lax version of mergeValue in fromRDF. Hence the idea of

public List<Object> fromRDF(final RDFDataset dataset, boolean noDupsInDataset)

Pass false (the default) and there's no modification in the behavior, pass true and the "lax mergeValue method" is called.

Note that there are probably other places in fromRDF where the lax mergeValue method could be called. To be seen.

Best,

fps

@ansell
Copy link
Member

ansell commented May 8, 2016

Thanks for the tests to run to verify the results you are seeing locally. I will try to get around to reviewing the code soon, but some clarifications up front...

The part about the 19 tests failing was only when the entire if statement in the mergeValues call was removed. I didn't see any tests fail when I changed mergeValues to laxMergeValues, which is likely to mean that the testsuite is either not comprehensive enough, or there is the assumption that the RDFDataset will be deduplicated before feeding it into the JSONLD API.

I will look into if RDFDataset can be easily modified to be deduplicated, as that would simplify the discussion in this case.

I don't mind adding an extra public API method if it has large performance benefits, but in general I am trying to prune back the number of public API methods. That is the main reason that the library is still in 0.x releases, as I think there are too many public API methods, and worse, too many public API direct access fields without accessors.

@fpservant
Copy link
Author

fpservant commented May 9, 2016

Thank you for these clarifications. I understand the concern about the number of public methods - and the proposed one is kind of weird with its flag "I know there there are no duplicates". So, if it turns out that deduplicating the dataset is easy to code, and fast to execute, I agree that it would simplify everything.
Thanks again

@ansell
Copy link
Member

ansell commented May 18, 2016

I think the reason that I couldn't replicate your results was that I was working with a fully randomised dataset. In a fully randomised dataset the difference is fairly small.

The results below are small variations on the code you put above, with all triples having a single subject and unique objects, with a different number of predicates each time. They show the current method running time for the current code reducing with each predicate, bringing it closer to a realistic dataset each time. I will work on some other variations other than the fully randomised version where the two methods basically converge.

These are all for 2000 triples, 1000 benchmark rounds and 200 warming rounds.

Running test for lax versus slow for 1 predicate
Average time to parse a dataset containing one subject with 2000 different values of one subject and one predicate:
- Assuming no duplicates: 1
- Assuming duplicates: 163

Running test for lax versus slow for 2 predicates
Average time to parse a dataset containing one subject with 2000 different values of one subject and two predicates:
- Assuming no duplicates: 1
- Assuming duplicates: 85

Running test for lax versus slow for 5 predicates
Average time to parse a dataset containing one subject with 2000 different values of one subject and five predicates:
- Assuming no duplicates: 1
- Assuming duplicates: 41

@ansell
Copy link
Member

ansell commented May 18, 2016

The following is another fairly unrealistic case, 100 subjects, 1 predicate and unique objects for 2000 triples, but it shows the reduction in the gap between the two methods.

Running test for lax versus slow for 100 subjects and 1 predicate
Average time to parse a dataset containing 2000 different triples:
- Assuming no duplicates: 1
- Assuming duplicates: 3

@ansell
Copy link
Member

ansell commented May 18, 2016

For 1000 subjects, 5 predicates, and unique objects it swings the other way slightly, but still good performance

Running test for lax versus slow for 1000 subjects and 5 predicates
Average time to parse a dataset containing 2000 different triples:
- Assuming no duplicates: 2
- Assuming duplicates: 1

Particularly seeing that result, I will merge your method in as an optional, deprecated (deprecated being used here with javadoc saying it is experimental, not that I plan to remove it due to obsolescence) method that can be called by people who are knowledgable about the consequences to help with the extreme case you found.

In general though, will still follow the JSONLD-API compliant method that attempts to deduplicate values out of the RDF stream during the merge as it has fairly good performance in cases where the list of values attached to a given subject/predicate pair doesn't grow large.

I will look into deduplication of RDFDataset in another issue.

Closing this as I merged your changes into my existing branch, "perf" and will merge it into master.

@fpservant
Copy link
Author

Thanks Peter

fpservant added a commit to fpservant/jena that referenced this pull request Aug 28, 2016
(cf. jsonld-java/jsonld-java#173)
Changes following Andy's comments: RDFFormatVariant, indentation using
spaces
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants