Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor: Optimize the generated SPARQL queries #124

Conversation

syphax-bouazzouni
Copy link

@syphax-bouazzouni syphax-bouazzouni commented Jul 19, 2022

Preamble

This PR includes this

And require this

Starting issues

The generated SPARQL query to fetch data is not optimized because in the select clause we print all the attributes which causes the result to have a lot of rows (the number of rows is equal to the multiplication of the number of results of each attribute)

For example, if we have for a resource (e.g a Class) 3 attributes (e.g prefLabel, synonyms, definitions) with the following values:
prefLabel = prefLabel
synonyms = [syn_1, syn_2, syn_3]
definitions = [def_1, def_2]

We will get the following request results :

| prefLabel | syn_1 | def_1 |
| prefLabel | syn_2 | def_1 |
| prefLabel | syn_3 | def_1 |
| prefLabel | syn_1 | def_2 |
| prefLabel | syn_2 | def_2 |
| prefLabel | syn_3 | def_2 |

With this new implementation, we will have the following results
prefLabel
syn_1
syn_2
syn_3
def_1
def_2

The number of rows is equal to the multiplication of the number of results of each attribute, so if we have 4 attributes with each one 10 values, the result will be 10000 lines (10X10X10X10), whereas with this new implementation it is only 40 lines

Solution

We will transform all the requests to

SELECT DISTINCT ?id ?attributeProperty ?attributeObject 
FROM <graph>
WHERE { 
	?id a <type_uri> . 
  OPTIONAL { ?id ?attributeProperty ?attributeObject . }
FILTER(?id = <id_uri>) 
FILTER(?attributeProperty = <attribute_1_uri> || ... || ?attributeProperty = <attribute_n_uri>)

And for inverse attributes, we will use the following request

SELECT DISTINCT ?id ?attributeProperty ?attributeObject
FROM <graph> 
WHERE { 
 ?id a <type_uri> . 
 OPTIONAL { ?id ?attributeProperty ?attributeObject . } 
 OPTIONAL { ?attributeObject ?attributeProperty ?id . } // added line
FILTER(?id = <id_uri>) 
FILTER(?attributeProperty = <attribute_1_uri> || ... || ?attributeProperty = <attribute_n_uri>)

Solution issue

Theoretically, it should work, but the goo test_inverse.rb doesn’t pass because of a bug of 4store, where the ?attributeproperty was always returning resource not found

To reproduce the bug

Alternative and implemented solution for now

We will use UNION with binds for each attribute of a resource, as so

Include/general query

SELECT DISTINCT ?id ?attributeProperty ?attributeObject
FROM <graph_i> # for each graph from 1..n
WHERE { 
	?id a <type_uri> . 
	OPTIONAL {

		  { ?id <attribute_1_uri> ?attributeObject . BIND("var_1" as ?attributeProperty) } 
			UNION
				....
			UNION
		  { ?attributeObject <attribute_j_uri> ?id . BIND( "var_j" as ?attributeProperty)} # if attribute_j_uri is an inverse attribute 
		  UNION 
			  ....
			{ ?id <attribute_n_uri> ?attributeObject . BIND("var_n" as ?attributeProperty) } 
	}

FILTER(?id = <id_uri>) 
}

Aggregate query

unchanged

BNODE query

unchanged

SELECT DISTINCT ?id ?line1 ?line2 
FROM <http://goo.org/default/PersonPersistent>
 WHERE { 
   ?id a <http://goo.org/default/PersonPersistent> . 
   ?id <http://goo.org/default/contact_data> _:g64460 . 
   _:g64460 <http://goo.org/default/line1> ?line1. 
   _:g64460 <http://goo.org/default/line2> ?line2 . 
   FILTER(?id = <http://goo.org/default/person_persistent/Lewis> || ?id = <http://goo.org/default/person_persistent/John>) }

FILTER query (test/test_where#test_filter)

SELECT DISTINCT ?id ?attributeProperty ?attributeObject 
FROM <graph> 
WHERE { 
	  ?id a <type_uri> . 
    ?id <attribute_uri> ?internal_join_var_0 . #can alos be optional
    FILTER(?internal_join_var_0 > "var")
}

ORDER BY query (TODO)

SELECT DISTINCT ?id ?attributeProperty ?attributeObject 
FROM <graph> 
WHERE { 
	  ?id a <type_uri> . 
    ?id <attribute_uri> ?internal_join_var_0 . #can also be optional
}
ORDER BY ?internal_join_var_0

unmapped query (test/test_where#test_filter)

before

SELECT DISTINCT ?id ?object ?bind_as 
FROM <http://data.bioontology.org/ontologies/CSV_TEST_BRO/submissions/1> 
WHERE { 
			?id a <http://www.w3.org/2002/07/owl#Class> . 
      { ?id <http://bioportal.bioontology.org/ontologies/umls/hasSTY> ?object . BIND ("var_hasSTY0" as ?bind_as) } 
      UNION  
      { ?id <http://www.w3.org/2004/02/skos/core#historyNote> ?object . BIND ("var_historyNote0" as ?bind_as) } 
      UNION  
      { ?id <http://data.bioontology.org/metadata/prefixIRI> ?object . BIND ("var_prefixIRI0" as ?bind_as) } 
      ....
      UNION  
      { ?id <http://www.w3.org/2000/01/rdf-schema#label> ?object . BIND ("var_label0" as ?bind_as) } 
      UNION  
      { ?id <http://www.w3.org/2000/01/rdf-schema#subPropertyOf> ?object . BIND ("var_subPropertyOf0" as ?bind_as) } 
      FILTER(?id = <http://bioontology.org/ontologies/BiomedicalResourceOntology.owl#Ontology_Development_and_Management> || ... list of all ids...  || ?id = <http://bioontology.org/ontologies/BiomedicalResourceOntology.owl#Database>) }

new it is the same as the general/includes a query

SELECT DISTINCT ?id ?attributeProperty ?attributeObject 
FROM <http://data.bioontology.org/ontologies/CSV_TEST_BRO/submissions/1> 
WHERE { 
			?id a <http://www.w3.org/2002/07/owl#Class> . 
      { ?id <http://bioportal.bioontology.org/ontologies/umls/hasSTY> ?object . BIND ("var_hasSTY0" as ?bind_as) } 
      UNION  
      { ?id <http://www.w3.org/2004/02/skos/core#historyNote> ?object . BIND ("var_historyNote0" as ?bind_as) } 
      UNION  
      { ?id <http://data.bioontology.org/metadata/prefixIRI> ?object . BIND ("var_prefixIRI0" as ?bind_as) } 
      ....
      UNION  
      { ?id <http://www.w3.org/2000/01/rdf-schema#label> ?object . BIND ("var_label0" as ?bind_as) } 
      UNION  
      { ?id <http://www.w3.org/2000/01/rdf-schema#subPropertyOf> ?object . BIND ("var_subPropertyOf0" as ?bind_as) } 
      FILTER(?id = <http://bioontology.org/ontologies/BiomedicalResourceOntology.owl#Ontology_Development_and_Management> || ... list of all ids...  || ?id = <http://bioontology.org/ontologies/BiomedicalResourceOntology.owl#Database>) }

Equivalent predicates case

see *def self*.expand_equivalent_predicates(query, eq_p)

Before the filter was in the OPTIONAL clause

SELECT DISTINCT ?id ?attribute1
FROM <graph_i>
WHERE { 
    ?id a <type_uri> .
    OPTIONAL { ?id ?p ?attribute1 . FILTER(?p = <equivalent_uri_1> || .. || ?p = <equivalent_uri_n>) }
     ...
}

Now we add a UNION clause for each equivalent predicate

SELECT DISTINCT ?id ?attributeProperty ?attributeObject
FROM <graph_i>
WHERE { 
    ?id a <type_uri> .
    OPTIONAL { 
         {?id <equivalent_uri_1> ?attributeObject . BIND( "var_1" as ?attributeProperty)) }
				 UNION
         ...
         UNION
           {?id <equivalent_uri_2> ?attributeObject . BIND( "var_1" as ?attributeProperty)) }
     ...
		
}

Use case

With the use case of AGRVOC (2GB of triples) :

Old implementation New implementation
/ontologies/AGROVOC/submissions/1?display=all (132 properties) 24 minutes (1 032 240 lines returned) 4.3145 seconds (140 lignes returned)

@syphax-bouazzouni syphax-bouazzouni changed the title Optimize refactor the sparql query build unions Optimize & refactor the sparql query build unions Jul 19, 2022
@syphax-bouazzouni
Copy link
Author

Hi @mdorf, I think you are the one working on this part of the project. Could you please give me your opinion about this?

@syphax-bouazzouni syphax-bouazzouni changed the title Optimize & refactor the sparql query build unions Optimize the generated SPARQL queries Jul 19, 2022
@syphax-bouazzouni syphax-bouazzouni marked this pull request as ready for review July 20, 2022 14:23
@mdorf
Copy link
Member

mdorf commented Jul 21, 2022

Hi @mdorf, I think you are the one working on this part of the project. Could you please give me your opinion about this?

@syphax-bouazzouni, the solution appears to be sound based on the description. But, honestly, it's a bit difficult for me to provide a constructive feedback based on the number of commits. I need to understand how the original issue affected our system and how the solution addresses it. I wonder if this is a good topic to review during our face-to-face workshop in Montpellier in September.

@mdorf
Copy link
Member

mdorf commented Jul 21, 2022

Also, we need to make sure that ncbo/ontologies_linked_data and ncbo/ontologies_api tests pass with AllegroGraph with this change in place.

@jonquet
Copy link

jonquet commented Jul 21, 2022

Also, we need to make sure that ncbo/ontologies_linked_data and ncbo/ontologies_api tests pass with AllegroGraph with this change in place.

I do agree in principle to this. However, I don't think this should be a stopper (or slower) for not taking the change for 4store.
This issue (of the cartesian product) has been raised 5 years ago with @vemonet and we have here the opportunity to fix this.

Basically we expect this to fix the call :
https://data.bioontology.org/submissions
and possibly even
https://data.bioontology.org/submissions?display=all

@syphax-bouazzouni
Copy link
Author

syphax-bouazzouni commented Jul 21, 2022

@syphax-bouazzouni, the solution appears to be sound based on the description. But, honestly, it's a bit difficult for me to provide a constructive feedback based on the number of commits. I need to understand how the original issue affected our system and how the solution addresses it. I wonder if this is a good topic to review during our face-to-face workshop in Montpellier in September.

Yeah, I understand. For this kind of matter, I am totally available to do specific meetings to speak about it.
But yes we could also speak about it during the Ontoportal meeting, even though during that meeting I would prefer personally to speak more in a larger overview of our state and the improvements that we can do.

Also, we need to make sure that ncbo/ontologies_linked_data and ncbo/ontologies_api tests pass with AllegroGraph with this change in place.

For AllegroGraph, we have not yet set up an environment to test it, but naively I will say that it should also work for it because its basic SPARQL queries that didn't change a lot from before.

I do agree in principle to this. However, I don't think this should be a stopper (or slower) for not taking the change for 4store. This issue (of the cartesian product) has been raised 5 years ago with @vemonet and we have here the opportunity to fix this.

@jonquet for now the code for 4store and AllegroGraph are the same, so we can't isolate this improvement for only 4store

@syphax-bouazzouni syphax-bouazzouni changed the title Optimize the generated SPARQL queries Refactor: Optimize the generated SPARQL queries Jan 19, 2023
@alexskr alexskr changed the base branch from master to develop March 21, 2024 17:06
@mdorf mdorf self-assigned this Apr 16, 2024
@mdorf mdorf merged commit 1a6c63e into ncbo:develop Apr 16, 2024
@jonquet
Copy link

jonquet commented Apr 17, 2024

Cc @vemonet with who we originally found the bug! And congrats to @syphax-bouazzouni for such an investigation and fix.

@vemonet
Copy link

vemonet commented Apr 17, 2024

Wow nice to see it finally got resolved for everyone! Good job @syphax-bouazzouni, that was a consequent task

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants