Cleanup query string #93

EnnoMeijers · 2021-09-24T14:14:10Z

In order to make the sparql queries more simple and robust the query string should be cleaned up before passing to the query engine. At least the following steps could be done:

remove starting and trailing spaces
remove/replace non-ascii characters (é -> e, etc)
convert numbers to strings by surrounding them with single quotes

ddeboer · 2021-10-04T11:00:51Z

I discussed this issue with @sdevalk today.

Unfortunately, cleaning up (or: pre-processing) the search query is hard to do generically. Of your points, only

remove starting and trailing spaces

can be applied generically.

remove/replace non-ascii characters (é -> e, etc)

This will only work on specific configurations of Virtuoso.

convert numbers to strings by surrounding them with single quotes

This will only work for Virtuoso and fail to return results for non-Virtuoso sources such as CHT.

Moreover, the last two points are quite invasive in the user’s queries and perhaps make the Network of Terms too smart: if sources support handling diacritics, we should let them do so.

However, I agree that our CONSTRUCT queries are becoming too complex and hard to maintain. To resolve this, we could adopt an alternative approach where we have the Network of Terms applications prepare queries specifically geared towards and optimised for term source technologies. We can do so in two ways:

Use the adapter pattern, but this requires the application to know which term source uses which technology. As a result of this, term source-specific knowledge spills over from the queries into the application code.
Prepare a number of queries under different SPARQL parameter names, e.g. ?query for the unchanged user query, ?virtuoso_query for a Virtuoso-specific one, ?boolean_query that replaces A B with A AND B, ?diacritics_removed_query if that is still needed, etc. In the SPARQL queries, we can then choose for each query the parameter that gives the best results.

I prefer 2. It adds adds some redundant string replacement operations, but those should be negligible (and the adapter pattern would add some overhead, too). The benefit is that the knowledge of how to query specific term sources remains in the catalog of queries (that may be exposed in some admin UI later on) and doesn’t spill over in the application code.

ddeboer · 2021-11-25T08:34:07Z

@sdevalk Will make an inventory of the query variants that we need.

sdevalk · 2021-12-02T15:42:25Z

@ddeboer For further discussion: I've created an overview of the terminology sources that we currently query and the various preprocessing rules that we currently apply or that we could consider. The overview is in this Google Spreadsheet.

So far I think we need three query variants:

For queries that are used in literal searches (ABR, CHT, WO2, EuroVoc)
For queries that are used in Virtuoso full text searches (Adamlink, Brinkman, NTA, GTAA, RKDartists, Wereldculturen, Muziekweb, Muziekschatten)
For queries that don't need preprocessing (AAT, Wikidata)

ddeboer · 2021-12-03T07:35:26Z

Thanks @sdevalk, that’s really useful!

So we end up with these three query types:

?query that equals the raw query input
?literal_query for ABR etc. that converts to lowercase and trims whitespace
?boolean_query (or ?virtuoso_query or ?bif_query) for Adamlink etc. that quotes query parts and inserts boolean AND between them.

Does converting to lowercase influence the other queries? If not, we can probably reduce these to two query types by merging the first two, because removing whitespace definitely should not hurt the other queries.

sdevalk · 2021-12-03T09:52:44Z

@ddeboer Good point. Lowercasing the search words does not influence the other queries, so we can continue with two query types/variants.

github-actions · 2021-12-23T08:19:22Z

🎉 This issue has been resolved in version 5.5.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

EnnoMeijers added the enhancement New feature or request label Sep 24, 2021

ddeboer mentioned this issue Oct 31, 2021

RKD query broken #107

Closed

ddeboer assigned sdevalk Nov 25, 2021

ddeboer self-assigned this Nov 25, 2021

This was referenced Dec 3, 2021

feat: Provide query variants to SPARQL query netwerk-digitaal-erfgoed/network-of-terms#398

Merged

feat: Simplify queries by using pre-processed values #112

Merged

ddeboer closed this as completed in #112 Dec 23, 2021

github-actions bot added the released label Dec 23, 2021

ddeboer mentioned this issue Oct 25, 2022

Content-encoding SPARQL query (België) netwerk-digitaal-erfgoed/network-of-terms#772

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanup query string #93

Cleanup query string #93

EnnoMeijers commented Sep 24, 2021 •

edited

Loading

ddeboer commented Oct 4, 2021 •

edited

Loading

ddeboer commented Nov 25, 2021

sdevalk commented Dec 2, 2021

ddeboer commented Dec 3, 2021 •

edited

Loading

sdevalk commented Dec 3, 2021

github-actions bot commented Dec 23, 2021

Cleanup query string #93

Cleanup query string #93

Comments

EnnoMeijers commented Sep 24, 2021 • edited Loading

ddeboer commented Oct 4, 2021 • edited Loading

ddeboer commented Nov 25, 2021

sdevalk commented Dec 2, 2021

ddeboer commented Dec 3, 2021 • edited Loading

sdevalk commented Dec 3, 2021

github-actions bot commented Dec 23, 2021

EnnoMeijers commented Sep 24, 2021 •

edited

Loading

ddeboer commented Oct 4, 2021 •

edited

Loading

ddeboer commented Dec 3, 2021 •

edited

Loading