Skip to content
This repository has been archived by the owner on Feb 22, 2022. It is now read-only.

Cleanup query string #93

Closed
EnnoMeijers opened this issue Sep 24, 2021 · 6 comments · Fixed by #112
Closed

Cleanup query string #93

EnnoMeijers opened this issue Sep 24, 2021 · 6 comments · Fixed by #112
Assignees
Labels
enhancement New feature or request released

Comments

@EnnoMeijers
Copy link
Contributor

EnnoMeijers commented Sep 24, 2021

In order to make the sparql queries more simple and robust the query string should be cleaned up before passing to the query engine. At least the following steps could be done:

  • remove starting and trailing spaces
  • remove/replace non-ascii characters (é -> e, etc)
  • convert numbers to strings by surrounding them with single quotes
@EnnoMeijers EnnoMeijers added the enhancement New feature or request label Sep 24, 2021
@ddeboer
Copy link
Member

ddeboer commented Oct 4, 2021

I discussed this issue with @sdevalk today.

Unfortunately, cleaning up (or: pre-processing) the search query is hard to do generically. Of your points, only

  • remove starting and trailing spaces

can be applied generically.

  • remove/replace non-ascii characters (é -> e, etc)

This will only work on specific configurations of Virtuoso.

  • convert numbers to strings by surrounding them with single quotes

This will only work for Virtuoso and fail to return results for non-Virtuoso sources such as CHT.

Moreover, the last two points are quite invasive in the user’s queries and perhaps make the Network of Terms too smart: if sources support handling diacritics, we should let them do so.

However, I agree that our CONSTRUCT queries are becoming too complex and hard to maintain. To resolve this, we could adopt an alternative approach where we have the Network of Terms applications prepare queries specifically geared towards and optimised for term source technologies. We can do so in two ways:

  1. Use the adapter pattern, but this requires the application to know which term source uses which technology. As a result of this, term source-specific knowledge spills over from the queries into the application code.

  2. Prepare a number of queries under different SPARQL parameter names, e.g. ?query for the unchanged user query, ?virtuoso_query for a Virtuoso-specific one, ?boolean_query that replaces A B with A AND B, ?diacritics_removed_query if that is still needed, etc. In the SPARQL queries, we can then choose for each query the parameter that gives the best results.

I prefer 2. It adds adds some redundant string replacement operations, but those should be negligible (and the adapter pattern would add some overhead, too). The benefit is that the knowledge of how to query specific term sources remains in the catalog of queries (that may be exposed in some admin UI later on) and doesn’t spill over in the application code.

@ddeboer
Copy link
Member

ddeboer commented Nov 25, 2021

@sdevalk Will make an inventory of the query variants that we need.

@ddeboer ddeboer self-assigned this Nov 25, 2021
@sdevalk
Copy link
Contributor

sdevalk commented Dec 2, 2021

@ddeboer For further discussion: I've created an overview of the terminology sources that we currently query and the various preprocessing rules that we currently apply or that we could consider. The overview is in this Google Spreadsheet.

So far I think we need three query variants:

  1. For queries that are used in literal searches (ABR, CHT, WO2, EuroVoc)
  2. For queries that are used in Virtuoso full text searches (Adamlink, Brinkman, NTA, GTAA, RKDartists, Wereldculturen, Muziekweb, Muziekschatten)
  3. For queries that don't need preprocessing (AAT, Wikidata)

@ddeboer
Copy link
Member

ddeboer commented Dec 3, 2021

Thanks @sdevalk, that’s really useful!

So we end up with these three query types:

  • ?query that equals the raw query input
  • ?literal_query for ABR etc. that converts to lowercase and trims whitespace
  • ?boolean_query (or ?virtuoso_query or ?bif_query) for Adamlink etc. that quotes query parts and inserts boolean AND between them.

Does converting to lowercase influence the other queries? If not, we can probably reduce these to two query types by merging the first two, because removing whitespace definitely should not hurt the other queries.

@sdevalk
Copy link
Contributor

sdevalk commented Dec 3, 2021

@ddeboer Good point. Lowercasing the search words does not influence the other queries, so we can continue with two query types/variants.

@github-actions
Copy link

🎉 This issue has been resolved in version 5.5.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request released
Projects
None yet
3 participants