Skip to content
This repository has been archived by the owner on Feb 22, 2022. It is now read-only.

NTA query lacks support for special characters #46

Closed
EnnoMeijers opened this issue Mar 18, 2021 · 8 comments · Fixed by #112
Closed

NTA query lacks support for special characters #46

EnnoMeijers opened this issue Mar 18, 2021 · 8 comments · Fixed by #112
Assignees
Labels

Comments

@EnnoMeijers
Copy link
Contributor

The current nta.rq use the bif:contains operator for Virtuoso bus but lacks support for searching strings with special characters, like searching for 'Sébastien' in schema:givenName. Additional single quotes surrounding the searchTerm should be added.

Searching for this name in the demonstrator results in an empty set. Searching in the KB's endpoints results in matches for both 'Sébastien' and 'Sebastien', see query. Fixing nta.rq is not straightforward because of the current magic in the query for handling boolean operators in the search string.

@ddeboer
Copy link
Member

ddeboer commented Mar 18, 2021

Additional single quotes surrounding the searchTerm should be added.

An alternative solution is to strip diacritics from the query: searching for "sebastien" also returns Sébastien, but that may depend on the value of XAnyNormalization in the Virtuoso config.

@EnnoMeijers
Copy link
Contributor Author

I suspect the current behavior might change over time as KB is still working on this issue but currently this is the case and we could fix it this way for the nta query. I you are thinking of stripping the diacritics before sending the query to the different sources then I am hesitant to do so because it might impact the results form other sources. In this case we should investigate the impact for the different sources.

@sdevalk
Copy link
Contributor

sdevalk commented Mar 18, 2021

It indeed depends on the Virtuoso config. Virtuoso on NDE's data platform distinguishes between diacritics - "Sebastien" and "Sébastien" are distinct words (i.e. you cannot find the latter if you remove the diacritic). I have a solution in mind for fixing the query - I'll post it in this ticket.

@sdevalk
Copy link
Contributor

sdevalk commented Mar 29, 2021

@EnnoMeijers and @ddeboer:

A straightforward solution (for NTA, but also for other sources that use Virtuoso, such as RKDartists) would be to wrap search words inside (single or double) quotes. The approach looks like this:

Example search phrase of a user: Granpré Molière.

  1. Replace whitespaces in a search phrase with AND operators (as is currently the case). Example: Granpré Molière would then become Granpré AND Molière.
  2. Replace each search word in a search phrase with a leading quote, the original search word, and a trailing quote. Example: Granpré AND Molière would then become 'Granpré' AND 'Molière'.

Following this approach, a query for the NTA could look like this (for testing only):

PREFIX schema: <http://schema.org/>
SELECT *
WHERE {
  ?person schema:mainEntityOfPage/schema:isPartOf <http://data.bibliotheken.nl/id/dataset/persons> .
  ?person schema:name ?name .
  FILTER(<bif:contains>(
    ?name,
    REPLACE(
      # Replace query "A B" with "A AND B", leaving queries "A AND B" or "A OR B" unchanged
      # STR() is required; otherwise Virtuoso complains ("Invalid character in free-text search expression...")
      STR(REPLACE(
        "Granpré Molière",
        "(?<!AND)(?<!OR)[[:space:]]+(?!AND)(?!OR)",
        " AND ",
        "i"
      )),
      # Add leading and trailing quote around each search word, except OR and AND
      "\\b(?!OR[[:space:]]+|AND[[:space:]]+)([^[:space:]]+)",
      "'$1'",
      "i"
    )
  ))
} 
ORDER BY ?name
LIMIT 10

Shortcut to Yasgui: https://api.triplydb.com/s/sSRMMT5bu

This query is a stab at solving the problem - it's not perfect. For instance, how should we handle search phrases with boolean parenthesizations, such as (Granpré OR Molière)? And how should we handle quotes that have been put in search phrases by users, such as Eric D'hondt? We can extend the query above by checking for this kind of input. (Granpré OR Molière) would then become ('Granpré' OR 'Molière'), not the erroneous ('Granpré' OR 'Molière)'. And Eric D'hondt would then become 'Eric' AND 'D\\'hondt', not the erroneous 'Eric' AND 'D'hondt'.

The overarching question perhaps is if and to what extend we want to 'rewrite' or 'preprocess' search phrases of users. Rewriting could make the searches 'smarter': users don't have to know the search syntaxes of sources; the Network of Terms takes care of this. On the other hand, rewriting could have undesired side-effects, depending on the user's input.

What do you think?

@EnnoMeijers
Copy link
Contributor Author

I think this is an interesting approach for fixing a major part of the current problems experienced by the LM. It seems to be quite heavy in the processing resulting in a slow response of the KB sparqle endpoint, are we still in an acceptable range here? I tried expanding the query to include searching in schema:alternateName as well (using the property path '|') but that resulted in a mysterious error: "Virtuoso 37000 Error SP031: SPARQL compiler: The group does not contain triple pattern with '$name' object before bif:contains() predicate".

At some point it probably will be inevitable to do preprocessing on the input but I think we should be cautious with this because it might introduce more complexity and less predictable behavior. I suggest we should do more exploration on the real need for support for boolean search syntax. My impression is that the current user expectation is the every day Google search experience. I think we should aim for a similar experience and stay away from introducing complexity for supporting boolean search operations.

@sdevalk
Copy link
Contributor

sdevalk commented Apr 12, 2021

Thanks!

It seems to be quite heavy in the processing resulting in a slow response of the KB sparqle endpoint, are we still in an acceptable range here?

I haven't noticed a real loss in performance. Virtuoso should execute the conversion/pre-processing in the FILTER (e.g. with STR() and REPLACE()) just once - before doing the actual searching - so it should have minimal impact.

I tried expanding the query to include searching in schema:alternateName as well (using the property path '|') but that resulted in a mysterious error: "Virtuoso 37000 Error SP031: SPARQL compiler: The group does not contain triple pattern with '$name' object before bif:contains() predicate".

Yes, I'm experiencing the same issue - bif:contains and property paths do not seem to be good combination.

The query underneath uses VALUES for including other predicates, such as schema:alternateName. It performs quite well.

PREFIX schema: <http://schema.org/>
SELECT *
WHERE {
  ?person schema:mainEntityOfPage/schema:isPartOf <http://data.bibliotheken.nl/id/dataset/persons> .
  ?person ?predicate ?name .
  VALUES ?predicate { rdfs:label schema:name schema:alternateName foaf:name }
  FILTER(<bif:contains>(
    ?name,
    REPLACE(
      # Replace query "A B" with "A AND B", leaving queries "A AND B" or "A OR B" unchanged
      # STR() is required; otherwise Virtuoso complains ("Invalid character in free-text search expression...")
      STR(REPLACE(
        "Granpré Molière",
        "(?<!AND)(?<!OR)[[:space:]]+(?!AND)(?!OR)",
        " AND ",
        "i"
      )),
      # Add leading and trailing quote around each search word, except OR and AND
      "\\b(?!OR[[:space:]]+|AND[[:space:]]+)([^[:space:]]+)",
      "'$1'",
      "i"
    )
  ))
} 
ORDER BY ?name
LIMIT 10

Shortcut to Yasgui: https://api.triplydb.com/s/u04WhDLr3

My impression is that the current user expectation is the every day Google search experience. I think we should aim for a similar experience and stay away from introducing complexity for supporting boolean search operations.

That makes sense!

@EnnoMeijers
Copy link
Contributor Author

Ok, looks good, let's go ahead with it!

@github-actions
Copy link

🎉 This issue has been resolved in version 5.5.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
3 participants