Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Content-encoding SPARQL query (België) #772

Open
coret opened this issue Oct 21, 2022 · 9 comments
Open

Content-encoding SPARQL query (België) #772

coret opened this issue Oct 21, 2022 · 9 comments
Assignees

Comments

@coret
Copy link
Contributor

coret commented Oct 21, 2022

When searching for België in the GTAA no results are given, whilst searching for Belgie has among othersBelgië as result.

Testing by @wmelder showed the following:

The query for België via the construct_gtaa.rq query run via
curl -H Accept:text/turtle --data-urlencode "query@queries/construct_gtaa.rq" 'https://{username}:{password}@gtaa.apis.beeldengeluid.nl/sparql'
yields no results, but
curl -H "Content-type: application/x-www-form-urlencoded; charset=utf-8" -H Accept:text/turtle --data-urlencode "query@queries/construct_gtaa.rq" 'https://{username}:{password}@gtaa.apis.beeldengeluid.nl/sparql'
does give results!

It seems the Comunica client (Network of Terms) sends UTF-8, but doesn't include a character encoding header, so server-side it's considered US-ASCII (ISO-8859-1).

Should / can the charset be part of the dataset description of the GTAA within the Network of Terms (client-side solution). Of, should a default charset (utf-8) be hardcoded in the Comunica call with the option to override via de dataset description?

Some other searches which have problems with searching for terms with diacritics: Ampèrestraat (Adamlink) and Curaçaostraat (Gouda Tijdmachine). Haven't checked if adding a charset helps with these sources.

Some other search which do not have a problem with searching for terms with diacritics: Eichstätt (WO2 thesaurus), Galileïsche (AAT), Henriëtte (RKDartists)

@wmelder
Copy link
Contributor

wmelder commented Oct 21, 2022

Should / can the charset be part of the dataset description of the GTAA within the Network of Terms (client-side solution). Of, should a default charset (utf-8) be hardcoded in the Comunica call with the option to override via de dataset description?

Adding a hardcoded charset in the HTTP header would suffice. Otherwise, the receiving server doesn't know what type encoding is sent.

@wmelder
Copy link
Contributor

wmelder commented Oct 21, 2022

@wmelder
Copy link
Contributor

wmelder commented Oct 21, 2022

Adding a hardcoded charset in the HTTP header would suffice. Otherwise, the receiving server doesn't know what type encoding is sent.

On second thoughts... what if the server doesn't handle the charset properly? Or doesn't have an UTF-8 default encoding? Then it would be nice if network of terms can provide a charset that the server will handle properly. In those cases a dataset parameter should be necessary.

@ddeboer
Copy link
Member

ddeboer commented Oct 25, 2022

What is construct_gtaa.rq and where can I find it?

@wmelder
Copy link
Contributor

wmelder commented Oct 25, 2022

@ddeboer construct_gtaa.rq is basically the gtaa.rq query, but it may include VALUES for query and datasetUri, variables that are filled in from within the network of terms. To be able to use a test query file we renamed it. In itself not so exciting.

@wmelder
Copy link
Contributor

wmelder commented Oct 25, 2022

Currently these are the contents of the file:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX justskos: <http://justskos.org/ns/core#>
PREFIX text: <http://jena.apache.org/text#>

CONSTRUCT {
    ?uri a skos:Concept ;
        skos:prefLabel ?prefLabel ;
        skos:altLabel ?altLabel ;
        skos:hiddenLabel ?hiddenLabel ;
        skos:scopeNote ?scopeNote ;
        skos:broader ?broader_uri ;
        skos:narrower ?narrower_uri ;
        skos:related ?related_uri .
    ?broader_uri skos:prefLabel ?broader_prefLabel .
    ?narrower_uri skos:prefLabel ?narrower_prefLabel .
    ?related_uri skos:prefLabel ?related_prefLabel .
}
WHERE {
    VALUES ?query { "zelensky" }
    VALUES ?datasetUri {
        <http://data.beeldengeluid.nl/gtaa/Persoonsnamen>
        }
    ?uri text:query (skos:prefLabel skos:altLabel skos:hiddenLabel ?query) .
    ?uri skos:inScheme ?datasetUri ;
        justskos:status ?status .
    FILTER(?status IN ('approved', 'candidate'))

    OPTIONAL {
        ?uri skos:prefLabel ?prefLabel .
        FILTER(LANG(?prefLabel) = "nl" )
    }
    OPTIONAL {
        ?uri skos:altLabel ?altLabel .
        FILTER(LANG(?altLabel) = "nl")
    }
    OPTIONAL {
        ?uri skos:hiddenLabel ?hiddenLabel .
        FILTER(LANG(?hiddenLabel) = "nl")
    }
    OPTIONAL {
        ?uri skos:scopeNote ?scopeNote .
        FILTER(LANG(?scopeNote) = "nl")
    }
    OPTIONAL {
        ?uri skos:broader ?broader_uri .
        ?broader_uri skos:prefLabel ?broader_prefLabel .
        FILTER(LANG(?broader_prefLabel) = "nl")
    }
    OPTIONAL {
        ?uri skos:narrower ?narrower_uri .
        ?narrower_uri skos:prefLabel ?narrower_prefLabel .
        FILTER(LANG(?narrower_prefLabel) = "nl")
    }
    OPTIONAL {
        ?uri skos:related ?related_uri .
        ?related_uri skos:prefLabel ?related_prefLabel .
        FILTER(LANG(?related_prefLabel) = "nl")
    }
}
LIMIT 1000

@wmelder
Copy link
Contributor

wmelder commented Oct 25, 2022

For this issue it should be modified a bit:

    VALUES ?query { "België" }
    VALUES ?datasetUri {
        <http://data.beeldengeluid.nl/gtaa/GeografischeNamen>
        }

@ddeboer ddeboer self-assigned this Oct 25, 2022
@ddeboer
Copy link
Member

ddeboer commented Oct 25, 2022

For previous work on diacritics, see #426, netwerk-digitaal-erfgoed/network-of-terms-catalog#46 and netwerk-digitaal-erfgoed/network-of-terms-catalog#93. At least for Virtuoso sources (Adamlink), how diacritics are interpreted is out of our control.

@wmelder
Copy link
Contributor

wmelder commented Oct 25, 2022

In de sparql doc staat dat een POST met application/sparql-query altijd in UTF-8 is. Maar bij een POST met x-www-form-urlencoded staat dat er niet bij. Mogelijk beter om de application/sparql-query variant te gebruiken (met unescaped UTF-8 dus).

tip van onze ontwikkelaars...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants