Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use XSD datatypes not schema.org datatypes #654

Open
VladimirAlexiev opened this issue May 15, 2024 · 5 comments
Open

use XSD datatypes not schema.org datatypes #654

VladimirAlexiev opened this issue May 15, 2024 · 5 comments

Comments

@VladimirAlexiev
Copy link

VladimirAlexiev commented May 15, 2024

Schema.org datatypes are not good:

  • they go against standard XSD datatypes that are the foundation of both XML and RDF.
  • they are tentative (don't specify a lexical representation), eg schema:Number doesn't way what kind of number
  • they are not implemented in semantic repositories, i.e. there are special indexes for xsd:date, xsd:decimal etc but not for schema:Date, schema:Number etc

schemaorg/schemaorg#1781 explains in more detail what's wrong with them.

This also leads to confusion, eg in https://github.com/mlcommons/croissant/blob/main/docs/croissant.ttl:

croissant:Format a rdf:Class ;
  rdfs:label "Format" ;
  rdfs:comment "Specifies how to parse the format of the data from a string representation. For example, format may hold a date format string, a number format, or a bounding box format." ;
  rdfs:subClassOf schema:Text .

croissant:format a rdf:Property ;
  rdfs:label "format" ;
  rdfs:comment "A format to parse the values of the data from text, e.g., a date format or number format." ;
  schema:domainIncludes croissant:DataSource ;
  schema:rangeIncludes croissant:Format .
  • A datatype cannot be a subclass of another datatype
  • You can define a custom datatype based on a XSD datatype, but that's done based on restrictions (eg "Age is a subset of Integer by fixing minInclusive and maxInclusive"). As you don't define any restriction for Format, there's no need to define a new datatype.
  • In the class description, it's unclear what "hold" means: is the string stored directly in cr:format, or does it point to a node with type cr:Format that holds the string?

This issue involves the ontologies and JSONLD context.
Here's a count of occurrences in the two ontologies:

    2 schema:Boolean                                                          
    1 schema:DateTime                                                         
   29 schema:Text                                                             
    2 schema:URL   

Also, I think it's better to distinguish properties between owl:DatatypeProperty and owl:ObjectProperty.
Many Schema.org props are permissive and allow either literal or object ("string or thing"), but I think Croissant props are more precise,

@VladimirAlexiev
Copy link
Author

This use of a schema datatype is perhaps the only valid one:

rai:dataCollectionTimeframe a rdf:Property ;
  rdfs:label "dataCollectionTimeframe" ;
  rdfs:comment "Timeframe in terms of start and end date of the collection process, that it described as a DateTime indicating a time period in <a href=\"https://en.wikipedia.org/wiki/ISO_8601#Time_intervals\">ISO 8601 time interval format</a>. For example, a collection time frame ranging from 2020 - 2022 can be indicated in ISO 8601 interval format via \"2020/2022\"." ;
  schema:domainIncludes schema:Dataset ;
  schema:rangeIncludes schema:DateTime .

The reason is that there's no XSD datatype to cover date intervals.

  • https://schema.org/DateTime description doesn't allow interval: "A combination of date and time of day in the form [-]CCYY-MM-DDThh:mm:ss[Z|(+|-)hh:mm] (see Chapter 5.4 of ISO 8601)."
  • but https://schema.org/datasetTimeInterval allows interval "The range of temporal applicability of a dataset, e.g. for a 2011 census dataset, the year 2011 (in ISO 8601 time interval format)". This prop has DateTime as range (which confirms my claim that Schema datatypes are tentative)

@benjelloun
Copy link
Contributor

Wow, the discussion in schemaorg/schemaorg#1781 is quite amazing and instructive.

You make very valid points on the merits of xsd types vs. schema.org basic data types.

Personally, I would lean towards supporting both in Croissant, and specifying a clear mapping as you do in that discussion.
If there is a consensus on the benefits, we can recommend using the xsd basic datasets types over the schema.org ones in the next version of Croissant.

In general, data typing in Croissant aims to be extensible, and not limited to a single namespace. For instance, users can "semantically" type their data by associate classes from schema.org, wikidata, or other vocabularies. That said, for basic data types we certainly want to favor consistency to reduce the burden on tools and users of the datasets.

As you noted, we do inherit some of the fuzzyness of schema.org, but try to make things a bit more precise where necessary.

Regarding format, "holds" means "contains". cr:Format is just a marker type, but its values are still strings (err... I mean sc:Text. :-)

@pierrot0
Copy link
Contributor

We are definitely going to need to differentiate between int8, int16, uint8... and xsd has short, long, unsignedLong, etc.
So in that regard xsd seems useful indeed.

Looking at numpy types, is xsd enough though? What mechanism do we want to support to describe a field as being a int128, or a complex number for example?

@VladimirAlexiev
Copy link
Author

@pierrot0 For that you'd need custom datatypes.

How about large multidimensional arrays (tensors)? NetCDF and HDF5 for example have mechanisms for capturing such in binary and for describing them.

@VladimirAlexiev
Copy link
Author

@pierrot0 I've reread the discussion above.

users can "semantically" type their data by associate classes from schema.org, wikidata, or other vocabularies.

What precisely do you mean by this, can you give an example?

cr:Format is just a marker type, but its values are still strings (err... I mean sc:Text. :-)

  • Only resources (nodes) can have rdf:type
  • Literals can have a datatype, eg "application/json"^^cr:Format is a valid literal with a custom datatype. But do you really want that?
  • The legitimate use for datatypes is to signal/trigger special processing by the semantic database, eg
    • "2024-10-09"^^xsd:date tells it to index the literal as a date (so eg it should come before "12024-10-09")
    • "point(1 2)"geo:wktLiteral tells it to put a GeoSPARQL literal (expressed as Well Known Text) in a geospatial index

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants