Mapping PROV Qualified Names to xsd:QName

Luc Moreau edited this page Jul 24, 2015 · 8 revisions

Mapping PROV Qualified Names to xsd:QName

1. Introduction

PROV-DM defines a PROV Identifier as a Qualified Name with the following definition: A qualified name is a name subject to namespace interpretation. It consists of a namespace, denoted by an optional prefix, and a local name. PROV-DM stipulates that a qualified name can be mapped into an IRI by concatenating the IRI associated with the prefix and the local part.

PROV-N provides a concrete syntax for prov:QUALIFIED_NAME, further noting that a PROV-N qualified name QUALIFIED_NAME can be mapped to a valid IRI [RFC3987] by concatenating the namespace denoted its local name to the local name, whose -escaped characters have been unescaped by dropping the character '' (backslash).

PROV-XML defines the type of both the prov:id and prov:ref xml-attributes to be xsd:QName as that is the XSD datatype that most closely matches the qualified name definition by PROV-DM. Care should be taken when generating PROV identifier values in PROV-XML such that there is a known mapping to a URI.

A further note adds:

The xsd:QName datatype is more restrictive than the QualifiedName defined in [PROV-N], e.g. PROV-N allows local names to start with numbers, therefore valid identifier values in [PROV-N] serializations have to potential to not be valid identifier values in PROV-XML. It is recommended to enhance interoperability that provenance users strive to always use identifier schemes that map to valid xsd:QNames and URIs.

While this suggestion may work well for applications that are in full control of the design of their identifiers, this suggestion is not workable for applications, such as ProvToolbox, expected to consume arbitrary provenance in arbitrary representations. Any form of URI needs to be mapped to a Qualified Name for PROV-N and to an xsd:QName for PROV-XML.

This limitation was recognized by the Provenance Working Group, a beginning of solution was outlined in email discussions, but never made it to the PROV-XML specification.

The purpose of this document is to outline the mapping process of Qualified Names to xsd:QName adopted by ProvToolbox.

2. A Reversible Encoding

The suggestion outlined in email discussions was escaping Qualified Names in an unspecified way, and was relying on a separate explicit URI representation, for converting PROV-XML representations back into other PROV formats. Based on experience with ProvToolbox, we felt it would negatively affect the readability of PROV-XML.

Instead, we have implemented a reversible encoding from Qualified Name to xsd:QName, which allows such xsd:QName to be converted back to Qualified Name.

2.1 Underscore-encoding

There already exists an encoding scheme that is reversible: Percent encoding as used in URIs. However, the character % is not valid in xsd:QName. So, instead, we had to choose a character that is valid in local names and was not too frequently used, because itself would have to be escaped.

After consideration, it was decided to use _ (Underscore).

2.2 Start-escaping

The first character of an xsd:QName local name is expected to belong to a restricted subset of characters. For instance, a local name cannot start with a digit. Therefore, after underscore-encoding a local name, we further escape the first character with a _ (Underscore) if it not a valid start character.

3. Examples

The following table illustrates a few conversions.

prov:QUALIFIED_NAME(*) xsd:QName Comment
ex:abc ex:abc Provly identifier, no escaping required
ex:abc01 ex:abc01 Provly identifier, no escaping required
ex:01 ex:_01 QName starting by a non PN_CHAR_START to be escaped with _
ex: ex:_ empty local name mapped to _
ex:_ ex:___ _ escaped, and escaped again since at the start
ex:a01b_c ex:a01b__c Escape _
ex:a@b ex:a_40b Mapping of @ to _40
ex:a~b ex:a_7Eb Mapping of ~ to _7E
ex:a&b ex:a_26b Mapping of & to _26
ex:a+b ex:a_2Bb Mapping of + to _2B
ex:a*b ex:a_2Ab Mapping of * to _2A
ex:a#b ex:a_23b Mapping of # to _23
ex:a$b ex:a_24b Mapping of $ to _24
ex:a!b ex:a_21b Mapping of ! to _21b
ex:a01/bc ex:a01_2Fbc Mapping of / to _2F
ex:a01b\c ex:a01b_5Cc Mapping of \ to _5C
ex:a01b=c ex:a01b_3Dc Mapping of = to _3D
ex:a01b'c ex:a01b_27c Mapping of ' to _27
ex:a01b(c ex:a01b_28c Mapping of ( to _28
ex:a01b)c ex:a01b_29c Mapping of ) to _29
ex:a01b,c ex:a01b_2Cc Mapping of , to _2C
ex:a01b:c ex:a01b_3Ac Mapping of : to _3A
ex:a01b;c ex:a01b_3Bc Mapping of ; to _3B
ex:a01b[c ex:a01b_5Bc Mapping of [ to _5B
ex:a01b]c ex:a01b_5Dc Mapping of ] to _5D
ex:a01b.c ex:a01b.c . permitted in QName
ex:a01bc. ex:a01bc. . permitted at end of QName
ex:='(),_:;[].@~ ex:__3D_27_28_29_2C___3A _3B_5B_5D._40_7E Escape them all except .
ex:?a\=b ex:__3Fa_5C_3Db Escape symbols
ex:55348dff-4fcc-4ac2-ab56-641798c64400 ex:_55348dff-4fcc-4ac2-ab56-641798c64400 Escaping of a UUID-like QualifiedName
ex:À-ÖØ-öø-˿Ͱͽ ex:À-ÖØ-öø-˿Ͱͽ Support for Unicode

(*) Note that the prov:QUALIFIED_NAME column displays unescaped Qualified Names. So, the correct syntax for ex:a01bc. is ex:a01bc\. since . is not allowed in final position.

4. Provly Identifiers

PROV-XML makes the following suggestion.

It is recommended to enhance interoperability that >provenance users strive to always use identifier schemes > that map to valid xsd:QNames and URIs.

We call these "provly identifiers". For instance ex:ab01 is a "provly identifier", since it is both a PROV-N Qualified Name and a xsd:QName.

5. ProvToolbox support

The class org.openprovenance.prov.model.QualifiedNameUtils offers conversion methods implementing the encoding describe in this section.

The method toQName() implements the encoding describes in Section 2.

6. Conclusion

We recognize that this solution is our own, and in a sense, is not inter-operable. Other solutions are possible. But, one such solution (or more) is required to support inter-operable conversions between PROV-XML and the other representations.

A consequence of releasing ProvToolbox 0.7.0 with support for this encoding is that PROV-XML documents previously generated may not be readable if they don't support this encoding.

A future version of PROV will have to specify the mapping between PROV representations, and specifically, will have to address the mapping of PROV-N identifier to xsd:QName, as mandated by PROV-XML. The solution presented here will be an input to this standardization effort.