Skip to content

Latest commit

 

History

History
859 lines (632 loc) · 29.4 KB

SPEC.md

File metadata and controls

859 lines (632 loc) · 29.4 KB

KDL Spec

This is the semi-formal specification for KDL, including the intended data model and the grammar.

This document describes KDL version 2.0.0-draft.4. It was released on 2024-02-12.

Introduction

KDL is a node-oriented document language. Its niche and purpose overlaps with XML, and as do many of its semantics. You can use KDL both as a configuration language, and a data exchange or storage format, if you so choose.

The bulk of this document is dedicated to a long-form description of all Components of a KDL document. There is also a much more terse Grammar at the end of the document that covers most of the rules, with some semantic exceptions involving the data model.

KDL is designed to be easy to read and easy to implement.

In this document, references to "left" or "right" refer to directions in the data stream towards the beginning or end, respectively; in other words, the directions if the data stream were only ASCII text. They do not refer to the writing direction of text, which can flow in either direction, depending on the characters used.

Components

Document

The toplevel concept of KDL is a Document. A Document is composed of zero or more Nodes, separated by newlines and whitespace, and eventually terminated by an EOF.

All KDL documents should be UTF-8 encoded and conform to the specifications in this document.

Example

The following is a document composed of two toplevel nodes:

foo {
    bar
}
baz

Node

Being a node-oriented language means that the real core component of any KDL document is the "node". Every node must have a name, which must be a String.

The name may be preceded by a Type Annotation to further clarify its type, particularly in relation to its parent node. (For example, clarifying that a particular date child node is for the publication date, rather than the last-modified date, with (published)date.)

Following the name are zero or more Arguments or Properties, separated by either whitespace or a slash-escaped line continuation. Arguments and Properties may be interspersed in any order, much like is common with positional arguments vs options in command line tools.

Children can be placed after the name and the optional Arguments and Properties, possibly separated by either whitespace or a slash-escaped line continuation.

Arguments are ordered relative to each other (but not relative to Properties) and that order must be preserved in order to maintain the semantics.

By contrast, Property order SHOULD NOT matter to implementations. Children should be used if an order-sensitive key/value data structure must be represented in KDL.

Nodes MAY be prefixed with Slashdash to "comment out" the entire node, including its properties, arguments, and children, and make it act as plain whitespace, even if it spreads across multiple lines.

Finally, a node is terminated by either a Newline, a semicolon (;) or the end of the file/stream (an EOF).

Example

foo 1 key=val 3 {
    bar
    (role)baz 1 2
}

Line Continuation

Line continuations allow Nodes to be spread across multiple lines.

A line continuation is a \ character followed by zero or more whitespace items (including multiline comments) and an optional single-line comment. It must be terminated by a Newline (including the Newline that is part of single-line comments).

Following a line continuation, processing of a Node can continue as usual.

Example

my-node 1 2 \  // comments are ok after \
        3 4    // This is the actual end of the Node.

Property

A Property is a key/value pair attached to a Node. A Property is composed of a String, followed immediately by an equals sign, and then a Value.

Properties should be interpreted left-to-right, with rightmost properties with identical names overriding earlier properties. That is:

node a=1 a=2

In this example, the node's a value must be 2, not 1.

No other guarantees about order should be expected by implementers. Deserialized representations may iterate over properties in any order and still be spec-compliant.

Properties MAY be prefixed with /- to "comment out" the entire token and make it act as plain whitespace, even if it spreads across multiple lines.

Equals Sign

Any of the following characters may be used as equals signs in properties:

Name Character Code Point
EQUALS SIGN = U+003D
SMALL EQUALS SIGN U+FE66
FULLWIDTH EQUALS SIGN U+FF1D
HEAVY EQUALS SIGN 🟰 U+1F7F0

Argument

An Argument is a bare Value attached to a Node, with no associated key. It shares the same space as Properties, and may be interleaved with them.

A Node may have any number of Arguments, which should be evaluated left to right. KDL implementations MUST preserve the order of Arguments relative to each other (not counting Properties).

Arguments MAY be prefixed with /- to "comment out" the entire token and make it act as plain whitespace, even if it spreads across multiple lines.

Example

my-node 1 2 3 a b c

Children Block

A children block is a block of Nodes, surrounded by { and }. They are an optional part of nodes, and create a hierarchy of KDL nodes.

Regular node termination rules apply, which means multiple nodes can be included in a single-line children block, as long as they're all terminated by ;.

Example

parent {
    child1
    child2
}

parent { child1; child2; }

Value

A value is either: a String, a Number, a Boolean, or Null.

Values MUST be either Arguments or values of Properties. Only String values may be used as Node names or Property keys.

Values (both as arguments and as properties) MAY be prefixed by a single Type Annotation.

Type Annotation

A type annotation is a prefix to any Node Name or Value that includes a suggestion of what type the value is intended to be treated as, or as a context-specific elaboration of the more generic type the node name indicates.

Type annotations are written as a set of ( and ) with a single String in it. It may contain Whitespace after the ( and before the ), and may be separated from its target by Whitespace.

KDL does not specify any restrictions on what implementations might do with these annotations. They are free to ignore them, or use them to make decisions about how to interpret a value.

Additionally, the following type annotations MAY be recognized by KDL parsers and, if used, SHOULD interpret these types as follows:

Reserved Type Annotations for Numbers Without Decimals:

Signed integers of various sizes (the number is the bit size):

  • i8
  • i16
  • i32
  • i64

Unsigned integers of various sizes (the number is the bit size):

  • u8
  • u16
  • u32
  • u64

Platform-dependent integer types, both signed and unsigned:

  • isize
  • usize

Reserved Type Annotations for Numbers With Decimals:

IEEE 754 floating point numbers, both single (32) and double (64) precision:

  • f32
  • f64

IEEE 754-2008 decimal floating point numbers

  • decimal64
  • decimal128

Reserved Type Annotations for Strings:

  • date-time: ISO8601 date/time format.
  • time: "Time" section of ISO8601.
  • date: "Date" section of ISO8601.
  • duration: ISO8601 duration format.
  • decimal: IEEE 754-2008 decimal string format.
  • currency: ISO 4217 currency code.
  • country-2: ISO 3166-1 alpha-2 country code.
  • country-3: ISO 3166-1 alpha-3 country code.
  • country-subdivision: ISO 3166-2 country subdivision code.
  • email: RFC5322 email address.
  • idn-email: RFC6531 internationalized email address.
  • hostname: RFC1132 internet hostname (only ASCII segments)
  • idn-hostname: RFC5890 internationalized internet hostname (only xn---prefixed ASCII "punycode" segments, or non-ASCII segments)
  • ipv4: RFC2673 dotted-quad IPv4 address.
  • ipv6: RFC2373 IPv6 address.
  • url: RFC3986 URI.
  • url-reference: RFC3986 URI Reference.
  • irl: RFC3987 Internationalized Resource Identifier.
  • irl-reference: RFC3987 Internationalized Resource Identifier Reference.
  • url-template: RFC6570 URI Template.
  • uuid: RFC4122 UUID.
  • regex: Regular expression. Specific patterns may be implementation-dependent.
  • base64: A Base64-encoded string, denoting arbitrary binary data.

Examples

node (u8)123
node prop=(regex).*
(published)date "1970-01-01"
(contributor)person name="Foo McBar"

String

Strings in KDL represent textual UTF-8 Values. A String is either an Identifier String (like foo), a Quoted String (like "foo") or a Raw String (like #"foo"#). Identifier Strings let you write short, "single-word" strings with a minimum of syntax; Quoted Strings let you write strings with whitespace (including newlines!) or escapes; Raw Strings let you write strings with whitespace but without escapes, allowing you to not worry about the string's content containing anything that might look like an escape.

Strings MUST be represented as UTF-8 values.

Strings MUST NOT include the code points for disallowed literal code points directly. Quoted Strings may include these code points as values by representing them with their corresponding \u{...} escape.

Identifier String

An Identifier String (sometimes referred to as just an "identifier") is composed of any Unicode Scalar Value other than non-initial characters, followed by any number of Unicode Scalar Values other than non-identifier characters.

A handful of patterns are disallowed, to avoid confusion with other values:

  • idents that appear to start with a Number (like 1.0v2 or -1em) or the "almost a number" pattern of a decimal point without a leading digit (like .1).
  • idents that are the language keywords (inf, -inf, nan, true, false, and null) without their leading #.

Identifiers that match these patterns MUST be treated as a syntax error; such values can only be written as quoted or raw strings. The precise details of the identifier syntax is specified in the Full Grammar below.

Identifier Strings are terminated by Whitespace or Newlines.

Non-initial characters

The following characters cannot be the first character in an Identifier String:

Additionally, the - character can only be used as an initial character if the second character is not a digit. This allows identifiers to look like --this, and removes the ambiguity of having an identifier look like a negative number.

Non-identifier characters

The following characters cannot be used anywhere in a Identifier String:

Quoted String

A Quoted String is delimited by " on either side of any number of literal string characters except unescaped " and \. This includes literal Newline characters, which means a single String Value can span multiple lines, following specific Multi-line String rules.

Like Identifier Strings, Quoted Strings MUST NOT include any of the disallowed literal code-points as code points in their body.

Quoted Strings also follow the Multi-line rules specified in Multi-line String.

Escapes

In addition to literal code points, a number of "escapes" are supported in Quoted Strings. "Escapes" are the character \ followed by another character, and are interpreted as described in the following table:

Name Escape Code Pt
Line Feed \n U+000A
Carriage Return \r U+000D
Character Tabulation (Tab) \t U+0009
Reverse Solidus (Backslash) \\ U+005C
Quotation Mark (Double Quote) \" U+0022
Backspace \b U+0008
Form Feed \f U+000C
Space \s U+0020
Unicode Escape \u{(1-6 hex chars)} Code point described by hex characters, as long as it represents a Unicode Scalar Value
Whitespace Escape See below N/A
Escaped Whitespace

In addition to escaping individual characters, \ can also escape whitespace. When a \ is followed by one or more literal whitespace characters, the \ and all of that whitespace are discarded. For example, "Hello World" and "Hello \ World" are semantically identical. See whitespace and newlines for how whitespace is defined.

Note that only literal whitespace is escaped; whitespace escapes (\n and such) are retained. For example, these strings are all semantically identical:

"Hello\       \nWorld"

    "Hello\n\
    World"

"Hello\nWorld"

"
  Hello
  World
  "
Invalid escapes

Except as described in the escapes table, above, \ MUST NOT precede any other characters in a string.

Raw String

Raw Strings in KDL are much like Quoted Strings, except they do not support \-escapes. They otherwise share the same properties as far as literal Newline characters go, multi-line rules, and the requirement of UTF-8 representation.

Raw String literals are represented with one or more # characters, followed by ", followed by any number of UTF-8 literals. The string is then closed by a " followed by a matching number of # characters. This means that the string sequence " or "# and such must not match the closing " with the same or more # characters as the opening #, in the body of the string.

Like other Strings, Raw Strings MUST NOT include any of the disallowed literal code-points as code points in their body. Unlike with Quoted Strings, these cannot simply be escaped, and are thus unrepresentable when using Raw Strings.

Example

just-escapes #"\n will be literal"#

The string contains the literal characters \n will be literal.

quotes-and-escapes ##"hello\n\r\asd"#world"##

The string contains the literal characters hello\n\r\asd"#world

Multi-line Strings

When a Quoted or Raw String spans multiple lines with literal, non-escaped Newlines, it follows a special multi-line syntax that automatically "dedents" the string, allowing its value to be indented to a visually matching level if desired.

A Multi-line string MUST start with a Newline immediately following its opening ". Its final line MUST contain only whitespace, followed by a single closing ". All in-between lines that contain non-newline characters MUST start with the exact same whitespace as the final line (precisely matching codepoints, not merely counting characters).

The value of the Multi-line String omits the first and last Newline, the Whitespace of the last line, and the matching Whitespace prefix on all intermediate lines. The first and last Newline can be the same character (that is, empty multi-line strings are legal).

Strings with literal Newlines that do not immediately start with a Newline and whose final " is not preceeded by optional whitespace and a Newline are illegal.

In other words, the final line specifies the whitespace prefix that will be removed from all other lines.

It is a syntax error for any body lines of the multi-line string to not match the whitespace prefix of the last line with the final quote.

Newline Normalization

Literal Newline sequences in Multi-line Strings must be normalized to a single U+000A (LF) during deserialization. This means, for example, that CR LF becomes a single LF during parsing.

This normalization does not apply to non-literal Newlines entered using escape sequences.

For clarity: this normalization is for individual sequences. That is, the literal sequence CRLF CRLF becomes LF LF, not LF.

Example

multi-line "
        foo
    This is the base indentation
            bar
    "

This example's string value will be:

    foo
This is the base indentation
        bar

which is equivalent to " foo\nThis is the base indentation\n bar" when written as a single-line string.


If the last line wasn't indented as far, it won't dedent the rest of the lines as much:

multi-line "
        foo
    This is no longer on the left edge
            bar
  "

This example's string value will be:

      foo
  This is no longer on the left edge
          bar

Equivalent to " foo\n This is no longer on the left edge\n bar".


Empty lines can contain any whitespace, or none at all, and will be reflected as empty in the value:

multi-line "
    Indented a bit

    A second indented paragraph.
    "

This example's string value will be:

Indented a bit.

A second indented paragraph.

Equivalent to "Indented a bit.\n\nA second indented paragraph."


The following yield syntax errors:

multi-line "
  closing quote with non-whitespace prefix"
multi-line "stuff
  "
// Every line must share the exact same prefix as the closing line.
multi-line "[\n]
[tab]a[\n]
[space][space]b[\n]
[space][tab][\n]
[tab]"

Interaction with Whitespace Escapes

Multi-line strings support the same mechanism for escaping whitespace. When processing a Multi-line String, implementations MUST dedent the string after resolving all whitespace escapes, but before resolving other backslash escapes. Furthermore, a whitespace escape that attempts to escape the final line's newline and/or whitespace prefix is invalid since the multi-line string has to still be valid with the escaped whitespace removed.

For example, the following example is illegal:

  // Equivalent to trying to write a string containing `foo\nbar\`.
  "
  foo
  bar\
  "

while the following example is allowed

  "
  foo \
bar
  baz
  \   "
  // this is equivalent to
  "
  foo bar
  baz
  "

Number

Numbers in KDL represent numerical Values. There is no logical distinction in KDL between real numbers, integers, and floating point numbers. It's up to individual implementations to determine how to represent KDL numbers.

There are five syntaxes for Numbers: Keywords, Decimal, Hexadecimal, Octal, and Binary.

  • All non-Keyword numbers may optionally start with one of - or +, which determine whether they'll be positive or negative.
  • Binary numbers start with 0b and only allow 0 and 1 as digits, which may be separated by _. They represent numbers in radix 2.
  • Octal numbers start with 0o and only allow digits between 0 and 7, which may be separated by _. They represent numbers in radix 8.
  • Hexadecimal numbers start with 0x and allow digits between 0 and 9, as well as letters A through F, in either lower or upper case, which may be separated by _. They represent numbers in radix 16.
  • Decimal numbers are a bit more special:
    • They have no radix prefix.
    • They use digits 0 through 9, which may be separated by _.
    • They may optionally include a decimal separator ., followed by more digits, which may again be separated by _.
    • They may optionally be followed by E or e, an optional - or +, and more digits, to represent an exponent value.

Note that, similar to JSON and some other languages, numbers without an integer digit (such as .1) are illegal. They must be written with at least one integer digit, like 0.1. (These patterns are also disallowed from Identifier Strings, to avoid confusion.)

Keyword Numbers

There are three special "keyword" numbers included in KDL to accomodate the widespread use of IEEE 754 floats:

  • #inf - floating point positive infinity.
  • #-inf - floating point negative infinity.
  • #nan - floating point NaN/Not a Number.

To go along with this and prevent foot guns, the bare Identifier Strings inf, -inf, and nan are considered illegal identifiers and should yield a syntax error.

Boolean

A boolean Value is either the symbol #true or #false. These SHOULD be represented by implementation as boolean logical values, or some approximation thereof.

Example

my-node #true value=#false

Null

The symbol #null represents a null Value. It's up to the implementation to decide how to represent this, but it generally signals the "absence" of a value.

Example

my-node #null key=#null

Whitespace

The following characters should be treated as non-Newline white space:

Name Code Pt
Character Tabulation U+0009
Line Tabulation U+000B
Space U+0020
No-Break Space U+00A0
Ogham Space Mark U+1680
En Quad U+2000
Em Quad U+2001
En Space U+2002
Em Space U+2003
Three-Per-Em Space U+2004
Four-Per-Em Space U+2005
Six-Per-Em Space U+2006
Figure Space U+2007
Punctuation Space U+2008
Thin Space U+2009
Hair Space U+200A
Narrow No-Break Space U+202F
Medium Mathematical Space U+205F
Ideographic Space U+3000

Single-line comments

Any text after //, until the next literal Newline is "commented out", and is considered to be Whitespace.

Multi-line comments

In addition to single-line comments using //, comments can also be started with /* and ended with */. These comments can span multiple lines. They are allowed in all positions where Whitespace is allowed and can be nested.

Slashdash comments

Finally, a special kind of comment called a "slashdash", denoted by /-, can be used to comment out entire components of a KDL document logically, and have those elements be treated as whitespace.

Slashdash comments can be used before:

  • A Node name (or its type annotation): the entire Node is treated as Whitespace, including all props, args, and children.
  • A node Argument (or its type annotation), in which case the Argument value is treated as Whitespace.
  • A Property key, in which case the entire property, both key and value, is treated as Whitespace.
  • A Children Block, in which case the entire block, including all children within, is treated as Whitespace.

Newline

The following characters should be treated as new lines:

Acronym Name Code Pt
CR Carriage Return U+000D
LF Line Feed U+000A
CRLF Carriage Return and Line Feed U+000D + U+000A
NEL Next Line U+0085
FF Form Feed U+000C
LS Line Separator U+2028
PS Paragraph Separator U+2029

Note that for the purpose of new lines, CRLF is considered a single newline.

Disallowed Literal Code Points

The following code points may not appear literally anywhere in the document. They may be represented in Strings (but not Raw Strings) using \u{}.

  • The codepoints U+0000-0008 or the codepoints U+000E-001F (various control characters).
  • U+007F (the Delete control character).
  • Any codepoint that is not a Unicode Scalar Value (U+D800-DFFF).
  • U+200E-200F, U+202A-202E, and U+2066-2069, the unicode "direction control" characters
  • U+FEFF, aka Zero-width Non-breaking Space (ZWNBSP)/Byte Order Mark (BOM), except as the first code point in a document.

Full Grammar

This is the full official grammar for KDL and should be considered authoritative if something seems to disagree with the text above. The grammar language syntax is defined below.

document := bom? nodes

nodes := (line-space* node)* line-space*

plain-line-space := newline | ws | single-line-comment
plain-node-space := ws* escline ws* | ws+

line-space := plain-line-space+ | '/-' plain-node-space* node
node-space := plain-node-space+ ('/-' plain-node-space* (node-prop-or-arg | node-children))?

required-node-space := node-space* plain-node-space+
optional-node-space := node-space*

base-node := type? optional-node-space string (required-node-space node-prop-or-arg)* (required-node-space node-children)?
node := base-node optional-node-space node-terminator
final-node := base-node optional-node-space node-terminator?
node-prop-or-arg := prop | value
node-children := '{' nodes final-node? '}'
node-terminator := single-line-comment | newline | ';' | eof

prop := string optional-node-space equals-sign optional-node-space value
value := type? optional-node-space (string | number | keyword)
type := '(' optional-node-space string optional-node-space ')'

equals-sign := See Table ([Equals Sign](#equals-sign))

string := identifier-string | quoted-string | raw-string

identifier-string := unambiguous-ident | signed-ident | dotted-ident
unambiguous-ident := ((identifier-char - digit - sign - '.') identifier-char*) - 'true' - 'false' - 'null' - 'inf' - '-inf' - 'nan'
signed-ident := sign ((identifier-char - digit - '.') identifier-char*)?
dotted-ident := sign? '.' ((identifier-char - digit) identifier-char*)?
identifier-char := unicode - unicode-space - newline - [\\/(){};\[\]"#] - disallowed-literal-code-points - equals-sign

quoted-string := '"' (single-line-string-body | newline multi-line-string-body newline unicode-space*) '"'
single-line-string-body := (string-character - newline)*
multi-line-string-body := string-character*
string-character := '\' escape | [^\\"] - disallowed-literal-code-points
escape := ["\\bfnrts] | 'u{' hex-digit{1, 6} '}' | (unicode-space | newline)+
hex-digit := [0-9a-fA-F]

raw-string := '#' raw-string-quotes '#' | '#' raw-string '#'
raw-string-quotes := '"' (single-line-raw-string-body | newline multi-line-raw-string-body newline unicode-space*) '"'
single-line-raw-string-body := (unicode - newline - disallowed-literal-code-points)*
multi-line-raw-string-body := (unicode - disallowed-literal-code-points)*

number := keyword-number | hex | octal | binary | decimal

decimal := sign? integer ('.' integer)? exponent?
exponent := ('e' | 'E') sign? integer
integer := digit (digit | '_')*
digit := [0-9]
sign := '+' | '-'

hex := sign? '0x' hex-digit (hex-digit | '_')*
octal := sign? '0o' [0-7] [0-7_]*
binary := sign? '0b' ('0' | '1') ('0' | '1' | '_')*

keyword := boolean | '#null'

keyword-number := '#inf' | '#-inf' | '#nan'

boolean := '#true' | '#false'

escline := '\\' ws* (single-line-comment | newline | eof)

newline := See Table (All line-break white_space)

ws := unicode-space | multi-line-comment

bom := '\u{FEFF}'

disallowed-literal-code-points := See Table (Disallowed Literal Code Points)

unicode := Any Unicode Scalar Value

unicode-space := See Table (All [White_Space](#whitespace) unicode characters which are not `newline`)

single-line-comment := '//' ^newline* (newline | eof)
multi-line-comment := '/*' commented-block
commented-block := '*/' | (multi-line-comment | '*' | '/' | [^*/]+) commented-block

Grammar language

The grammar language syntax is a combination of ABNF with some regex spice thrown in. Specifically:

  • Single quotes (') are used to denote literal text. \ within a literal string is used for escaping other single-quotes, for initiating unicode characters using hex values (\u{FEFF}), and for escaping \ itself (\\).
  • * is used for "zero or more", + is used for "one or more", and ? is used for "zero or one".
  • () can be used to group matches that must be matched together.
  • a | b means a or b, whichever matches first. If multipe items are before a |, they are a single group. a b c | d is equivalent to (a b c) | d.
  • [] are used for regex-style character matches, where any character between the brackets will be a single match. \ is used to escape \, [, and ]. They also support character ranges (0-9), and negation (^)
  • - is used for "except for" or "minus" whatever follows it. For example, a - 'x' means "any a, except something that matches the literal 'x'".
  • The prefix ^ means "something that does not match" whatever follows it. For example, ^foo means "must not match foo".