Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
319 lines (235 sloc) 12 KB
Whitespace separated literals (WSL)
===================================
WSL is a human and programmer friendly text format for relational
databases. WSL's design goals are explained in-depth on the project
webpage at http://jstimpfle.de/projects/wsl/main.html. A sample database
can be found at http://jstimpfle.de/projects/wsl/world_x.wsl.txt.
WSL can be explained as consisting of two parts: 1) a schema language
and 2) a database tuple notation which takes advantage of schema
information to enable a very clean and scripting-friendly lexical
syntax. Other extensions, for example query languages, can be added, but
are not part of this specification.
STRUCTURE
=========
A convenient way to represent Unicode is needed for practical reasons.
Therefore WSL databases must be UTF-8 encoded; no other encodings are
supported. http://utf8everywhere.org gives many good reasons for UTF-8
only.
A WSL database is normally contained in a single file or stream of
bytes. Lines are terminated by single newline (0x0a) characters. The
stream contains an optional inline database schema followed by the
relational data. The schema language is discussed in DATABASE SCHEMA
LANGUAGE. The relational data notation is explained in DATABASE TUPLES.
(Obvious variations are files or streams of only database tuples with a
separately provided schema, or databases split into one file per table).
DATABASE SCHEMA LANGUAGE
========================
The schema language supports declaration of user-defined domains (column
data types), tables, and database integrity constraints (primary and
foreign keys). These are the most essential ingredients to relational
databases. The following schema is taken from the database referenced
above. The lines are prefixed with '%' as required for inline notation.
(The prefixes are not part of the schema language. They are omitted if
the schema is provided separately).
% DOMAIN CityID Integer
% DOMAIN CityName String
% DOMAIN CountryCode Atom
% DOMAIN CountryCode2 Atom
% DOMAIN CountryName String
% DOMAIN District String
% DOMAIN Language String
% DOMAIN IsOfficial Enum T F
% DOMAIN Percentage Atom # decimal type not implemented
% DOMAIN Population Integer
% TABLE City CityID CityName CountryCode District Population
% TABLE Country CountryCode CountryCode2 CountryName
% TABLE Capital CountryCode CityID
% TABLE Language CountryCode Language IsOfficial Percentage
% KEY UniqueCountryCode Capital CC *
% REFERENCE CapitalCountry Capital CC * => Country CC * *
% REFERENCE CapitalCity Capital * CID => City CID * * * *
% REFERENCE LanguageCountry Language CC * * * => Country CC * *
Explanation of this language follows.
IDENTIFIERS
-----------
Identifiers are used at various places in the schema language; they must
match the pattern [a-zA-Z][a-zA-Z0-9_]* and must be separated from other
tokens by space (0x20) characters.
An "uppercase identifier" is an identifier without lowercase characters.
A "lowercase identifier" is an identifier without uppercase characters.
LEXICAL SYNTAX OF THE SCHEMA LANGUAGE
-------------------------------------
A schema is encoded as a sequence of statements, where each statement
stands on its own line. In inline database notation, each line must
start with a '%' character. If the rest of the line is not empty, it
must be separated from the '%' character by one space character. The '%'
prefix is not part of the schema language and must not be used when the
schema is given separately.
The first identifier (separated by a single following space, or the end
of the line) is the statement type which decides how to interpret the
content of the line after it. For extensibilty, if a statement type is
not known, the line is ignored. The statement types defined here are
DOMAIN, TABLE, KEY, and REFERENCE. Their respective syntaxes are
discussed in the following.
DOMAIN DECLARATIONS
-------------------
Example:
% DOMAIN IsOfficial Enum T F
A DOMAIN line declares a new domain (column type). After the DOMAIN
token there come two more identifiers. The first identifier is the name
of the new domain (here, "IsOfficial"). The second identifier is the
name of a domain parser (here, the `Enum' domain parser) which is
optionally parameterized by the remaining line (here, "T F").
Domain parsers take the remaining line and return domain *objects*,
which in turn hold a value decoder and a value encoder (to decode /
encode values in database tuples).
Since domain parsers roughly correspond to the (data-) types of the
internal representations of database values, the terms "domain parser"
and "(data-) type" are often used interchangeably.
Each domain parser implementation should also provide sensible
pre-defined domain objects. If there is only one obvious implementation,
the domain object should have the same name as the domain parser, if
possible.
There are a number of standard domain parsers and pre-defined domain
objects that every conforming implementation must provide. These are
discussed in STANDARD DATATYPES.
Library implementations must allow users of the API to provide custom
datatypes for flexibility.
DOMAIN PARAMETERIZATION
-----------------------
As mentioned, what kind of domain object is returned from a domain
parser is controlled by the remaining characters in the DOMAIN
declaration line. This is called *domain parameterization*.
The parameter syntax is different between different domain parsers, but
there is a recommended standard syntax. See RECOMMENDED DOMAIN PARSER
PARAMETER SYNTAX.
There are two intended use cases for parameterization: specifying
lexical syntax and specifying value constraints.
An example for parameterization of lexical syntax is the standard
"String" type which can be parameterized to recognize different styles
of string literals, like the [default style] and the "C style".
An example for parameterization of value constraints is the "Int" type
which can be parameterized to allow only values in a specified range.
Database values have domain dependent internal representation. For
example, String values might be implemented by contiguous memory
sequences, and Int values by machine integers. But the representation
must not be parameterizable: it must be the same for all domain objects
returned from the same domain parser. The rationale for this is that it
should be possible to efficiently "cast" values from distinct domains
created by the same parser to a most general version and to apply
generic functions to them. In the case of a datatype with string
representation, examples are comparison or concatenation functions.
RECOMMENDED DOMAIN PARSER PARAMETERIZATION SYNTAX
-------------------------------------------------
The parameterization syntax should if possible follow the style
% DOMAIN Foo Bar arg1 arg2 opt1=foo opt2=bar
I.e. space separated words, where each word is a lowercase identifier
optionally followed by an equal sign and a user-chosen argument with
suitable syntax (preferably printable ASCII).
UNIQUE KEY CONSTRAINTS
----------------------
Example:
% TABLE Person PersonId Comment
% KEY UniquePersonId Person P *
A unique key constraint is declared by a KEY statement. After the KEY
token comes the identifying name of the key constraint (which must be an
identifier). After that comes the name of a table and one more token for
each column in the table. Each column token should be either "*" or an
variable name (uppercase identifier). All identifiers in a KEY
declaration line must be distinct. The interpretation is that those
columns which are paired with variables form the unique key. The example
specifies that the Person table (which has two columns) is unique in the
first column. That means that for any given Person there is no other
Person with that PersonId.
FOREIGN KEY CONSTRAINTS
-----------------------
A foreign key constraint is declared by a line starting with the token
REFERENCE. After the REFERENCE token come the identifying naame of the
constraint and two table-column token sequences, as described in the
UNIQUE KEY section. These two sequences must be separated by a => token.
The identifiers must be distinct per sequence, and the same must be used
in both sequences. For example,
% REFERENCE Friend1Person Friends P * => Person P * *
% REFERENCE Friend2Person Friends * P => Person P * *
declares one foreign key constraint from table Friends (first column) to
Person (first column), and one foreign key constraint from Friends
(second column) to Person (first column). This
% REFERENCE Friend2Person Friends P P => Person P * *
is invalid since each variable can be used only once on each side. This
% REFERENCE PermRepo Perm D R * * => Repo D R *
declares a multi-column foreign key constraint from Perm to Repo.
STANDARD DATATYPES
==================
ID
--
An ID is a simple one-word identifier. There are no delimiters (apart
from the surrounding spaces) and there is no escaping. IDs are meant to
serve as keys in relations. But they can help out for almost anything
which does not contain space characters if no more suitable type is
available, for example URLs or numbers.
Parameterization:
- if unparameterized, only IDs matching the pattern
[a-zA-Z][a-zA-Z0-9_]* are recognized.
- Further parameterization to be defined.
The value ordering is the lexical ordering of the bytes that make up the
UTF-8 encoded ID.
An unparameterized ID domain object must be pre-defined with name "ID".
String
------
Strings are arbitrary UTF-8 encoded text, but typically hold a short
sequence of words. Without parameterization, they are encoded as
literals between square brackets in [this style] without any escaping
defined. In this mode it's not possible to encode any of the characters
[, \, and ] (0x5b - 0x5d).
If the parameterization string contains the keyword "escape", the
following backslash escape sequences are recognized:
- \xHH Hexadecimal byte (lower case, 0x0a but not 0x0A).
- \uDDDD Four-digit Unicode code point (with leading zeroes).
- \UDDDDDDDD Eight-digit Unicode codepoint (with leading zeroes).
Other literal styles might be added, like "C style string literals".
For value constraints, pattern matchers might be added in the future,
for example one for email addresses or one matching user-specified
regular expressions.
An unparameterized String domain object must be pre-defined with name
"String".
Int
---
Machine type signed integer. This has the same lexical representation
and interpretation as in C. An error should be thrown by the parser if a
lexem can't be represented on the executing machine.
An unparameterized Int domain object must be available with name "Int".
Enum
----
Simple enumeration type. Parameterization is the space separated list of
identifiers that are part of the enumeration. The ordering of these
values is the order in the declaration.
Decimal
-------
TODO: Is making this complexity required warranted?
TODO: Specify more precisely.
Represents numbers in decimal (for example 10.1) with optional unit
suffix (for example 1.5kg), fixed number of decimal places and optional
left padded zeroes. Multiple unit levels are possible, for example
1h 03m 34.05s.
Parameterization: A format string like "{1000}h{060}m{060.2}s". This
would mean:
"{1000}": values from 0-999
"{060}": values from 0-59, zero padded to 2 places (as in 59)
"{060.2}s": values from 0-59, zero padded, with 2 places of precision.
A "{}" means any number without zero-padding. Zero-padding is only
possible if a value range is given. "{.03}" means "any number without
padding with 2 places of precision". Only the rightmost number can have
a precision indicated.
No digits are allowed right next to a bracket.
The obvious ordering rules apply.
DATABASE TUPLES
===============
After the header, each row of the data is given on a single line
(extinct of newline characters) as a sequence of tokens that are
separated by exactly one space (0x20) character.
The first such token is always the name of the table to which the row
belongs.
The remaining tokens of the line are the values of the row (in order
corresponding to the columns of the table). Each token is parsed
according to the lexical rules created by the domain object of the
current column.