draft-thierry-bulk-04.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="lib/rfc2629.xslt"?>
<?rfc toc="yes" ?>
<?rfc symrefs="yes" ?>
<?rfc sortrefs="yes" ?>
<?rfc compact="yes"?>
<?rfc subcompact="no" ?>
<?rfc linkmailto="no" ?>
<?rfc editing="no" ?>
<?rfc comments="yes"?>
<?rfc inline="yes"?>
<?rfc rfcedstyle="yes"?>
<?rfc-ext allow-markup-in-artwork="yes" ?>
<?rfc-ext include-index="no" ?>

<rfc ipr="trust200902"
     category="exp"
     submissionType="IETF"
     docName="draft-thierry-bulk-04">
  <front>
    <title abbrev="BULK1">Binary Uniform Language Kit 1.0</title>

    <author initials="P." surname="Thierry" fullname="Pierre Thierry">
      <organization>Thierry Technologies</organization>
      <address>
        <email>pierre@nothos.net</email>
      </address>
    </author>

    <date day="31" month="03" year="2024" />
    <keyword>binary</keyword>

    <abstract>
      <t>
        This specification describes a uniform, decentrally extensible and efficient format for
        data serialization.
      </t>
    </abstract>

  </front>

  <middle>
    <section anchor="intro" title="Introduction">
      <section title="Rationale">
        <t>
          This specification aims at finding an original trade-off between uniformity, generality,
          extensibility, decentralization, compactness and processing speed for a data format. It is
          our opinion that every widely used existing format occupy a different position than this
          one in the solution space for formats, that none is better on all axes, and that this one
          is the current best on several axes, hence this new design. It is also our opinion that
          some of those existing formats constitute an optimal solution for their specific use case,
          either in a absolute sense, or at least at the time of their design. But the ever-changing
          field of IT now faces new challenges that call for a new approach.
        </t>
	<t>
	  In particular, whereas the previous trend for Internet and Web standards and programming
	  tools has been to create human-readable syntaxes for data and protocols, the advent of
	  technologies like <xref target="protobuf">protocol buffers</xref>, <xref
	  target="Thrift">Thrift</xref>, the various binary serializations for JSON like <xref
	  target="Avro">Avro</xref> or <xref target="Smile">Smile</xref>, or the binary <xref
	  target="HTTP2">HTTP/2</xref> seem to indicate that the time is ripe for a generalized use
	  of binary, reserved until now for the low-level protocols. The lessons about flexibility
	  learnt in the previous switch from binary to plain text can now be applied to efficient
	  binary syntaxes.
	</t>
	<section title="Definitions">
	  <t>
	    By uniformity, we mean the property of a syntax that can be parsed even by an
	    application that doesn't understand the semantics of every part of the processed
	    data. Of course, almost all syntaxes that feature uniformity contain a limited number
	    of non uniform elements. Also, uniformity really only has value in the face of
	    extension, as a fixed syntax doesn't need uniformity (it only makes the implementation
	    simpler).
	  </t>
	  <t>
	    Almost all extensible syntaxes have their extensible part uniform to a great degree. In
	    this specification, uniformity is hence evaluated on two criteria: first, the number of
	    non uniform elements (and, incidentally, their diversity), second, the fact that the
	    uniformity of the extensible part is not a limitation to the users (i.e. that the
	    temptation to extend the format in a non-uniform way is as absent as possible).
	  </t>
	  <t>
	    A good counter-example is found in most programming languages. Adding a new branching
	    construct cannot be done in a terse way without modifying the underlying
	    implementation. Such a construct either cannot be defined by user code (because of
	    evaluation rules) or can in a terribly verbose and inconvenient way (with lots of
	    boilerplate code). Notable exceptions to this limitation of programming languages are
	    Lisp, Haskell and stack programming languages.
	  </t>
	  <t>
	    On the other hand, a stack programming language is the canonical example of a
	    non-uniform language. Each operator takes a number of operands from the stack. Not
	    knowing the arity of an operator makes it impossible to continue parsing, even when its
	    evaluation was optional to the final processing. In the design space, stack programming
	    languages completely sacrifice uniformity to achieve one of the highest combination of
	    extensibility, compactness and speed of processing.
	  </t>
	  <t>
	    By generality, we mean the ability of a syntax to lend itself to describe any kind of
	    data with a reasonable (or better yet, high) level of compactness and simplicity. For
	    example, although both arrays and linked lists could be considered very general as they
	    are both able to store any kind of data, they actually are at the respective cost of
	    complexity (arrays need the embedding of data structure in the data or in the
	    processing logic) and size (in-memory linked lists can waste as much as half or two
	    third of the space for the overhead of the data structure).
	  </t>
	  <t>
	    By decentralization, we mean the ability to extend the syntax in a way that avoid
	    naming collisions without the use of a central registry. Note that the DNS, as we use
	    it, is NOT decentralized in this sense, but distributed, as it cannot work without its
	    root servers and prior knowledge of their location.
	  </t>
	</section>
	<section title="State of the art">
	  <t>
	    Uniformity, generality and extensibility are usually highly-valued traits in formats
	    design. Programming languages obviously feature them foremost, although their
	    generality usually stops at what they are supposed to express: procedures. Most of them
	    are ill-suited to represent arbitrary data, but notable exceptions include Lisp (where
	    "code is data") and Javascript, from which a subset has been extracted to exchange
	    data, JSON, which has seen a tremendous success for this purpose. JSON may lack in
	    generality and compactness, but its design makes its parsing really straightforward and
	    fast. All of them, though, lack decentralization. Some of them make it possible to
	    extend them in a distrubuted way if some discipline is followed (for example, by naming
	    modules after domain names), but the discipline is not mandatory (and even with domain
	    names, a change of ownership makes it possible for name collisions).
	  </t>
	  <t>
	    The SGML/XML family of formats also feature uniformity, generality and extensibility
	    and actually fare much better than programming languages on the three fronts. XML
	    namespaces also make XML naming distributed and there have been attempts at making it
	    compact (e.g. EXI from W3C, Fast Infoset from ISO/ITU or EBML).
	  </t>
	  <t>
	    All the previously cited formats clearly lack compactness, although just applying
	    standard compression techniques would sacrifice only very little processing time to
	    gain huge size reductions on most of their intended use cases, but compression may not
	    address their ineffectiveness at storing arbitrary bytes.
	  </t>
	  <t>
	    So-called binary formats pretty much exhibit the opposite trade-offs. Most of them are
	    not uniform to achieve better compactness. Some are specifically designed for a great
	    generality, but many lack extensibility. When they are extensible, it's never in a
	    decentralized way, again for reasons that have to do with compactness. They are usually
	    extremely fast to parse.
	  </t>
	  <t>
	    Actually, many binary formats are not so much formats but formats frameworks, and
	    exclude extensibility by design. For each use case, an IDL compiler creates a brand new
	    format that is essentially incompatible with all other formats created by the same
	    compiler (EBML specifically cites this property among its own disadvantages). If the
	    IDL compiler and framework are correctly designed, such a format usually represent an
	    optimum in compactness and speed of processing, as the compiler can also automatically
	    generate an ad-hoc optimized parser.
	  </t>
	  <t>
	    Where extensibility has been planned in existing formats, it often doesn't get used
	    that much or at all because of the complications around it. Many binary formats include
	    reserved values meant to extend them to future uses, like the <spanx
	    style="verb">CM</spanx> field in the ZIP format. A case like this one faces an
	    chicken-and-egg problem: if you don't write and get a specification officially adopted,
	    implementations might not want to include your extension, but if your extension is
	    purely theoretical and hasn't been tested in the wild, you may face resistance to get
	    it officially adopted. This is probably why even though most compression formats
	    include the ability to later encode other compression methods, each new compression
	    method usually comes with its own format.
	  </t>
	  <t>
	    When extensions are managed with any form of registry, another issue is that you
	    usually need to reserve a large set of values for free experimentation, and once an
	    extension gains any traction while in experimentation, its authors face the difficulty
	    to switch all existing implementations to the definitive values they'll get. And how
	    experimenters choose their temporary values makes them vulnerable to conflicts with
	    others.
	  </t>
	</section>
      </section>
      <section title="Format overview">
	<t>
	  A BULK stream is a stream of 8-bit bytes, in big-endian order. Parsing a BULK stream
	  yields a sequence of expressions, which can be either atoms or forms, which are sequences
	  of expressions. The syntax of forms is entirely uniform, without a single exception: a
	  starting byte marker, a sequence of expressions and an ending byte marker. Among atoms,
	  only nil (the null byte) and arrays have a special syntax, for efficiency purposes. Even
	  booleans and floating-point numbers follow the uniform syntax that every other expression
	  follows.
	</t>
	<t>
	  Non uniform atoms start with a marker byte, followed by a static or dynamic number of
	  bytes, depending on the type.
	</t>
	<t>
	  Any other atom is a reference, which consists of a namespace marker (in almost all cases,
	  a single byte) followed by an identifier within this namespace (a single byte). All in
	  all, a very little sacrifice is made in compactness for the benefit of a very simple
	  syntax: apart from nil and small integers, nothing is smaller than 2 bytes, and as most
	  forms involve a reference followed by some content, a form is usually 4 bytes + its
	  content.
	</t>
	<t>
	  A namespace marker in a BULK stream is associated to a namespace identified by some
	  identifier guaranteed to be unique without coordination (like a UUID or cryptographical
	  hash), thus ensuring decentralized extensibility. The stream can be processed even if the
	  application doesn't recognize the namespace. Parsing remains possible thanks to the
	  uniform syntax.
	</t>
	<t>
	  Combination of BULK namespaces, BULK streams and even other formats doesn't need any
	  content transformation to work. Here are some examples:
	  <list style="symbols">
	    <t>
	      The content of a BULK stream, enclosed in list starting and ending byte markers,
	      constitute a valid BULK expression. Thus BULK streams can be packed or annotated
	      within a BULK stream without modification. Annotation use cases include adding
	      metadata or cryptographic signature.
	    </t>
	    <t>
	      A BULK format could specify in its syntax the place for an expression holding
	      metadata. Whether the specification provides its own metadata forms or not, an
	      application could use a BULK serialization for MARC, TEI Header, XML or RDF for this
	      metadata expression. The vocabulary selected would be univocally expressed by the
	      namespace and every vocabulary would be parsed by the same mechanisms.
	    </t>
	    <t>
	      Whenever a content must be stored as-is instead of serialized or a highly-optimized
	      ad hoc serialization exists for some data, anything can always be stored within an
	      array. They can contain arbitray bytes and there is no limit to their size.
	    </t>
	  </list>
	</t>
	<t>
	  Furthermore, BULK expressions can be evaluated. Most expressions evaluate to themselves,
	  but some evaluate by default to the result of a pure function call, making it possible to
	  serialize data in an even more compact form, by eliminating boilerplate data and repeated
	  patterns.
	</t>
      </section>
      <section title="Conventions and Terminology">
        <t>
          The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD
          NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as
          described in <xref target="RFC2119">RFC 2119</xref>.
        </t>
        <t>
          Literal numerical values are provided in decimal or hexadecimal as appropriate.
          Hexadecimal literals are prefixed with <spanx style="verb">0x</spanx> to distinguish them
          from decimal literals.
        </t>
	<t>
	   The text notation of the BULK stream uses mnemonics for some bytes sequences. Mnemonics
	   are series of characters, excluding all capital letters and white space, like <spanx
	   style="verb">this-is-one-mnemonic</spanx> or <spanx
	   style="verb">what-the-%§!?#-is-that?</spanx>. They are always separated by white
	   space. Outside the use of mnemonics, a sequence of bytes (of one or more bytes) can be
	   represented by its hexadecimal value as an unsigned integer (e.g. <spanx
	   style="verb">0x3F</spanx> or <spanx style="verb">0x3A0B770F</spanx>). Such a sequence of
	   bytes can include dashes to make it more readable (e.g. <spanx
	   style="verb">0xDDA37D36-85E6-4E6D-9B51-959E1CCE366C</spanx>). Some types in this
	   specification define a special syntax for their representation in the text notation.
	</t>
	<t>
	  In the grammar, a shape is a pattern of bytes, following the rules of the text notation
	  for a BULK stream. Apart from mnemonics and fixed sequences of bytes, a shape can
	  contain:
	  <list style="symbols">
	    <t>an arbitrary sequence of a fixed number of bytes, represented by its size, i.e. a
	    number of bytes in decimal immediately followed by a B uppercase letter (e.g. <spanx
	    style="verb">4B</spanx>)</t>
	    <t>a typed sequence of bytes, represented by the name of its type, a capitalized word
	    (e.g.  <spanx style="verb">Foo</spanx>); this means a sequence of bytes whose specific
	    yield (cf. <xref target="parsing"/>) has this type</t>
	    <t>a named sequence of bytes (of zero or more bytes), represented by a series of any
	    character excluding '{}' between '{' and '}' (e.g. <spanx style="verb">{quux}</spanx>);
	    a named sequence can be typed or sized, in which case it is immediately followed by ':'
	    and a type or size (e.g. <spanx style="verb">{quux}:Bar</spanx> or <spanx
	    style="verb">{quux}:12B</spanx>)</t>
	  </list>
	</t>
	<t>
	  When an entire shape describes the byte sequence of an atom, it is the normative
	  specification for parsing it, but shapes of forms are only normative with respect to
	  their default evaluation. A reference defined with a form shape can be used in different
	  shapes, albeit with different semantics and value and even when used in its default
	  shape, a processing application MAY give it alternative semantics.
	</t>
	<t>
	  For example, this specification defines a way do specify a string encoding with forms of
	  the shape <spanx style="verb">( stringenc {enc}:Expr )</spanx>. But the shapes <spanx
	  style="verb">( stringenc {arg1}:Int {arg2}:Int )</spanx> or <spanx style="verb">(
	  {arg1}:Int stringenc {arg2}:Int )</spanx> are syntactly valid. They just have unspecified
	  semantics, as far as this specification is concerned.
	</t>
	<t>
	  Some identifiers are expected to be verifiable against a byte sequence. This means that
	  there must be an algorithm that, given the byte sequence as input, produces the
	  identifier as output and, given a different byte sequence, would produce a different
	  identifier. Because this verification has security implications, the algorithm used
	  should have the same guarantees than a cryptographic hash function in terms of
	  collisions.
	</t>
      </section>
    </section>

    <section title="BULK syntax">
      <t>
	A BULK stream is a sequence of 8-bit bytes. Bits and bytes are in big-endian order. The
	result of parsing a BULK stream is a list of abstract data, called the abstract yield. BULK
	parsing is injective: a BULK stream has only one abstract yield, but different BULK streams
	can have the same abstract yield.
      </t>
      <t>
	A processing application is not expected to actually produce the abstract yield, but an
	adaptation of the abstract yield to its own implementation, called the concrete
	yield. Also, some expressions in a BULK stream may have the semantics of a transformation
	of the abstract yield. A processing application MAY thus not produce or retain the concrete
	yield but the result of its transformation. This specification deals mainly with the byte
	sequence and the abstract yield and occasionnally provide guidelines about the concrete
	yield. Of course, a processing application MAY not produce the concrete yield at all but
	produce various data structures and side effects from parsing the BULK stream.
      </t>
      <t>
	The abstract yield is a list of expressions. Expressions can be atoms or forms. Forms
	are lists of expressions. If a byte sequence is parsed as an expression, this byte
	sequence is said to denote this expression.
      </t>
      <t>
	When a sequence of bytes is named in a shape, its name can be used in this specification to
	designate either the byte sequence, or the expression or list of expressions it
	denotes. When there could be ambiguity, this specification specifies which is designated.
      </t>

      <section anchor="parsing" title="Parsing algorithm">
	<t>
	  The parser operates with a context, which is a list of expressions. Each time an
	  expression is parsed, it is appended at the end of the context. The initial context is the
	  abstract yield.
	</t>
	<t>
	  At the beginning of a BULK stream and after having consumed the byte sequence denoting a
	  complete expression, the parser is at the dispatch stage. At this stage, the next byte is
	  a marker byte, which tells the parser what kind of expression comes next (the marker byte
	  is the first byte of the sequence that denotes an expression). The expression appended to
	  the context after reading a byte sequence is called the specific yield of the byte
	  sequence.
	</t>
	<t>
	  The <spanx style="verb">0x01</spanx> and <spanx style="verb">0x02</spanx> marker bytes are
	  special cases. When the parser reads <spanx style="verb">0x01</spanx>, it immediately
	  appends an empty list to the current context. This list becomes the new context. This new
	  context has the previous context as parent. Then the parser returns to its dispatch
	  stage. When the parser reads <spanx style="verb">0x02</spanx>, it appends nothing to the
	  context, but instead the parent of the current context becomes the new context and the
	  parser returns to the dispatch stage. Thus it is a parsing error to read <spanx
	  style="verb">0x02</spanx> when the context is the abstract yield.
	</t>
	<t>
	  Some forms have side-effects in their semantics. Those side-effects MUST not affect the
	  parsing of any expression. They can affect evaluation, in which case they MUST only affect
	  the evaluation of expressions in the scope of the form. The outer scope of an expression
	  is the part of its context that follows the expression. Some forms MAY define an inner
	  scope in their shape. The scope of an expression is the union of the outer and inner
	  scopes. This makes BULK lexically scoped.
	</t>
	<t>
	  Whenever a parsing error is encountered, parsing of the BULK stream MUST stop.
	</t>
	<section title="Summary of marker bytes">
	  <table>
	    <thead><tr><th>marker</th><th>shape</th></tr></thead>
	    <tbody>
	      <tr><td><spanx style="verb">00</spanx></td><td><xref target="nil"><spanx style="verb">nil</spanx></xref></td></tr>
	      <tr><td><spanx style="verb">01</spanx></td><td><xref target="start"><spanx style="verb">(</spanx></xref></td></tr>
	      <tr><td><spanx style="verb">02</spanx></td><td><xref target="end"><spanx style="verb">)</spanx></xref></td></tr>
	      <tr><td><spanx style="verb">03</spanx></td><td><xref target="array"><spanx style="verb"># Nat {content}</spanx></xref></td></tr>
	      <tr><td><spanx style="verb">04–0F</spanx></td><td><xref target="reserved"><spanx style="verb">reserved</spanx></xref></td></tr>
	      <tr><td><spanx style="verb">10–7F</spanx></td><td><xref target="ref"><spanx style="verb">references</spanx></xref></td></tr>
	      <tr><td><spanx style="verb">80–BF</spanx></td><td><xref target="smallint"><spanx style="verb">w6[value]</spanx></xref></td></tr>
	      <tr><td><spanx style="verb">C0–FF</spanx></td><td><xref target="smallarray"><spanx style="verb">#[size] {content}</spanx></xref></td></tr>
	    </tbody>
	  </table>
	</section>
	<section title="Evaluation">
	  <t>
	    A processing application MAY implement evaluation of BULK expressions and streams. When
	    evaluating a BULK stream, when the parser gets to the dispatch stage and the context is
	    the abstract yield, the last expression in the context is replaced by what it evaluates
	    to. (of course, this description is supposed to provide the semantics of BULK
	    evaluation, but a processing application MAY implement evaluation with a different
	    algorithm as long as it provides the same semantics)
	  </t>
	  <t>
	    The default evaluation rule is that an expression evaluates to itself. A name within a
	    namespace can have a value, which is what a reference associated to this name evaluates
	    to. A reference whose marker value is associated to no namespace or whose name has no
	    value evaluates to itself. How self-evaluating BULK expressions are represented in the
	    concrete yield is application-dependent, but future specifications MAY define a
	    standard API to access it, similar to the Document Object Model for XML.
	  </t>
	  <t>
	    The evaluation of a form obeys a special rule, though: if the first expression of the
	    form has type <spanx style="verb">Function</spanx>, that function is called with an
	    argument list and the form evaluates to the return value if it's an atom or the
	    evaluation of the return value if it is a form. If the function has type <spanx
	    style="verb">LazyFunction</spanx>, the argument list is the rest of the form. If the
	    function has type <spanx style="verb">EagerFunction</spanx>, the argument list is the
	    rest of the form, where each expression is replaced by what it evaluates to. Any
	    expression that has type <spanx style="verb">LazyFunction</spanx> or <spanx
	    style="verb">EagerFunction</spanx> also has type <spanx style="verb">Function</spanx>.
	  </t>
	  <t>
	    A form whose first expression doesn't have type <spanx style="verb">Function</spanx>
	    evaluates to itself.
	  </t>
	  <t>
	    When an application evaluates a BULK expression, it MUST verify that evaluation will
	    terminate in a finite number of evaluation steps. An application MAY verify finite
	    termination statically or dynamically. For example, an application MAY stop evaluation
	    in error after a predetermined number of steps.
	  </t>
	</section>
      </section>

      <section title="Forms">
	<section anchor="start" title="starting marker byte">
	  <t>
	    <list style="hanging">
	      <t hangText="marker"><spanx style="verb">0x01</spanx></t>
	      <t hangText="mnemonic"><spanx style="verb">(</spanx></t>
	    </list>
	  </t>
	</section>
	<section anchor="end" title="ending marker byte">
	  <t>
	    <list style="hanging">
	      <t hangText="marker"><spanx style="verb">0x02</spanx></t>
	      <t hangText="mnemonic"><spanx style="verb">)</spanx></t>
	    </list>
	  </t>
	</section>

	<section title="Difference between sequence and form">
	  <t>
	    There is a difference between a byte sequence denoting several expressions among the
	    current context and a byte sequence denoting a form (i.e. a single expression that is a
	    list of expressions). As an example, let's examine several forms of the shape <spanx
	    style="verb">( foo {bar} )</spanx>.
	  </t>
	  <t>
	    <list style="symbols">
	      <t>In the form <spanx style="verb">( foo nil nil nil )</spanx>, {bar} denotes 3
	      expressions, and they are three atoms in the yield.</t>

	      <t>In the form <spanx style="verb">( foo nil )</spanx>, {bar} is a single expression
	      in the yield, and that expression is an atom.</t>

	      <t>In the form <spanx style="verb">( foo ( nil nil nil ) )</spanx>, {bar} is also a
	      single expression in the yield, and that expression is a form, a list in the
	      yield.</t>
	    </list>
	  </t>
	  <t>
	    In a shape, when a byte sequence must yield a single expression, it has the type <spanx
	    style="verb">Expr</spanx>. So the last two examples fit the shape <spanx style="verb">(
	    foo {seq}:Expr )</spanx> but not the first. When a byte sequence must yield a form, it
	    has type <spanx style="verb">Form</spanx>. Thus the shape <spanx style="verb">( foo
	    {bar}:Form )</spanx> is equivalent to <spanx style="verb">( foo ( {bar} )
	    )</spanx>. Either one MAY be used.
	  </t>
	</section>
      </section>

      <section title="Atoms">
	<section anchor="nil" title="nil">
	  <t>
	    <list style="hanging">
	      <t hangText="marker"><spanx style="verb">0x00</spanx> (mnemonic: <spanx
	      style="verb">nil</spanx>)</t>
	      <t hangText="shape"><spanx style="verb">nil</spanx></t>
	    </list>
	  </t>
	  <t>
	    Apart from being a possible short marker value, the fact that the <spanx
	    style="verb">0x00</spanx> byte represents a valid atom means that a series of null bytes
	    is a valid part of a BULK stream, thus making the format less fragile. In a network
	    communication, nil atoms can be sent to keep the channel open. They can also be used as
	    padding at the end of a form or between forms.
	  </t>
	</section>

	<section title="Arrays">
	  <t>
	    Arrays can be used to store arbitrary bytes.
	  </t>
	  <t>
	    An array can be interpreted either as a bits sequence or as an unsigned integer in
	    binary notation. The choice depends on the context and the application. Actually, many
	    processing applications may not need make any choice, as most programming language
	    implementations actually also confuse unsigned integers and bits sequences to some
	    extent. Expressions that are unsigned integers (that is, natural numbers) have type
	    <spanx style="verb">Nat</spanx>.
	  </t>
	  <t>
	    Big arrays typically store the content of a file or a binary message of another
	    format. They can also be used to store a vector or matrix of fixed-size elements.
	  </t>
	  <t>
	    In any case, the semantics of the content must be inferred by the processing
	    application; where ambiguity can appear, an application SHOULD enclose the array in a
	    type-denoting form.
	  </t>
	  <t>
	    Because BULK arrays have no end markers, the payload of a BULK array can constitute the
	    end of the stream.
	  </t>
	  <t>
	    The start and end of an array are known without reading its content, which means that
	    its content can be skipped in constant time and mapped in memory (or read lazily by any
	    other means).
	  </t>
	  <t>
	    Because BULK can use integers with arbitrary size to store the size of an array, BULK
	    arrays have no limit in size.
	  </t>
	  
	  <section anchor="array" title="Generic array">
	    <t>
	      <list style="hanging">
		<t hangText="marker"><spanx style="verb">0x03</spanx> (mnemonic: <spanx
		style="verb">#</spanx>)</t>
		<t hangText="shape"><spanx style="verb"># Nat {content}</spanx></t>
	      </list>
	    </t>
	    <t>
	      Arrays have a special parsing rule. After consuming the marker byte, the parser
	      returns to the dispatch stage. It is a parser error if the parsed expression is not of
	      type <spanx style="verb">Nat</spanx> or if its value cannot be recognized. This
	      integer is not added to any context, but the parser consumes as many bytes as this
	      integer and they constitute the content of this array.
	    </t>
	    <t>
	      In the text notation, a quoted string is the notation for an array containing the
	      encoding of that string in the <xref target="stringenc">current encoding</xref>,
	      except if the size of the encoding is below 64 bytes, cf. <xref
	      target="smallarray" sectionFormat="bare">small arrays</xref>.
	    </t>
	    <t>Types: <spanx style="verb">Bytes</spanx>, <spanx style="verb">Nat</spanx></t>
	    <t>
	      In a shape, the type <spanx style="verb">String</spanx> is synonymous with <spanx
	      style="verb">Bytes</spanx>, but means that the content of the array is supposed to be
	      taken as a string in the current encoding.
	    </t>
	  </section>

	  <section anchor="smallarray" title="Small array">
	    <t>
	      <list style="hanging">
		<t hangText="marker"><spanx style="verb">0xC0–0xFF</spanx> (mnemonic: <spanx
		style="verb">#[size]</spanx>)</t>
		<t hangText="shape"><spanx style="verb">#[size] {content}</spanx></t>
	      </list>
	    </t>
	    <t>
	      Small arrays have a special parsing rule. The 6 least significant bits of the marker
	      byte are treated as un unsigned integer. This integer is not added to any context, but
	      the parser consumes as many bytes as this integer and they constitute the content of
	      this array.
	    </t>
	    <t>
	      In the text notation, the marker byte of a small array of size X is written as <spanx
	      style="verb">#[X]</spanx>. For example, <spanx style="verb">#[2] 0x1234</spanx> is a
	      notation for the bytes <spanx style="verb">0xE2 0x12 0x34</spanx>.
	    </t>
	    <t>
	      In the text notation, a quoted string is the notation for a small array containing the
	      encoding of that string in the current encoding if the size of the encoding is below
	      64 bytes.
	    </t>
	    <t>Types: <spanx style="verb">Bytes</spanx>, <spanx style="verb">Nat</spanx></t>
	  </section>

	  <section anchor="smallint" title="Small unsigned integers">
	    <t>
	      <list style="hanging">
		<t hangText="marker"><spanx style="verb">0x80–0xBF</spanx> (mnemonic: <spanx
		style="verb">w6[value]</spanx>)</t>
		<t hangText="shape"><spanx style="verb">w6[value]</spanx></t>
	      </list>
	    </t>
	    <t>
	      Small unsigned integers have a special parsing rule. The 6 least significant bits of
	      the marker byte are the value denoted by this byte (as bits or as an unsigned integer
	      in binary notation).
	    </t>
	    <t>
	      In the text notation, the marker byte of a small unsigned integer of value X is
	      written as <spanx style="verb">w6[X]</spanx>. For example, <spanx
	      style="verb">w6[11]</spanx> is a notation for the byte <spanx
	      style="verb">0xCB</spanx> (as is <spanx style="verb">11</spanx>).
	    </t>
	    <t>Types: <spanx style="verb">Bytes</spanx>, <spanx style="verb">Nat</spanx></t>
	  </section>
	</section>


	<section anchor="reserved" title="Reserved marker bytes">
	  <t>
	    Marker bytes <spanx style="verb">0x04−0x0F</spanx> are reserved for future major
	    versions of BULK. It is a parser error if a BULK stream with major version 1 contains
	    such a marker byte.
	  </t>
	</section>

	<section anchor="ref" title="References">
	  <t><list style="hanging">
	    <t hangText="marker"><spanx style="verb">0x10−0x7F</spanx></t>
	    <t hangText="shape">
	      <spanx style="verb">{ns}:1B {name}:1B</spanx>
	      <vspace/>
  	      <spanx style="verb">0x7F {ns'} {name}:1B</spanx>
	    </t>
	  </list>
	  </t>
	  <t>
	    The <spanx style="verb">{ns}</spanx> byte is a value associated with a namespace, called
	    the namespace marker. Values <spanx style="verb">0x10−0x17</spanx> are reserved for
	    namespaces defined by BULK specifications. Greater values can be associated with
	    namespaces identified by a unique identifier.
	  </t>
	  <t>
	    The <spanx style="verb">{name}</spanx> byte is the name within the
	    namespace. Vocabularies with more than 256 names thus need to be spread accross several
	    namespaces.
	  </t>
	  <t>
	    The specification of a namespace SHOULD include a mnemonic for the namespace and for
	    each defined name. When descriptions use several namespaces, the mnemonic of a reference
	    SHOULD be the concatenation of the namespace mnemonic, ":" and the name mnemonic if
	    there can be an ambiguity. For example, the <spanx style="verb">fp</spanx> name in
	    namespace <spanx style="verb">math</spanx> becomes <spanx style="verb">math:fp</spanx>.
	  </t>
	  <t>Type: <spanx style="verb">Ref</spanx></t>
	  <section title="Special case">
	    <t>
	      References have a special parsing rule. In case a BULK stream needs an important
	      number of namespaces, if the marker byte is <spanx style="verb">0x7F</spanx>, the
	      parser continues to read bytes until it finds a byte different than 0xFF. The sum of
	      each of those bytes taken as unsigned integers is the namespace marker. For example,
	      the reference denoted by the bytes <spanx style="verb">0x7F 0xFF 0x8C 0x1A</spanx> is
	      the name 26 in the namespace associated with 522.
	    </t>
	  </section>
	</section>

      </section>

    </section>

    <section title="Standard namespaces">
      <t>
	Standard namespaces have a fixed marker value and are not identified by a unique
	identifier.
      </t>

      <section title="BULK core namespace">
	<t>
	  <list style="hanging">
	      <t hangText="marker"><spanx style="verb">0x20</spanx> (mnemonic: <spanx
	      style="verb">bulk</spanx>)</t>
	  </list>
	</t>

	<section title="Version">
	  <t>
	    <list style="hanging">
	      <t hangText="name"><spanx style="verb">0x00</spanx> (mnemonic: <spanx
	      style="verb">version</spanx>)</t>
	      <t hangText="shape"><spanx style="verb">( version {major}:Nat {minor}:Nat
	      )</spanx></t>
	    </list>
	  </t>
	  <t>
	    When parsing a BULK stream, a processing application MUST determine explicitely the
	    major and minor version of the BULK specification that the stream obeys. This
	    information MAY be exchanged out-of-band, if BULK is used to exchange a number a very
	    small messages, where repeated headers of 8 bytes might become too big an overhead. A
	    processing application MUST NOT assume a default version.
	  </t>
	  <t>
	    If the version is expressed within a BULK stream, this form MUST be the first in the
	    stream. In any other place, this form has no semantics attached to it. This
	    specification defines BULK 1.0. When writing a BULK stream, an application MUST denote
	    {major} and {minor} by the smallest byte sequence possible.
	  </t>
	  <t>
	    An application writing a BULK stream to long-term storage (e.g. in a file or a database
	    record) SHOULD include a <spanx style="verb">version</spanx> form.
	  </t>
	  <t>
	    Two BULK versions with the same major version MUST share the same parsing rules and the
	    same definitions of marker bytes. Changing the syntax or semantics of existing marker
	    bytes and using marker bytes in the reserved interval warrants a new major
	    version. Changing the syntax or semantics of existing names in standard namespaces
	    also.
	  </t>
	  <t>
	    Adding standard namespaces or adding names in existing standard namespaces warrants a
	    new minor version.
	  </t>
	</section>

	<section title="Booleans">
	  <section title="true">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x02</spanx> (mnemonic: <spanx
		style="verb">true</spanx>)</t>
		<t hangText="shape"><spanx style="verb">true</spanx></t>
	      </list>
	    </t>
	    <t>
	      Type: <spanx style="verb">Boolean</spanx>.
	    </t>
	  </section>

	  <section title="false">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x03</spanx> (mnemonic: <spanx
		style="verb">false</spanx>)</t>
		<t hangText="shape"><spanx style="verb">false</spanx></t>
	      </list>
	    </t>
	    <t>
	      Type: <spanx style="verb">Boolean</spanx>.
	    </t>
	  </section>
	</section>

	<section title="Strings encoding">
	  <section anchor="stringenc" title="Current encoding">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x04</spanx> (mnemonic: <spanx
		style="verb">stringenc</spanx>)</t>
		<t hangText="shape"><spanx style="verb">( stringenc {enc}:Encoding )</spanx></t>
	      </list>
	    </t>
	    <t>
	      This tells the processing application that, in the scope of this expression, all
	      expressions that are understood by the application as character strings will be encoded
	      with the encoding designated by {enc}.
	    </t>
	    <t>
	      As the abstract yield doesn't contain strings but expressions that will be used as
	      strings by the application, it is not a parsing error if the application doesn't
	      recognize {enc}. In this situation, it is a parsing error when the application actually
	      needs to decode a byte sequence as a string. It is not a parsing error when a processing
	      application only transmits a byte sequence encoding a string, if it can accurately
	      convey the encoding to the receiving application.
	    </t>
	  </section>

	  <section title="IANA registered character set">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x05</spanx> (mnemonic: <spanx
		style="verb">iana-charset</spanx>)</t>
		<t hangText="shape"><spanx style="verb">( iana-charset {id}:Nat )</spanx></t>
	      </list>
	    </t>
	    <t>
	      This designates the string encoding registered among the <xref
	      target="IANA-Charsets">IANA Character Sets</xref> whose MIBenum is {id}.
	    </t>
	    <t>
	      Type: <spanx style="verb">Encoding</spanx>.
	    </t>
	  </section>

	  <section title="Windows code page">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x06</spanx> (mnemonic: <spanx
		style="verb">code-page</spanx>)</t>
		<t hangText="shape"><spanx style="verb">( code-page {id}:Nat )</spanx></t>
	      </list>
	    </t>
	    <t>
	      This designates the string encoding among Windows code pages whose identifier is {id}.
	    </t>
	    <t>
	      Type: <spanx style="verb">Encoding</spanx>.
	    </t>
	  </section>
	</section>

	<section title="Namespaces">
	  <section title="New namespace">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x07</spanx> (mnemonic: <spanx
		style="verb">ns</spanx>)</t>
		<t hangText="shape"><spanx style="verb">( ns {marker}:Ref {id}:Expr )</spanx></t>
	      </list>
	    </t>
	    <t>
	      This associates the namespace identified by {id} to the namespace marker of {marker},
	      within the scope of this expression.
	    </t>
	  </section>

	  <section title="Package">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x08</spanx> (mnemonic: <spanx
		style="verb">package</spanx>)</t>
		<t hangText="shape"><spanx style="verb">( package {id}:Expr {namespaces}
		)</spanx></t>
	      </list>
	    </t>
	    <t>
	      This creates a package identified by {id}. Packages are immutable, {id} MUST be
	      verifiable against the byte sequence {namespaces}. {namespaces} must be a series of
	      expressions each identifying a BULK namespace.
	    </t>
	  </section>

	  <section title="Import">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x09</spanx> (mnemonic: <spanx
		style="verb">import</spanx>)</t>
		<t hangText="shape"><spanx style="verb">( import {base}:Nat {count}:Nat {id}:Expr
		)</spanx></t>
	      </list>
	    </t>
	    <t>
	      This associates the first {count} namespaces in the package identified by {id} with a
	      continuous range of marker bytes starting at {base} within the scope of this
	      expression.
	    </t>
	    <t>
	      Example: <spanx style="verb">( import 28 3 0x0123456789ABCDEF )</spanx> associates
	      the first 3 namespaces of the package identified by <spanx
	      style="verb">0x0123456789ABCDEF</spanx> to the marker bytes 28, 29 and 30.
	    </t>
	  </section>
	</section>

	<section title="Definitions">
	  <t>
	    To define a reference is to change the the value of its name in its namespace (as
	    identified by its unique identifier, not the marker value) within a certain scope.
	  </t>
	  <t>
	    If a BULK stream is not evaluated, the semantics of a definition are entirely
	    application-dependent.
	  </t>
	  <t>
	    When a BULK stream containing definitions for a namespace comes from a trusted source
	    (i.e. in configuration files of the application, or in the communication with an agent
	    that has been granted the relevant authority), an application MAY give those
	    definitions long-lasting semantics (i.e. keep the values of the names at the end of
	    parsing). This is the preferred mechanism for bulk namespace definition when the
	    semantics of the defined expressions can be expressed completely by BULK forms.
	  </t>

	  <section title="Simple definition">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x0A</spanx> (mnemonic: <spanx
		style="verb">define</spanx>)</t>
		<t hangText="shape">
		  <spanx style="verb">( define {ref}:Ref {value}:Expr )</spanx>
		  <vspace/>
		  <spanx style="verb">( define nil {value}:Expr )</spanx>
		  </t>
	      </list>
	    </t>
	    <t>
	      This defines the reference <spanx style="verb">{ref}</spanx> to the yield of <spanx
	      style="verb">{value}</spanx> in the outer scope of this form.
	    </t>
	    <t>
	      In any context where there is a default namespace where definitions are made,
	      e.g. <xref target="verifiable"><spanx style="verb">verifiable-ns</spanx></xref>, the
	      second shape defines the smallest name that is not yet defined to <spanx
	      style="verb">{value}</spanx>.
	    </t>
	  </section>

	  <section title="Named definition">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x0B</spanx> (mnemonic: <spanx
		style="verb">mnemonic/def</spanx>)</t>
		<t hangText="shape">
		  <spanx style="verb">( mnemonic/def {ref}:Ref {mnemonic}:String
		  {doc}:Expr {value} )</spanx>
		  <vspace/>
		  <spanx style="verb">( mnemonic/def nil {mnemonic}:String
		  {doc}:Expr {value} )</spanx>
		</t>
	      </list>
	    </t>
	    <t>
	      This suggest <spanx style="verb">{mnemonic}</spanx> as the mnemonic of the name
	      designated by <spanx style="verb">{ref}</spanx> in its namespace. If <spanx
	      style="verb">{value}</spanx> is of type Expr, this defines the reference <spanx
	      style="verb">{ref}</spanx> to <spanx style="verb">{value}</spanx> in the scope of this
	      form.
	    </t>
	    <t>
	      <spanx style="verb">{doc}</spanx> is any expression that provides a documentation for
	      this reference. If it has type Bytes, it MUST be a string. It could be any kind of
	      metadata or document type.
	    </t>
	    <t>
	      In any context where there is a default namespace where definitions are made,
	      e.g. <xref target="verifiable"><spanx style="verb">verifiable-ns</spanx></xref>, the
	      second shape defines the smallest name that is not yet defined to <spanx
	      style="verb">{value}</spanx>.
	    </t>
	  </section>

	  <section title="Namespace description">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x0C</spanx> (mnemonic: <spanx
		style="verb">ns-mnemonic</spanx>)</t>
		<t hangText="shape"><spanx style="verb">( ns-mnemonic {ns}:Expr {mnemonic}:String
		{doc} )</spanx></t>
	      </list>
	    </t>
	    <t>
	      This suggest {mnemonic} as the mnemonic of the namespace designated by {ns} (which
	      can be the integer to which this namespace is associated, a reference in this
	      namespace or the unique identifier of this namespace).
	    </t>
	  </section>

	  <section anchor="verifiable" title="Verifiable namespace definition">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x0D</spanx> (mnemonic: <spanx
		style="verb">verifiable-ns</spanx>)</t>
		<t hangText="shape"><spanx style="verb">( verifiable-ns {marker}:Nat {id}:UniqueID
		{data}:Expr {mnemonic}:Expr {definitions} )</spanx></t>
		<t hangText="inner scope"><spanx style="verb">{id} {data} {mnemonic} {definitions}</spanx></t>
	      </list>
	    </t>
	    <t>
	      This associates the namespace identified by {id} to the namespace marker of {marker},
	      within the scope of this form. Verifiable namespaces are immutable, {id} MUST be
	      verifiable against the byte sequence <spanx style="verb">{data} {mnemonic}
	      {definitions}</spanx>. The semantics of this form is to define in its scope any
	      definition made within <spanx style="verb">{definitions}</spanx>.
	    </t>
	    <t>
	      If {mnemonic} is of type <spanx style="verb">String</spanx>, then this suggests it as
	      the mnemonic of the namespace. Else it MUST be <spanx style="verb">nil</spanx>.
	    </t>
	    <t>
	      If more data than {id} is needed to verify {id} against {definitions} (like the salt
	      of a hash function, or the namespace of a UUID), this data should be provided by
	      {data}. Else {data} MUST be <spanx style="verb">nil</spanx>.
	    </t>
	    <t>
	      A verifiable namespace wouldn't really be immutable if it used definitions from other
	      namespaces that aren't immutable. To that effect, an application SHOULD stop
	      processing this form with an error when <spanx style="verb">{definitions}</spanx>
	      contain references from namespaces that cannot be determined to be immutable
	      themselves. The goal is to prevent a user or system to be unwittingly vulnerable, so
	      an application MAY provide an option to accept a specific verifiable namespace, but an
	      application MUST NOT provide an option to accept any vulnerable verifiable
	      namespace. That is, an option like <spanx style="verb">--accept-ns
	      8f82849556d74466</spanx> is acceptable but <spanx
	      style="verb">--disable-immutability-check</spanx> is not.
	    </t>
	  </section>

	  <section title="Array concatenation">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x0E</spanx> (mnemonic: <spanx
		style="verb">concat</spanx>)</t>
		<t hangText="shape"><spanx style="verb">( concat {array1}:Bytes {array2}:Bytes )</spanx></t>
	      </list>
	    </t>
	    <t>
	      <list style="hanging">
		<t hangText="Name's type">EagerFunction</t>
		<t hangText="Form's type">Bytes</t>
		<t hangText="Form's value">the concatenation of {array1} and {array2}.</t>
	      </list>
	    </t>
	    <t>
	      The value of this form is an array that contains the bytes in array1 followed by the
	      bytes in array2.
	    </t>
	  </section>

	  <section title="Substituton">
	    <section title="Substitution function">
	      <t>
		<list style="hanging">
		  <t hangText="name"><spanx style="verb">0x0F</spanx> (mnemonic: <spanx
		  style="verb">subst</spanx>)</t>
		  <t hangText="shape"><spanx style="verb">( subst {code} )</spanx></t>
		</list>
	      </t>
	      <t>
		<list style="hanging">
		  <t hangText="Name's type">LazyFunction</t>
		  <t hangText="Form's type">EagerFunction</t>
		  <t hangText="Form's value">A substitution function whose return value is the
		  value of {code}. Within {code}'s specific yield, the names <spanx
		  style="verb">arg</spanx> and <spanx style="verb">rest</spanx> are defined:</t>
		</list>
	      </t>
	    </section>
	    <section title="Argument">
	      <t>
		<list style="hanging">
		  <t hangText="name"><spanx style="verb">0x10</spanx> (mnemonic: <spanx
		  style="verb">arg</spanx>)</t>
		  <t hangText="shape"><spanx style="verb">( arg {n}:Nat )</spanx></t>
		</list>
	      </t>
	      <t>
		<list style="hanging">
		  <t hangText="Name's type">EagerFunction</t>
		  <t hangText="Form's type">Expr</t>
		  <t hangText="Form's value">the element number {n} (starting at zero) of the
		  substitution function's arguments list</t>
		</list>
	      </t>
	    </section>
	    <section title="Rest of arguments list">
	      <t>
		<list style="hanging">
		  <t hangText="name"><spanx style="verb">0x11</spanx> (mnemonic: <spanx
		  style="verb">rest</spanx>)</t>
		  <t hangText="shape"><spanx style="verb">( rest {n}:Nat )</spanx></t>
		</list>
	      </t>
	      <t>
		<list style="hanging">
		  <t hangText="Name's type">EagerFunction</t>
		  <t hangText="Form's type">Expr</t>
		  <t hangText="Form's value">the substitution function's arguments list without its
		  first {n} elements.</t>
		</list>
	      </t>
	      <section title="Examples">
		<t>Here is a definition of the inverse followed by the number 1/2, 1/3 and 1/4:</t>
		<t><spanx style="verb">( define inverse ( subst ( frac 1 ( arg 0 ) ) ) ) ( inverse
		2 ) ( inverse 3 ) ( inverse 4 )</spanx></t>
		<t>Substitution will splice multiple expressions in place:</t>
		<t>The evaluation of <spanx style="verb">( ( subst 1 ( rest 0 ) 2 ) 3 4 )</spanx>
		must yield the same as <spanx style="verb">( 1 3 4 2 )</spanx></t>
	      </section>
	    </section>

	  </section>
	</section>

	<section title="Arithmetic">
	  <t>
	    A processing application must recognize the type of all expressions defined in this
	    specification that have the type Number, but an application MAY consider a number as
	    having an unknown value if it has no adequate data type to store it.
	  </t>
	  <t>
	    In the text notation of a BULK stream, a decimal integer represents the smallest byte
	    sequence that yields this integer. For example, <spanx style="verb">( 31 256 )</spanx>
	    is a notation for the bytes <spanx style="verb">0x01 0xDF 0xE20100 0x02</spanx>.
	  </t>

	  <section title="Unsigned integer">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x20</spanx> (mnemonic: <spanx
		style="verb">unsigned-int</spanx>)</t>
		<t hangText="shape"><spanx style="verb">( unsigned-int {bits}:Bytes )</spanx></t>
	      </list>
	    </t>
	    <t>
	      The bits contained in {bits} is the value of this integer in binary notation. This
	      form exists in case disambiguation of the semantics is necessary.
	    </t>
	    <t>
	      Type: <spanx style="verb">Number</spanx>, <spanx style="verb">Int</spanx>, <spanx
	      style="verb">Nat</spanx>.
	    </t>
	  </section>

	  <section title="Signed integer">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x21</spanx> (mnemonic: <spanx
		style="verb">signed-int</spanx>)</t>
		<t hangText="shape"><spanx style="verb">( signed-int {bits}:Bytes )</spanx></t>
	      </list>
	    </t>
	    <t>
	      The bits contained in {bits} is the value of this integer in two's-complement
	      notation.
	    </t>
	    <t>
	      Type: <spanx style="verb">Number</spanx>, <spanx style="verb">Int</spanx>.
	    </t>
	  </section>

	  <section title="Fraction">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x22</spanx> (mnemonic: <spanx
		style="verb">frac</spanx>)</t>
		<t hangText="shape"><spanx style="verb">( frac {num}:Int {div}:Int )</spanx></t>
	      </list>
	    </t>
	    <t>
	      This is the number {num}/{div}.
	    </t>
	    <t>
	      Type: <spanx style="verb">Number</spanx>.
	    </t>
	  </section>

	  <section title="Binary floating-point number">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x23</spanx> (mnemonic: <spanx
		style="verb">binary-float</spanx>)</t>
		<t hangText="shape"><spanx style="verb">( binary-float {bits}:Bytes )</spanx></t>
	      </list>
	    </t>
	    <t>
	      This is a floating-point number expressed in IEEE 754-2008 binary interchange
	      format. {bits} can be of size 16, 32, 64, 128 or any bigger multiple of 32 bits, as
	      per IEEE 754-2008 rules.
	    </t>
	    <t>
	      Types: <spanx style="verb">Number</spanx>, <spanx style="verb">Float</spanx>.
	    </t>
	  </section>

	  <section title="Decimal floating-point number">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x24</spanx> (mnemonic: <spanx
		style="verb">decimal-float</spanx>)</t>
		<t hangText="shape"><spanx style="verb">( decimal-float {bits}:Bytes )</spanx></t>
	      </list>
	    </t>
	    <t>
	      This is a floating-point number expressed in IEEE 754-2008 decimal interchange
	      format. {bits} can be of size 32, 64, 128 or any bigger multiple of 32 bits, as per
	      IEEE 754-2008 rules.
	    </t>
	    <t>
	      Types: <spanx style="verb">Number</spanx>, <spanx style="verb">Float</spanx>.
	    </t>
	  </section>

	  <section title="Binary fixed point number">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x25</spanx> (mnemonic: <spanx
		style="verb">binary-fixed</spanx>)</t>
		<t hangText="shape"><spanx style="verb">( binary-fixed {point}:Nat {bits}:Bytes )</spanx></t>
	      </list>
	    </t>
	    <t>
	      This is a fixed point binary number. <spanx style="verb">{bits}</spanx> contains an
	      integer in two's complement. That integer divided by 2^point is the value of this
	      form. For example, <spanx style="verb">( binary-fixed 2 15 )</spanx> has value <spanx
	      style="verb">3.75<sub>10</sub></spanx> (<spanx
	      style="verb">11.11<sub>2</sub></spanx>).
	    </t>
	    <t>
	      Types: <spanx style="verb">Number</spanx>, <spanx style="verb">Float</spanx>.
	    </t>
	  </section>

	  <section title="Decimal fixed point number">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x26</spanx> (mnemonic: <spanx
		style="verb">decimal-fixed</spanx>)</t>
		<t hangText="shape"><spanx style="verb">( decimal-fixed {point}:Nat {bits}:Bytes )</spanx></t>
	      </list>
	    </t>
	    <t>
	      This is a fixed point decimal number. <spanx style="verb">{bits}</spanx> contains an
	      integer in two's complement. That integer divided by 10^point is the value of this
	      form. For example, <spanx style="verb">( decimal-fixed 2 123 )</spanx> has value
	      <spanx style="verb">1.23</spanx>.
	    </t>
	    <t>
	      Types: <spanx style="verb">Number</spanx>, <spanx style="verb">Float</spanx>.
	    </t>
	  </section>

	  <section title="Decimal fixed point number with 2 decimal places">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x27</spanx> (mnemonic: <spanx
		style="verb">decimal2</spanx>)</t>
		<t hangText="value"><spanx style="verb">( subst ( decimal-fixed 2 ( arg 0 ) ) )</spanx></t>
	      </list>
	    </t>
	  </section>
	</section>

	<section title="Compact formats">
	  <t>
	    This specification and other specifications in the official BULK suite take the option
	    to use as their basic building block a form with a distinguishing reference as first
	    element (basically, they are a binary representation of an abstract syntax tree). As
	    noted previously, this means that most representations weigh 4 bytes plus their actual
	    content, which will in turn have some overhead because of one or several marker bytes.
	  </t>
	  <t>
	    But when there is a special need for compactness, BULK makes it possible to design
	    protocols and formats with different trade-offs, while retaining its property of being
	    parseable by processing applications not knowing the protocol in its entirety.
	  </t>
	  <t>
	    On one end of the spectrum, a format might choose to use an array to encapsulate an ad
	    hoc binary format. An extreme use of this scheme would be to use BULK just to make
	    explicit the binary format used. With a known profile (for example with a file extension
	    and/or media type for such explicitly typed BLOBs), a BULK stream that consists solely
	    of the version form, a reference that describes the binary format and an array will have
	    an overhead between 11 and 20 bytes depending on the size of the content. Without a
	    profile, with the namespaces associations, the overhead is between 45 and 54 bytes.
	  </t>
	  <t>
	    Still, even this extreme in the design space retains the ability to insert expressions
	    in the BULK stream, whatever their type. Thus metadata can be added about data that is
	    represented in a format that doesn't allow for metadata or for limited metadata.
	  </t>
	  <t>
	    In-between these two extremes, several options are available to produce a format that
	    leverages the BULK parser a lot more while being more compact than a basic BULK
	    format. The following forms provide a standard way to create such formats.
	  </t>
	  <t>
	    A flat list of operators and operands is called a BULK bytecode. Prefix bytecodes are
	    those where operators come before operands, postfix bytecodes are those where operators
	    come after operands. In the following forms, operators MUST be references.
	  </t>
	  <t>
	    The default semantics of a bytecode form is to transform it to an abstract syntax tree
	    of its content and then evaluate the resulting expression with the normal BULK
	    evaluation rules. When evaluating a bytecode form that doesn't provide arities, a
	    processing application MUST abort this transformation as soon as it encounters a
	    reference for which it cannot determine if it is an operator or its arity. When
	    evaluating a bytecode form that provides arities, any reference that is not known to be
	    an operator MUST be determined to be an operand.
	  </t>
	  <t>
	    To transform a prefix bytecode form, a processing application creates an alternate
	    context. If the first expression of the bytecode can be determined to be an operand, it
	    is removed from the beginning of the bytecode and appended at the end of the alternate
	    context. If the first expression of the bytecode is a reference that can be determined
	    to be an operator, it is removed from the beginning of the bytecode and a list is
	    created with the operator as the first expression, then as many next expressions as its
	    arity are removed from the beginning of the bytecode and appended at the end of this
	    list. Then that resulting list is appended at the end of the alternate context. The
	    transformation continues until the bytecode is empty, in which case the alternate
	    context replaces the bytecode form and the transformation is complete. The resulting
	    form can then be evaluated in turn.
	  </t>
	  <t> Example: the default semantics of </t>
	  <t><spanx style="verb">( prefix* ( ( 2 sgf:black ) ) sgf:game sgf:black 1 2
	  sgf:black 3 4 sgf:black 5 6 )</spanx> </t>
	  <t>is that it's transformed into</t>
	  <t><spanx style="verb">( sgf:game ( sgf:black 1 2 ) ( sgf:black 3 4 ) ( sgf:black 5 6 )
	  )</spanx> </t>
	  <t>
	    To transform a postfix bytecode form, a processing application creates a data stack. If
	    the first expression of the bytecode can be determined to be an operand, it is removed
	    from the beginning of the bytecode and pushed on top of the stack. If the first
	    expression of the bytecode can be determined to be an operator, it is removed from the
	    beginning of the bytecode and a list is created with the operator as the first
	    expression, then as many next expressions as its arity are popped from the stack and
	    appended at the end of this list. Then that resulting list is pushed on top of the
	    stack. The transformation continues until the bytecode is empty, in which case the list
	    of the elements on the stack (with the top of the stack as the last element) replaces
	    the bytecode form and the transformation is complete. The resulting form can then be
	    evaluated in turn.
	  </t>
	  <t> Example: the default semantics of </t>
	  <t><spanx style="verb">( postfix* ( ( 2 sgf:black ) ) sgf:game 2 1 sgf:black
	  4 3 sgf:black 6 5 sgf:black )</spanx> </t>
	  <t>is that it's transformed into</t>
	  <t><spanx style="verb">( sgf:game ( sgf:black 1 2 ) ( sgf:black 3 4 ) ( sgf:black 5 6 )
	  )</spanx> </t>
	  <t>
	    If the overhead of several marker bytes in the operands of some operators is too much,
	    even more compactness can be achieved by packing together small operands. For example,
	    instead of an operator with two integers as its operands, one could specify an operator
	    to take a single word as operand and extract the integers from it (while still retaining
	    the ability to operate on many sizes of integers, because it can still deduce the size
	    of the integers by dividing the size of the word by two).
	  </t>
	  <t>
	    For example, a BULK format representing player moves with a pair of coordinates might
	    represent a single move with the following shapes:
	  </t>
	  <t>
	    <list style="hanging">
	      <t hangText="basic (8 bytes)"><spanx style="verb">( sgf:black/2 #[1] 0x41 #[1] 0x5A
	      )</spanx></t>
	      <t hangText="packed basic (7 bytes)"><spanx style="verb">( sgf:black/1 #[2] 0x41 0x5A
	      )</spanx></t>
	      <t hangText="bytecode (6 bytes)"><spanx style="verb">sgf:black/2 #[1] 0x41 #[1]
	      0x5A</spanx></t>
	      <t hangText="packed bytecode (5 bytes)"><spanx style="verb">sgf:black/1 #[2] 0x41
	      0x5A</spanx></t>
	    </list>
	  </t>
	  <t>
	    The transformation defined for the bytecode forms makes it possible to mix literal
	    expressions and operations represented by a list of operators and operands. In the
	    previous scenario, for example, one might represent alternating moves by two players as
	    a list of words, lowering the weight of each move to 3 bytes when coordinates are
	    below 256. The difference between all these schemes and an array is that you keep the
	    ability to insert other forms, for example to represent comments on the game or
	    variants.
	  </t>
	  <t>
	    The cost of the bytecode format is that if it contains operators whose arity is unknown
	    to a processing application, the whole list after the first occurrence of them is
	    unreadable to that processing application, whereas in the basic format, the processing
	    application can still process all the forms it understands (and it requires no
	    anticipation by the application creating the BULK stream).
	  </t>

	  <section title="Prefix bytecode">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x30</spanx> (mnemonic: <spanx
		style="verb">prefix</spanx>)</t>
		<t hangText="shape"><spanx style="verb">( prefix {bytecode} )</spanx></t>
	      </list>
	    </t>
	    <t>
	      This is a prefix bytecode form that doesn't provide arities.
	    </t>
	  </section>

	  <section title="Prefix bytecode with arities">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x31</spanx> (mnemonic: <spanx
		style="verb">prefix*</spanx>)</t>
		<t hangText="shape"><spanx style="verb">( prefix* {arities}:Expr {bytecode}
		)</spanx></t>
	      </list>
	    </t>
	    <t>
	      This is a prefix bytecode form that provides arities.
	    </t>
	    <t>
	      <spanx style="verb">{arities}</spanx> MUST be a list of shapes <spanx
	      style="verb">( {arity}:Nat {refs} )</spanx>. {refs} MUST be a series of
	      references. It indicates that all references in this series are operators of arity
	      {arity}. <spanx style="verb">{arities}</spanx> can be a form or a reference defined to
	      a list.
	    </t>
	    <t>
	      Within the prefix bytecode of this form, if there is a <spanx
	      style="verb">prefix</spanx> form, the arities declared in the outside form still
	      apply.
	    </t>
	  </section>

	  <section title="Postfix bytecode">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x32</spanx> (mnemonic: <spanx
		style="verb">postfix</spanx>)</t>
		<t hangText="shape"><spanx style="verb">( postfix {bytecode} )</spanx></t>
	      </list>
	    </t>
	    <t>
	      This is a postfix bytecode form that doesn't provide arities.
	    </t>
	  </section>

	  <section title="Postfix bytecode with arities">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x33</spanx> (mnemonic: <spanx
		style="verb">postfix*</spanx>)</t>
		<t hangText="shape"><spanx style="verb">( postfix* {arities}:Expr {bytecode}
		)</spanx></t>
	      </list>
	    </t>
	    <t>
	      This is a postfix bytecode form that provides arities.
	    </t>
	    <t>
	      <spanx style="verb">{arities}</spanx> MUST be a list of shapes <spanx style="verb">(
	      {arity}:Nat {refs} )</spanx>. {refs} MUST be a series of references. It indicates that
	      all references in this series are operators of arity {arity}. <spanx
	      style="verb">{arities}</spanx> can be a form or a reference defined to a list.
	    </t>
	    <t>
	      Within the prefix bytecode of this form, if there is a <spanx
	      style="verb">prefix</spanx> form, the arities declared in the outside form still
	      apply.
	    </t>
	  </section>

	  <section title="Arity declaration">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x34</spanx> (mnemonic: <spanx
		style="verb">arity</spanx>)</t>
		<t hangText="shape"><spanx style="verb">( arity {arity}:Nat {refs} )</spanx></t>
	      </list>
	    </t>
	    <t>
	      {refs} MUST be a series of references. It indicates that all references in this series
	      are operators of arity {arity}.
	    </t>
	    <t>
	      Whenever arities have been provided for some references in a namespace, all references
	      in that namespace whose arities aren't provided MUST be determined to be operands by a
	      processing application.
	    </t>
	  </section>

	</section>
      </section>
    </section>

    <section title="Extension namespaces">
      <t>
	Extension namespaces are defined with a unique identifier, to be associated to a marker
	value.
      </t>
      <t>
	By is decentralized nature, as far as a processing application is concerned, apart from
	standard namespaces, there is no difference between a namespace defined as part of the
	official BULK suite and a user-defined one.
      </t>
    </section>

    <section title="Profiles">
      <t>
	A profile is a byte sequence parsed by a processing application just after the <spanx
	style="verb">version</spanx> form or before the first expression if there is no <spanx
	style="verb">version</spanx> form. Thus a parser SHOULD look ahead at the beginning of a
	stream to see if the first three bytes are <spanx style="verb">( bulk:version</spanx>. With
	respect to the BULK stream, the profile is an out-of-band information, usually implicit.
      </t>
      <t>
	A processing application doesn't need to include the profile in the concrete yield, as long
	as the semantics of the abstract yield are maintained.
      </t>
      <t>
	The same BULK stream might be processed with different profiles.
      </t>
      <t>
	A processing application MUST NOT deduce the profile from the content of a BULK stream.
      </t>

      <section title="Profile redundancy">
	<t>
	  A processing application SHOULD only rely on the use of a profile when it is a safe
	  assumption that the profile is known, for example within a communication where the
	  protocol dictates the profile.
	</t>
	<t>
	  In particular, long-term storage of a BULK stream SHOULD preserve profile information, for
	  example with a media type that dictates the profile.
	</t>
	<t>
	  Otherwise, an application writing a BULK stream in a long-term storage SHOULD include the
	  profile after the version form. For this reason, the expressions in a profile SHOULD have
	  idempotent semantics.
	</t>
      </section>

      <section title="Standard profile">
	<t>
	  This specification defines the default profile that a processing application MUST use when
	  it is not using a specific profile:
	</t>
	<t>
	  <spanx style="verb">( bulk:stringenc ( bulk:iana-charset 106 ) )</spanx>
	</t>
	<t>
	  This means that the default string encoding in a BULK stream is UTF-8.
	</t>
      </section>
    </section>

    <section title="Security Considerations" anchor="sec">
      <section title="Parsing">
	<t>
	  Parsing a BULK stream is designed to be free of side-effects for the processing application,
	  apart from storing the parsed results.
	</t>
	<t>
	  Arrays in BULK carry their size, so as for the application to know in advance the size of
	  the data to read and store, thus making it easier to build robust code. A malicious
	  software, however, may announce an array with a size choosen to get an application to
	  exhaust its available memory. When a BULK stream has been completely received, an array
	  bigger than the remaining data SHOULD trigger an error. When a BULK stream's size is not
	  known in advance, the application SHOULD use a growable data structure.
	</t>
      </section>
      <section title="Forwarding">
	<t>
	  When a processing application forwards all or part of the data in a BULK stream to another
	  application, care must be taken if part of the forwarded data was not entirely recognized,
	  as it could be used by an attacker to benefit from the authority the forwarding
	  application has on the recipient of the data.
	</t>
      </section>
      <section title="Definitions">
	<t>
	  The architecture of a processing application SHOULD ensure that a malicious agent cannot
	  abuse authority given to it to define a namespace in order to modify associations in other
	  namespaces. Depending on the use of data structures storing BULK expressions, this could
	  amount to giving an attacker a way to manipulate the application's state. See <xref
	  target="robustNS"/> for an example of architecture that is resistant to that kind of
	  attack.
	</t>
      </section>
    </section>

    <section title="IANA Considerations">
      <t>
	This specification defines a new media type, application/bulk. Here are the informations for
	its registration to IANA:
      </t>
      <t>
	<list style="hanging">
	  <t hangText="Type name">application</t>
	  <t hangText="Subtype name">bulk</t>
	  <t hangText="Required parameters">none</t>
	  <t hangText="Optional parameters">none</t>
	  <t hangText="Encoding considerations">none, content is self-describing</t>
	  <t hangText="Security considerations">cf. <xref target="sec"/></t>
	  <t hangText="Interoperability considerations">the constraint to start any BULK file with a
	  version form has the side-effect that classes of BULK streams can be identified by a
	  sequence of bytes acting as "magic number", at offset 0:
	  <list style="hanging">
	    <t hangText="0x012000">any BULK stream</t>
	    <t hangText="0x012000C1">a BULK stream of major version 1</t>
	    <t hangText="0x012000C1C002">a BULK stream of version 1.0</t>
	  </list>
	  </t>
	  <t hangText="Published specification">this document</t>
	  <t hangText="Applications that use this media type">none so far</t>
	  <t hangText="Fragment identifier considerations">this specification defines no semantics
	  for addressing the data with a fragment identifier; a future specification MAY define
	  fragment identifier syntaxes to address the content by byte offset or the parsed results
	  by their position in the yielded list</t>
	  <t hangText="Additional information">a future specification MAY define a naming convention
	  for media types based on bulk with a +bulk suffix, as for XML with +xml</t>
	</list>
      </t>
    </section>

    <section title="Acknowledgements">
      <t>
	The original author of this specification read <eref
	target="http://www.schnada.de/grapt/eriknaggum-xmlrant.html">Erik Naggum's famous rant about
	XML</eref> several years before, and while forgotten as such, it clearly was the seed that
	slowly bloomed into the design of BULK. This format is dedicated to Erik.
      </t>
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <reference anchor="RFC2119">
        <front>
          <title>
            Key words for use in RFCs to Indicate Requirement Levels
          </title>
          <author initials="S." surname="Bradner" fullname="Scott Bradner">
            <organization>Harvard University</organization>
            <address><email>sob@harvard.edu</email></address>
          </author>
          <date month="March" year="1997"/>
        </front>
        <seriesInfo name="BCP" value="14"/>
        <seriesInfo name="RFC" value="2119"/>
      </reference>

      <reference anchor="IANA-Charsets" target="http://www.iana.org/assignments/character-sets">
        <front>
          <title>
	    IANA Charset Registry (archived at):
          </title>
	  <author/>
	  <date/>
        </front>
      </reference>

    </references>


    <references title="Informative references">

      <reference anchor="HTTP2">
	<front>
	  <title>Hypertext Transfer Protocol version 2 (HTTP/2)</title>
	  <author initials="M." surname="Belshe" fullname="Mike Belshe">
	    <organization>BitGo</organization>
	    <address>
	      <email>mike@belshe.com</email>
	    </address>
	  </author>

	  <author initials="R." surname="Peon" fullname="Roberto Peon">
	    <organization>Google, Inc</organization>
	    <address>
	      <email>fenix@google.com</email>
	    </address>
	  </author>

	  <author initials="M." surname="Thomson" fullname="Martin Thomson" role="editor">
	    <organization>Mozilla</organization>
	    <address>
	      <postal>
		<street>331 E Evelyn Street</street>
		<city>Mountain View, CA</city>
		<code>94041</code>
		<country>US</country>
	      </postal>
	      <email>martin.thomson@gmail.com</email>
	    </address>
	  </author>
	  <date month="May" year="2015"/>
	</front>
        <seriesInfo name="RFC" value="7540"/>
      </reference>

      <reference anchor="Avro" target="http://avro.apache.org/docs/1.7.4/spec.html">
	<front>
	  <title>Apache Avro™ 1.7.4 Specification</title>
	  <author initials="D." surname ="Cutting" fullname="Doug Cutting">
            <organization>Cloudera</organization>
	  </author>
	  <date month="February" year="2013"/>
	</front>
      </reference>

      <reference anchor="protobuf" target="https://developers.google.com/protocol-buffers/">
	<front>
	  <title>Protocol Buffers</title>
	  <author/>
	  <date month="July" year="2008"/>
	</front>
      </reference>

      <reference anchor="Smile" target="http://wiki.fasterxml.com/SmileFormat">
	<front>
	  <title>Smile Data Format</title>
	  <author initials="T." surname ="Saloranta" fullname="Tatu Saloranta">
	    <address><email>tsaloranta@gmail.com</email></address>
	  </author>
	  <date month="September" year="2010"/>
	</front>
      </reference>

      <reference anchor="Thrift" target="http://thrift.apache.org/static/files/thrift-20070401.pdf">
	<front>
	  <title>Thrift: Scalable Cross-Language Services Implementation</title>
	  <author initials="M." surname ="Slee" fullname="Mark Slee">
	    <organization>Facebook</organization>
	    <address><email>mcslee@facebook.com</email></address>
	  </author>
	  <author initials="A." surname ="Agarwal" fullname="Aditya Agarwal">
	    <organization>Facebook</organization>
	    <address><email>aditya@facebook.com</email></address>
	  </author>
	  <author initials="M." surname ="Kwiatkowski" fullname="Marc Kwiatkowski">
	    <organization>Facebook</organization>
	    <address><email>marc@facebook.com</email></address>
	  </author>
	  <date month="April" year="2007"/>
	</front>
      </reference>

    </references>

    <section anchor="robustNS" title="Robust namespace definition">
      <t>
	This constitutes a suggestion of architecture for a BULK processing application. It has the
	advantage that an agent cannot modify the values of names to which it has not specifically
	been given authority. This architecture doesn't ensure this property by checking the
	validity of definitions but by adhering to the Principle Of Least Authority, thus ensuring
	no false positives or TOCTOU race conditions.
      </t>
      <t>
	For each new context (including the abstract yield when parsing starts), the parser creates
	a new copy of each known namespace. These copies are available in this context to retrieve
	and define values. It implements the lexical scoping of definitions on top of providing the
	robustness properties discussed here.
      </t>
      <t>
	By default, all namespaces created in a context are discarded at the end of this context.
      </t>
      <t>
	Of course, an implementation of the architecture presented here can be optimized compared to
	the abstract algorithm, for example by using copy-on-demand.
      </t>
      <t>
	Any namespace that is not a copy for its context but the object retained by the application
	afterwards, gives authority to make long-lasting definitions. Such a namespace is called
	lasting here.
      </t>
      <section title="Selective authority">
	<t>
	  A number of lasting namespaces are included for the abstract yield. Their unique
	  identifiers are agreed out-of-band. The disadvantage of this solution is that it needs
	  prior agreement on the definable namespaces.
	</t>
      </section>
      <section title="Open authority">
	<t>
	  Any <spanx style="verb">ns</spanx> form for a unique identifier unknown to the processing
	  application triggers the  of a lasting namespace.
	</t>
	<t>
	  The disadvantage of this solution is that it opens a denial of service vulnerability. If
	  Bob is a processing application and Carol and Dave are agents communicating with Bob with
	  an open authority, Dave can prevent Carol from defining a namespace if it manages to know
	  the unique identifier and starting a communication with Bob before Carol.
	</t>
	<t>
	  If an agent uses a secure way to create unique identifiers, this solution is both flexible
	  and safe (the burden is not on the BULK processing application).
	</t>
      </section>
    </section>

  </back>
</rfc>