-
Notifications
You must be signed in to change notification settings - Fork 0
/
draft-thierry-bulk-04.xml
1751 lines (1671 loc) · 77 KB
/
draft-thierry-bulk-04.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="lib/rfc2629.xslt"?>
<?rfc toc="yes" ?>
<?rfc symrefs="yes" ?>
<?rfc sortrefs="yes" ?>
<?rfc compact="yes"?>
<?rfc subcompact="no" ?>
<?rfc linkmailto="no" ?>
<?rfc editing="no" ?>
<?rfc comments="yes"?>
<?rfc inline="yes"?>
<?rfc rfcedstyle="yes"?>
<?rfc-ext allow-markup-in-artwork="yes" ?>
<?rfc-ext include-index="no" ?>
<rfc ipr="trust200902"
category="exp"
submissionType="IETF"
docName="draft-thierry-bulk-04">
<front>
<title abbrev="BULK1">Binary Uniform Language Kit 1.0</title>
<author initials="P." surname="Thierry" fullname="Pierre Thierry">
<organization>Thierry Technologies</organization>
<address>
<email>pierre@nothos.net</email>
</address>
</author>
<date day="31" month="03" year="2024" />
<keyword>binary</keyword>
<abstract>
<t>
This specification describes a uniform, decentrally extensible and efficient format for
data serialization.
</t>
</abstract>
</front>
<middle>
<section anchor="intro" title="Introduction">
<section title="Rationale">
<t>
This specification aims at finding an original trade-off between uniformity, generality,
extensibility, decentralization, compactness and processing speed for a data format. It is
our opinion that every widely used existing format occupy a different position than this
one in the solution space for formats, that none is better on all axes, and that this one
is the current best on several axes, hence this new design. It is also our opinion that
some of those existing formats constitute an optimal solution for their specific use case,
either in a absolute sense, or at least at the time of their design. But the ever-changing
field of IT now faces new challenges that call for a new approach.
</t>
<t>
In particular, whereas the previous trend for Internet and Web standards and programming
tools has been to create human-readable syntaxes for data and protocols, the advent of
technologies like <xref target="protobuf">protocol buffers</xref>, <xref
target="Thrift">Thrift</xref>, the various binary serializations for JSON like <xref
target="Avro">Avro</xref> or <xref target="Smile">Smile</xref>, or the binary <xref
target="HTTP2">HTTP/2</xref> seem to indicate that the time is ripe for a generalized use
of binary, reserved until now for the low-level protocols. The lessons about flexibility
learnt in the previous switch from binary to plain text can now be applied to efficient
binary syntaxes.
</t>
<section title="Definitions">
<t>
By uniformity, we mean the property of a syntax that can be parsed even by an
application that doesn't understand the semantics of every part of the processed
data. Of course, almost all syntaxes that feature uniformity contain a limited number
of non uniform elements. Also, uniformity really only has value in the face of
extension, as a fixed syntax doesn't need uniformity (it only makes the implementation
simpler).
</t>
<t>
Almost all extensible syntaxes have their extensible part uniform to a great degree. In
this specification, uniformity is hence evaluated on two criteria: first, the number of
non uniform elements (and, incidentally, their diversity), second, the fact that the
uniformity of the extensible part is not a limitation to the users (i.e. that the
temptation to extend the format in a non-uniform way is as absent as possible).
</t>
<t>
A good counter-example is found in most programming languages. Adding a new branching
construct cannot be done in a terse way without modifying the underlying
implementation. Such a construct either cannot be defined by user code (because of
evaluation rules) or can in a terribly verbose and inconvenient way (with lots of
boilerplate code). Notable exceptions to this limitation of programming languages are
Lisp, Haskell and stack programming languages.
</t>
<t>
On the other hand, a stack programming language is the canonical example of a
non-uniform language. Each operator takes a number of operands from the stack. Not
knowing the arity of an operator makes it impossible to continue parsing, even when its
evaluation was optional to the final processing. In the design space, stack programming
languages completely sacrifice uniformity to achieve one of the highest combination of
extensibility, compactness and speed of processing.
</t>
<t>
By generality, we mean the ability of a syntax to lend itself to describe any kind of
data with a reasonable (or better yet, high) level of compactness and simplicity. For
example, although both arrays and linked lists could be considered very general as they
are both able to store any kind of data, they actually are at the respective cost of
complexity (arrays need the embedding of data structure in the data or in the
processing logic) and size (in-memory linked lists can waste as much as half or two
third of the space for the overhead of the data structure).
</t>
<t>
By decentralization, we mean the ability to extend the syntax in a way that avoid
naming collisions without the use of a central registry. Note that the DNS, as we use
it, is NOT decentralized in this sense, but distributed, as it cannot work without its
root servers and prior knowledge of their location.
</t>
</section>
<section title="State of the art">
<t>
Uniformity, generality and extensibility are usually highly-valued traits in formats
design. Programming languages obviously feature them foremost, although their
generality usually stops at what they are supposed to express: procedures. Most of them
are ill-suited to represent arbitrary data, but notable exceptions include Lisp (where
"code is data") and Javascript, from which a subset has been extracted to exchange
data, JSON, which has seen a tremendous success for this purpose. JSON may lack in
generality and compactness, but its design makes its parsing really straightforward and
fast. All of them, though, lack decentralization. Some of them make it possible to
extend them in a distrubuted way if some discipline is followed (for example, by naming
modules after domain names), but the discipline is not mandatory (and even with domain
names, a change of ownership makes it possible for name collisions).
</t>
<t>
The SGML/XML family of formats also feature uniformity, generality and extensibility
and actually fare much better than programming languages on the three fronts. XML
namespaces also make XML naming distributed and there have been attempts at making it
compact (e.g. EXI from W3C, Fast Infoset from ISO/ITU or EBML).
</t>
<t>
All the previously cited formats clearly lack compactness, although just applying
standard compression techniques would sacrifice only very little processing time to
gain huge size reductions on most of their intended use cases, but compression may not
address their ineffectiveness at storing arbitrary bytes.
</t>
<t>
So-called binary formats pretty much exhibit the opposite trade-offs. Most of them are
not uniform to achieve better compactness. Some are specifically designed for a great
generality, but many lack extensibility. When they are extensible, it's never in a
decentralized way, again for reasons that have to do with compactness. They are usually
extremely fast to parse.
</t>
<t>
Actually, many binary formats are not so much formats but formats frameworks, and
exclude extensibility by design. For each use case, an IDL compiler creates a brand new
format that is essentially incompatible with all other formats created by the same
compiler (EBML specifically cites this property among its own disadvantages). If the
IDL compiler and framework are correctly designed, such a format usually represent an
optimum in compactness and speed of processing, as the compiler can also automatically
generate an ad-hoc optimized parser.
</t>
<t>
Where extensibility has been planned in existing formats, it often doesn't get used
that much or at all because of the complications around it. Many binary formats include
reserved values meant to extend them to future uses, like the <spanx
style="verb">CM</spanx> field in the ZIP format. A case like this one faces an
chicken-and-egg problem: if you don't write and get a specification officially adopted,
implementations might not want to include your extension, but if your extension is
purely theoretical and hasn't been tested in the wild, you may face resistance to get
it officially adopted. This is probably why even though most compression formats
include the ability to later encode other compression methods, each new compression
method usually comes with its own format.
</t>
<t>
When extensions are managed with any form of registry, another issue is that you
usually need to reserve a large set of values for free experimentation, and once an
extension gains any traction while in experimentation, its authors face the difficulty
to switch all existing implementations to the definitive values they'll get. And how
experimenters choose their temporary values makes them vulnerable to conflicts with
others.
</t>
</section>
</section>
<section title="Format overview">
<t>
A BULK stream is a stream of 8-bit bytes, in big-endian order. Parsing a BULK stream
yields a sequence of expressions, which can be either atoms or forms, which are sequences
of expressions. The syntax of forms is entirely uniform, without a single exception: a
starting byte marker, a sequence of expressions and an ending byte marker. Among atoms,
only nil (the null byte) and arrays have a special syntax, for efficiency purposes. Even
booleans and floating-point numbers follow the uniform syntax that every other expression
follows.
</t>
<t>
Non uniform atoms start with a marker byte, followed by a static or dynamic number of
bytes, depending on the type.
</t>
<t>
Any other atom is a reference, which consists of a namespace marker (in almost all cases,
a single byte) followed by an identifier within this namespace (a single byte). All in
all, a very little sacrifice is made in compactness for the benefit of a very simple
syntax: apart from nil and small integers, nothing is smaller than 2 bytes, and as most
forms involve a reference followed by some content, a form is usually 4 bytes + its
content.
</t>
<t>
A namespace marker in a BULK stream is associated to a namespace identified by some
identifier guaranteed to be unique without coordination (like a UUID or cryptographical
hash), thus ensuring decentralized extensibility. The stream can be processed even if the
application doesn't recognize the namespace. Parsing remains possible thanks to the
uniform syntax.
</t>
<t>
Combination of BULK namespaces, BULK streams and even other formats doesn't need any
content transformation to work. Here are some examples:
<list style="symbols">
<t>
The content of a BULK stream, enclosed in list starting and ending byte markers,
constitute a valid BULK expression. Thus BULK streams can be packed or annotated
within a BULK stream without modification. Annotation use cases include adding
metadata or cryptographic signature.
</t>
<t>
A BULK format could specify in its syntax the place for an expression holding
metadata. Whether the specification provides its own metadata forms or not, an
application could use a BULK serialization for MARC, TEI Header, XML or RDF for this
metadata expression. The vocabulary selected would be univocally expressed by the
namespace and every vocabulary would be parsed by the same mechanisms.
</t>
<t>
Whenever a content must be stored as-is instead of serialized or a highly-optimized
ad hoc serialization exists for some data, anything can always be stored within an
array. They can contain arbitray bytes and there is no limit to their size.
</t>
</list>
</t>
<t>
Furthermore, BULK expressions can be evaluated. Most expressions evaluate to themselves,
but some evaluate by default to the result of a pure function call, making it possible to
serialize data in an even more compact form, by eliminating boilerplate data and repeated
patterns.
</t>
</section>
<section title="Conventions and Terminology">
<t>
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD
NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as
described in <xref target="RFC2119">RFC 2119</xref>.
</t>
<t>
Literal numerical values are provided in decimal or hexadecimal as appropriate.
Hexadecimal literals are prefixed with <spanx style="verb">0x</spanx> to distinguish them
from decimal literals.
</t>
<t>
The text notation of the BULK stream uses mnemonics for some bytes sequences. Mnemonics
are series of characters, excluding all capital letters and white space, like <spanx
style="verb">this-is-one-mnemonic</spanx> or <spanx
style="verb">what-the-%§!?#-is-that?</spanx>. They are always separated by white
space. Outside the use of mnemonics, a sequence of bytes (of one or more bytes) can be
represented by its hexadecimal value as an unsigned integer (e.g. <spanx
style="verb">0x3F</spanx> or <spanx style="verb">0x3A0B770F</spanx>). Such a sequence of
bytes can include dashes to make it more readable (e.g. <spanx
style="verb">0xDDA37D36-85E6-4E6D-9B51-959E1CCE366C</spanx>). Some types in this
specification define a special syntax for their representation in the text notation.
</t>
<t>
In the grammar, a shape is a pattern of bytes, following the rules of the text notation
for a BULK stream. Apart from mnemonics and fixed sequences of bytes, a shape can
contain:
<list style="symbols">
<t>an arbitrary sequence of a fixed number of bytes, represented by its size, i.e. a
number of bytes in decimal immediately followed by a B uppercase letter (e.g. <spanx
style="verb">4B</spanx>)</t>
<t>a typed sequence of bytes, represented by the name of its type, a capitalized word
(e.g. <spanx style="verb">Foo</spanx>); this means a sequence of bytes whose specific
yield (cf. <xref target="parsing"/>) has this type</t>
<t>a named sequence of bytes (of zero or more bytes), represented by a series of any
character excluding '{}' between '{' and '}' (e.g. <spanx style="verb">{quux}</spanx>);
a named sequence can be typed or sized, in which case it is immediately followed by ':'
and a type or size (e.g. <spanx style="verb">{quux}:Bar</spanx> or <spanx
style="verb">{quux}:12B</spanx>)</t>
</list>
</t>
<t>
When an entire shape describes the byte sequence of an atom, it is the normative
specification for parsing it, but shapes of forms are only normative with respect to
their default evaluation. A reference defined with a form shape can be used in different
shapes, albeit with different semantics and value and even when used in its default
shape, a processing application MAY give it alternative semantics.
</t>
<t>
For example, this specification defines a way do specify a string encoding with forms of
the shape <spanx style="verb">( stringenc {enc}:Expr )</spanx>. But the shapes <spanx
style="verb">( stringenc {arg1}:Int {arg2}:Int )</spanx> or <spanx style="verb">(
{arg1}:Int stringenc {arg2}:Int )</spanx> are syntactly valid. They just have unspecified
semantics, as far as this specification is concerned.
</t>
<t>
Some identifiers are expected to be verifiable against a byte sequence. This means that
there must be an algorithm that, given the byte sequence as input, produces the
identifier as output and, given a different byte sequence, would produce a different
identifier. Because this verification has security implications, the algorithm used
should have the same guarantees than a cryptographic hash function in terms of
collisions.
</t>
</section>
</section>
<section title="BULK syntax">
<t>
A BULK stream is a sequence of 8-bit bytes. Bits and bytes are in big-endian order. The
result of parsing a BULK stream is a list of abstract data, called the abstract yield. BULK
parsing is injective: a BULK stream has only one abstract yield, but different BULK streams
can have the same abstract yield.
</t>
<t>
A processing application is not expected to actually produce the abstract yield, but an
adaptation of the abstract yield to its own implementation, called the concrete
yield. Also, some expressions in a BULK stream may have the semantics of a transformation
of the abstract yield. A processing application MAY thus not produce or retain the concrete
yield but the result of its transformation. This specification deals mainly with the byte
sequence and the abstract yield and occasionnally provide guidelines about the concrete
yield. Of course, a processing application MAY not produce the concrete yield at all but
produce various data structures and side effects from parsing the BULK stream.
</t>
<t>
The abstract yield is a list of expressions. Expressions can be atoms or forms. Forms
are lists of expressions. If a byte sequence is parsed as an expression, this byte
sequence is said to denote this expression.
</t>
<t>
When a sequence of bytes is named in a shape, its name can be used in this specification to
designate either the byte sequence, or the expression or list of expressions it
denotes. When there could be ambiguity, this specification specifies which is designated.
</t>
<section anchor="parsing" title="Parsing algorithm">
<t>
The parser operates with a context, which is a list of expressions. Each time an
expression is parsed, it is appended at the end of the context. The initial context is the
abstract yield.
</t>
<t>
At the beginning of a BULK stream and after having consumed the byte sequence denoting a
complete expression, the parser is at the dispatch stage. At this stage, the next byte is
a marker byte, which tells the parser what kind of expression comes next (the marker byte
is the first byte of the sequence that denotes an expression). The expression appended to
the context after reading a byte sequence is called the specific yield of the byte
sequence.
</t>
<t>
The <spanx style="verb">0x01</spanx> and <spanx style="verb">0x02</spanx> marker bytes are
special cases. When the parser reads <spanx style="verb">0x01</spanx>, it immediately
appends an empty list to the current context. This list becomes the new context. This new
context has the previous context as parent. Then the parser returns to its dispatch
stage. When the parser reads <spanx style="verb">0x02</spanx>, it appends nothing to the
context, but instead the parent of the current context becomes the new context and the
parser returns to the dispatch stage. Thus it is a parsing error to read <spanx
style="verb">0x02</spanx> when the context is the abstract yield.
</t>
<t>
Some forms have side-effects in their semantics. Those side-effects MUST not affect the
parsing of any expression. They can affect evaluation, in which case they MUST only affect
the evaluation of expressions in the scope of the form. The outer scope of an expression
is the part of its context that follows the expression. Some forms MAY define an inner
scope in their shape. The scope of an expression is the union of the outer and inner
scopes. This makes BULK lexically scoped.
</t>
<t>
Whenever a parsing error is encountered, parsing of the BULK stream MUST stop.
</t>
<section title="Summary of marker bytes">
<table>
<thead><tr><th>marker</th><th>shape</th></tr></thead>
<tbody>
<tr><td><spanx style="verb">00</spanx></td><td><xref target="nil"><spanx style="verb">nil</spanx></xref></td></tr>
<tr><td><spanx style="verb">01</spanx></td><td><xref target="start"><spanx style="verb">(</spanx></xref></td></tr>
<tr><td><spanx style="verb">02</spanx></td><td><xref target="end"><spanx style="verb">)</spanx></xref></td></tr>
<tr><td><spanx style="verb">03</spanx></td><td><xref target="array"><spanx style="verb"># Nat {content}</spanx></xref></td></tr>
<tr><td><spanx style="verb">04–0F</spanx></td><td><xref target="reserved"><spanx style="verb">reserved</spanx></xref></td></tr>
<tr><td><spanx style="verb">10–7F</spanx></td><td><xref target="ref"><spanx style="verb">references</spanx></xref></td></tr>
<tr><td><spanx style="verb">80–BF</spanx></td><td><xref target="smallint"><spanx style="verb">w6[value]</spanx></xref></td></tr>
<tr><td><spanx style="verb">C0–FF</spanx></td><td><xref target="smallarray"><spanx style="verb">#[size] {content}</spanx></xref></td></tr>
</tbody>
</table>
</section>
<section title="Evaluation">
<t>
A processing application MAY implement evaluation of BULK expressions and streams. When
evaluating a BULK stream, when the parser gets to the dispatch stage and the context is
the abstract yield, the last expression in the context is replaced by what it evaluates
to. (of course, this description is supposed to provide the semantics of BULK
evaluation, but a processing application MAY implement evaluation with a different
algorithm as long as it provides the same semantics)
</t>
<t>
The default evaluation rule is that an expression evaluates to itself. A name within a
namespace can have a value, which is what a reference associated to this name evaluates
to. A reference whose marker value is associated to no namespace or whose name has no
value evaluates to itself. How self-evaluating BULK expressions are represented in the
concrete yield is application-dependent, but future specifications MAY define a
standard API to access it, similar to the Document Object Model for XML.
</t>
<t>
The evaluation of a form obeys a special rule, though: if the first expression of the
form has type <spanx style="verb">Function</spanx>, that function is called with an
argument list and the form evaluates to the return value if it's an atom or the
evaluation of the return value if it is a form. If the function has type <spanx
style="verb">LazyFunction</spanx>, the argument list is the rest of the form. If the
function has type <spanx style="verb">EagerFunction</spanx>, the argument list is the
rest of the form, where each expression is replaced by what it evaluates to. Any
expression that has type <spanx style="verb">LazyFunction</spanx> or <spanx
style="verb">EagerFunction</spanx> also has type <spanx style="verb">Function</spanx>.
</t>
<t>
A form whose first expression doesn't have type <spanx style="verb">Function</spanx>
evaluates to itself.
</t>
<t>
When an application evaluates a BULK expression, it MUST verify that evaluation will
terminate in a finite number of evaluation steps. An application MAY verify finite
termination statically or dynamically. For example, an application MAY stop evaluation
in error after a predetermined number of steps.
</t>
</section>
</section>
<section title="Forms">
<section anchor="start" title="starting marker byte">
<t>
<list style="hanging">
<t hangText="marker"><spanx style="verb">0x01</spanx></t>
<t hangText="mnemonic"><spanx style="verb">(</spanx></t>
</list>
</t>
</section>
<section anchor="end" title="ending marker byte">
<t>
<list style="hanging">
<t hangText="marker"><spanx style="verb">0x02</spanx></t>
<t hangText="mnemonic"><spanx style="verb">)</spanx></t>
</list>
</t>
</section>
<section title="Difference between sequence and form">
<t>
There is a difference between a byte sequence denoting several expressions among the
current context and a byte sequence denoting a form (i.e. a single expression that is a
list of expressions). As an example, let's examine several forms of the shape <spanx
style="verb">( foo {bar} )</spanx>.
</t>
<t>
<list style="symbols">
<t>In the form <spanx style="verb">( foo nil nil nil )</spanx>, {bar} denotes 3
expressions, and they are three atoms in the yield.</t>
<t>In the form <spanx style="verb">( foo nil )</spanx>, {bar} is a single expression
in the yield, and that expression is an atom.</t>
<t>In the form <spanx style="verb">( foo ( nil nil nil ) )</spanx>, {bar} is also a
single expression in the yield, and that expression is a form, a list in the
yield.</t>
</list>
</t>
<t>
In a shape, when a byte sequence must yield a single expression, it has the type <spanx
style="verb">Expr</spanx>. So the last two examples fit the shape <spanx style="verb">(
foo {seq}:Expr )</spanx> but not the first. When a byte sequence must yield a form, it
has type <spanx style="verb">Form</spanx>. Thus the shape <spanx style="verb">( foo
{bar}:Form )</spanx> is equivalent to <spanx style="verb">( foo ( {bar} )
)</spanx>. Either one MAY be used.
</t>
</section>
</section>
<section title="Atoms">
<section anchor="nil" title="nil">
<t>
<list style="hanging">
<t hangText="marker"><spanx style="verb">0x00</spanx> (mnemonic: <spanx
style="verb">nil</spanx>)</t>
<t hangText="shape"><spanx style="verb">nil</spanx></t>
</list>
</t>
<t>
Apart from being a possible short marker value, the fact that the <spanx
style="verb">0x00</spanx> byte represents a valid atom means that a series of null bytes
is a valid part of a BULK stream, thus making the format less fragile. In a network
communication, nil atoms can be sent to keep the channel open. They can also be used as
padding at the end of a form or between forms.
</t>
</section>
<section title="Arrays">
<t>
Arrays can be used to store arbitrary bytes.
</t>
<t>
An array can be interpreted either as a bits sequence or as an unsigned integer in
binary notation. The choice depends on the context and the application. Actually, many
processing applications may not need make any choice, as most programming language
implementations actually also confuse unsigned integers and bits sequences to some
extent. Expressions that are unsigned integers (that is, natural numbers) have type
<spanx style="verb">Nat</spanx>.
</t>
<t>
Big arrays typically store the content of a file or a binary message of another
format. They can also be used to store a vector or matrix of fixed-size elements.
</t>
<t>
In any case, the semantics of the content must be inferred by the processing
application; where ambiguity can appear, an application SHOULD enclose the array in a
type-denoting form.
</t>
<t>
Because BULK arrays have no end markers, the payload of a BULK array can constitute the
end of the stream.
</t>
<t>
The start and end of an array are known without reading its content, which means that
its content can be skipped in constant time and mapped in memory (or read lazily by any
other means).
</t>
<t>
Because BULK can use integers with arbitrary size to store the size of an array, BULK
arrays have no limit in size.
</t>
<section anchor="array" title="Generic array">
<t>
<list style="hanging">
<t hangText="marker"><spanx style="verb">0x03</spanx> (mnemonic: <spanx
style="verb">#</spanx>)</t>
<t hangText="shape"><spanx style="verb"># Nat {content}</spanx></t>
</list>
</t>
<t>
Arrays have a special parsing rule. After consuming the marker byte, the parser
returns to the dispatch stage. It is a parser error if the parsed expression is not of
type <spanx style="verb">Nat</spanx> or if its value cannot be recognized. This
integer is not added to any context, but the parser consumes as many bytes as this
integer and they constitute the content of this array.
</t>
<t>
In the text notation, a quoted string is the notation for an array containing the
encoding of that string in the <xref target="stringenc">current encoding</xref>,
except if the size of the encoding is below 64 bytes, cf. <xref
target="smallarray" sectionFormat="bare">small arrays</xref>.
</t>
<t>Types: <spanx style="verb">Bytes</spanx>, <spanx style="verb">Nat</spanx></t>
<t>
In a shape, the type <spanx style="verb">String</spanx> is synonymous with <spanx
style="verb">Bytes</spanx>, but means that the content of the array is supposed to be
taken as a string in the current encoding.
</t>
</section>
<section anchor="smallarray" title="Small array">
<t>
<list style="hanging">
<t hangText="marker"><spanx style="verb">0xC0–0xFF</spanx> (mnemonic: <spanx
style="verb">#[size]</spanx>)</t>
<t hangText="shape"><spanx style="verb">#[size] {content}</spanx></t>
</list>
</t>
<t>
Small arrays have a special parsing rule. The 6 least significant bits of the marker
byte are treated as un unsigned integer. This integer is not added to any context, but
the parser consumes as many bytes as this integer and they constitute the content of
this array.
</t>
<t>
In the text notation, the marker byte of a small array of size X is written as <spanx
style="verb">#[X]</spanx>. For example, <spanx style="verb">#[2] 0x1234</spanx> is a
notation for the bytes <spanx style="verb">0xE2 0x12 0x34</spanx>.
</t>
<t>
In the text notation, a quoted string is the notation for a small array containing the
encoding of that string in the current encoding if the size of the encoding is below
64 bytes.
</t>
<t>Types: <spanx style="verb">Bytes</spanx>, <spanx style="verb">Nat</spanx></t>
</section>
<section anchor="smallint" title="Small unsigned integers">
<t>
<list style="hanging">
<t hangText="marker"><spanx style="verb">0x80–0xBF</spanx> (mnemonic: <spanx
style="verb">w6[value]</spanx>)</t>
<t hangText="shape"><spanx style="verb">w6[value]</spanx></t>
</list>
</t>
<t>
Small unsigned integers have a special parsing rule. The 6 least significant bits of
the marker byte are the value denoted by this byte (as bits or as an unsigned integer
in binary notation).
</t>
<t>
In the text notation, the marker byte of a small unsigned integer of value X is
written as <spanx style="verb">w6[X]</spanx>. For example, <spanx
style="verb">w6[11]</spanx> is a notation for the byte <spanx
style="verb">0xCB</spanx> (as is <spanx style="verb">11</spanx>).
</t>
<t>Types: <spanx style="verb">Bytes</spanx>, <spanx style="verb">Nat</spanx></t>
</section>
</section>
<section anchor="reserved" title="Reserved marker bytes">
<t>
Marker bytes <spanx style="verb">0x04−0x0F</spanx> are reserved for future major
versions of BULK. It is a parser error if a BULK stream with major version 1 contains
such a marker byte.
</t>
</section>
<section anchor="ref" title="References">
<t><list style="hanging">
<t hangText="marker"><spanx style="verb">0x10−0x7F</spanx></t>
<t hangText="shape">
<spanx style="verb">{ns}:1B {name}:1B</spanx>
<vspace/>
<spanx style="verb">0x7F {ns'} {name}:1B</spanx>
</t>
</list>
</t>
<t>
The <spanx style="verb">{ns}</spanx> byte is a value associated with a namespace, called
the namespace marker. Values <spanx style="verb">0x10−0x17</spanx> are reserved for
namespaces defined by BULK specifications. Greater values can be associated with
namespaces identified by a unique identifier.
</t>
<t>
The <spanx style="verb">{name}</spanx> byte is the name within the
namespace. Vocabularies with more than 256 names thus need to be spread accross several
namespaces.
</t>
<t>
The specification of a namespace SHOULD include a mnemonic for the namespace and for
each defined name. When descriptions use several namespaces, the mnemonic of a reference
SHOULD be the concatenation of the namespace mnemonic, ":" and the name mnemonic if
there can be an ambiguity. For example, the <spanx style="verb">fp</spanx> name in
namespace <spanx style="verb">math</spanx> becomes <spanx style="verb">math:fp</spanx>.
</t>
<t>Type: <spanx style="verb">Ref</spanx></t>
<section title="Special case">
<t>
References have a special parsing rule. In case a BULK stream needs an important
number of namespaces, if the marker byte is <spanx style="verb">0x7F</spanx>, the
parser continues to read bytes until it finds a byte different than 0xFF. The sum of
each of those bytes taken as unsigned integers is the namespace marker. For example,
the reference denoted by the bytes <spanx style="verb">0x7F 0xFF 0x8C 0x1A</spanx> is
the name 26 in the namespace associated with 522.
</t>
</section>
</section>
</section>
</section>
<section title="Standard namespaces">
<t>
Standard namespaces have a fixed marker value and are not identified by a unique
identifier.
</t>
<section title="BULK core namespace">
<t>
<list style="hanging">
<t hangText="marker"><spanx style="verb">0x20</spanx> (mnemonic: <spanx
style="verb">bulk</spanx>)</t>
</list>
</t>
<section title="Version">
<t>
<list style="hanging">
<t hangText="name"><spanx style="verb">0x00</spanx> (mnemonic: <spanx
style="verb">version</spanx>)</t>
<t hangText="shape"><spanx style="verb">( version {major}:Nat {minor}:Nat
)</spanx></t>
</list>
</t>
<t>
When parsing a BULK stream, a processing application MUST determine explicitely the
major and minor version of the BULK specification that the stream obeys. This
information MAY be exchanged out-of-band, if BULK is used to exchange a number a very
small messages, where repeated headers of 8 bytes might become too big an overhead. A
processing application MUST NOT assume a default version.
</t>
<t>
If the version is expressed within a BULK stream, this form MUST be the first in the
stream. In any other place, this form has no semantics attached to it. This
specification defines BULK 1.0. When writing a BULK stream, an application MUST denote
{major} and {minor} by the smallest byte sequence possible.
</t>
<t>
An application writing a BULK stream to long-term storage (e.g. in a file or a database
record) SHOULD include a <spanx style="verb">version</spanx> form.
</t>
<t>
Two BULK versions with the same major version MUST share the same parsing rules and the
same definitions of marker bytes. Changing the syntax or semantics of existing marker
bytes and using marker bytes in the reserved interval warrants a new major
version. Changing the syntax or semantics of existing names in standard namespaces
also.
</t>
<t>
Adding standard namespaces or adding names in existing standard namespaces warrants a
new minor version.
</t>
</section>
<section title="Booleans">
<section title="true">
<t>
<list style="hanging">
<t hangText="name"><spanx style="verb">0x02</spanx> (mnemonic: <spanx
style="verb">true</spanx>)</t>
<t hangText="shape"><spanx style="verb">true</spanx></t>
</list>
</t>
<t>
Type: <spanx style="verb">Boolean</spanx>.
</t>
</section>
<section title="false">
<t>
<list style="hanging">
<t hangText="name"><spanx style="verb">0x03</spanx> (mnemonic: <spanx
style="verb">false</spanx>)</t>
<t hangText="shape"><spanx style="verb">false</spanx></t>
</list>
</t>
<t>
Type: <spanx style="verb">Boolean</spanx>.
</t>
</section>
</section>
<section title="Strings encoding">
<section anchor="stringenc" title="Current encoding">
<t>
<list style="hanging">
<t hangText="name"><spanx style="verb">0x04</spanx> (mnemonic: <spanx
style="verb">stringenc</spanx>)</t>
<t hangText="shape"><spanx style="verb">( stringenc {enc}:Encoding )</spanx></t>
</list>
</t>
<t>
This tells the processing application that, in the scope of this expression, all
expressions that are understood by the application as character strings will be encoded
with the encoding designated by {enc}.
</t>
<t>
As the abstract yield doesn't contain strings but expressions that will be used as
strings by the application, it is not a parsing error if the application doesn't
recognize {enc}. In this situation, it is a parsing error when the application actually
needs to decode a byte sequence as a string. It is not a parsing error when a processing
application only transmits a byte sequence encoding a string, if it can accurately
convey the encoding to the receiving application.
</t>
</section>
<section title="IANA registered character set">
<t>
<list style="hanging">
<t hangText="name"><spanx style="verb">0x05</spanx> (mnemonic: <spanx
style="verb">iana-charset</spanx>)</t>
<t hangText="shape"><spanx style="verb">( iana-charset {id}:Nat )</spanx></t>
</list>
</t>
<t>
This designates the string encoding registered among the <xref
target="IANA-Charsets">IANA Character Sets</xref> whose MIBenum is {id}.
</t>
<t>
Type: <spanx style="verb">Encoding</spanx>.
</t>
</section>
<section title="Windows code page">
<t>
<list style="hanging">
<t hangText="name"><spanx style="verb">0x06</spanx> (mnemonic: <spanx
style="verb">code-page</spanx>)</t>
<t hangText="shape"><spanx style="verb">( code-page {id}:Nat )</spanx></t>
</list>
</t>
<t>
This designates the string encoding among Windows code pages whose identifier is {id}.
</t>
<t>
Type: <spanx style="verb">Encoding</spanx>.
</t>
</section>
</section>
<section title="Namespaces">
<section title="New namespace">
<t>
<list style="hanging">
<t hangText="name"><spanx style="verb">0x07</spanx> (mnemonic: <spanx
style="verb">ns</spanx>)</t>
<t hangText="shape"><spanx style="verb">( ns {marker}:Ref {id}:Expr )</spanx></t>
</list>
</t>
<t>
This associates the namespace identified by {id} to the namespace marker of {marker},
within the scope of this expression.
</t>
</section>
<section title="Package">
<t>
<list style="hanging">
<t hangText="name"><spanx style="verb">0x08</spanx> (mnemonic: <spanx
style="verb">package</spanx>)</t>
<t hangText="shape"><spanx style="verb">( package {id}:Expr {namespaces}
)</spanx></t>
</list>
</t>
<t>
This creates a package identified by {id}. Packages are immutable, {id} MUST be
verifiable against the byte sequence {namespaces}. {namespaces} must be a series of
expressions each identifying a BULK namespace.
</t>
</section>
<section title="Import">
<t>
<list style="hanging">
<t hangText="name"><spanx style="verb">0x09</spanx> (mnemonic: <spanx
style="verb">import</spanx>)</t>
<t hangText="shape"><spanx style="verb">( import {base}:Nat {count}:Nat {id}:Expr
)</spanx></t>
</list>
</t>
<t>
This associates the first {count} namespaces in the package identified by {id} with a
continuous range of marker bytes starting at {base} within the scope of this
expression.
</t>
<t>
Example: <spanx style="verb">( import 28 3 0x0123456789ABCDEF )</spanx> associates
the first 3 namespaces of the package identified by <spanx
style="verb">0x0123456789ABCDEF</spanx> to the marker bytes 28, 29 and 30.
</t>
</section>
</section>
<section title="Definitions">
<t>
To define a reference is to change the the value of its name in its namespace (as
identified by its unique identifier, not the marker value) within a certain scope.
</t>
<t>
If a BULK stream is not evaluated, the semantics of a definition are entirely
application-dependent.
</t>
<t>
When a BULK stream containing definitions for a namespace comes from a trusted source
(i.e. in configuration files of the application, or in the communication with an agent
that has been granted the relevant authority), an application MAY give those
definitions long-lasting semantics (i.e. keep the values of the names at the end of
parsing). This is the preferred mechanism for bulk namespace definition when the
semantics of the defined expressions can be expressed completely by BULK forms.
</t>
<section title="Simple definition">
<t>
<list style="hanging">
<t hangText="name"><spanx style="verb">0x0A</spanx> (mnemonic: <spanx
style="verb">define</spanx>)</t>
<t hangText="shape">
<spanx style="verb">( define {ref}:Ref {value}:Expr )</spanx>
<vspace/>
<spanx style="verb">( define nil {value}:Expr )</spanx>
</t>
</list>
</t>
<t>
This defines the reference <spanx style="verb">{ref}</spanx> to the yield of <spanx
style="verb">{value}</spanx> in the outer scope of this form.
</t>
<t>
In any context where there is a default namespace where definitions are made,
e.g. <xref target="verifiable"><spanx style="verb">verifiable-ns</spanx></xref>, the
second shape defines the smallest name that is not yet defined to <spanx
style="verb">{value}</spanx>.
</t>
</section>
<section title="Named definition">
<t>
<list style="hanging">
<t hangText="name"><spanx style="verb">0x0B</spanx> (mnemonic: <spanx
style="verb">mnemonic/def</spanx>)</t>
<t hangText="shape">
<spanx style="verb">( mnemonic/def {ref}:Ref {mnemonic}:String
{doc}:Expr {value} )</spanx>
<vspace/>
<spanx style="verb">( mnemonic/def nil {mnemonic}:String
{doc}:Expr {value} )</spanx>
</t>
</list>
</t>
<t>
This suggest <spanx style="verb">{mnemonic}</spanx> as the mnemonic of the name
designated by <spanx style="verb">{ref}</spanx> in its namespace. If <spanx
style="verb">{value}</spanx> is of type Expr, this defines the reference <spanx
style="verb">{ref}</spanx> to <spanx style="verb">{value}</spanx> in the scope of this
form.
</t>
<t>
<spanx style="verb">{doc}</spanx> is any expression that provides a documentation for
this reference. If it has type Bytes, it MUST be a string. It could be any kind of
metadata or document type.
</t>
<t>
In any context where there is a default namespace where definitions are made,
e.g. <xref target="verifiable"><spanx style="verb">verifiable-ns</spanx></xref>, the
second shape defines the smallest name that is not yet defined to <spanx
style="verb">{value}</spanx>.
</t>
</section>
<section title="Namespace description">
<t>
<list style="hanging">
<t hangText="name"><spanx style="verb">0x0C</spanx> (mnemonic: <spanx
style="verb">ns-mnemonic</spanx>)</t>
<t hangText="shape"><spanx style="verb">( ns-mnemonic {ns}:Expr {mnemonic}:String
{doc} )</spanx></t>
</list>
</t>
<t>
This suggest {mnemonic} as the mnemonic of the namespace designated by {ns} (which
can be the integer to which this namespace is associated, a reference in this
namespace or the unique identifier of this namespace).
</t>
</section>
<section anchor="verifiable" title="Verifiable namespace definition">
<t>
<list style="hanging">
<t hangText="name"><spanx style="verb">0x0D</spanx> (mnemonic: <spanx
style="verb">verifiable-ns</spanx>)</t>
<t hangText="shape"><spanx style="verb">( verifiable-ns {marker}:Nat {id}:UniqueID
{data}:Expr {mnemonic}:Expr {definitions} )</spanx></t>
<t hangText="inner scope"><spanx style="verb">{id} {data} {mnemonic} {definitions}</spanx></t>
</list>
</t>
<t>
This associates the namespace identified by {id} to the namespace marker of {marker},
within the scope of this form. Verifiable namespaces are immutable, {id} MUST be
verifiable against the byte sequence <spanx style="verb">{data} {mnemonic}
{definitions}</spanx>. The semantics of this form is to define in its scope any
definition made within <spanx style="verb">{definitions}</spanx>.
</t>
<t>
If {mnemonic} is of type <spanx style="verb">String</spanx>, then this suggests it as
the mnemonic of the namespace. Else it MUST be <spanx style="verb">nil</spanx>.
</t>
<t>
If more data than {id} is needed to verify {id} against {definitions} (like the salt
of a hash function, or the namespace of a UUID), this data should be provided by
{data}. Else {data} MUST be <spanx style="verb">nil</spanx>.
</t>
<t>
A verifiable namespace wouldn't really be immutable if it used definitions from other
namespaces that aren't immutable. To that effect, an application SHOULD stop
processing this form with an error when <spanx style="verb">{definitions}</spanx>
contain references from namespaces that cannot be determined to be immutable
themselves. The goal is to prevent a user or system to be unwittingly vulnerable, so
an application MAY provide an option to accept a specific verifiable namespace, but an
application MUST NOT provide an option to accept any vulnerable verifiable
namespace. That is, an option like <spanx style="verb">--accept-ns
8f82849556d74466</spanx> is acceptable but <spanx
style="verb">--disable-immutability-check</spanx> is not.
</t>
</section>
<section title="Array concatenation">
<t>
<list style="hanging">
<t hangText="name"><spanx style="verb">0x0E</spanx> (mnemonic: <spanx
style="verb">concat</spanx>)</t>
<t hangText="shape"><spanx style="verb">( concat {array1}:Bytes {array2}:Bytes )</spanx></t>
</list>
</t>
<t>
<list style="hanging">
<t hangText="Name's type">EagerFunction</t>
<t hangText="Form's type">Bytes</t>
<t hangText="Form's value">the concatenation of {array1} and {array2}.</t>
</list>
</t>
<t>
The value of this form is an array that contains the bytes in array1 followed by the
bytes in array2.
</t>