/
S02-bits.pod
4843 lines (3722 loc) · 202 KB
/
S02-bits.pod
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
=encoding utf8
=head1 TITLE
Synopsis 2: Bits and Pieces
=head1 AUTHORS
Larry Wall <larry@wall.org>
=head1 VERSION
Created: 10 Aug 2004
Last Modified: 28 Jul 2012
Version: 268
This document summarizes Apocalypse 2, which covers small-scale
lexical items and typological issues. (These Synopses also contain
updates to reflect the evolving design of Perl 6 over time, unlike the
Apocalypses, which are frozen in time as "historical documents".
These updates are not marked--if a Synopsis disagrees with its
Apocalypse, assume the Synopsis is correct.)
=head1 One-pass parsing
To the extent allowed by sublanguages' parsers, Perl is parsed using a
one-pass, predictive parser. That is, lookahead of more than one
"longest token" is discouraged. The currently known exceptions to
this are where the parser must:
=over 4
=item *
Locate the end of interpolated expressions that begin with a sigil
and might or might not end with brackets.
=item *
Recognize that a reduce operator is not really beginning a C<[...]> composer.
=back
One-pass parsing is fundamental to knowing exactly which language
you are dealing with at any moment, which in turn is fundamental
to allowing unambiguous language mutation in any desired direction.
(Generic languages are allowed, but only if intended; accidentally
generic languages lead to loss of linguistic identity and integrity.
This is the hard lesson of Perl 5's source filters and other
multi-pass parsing mistakes.)
=head1 Lexical Conventions
=head2 Unicode Semantics
In the abstract, Perl is written in Unicode, and has consistent Unicode
semantics regardless of the underlying text representations. By default
Perl presents Unicode in "NFG" formation, where each grapheme counts as
one character. A grapheme is what the novice user would think of as a
character in their normal everyday life, including any diacritics.
Perl can count Unicode line and paragraph separators as line markers,
but that behavior had better be configurable so that Perl's idea of
line numbers matches what your editor thinks about Unicode lines.
Unicode horizontal whitespace is counted as whitespace, but it's better
not to use thin spaces where they will make adjoining tokens look like
a single token. On the other hand, Perl doesn't use indentation as syntax,
so you are free to use any amount of whitespace anywhere that whitespace
makes sense. Comments always count as whitespace.
=head2 Bracketing Characters
For some syntactic purposes, Perl distinguishes bracketing characters
from non-bracketing. Bracketing characters are defined as any Unicode
characters with either bidirectional mirrorings or Ps/Pe/Pi/Pf properties.
In practice, though, you're safest using matching characters with
Ps/Pe/Pi/Pf properties, though ASCII angle brackets are a notable exception,
since they're bidirectional but not in the Ps/Pe/Pi/Pf sets.
Characters with no corresponding closing character do not qualify
as opening brackets. This includes the second section of the Unicode
BidiMirroring data table.
If a character is already used in Ps/Pe/Pi/Pf mappings, then any entry
in BidiMirroring is ignored (both forward and backward mappings).
For any given Ps character, the next Pe codepoint (in numerical
order) is assumed to be its matching character even if that is not
what you might guess using left-right symmetry. Therefore C<U+298D>
maps to C<U+298E>, not C<U+2990>, and C<U+298F> maps to C<U+2990>,
not C<U+298E>. Neither C<U+298E> nor C<U+2990> are valid bracket
openers, despite having reverse mappings in the BidiMirroring table.
The C<U+301D> codepoint has two closing alternatives, C<U+301E> and C<U+301F>;
Perl 6 only recognizes the one with lower code point number, C<U+301E>,
as the closing brace. This policy also applies to new one-to-many
mappings introduced in the future.
However, many-to-one mappings are fine; multiple opening characters
may map to the same closing character. For instance, C<U+2018>, C<U+201A>,
and C<U+201B> may all be used as the opener for the C<U+2019> closer.
Constructs that count openers and closers assume that only the given
opener is special. That is, if you open with one of the alternatives,
all other alternatives are treated as non-bracketing characters within
that construct.
=head2 Multiline Comments
Pod sections may be used reliably as multiline comments in Perl 6.
Unlike in Perl 5, Pod syntax now lets you use C<=begin comment>
and C<=end comment> delimit a Pod block correctly without the need
for C<=cut>. (In fact, C<=cut> is now gone.) The format name does
not have to be C<comment> -- any unrecognized format name will do
to make it a comment. (However, bare C<=begin> and C<=end> probably
aren't good enough, because all comments in them will show up in the
formatted output.)
We have single paragraph comments with C<=for comment> as well.
That lets C<=for> keep its meaning as the equivalent of a C<=begin>
and C<=end> combined. As with C<=begin> and C<=end>, a comment started
in code reverts to code afterwards.
Since there is a newline before the first C<=>, the Pod form of comment
counts as whitespace equivalent to a newline. See S26 for more on
embedded documentation.
=head2 Single-line Comments
Except within a quote literal, a C<#> character always introduces a comment in
Perl 6. There are two forms of comment based on C<#>. Embedded
comments require the C<#> to be followed by a backtick (C<`>) plus one
or more opening bracketing characters.
All other uses of C<#> are interpreted as single-line comments that
work just as in Perl 5, starting with a C<#> character and
ending at the subsequent newline. They count as whitespace equivalent
to newline for purposes of separation. Unlike in Perl 5, C<#>
may I<not> be used as the delimiter in quoting constructs.
=head2 Embedded Comments
Embedded comments are supported as a variant on quoting syntax, introduced
by C<#`> plus any user-selected bracket characters (as defined in
L</Bracketing Characters> above):
say #`( embedded comment ) "hello, world!";
$object\#`{ embedded comments }.say;
$object\ #`「
embedded comments
」.say;
Brackets may be nested, following the same policy as ordinary quote brackets.
There must be no space between the C<#`> and the opening bracket character.
(There may be the I<visual appearance> of space for some double-wide
characters, however, such as the corner quotes above.)
For multiline comments it is recommended (but not required) to use two or
more brackets both for visual clarity and to avoid relying too much on
internal bracket counting heuristics when commenting code that may accidentally
miscount single brackets:
#`{{
say "here is an unmatched } character";
}}
However, it's sometimes better to use Pod comments because they are
implicitly line-oriented.
=head2 User-selected Brackets
For all quoting constructs that use user-selected brackets, you can open
with multiple identical bracket characters, which must be closed by the
same number of closing brackets. Counting of nested brackets applies only
to pairs of brackets of the same length as the opening brackets:
say #`{{
This comment contains unmatched } and { { { { (ignored)
Plus a nested {{ ... }} pair (counted)
}} q<< <<woot>> >> # says " <<woot>> "
Note however that bare circumfix or postcircumfix C<<< <<...>> >>> is
not a user-selected bracket, but the ASCII variant of the C<< «...» >>
interpolating word list. Only C<#`> and the C<q>-style quoters (including
C<m>, C<s>, C<tr>, and C<rx>) enable subsequent user-selected brackets.
=head2 Unspaces
Some languages such as C allow you to escape newline characters
to combine lines. Other languages (such as regexes) allow you to
backslash a space character for various reasons. Perl 6 generalizes
this notion to any kind of whitespace. Any contiguous whitespace
(including comments) may be hidden from the parser by prefixing it
with C<\>. This is known as the "unspace". An unspace can suppress
any of several whitespace dependencies in Perl. For example, since
Perl requires an absence of whitespace between a noun and a postfix
operator, using unspace lets you line up postfix operators:
%hash\ {$key}
@array\ [$ix]
$subref\($arg)
As a special case to support the use above, a backslash where
a postfix is expected is considered a degenerate form of unspace.
Note that whitespace is not allowed before that, hence
$subref \($arg)
is a syntax error (two terms in a row). And
foo \($arg)
will be parsed as a list operator with a C<Capture> argument:
foo(\($arg))
However, other forms of unspace may usefully be preceded by whitespace.
(Unary uses of backslash may therefore never be followed by whitespace
or they would be taken as an unspace.)
Other postfix operators may also make use of unspace:
$number\ ++;
$number\ --;
1+3\ i;
$object\ .say();
$object\#`{ your ad here }.say
Another normal use of a you-don't-see-this-space is typically to put
a dotted postfix on the next line:
$object\ # comment
.say
$object\#`[ comment
].say
$object\
.say
But unspace is mainly about language extensibility: it lets you continue
the line in any situation where a newline might confuse the parser,
regardless of your currently installed parser. (Unless, of course,
you override the unspace rule itself...)
Although we say that the unspace hides the whitespace from the parser,
it does not hide whitespace from the lexer. As a result, unspace is not
allowed within a token. Additionally, line numbers are still
counted if the unspace contains one or more newlines.
Since Pod chunks count as whitespace to the language, they are also
swallowed up by unspace. Heredoc boundaries are suppressed, however,
so you can split excessively long heredoc intro lines like this:
ok(q:to'CODE', q:to'OUTPUT', \
"Here is a long description", \ # --more--
todo(:parrøt<0.42>, :dötnet<1.2>));
...
CODE
...
OUTPUT
To the heredoc parser that just looks like:
ok(q:to'CODE', q:to'OUTPUT', "Here is a long description", todo(:parrøt<0.42>, :dötnet<1.2>));
...
CODE
...
OUTPUT
Note that this is one of those cases in which it is fine to have
whitespace before the unspace, since we're only trying to suppress
the newline transition, not all whitespace as in the case of postfix
parsing. (Note also that the example above is not meant to spec how
the test suite works. )
=head2 Comments in Unspaces and vice versa
An unspace may contain a comment, but a comment may not contain an unspace.
In particular, end-of-line comments do not treat backslash as significant.
If you say:
#`\ (...
or
#\ `(...
it is an end-of-line comment, not an embedded comment. Write:
\ #`(
...
)
to mean the other thing.
=head2 Optional Whitespace and Exclusions
In general, whitespace is optional in Perl 6 except where it is needed
to separate constructs that would be misconstrued as a single token or
other syntactic unit. (In other words, Perl 6 follows the standard
I<longest-token> principle, or in the cases of large constructs, a
I<prefer shifting to reducing> principle. See L</Grammatical Categories>
below for more on how a Perl program is analyzed into tokens.)
This is an unchanging deep rule, but the surface ramifications of it
change as various operators and macros are added to or removed from
the language, which we expect to happen because Perl 6 is designed to
be a mutable language. In particular, there is a natural conflict
between postfix operators and infix operators, either of which
may occur after a term. If a given token may be interpreted as
either a postfix operator or an infix operator, the infix operator
requires space before it. Postfix operators may never have intervening
space, though they may have an intervening dot. If further separation
is desired, an unspace or embedded comment may be used as described above, as long
as no whitespace occurs outside the unspace or embedded comment.
For instance, if you were to add your own C<< infix:<++> >> operator,
then it must have space before it. The normal autoincrementing
C<< postfix:<++> >> operator may never have space before it, but may
be written in any of these forms:
$x++
$x\++
$x.++
$x\ ++
$x\ .++
$x\#`( comment ).++
$x\#`((( comment ))).++
$x\
.++
$x\ # comment
# inside unspace
.++
$x\ # comment
# inside unspace
++ # (but without the optional postfix dot)
$x\#`『 comment
more comment
』.++
$x\#`[ comment 1
comment 2
=begin Podstuff
whatever (Pod comments ignore current parser state)
=end Podstuff
comment 3
].++
=head3 Implicit Topical Method Calls
A consequence of the postfix rule is that (except when delimiting a
quote or terminating an unspace) a dot with whitespace in front
of it is always considered a method call on C<$_> where a term is
expected. If a term is not expected at this point, it is a syntax
error. (Unless, of course, there is an infix operator of that name
beginning with dot. You could, for instance, define a Fortranly
C<< infix:<.EQ.> >> if the fit took you. But you'll have to be sure to
always put whitespace in front of it, or it would be interpreted as
a postfix method call instead.)
For example,
foo .method
and
foo
.method
will always be interpreted as
foo $_.method
but never as
foo.method
Use some variant of
foo\
.method
if you mean the postfix method call.
One consequence of all this is that you may no longer write a Num as
C<42.> with just a trailing dot. You must instead say either C<42>
or C<42.0>. In other words, a dot following a number can only be a
decimal point if the following character is a digit. Otherwise the
postfix dot will be taken to be the start of some kind of method call
syntax. (The C<.123> form with a leading
dot is still allowed however when a term is expected, and is equivalent
to C<0.123> rather than C<$_.123>.)
=head1 Built-In Data Types
Perl 6 has an optional type system that helps you write safer
code that performs better. The compiler is free to infer what type
information it can from the types you supply, but it will not complain
about missing type information unless you ask it to.
Perl 6 is an OO engine, but you're not generally required to think
in OO when that's inconvenient. However, some built-in concepts such
as filehandles are more object-oriented in a user-visible way
than in Perl 5.
=head2 The P6opaque Datatype
In support of OO encapsulation, there is a new fundamental data
representation: B<P6opaque>. External access to opaque objects is always
through method calls, even for attributes.
=head2 Name Equivalence of Types
Types are officially compared using name equivalence rather than
structural equivalence. However, we're rather liberal in what we
consider a name. For example, the name includes the version and
authority associated with the module defining the type (even if
the type itself is "anonymous"). Beyond that, when you instantiate
a parametric type, the arguments are considered part of the "long
name" of the resulting type, so one C<Array of Int> is equivalent to
another C<Array of Int>. (Another way to look at it is that the type
instantiation "factory" is memoized.) Typename aliases are considered
equivalent to the original type. In particular, the C<Array of Int> syntax
is just sugar for C<Array:of(Int)>, which is the canonical form of an
instantiated generic type.
This name equivalence of parametric types extends only to parameters
that can be considered immutable (or that at least can have an
immutable snapshot taken of them). Two distinct classes are never
considered equivalent even if they have the same attributes because
classes are not considered immutable.
=head2 Properties on Objects
Perl 6 supports the notion of B<properties> on various kinds of
objects. Properties are like object attributes, except that they're
managed by the individual object rather than by the object's class.
According to S12, properties are actually implemented by a
kind of mixin mechanism, and such mixins are accomplished by the
generation of an individual anonymous class for the object (unless
an identical anonymous class already exists and can safely be shared).
=head3 Traits
Properties applied to objects constructed at compile-time, such as
variables and classes, are also called B<traits>. Traits cannot be
changed at run-time. Changes to run-time properties are done via
mixin instead, so that the compiler can optimize based on declared traits.
=head2 Types as Constraints
A variable's type is a constraint indicating what sorts of values the
variable may contain. More precisely, it's a promise that the object
or objects contained in the variable are capable of responding to the
methods of the indicated "role". See S12 for more about roles.
# $x can contain only Int objects
my Int $x;
=head2 Container Types
A variable may itself be bound to a container type that specifies how
the container works, without specifying what kinds of things it contains.
# $x is implemented by the MyScalar class
my $x is MyScalar;
Constraints and container types can be used together:
# $x can contain only Int objects,
# and is implemented by the MyScalar class
my Int $x is MyScalar;
Note that C<$x> is also initialized to the C<Int> type object. See below for
more on this.
=head2 Type Objects
C<my Dog $spot> by itself does not automatically call a C<Dog> constructor.
It merely assigns an undefined C<Dog> prototype object to C<$spot>:
my Dog $spot; # $spot is initialized with ::Dog
my Dog $spot = Dog; # same thing
$spot.defined; # False
say $spot; # "Dog()"
Any type name used as a value is an undefined instance of that type's
prototype object, or I<type object> for short. See S12 for more on that.
Any type name in rvalue context is parsed as a single type value and
expects no arguments following it. However, a type object responds to the
function call interface, so you may use the name of a type with parentheses
as if it were a function, and any argument supplied to the call is coerced
to the type indicated by the type object. If there is no argument
in the parentheses, the type object returns itself:
my $type = Num; # type object as a value
$num = $type($string) # coerce to Num
To get a real C<Dog> object, call a constructor method such as C<new>:
my Dog $spot .= new;
my Dog $spot = $spot.new; # .= is rewritten into this
You can pass in arguments to the constructor as well:
my Dog $cerberus .= new(heads => 3);
my Dog $cerberus = $cerberus.new(heads => 3); # same thing
=head2 Coercive type declarations
The parenthesized form of type coercion may be used in declarations
where it makes sense to accept a wider set of types but coerce them
to a narrow type. (This only works for one-way coercion, so you
may not declare any C<rw> parameter with a coercive type.) The type
outside the parens indicates the desired end result, and subsequent
code may depend on the it being that type. The type inside the parens
indicates the acceptable set of types that are allowed to be bound or
assigned to this location via coercion. If the wide type is omitted,
C<Any> is assumed. In any case, the wide type is only indicative of
permission to coerce; there must still be an available coercion routine
from the wide type to the narrow type to actually perform the coercion.
sub foo (Str(Any) $y) {...}
sub foo (Str() $y) {...} # same thing
my Num(Cool) $x = prompt "Gimme a number";
Coercions may also be specified on the return type:
sub bar ($x, $y --> Int()) { return 3.5 } # returns 3
=head2 Containers of Native Types
If you say
my int @array is MyArray;
you are declaring that the elements of C<@array> are native integers,
but that the array itself is implemented by the C<MyArray> class.
Untyped arrays and hashes are still perfectly acceptable, but have
the same performance issues they have in Perl 5.
=head2 Methods on Arrays
To get the number of elements in an array, use the C<.elems> method. You can
also ask for the total string length of an array's elements,
in codepoints or graphemes, using these methods, C<.codes> or C<.graphs>
respectively on the array. The same methods apply to strings as well.
(Note that C<.codes> is not well-defined unless you know which
canonicalization is in effect. Hence, it allows an optional argument
to specify the meaning exactly if it cannot be known from context.)
There is no C<.length> method for either arrays or strings, because C<length>
does not specify a unit.
=head2 Built-in Type Conventions
Built-in object types start with an uppercase letter. This includes
immutable types (e.g. C<Int>, C<Num>, C<Complex>, C<Rat>, C<Str>,
C<Bit>, C<Regex>, C<Set>, C<Block>, C<Iterator>),
as well as mutable (container) types, such as C<Scalar>,
C<Array>, C<Hash>, C<Buf>, C<Routine>, C<Module>, and non-instantiable Roles
such as C<Callable> and C<Integral>.
Non-object (native) types are lowercase: C<int>, C<num>, C<complex>,
C<rat>, C<buf>, C<bit>. Native types are primarily intended for
declaring compact array storage, that is, a sequence of storage locations of
the specified type laid out in memory contiguously without pointer indirection.
However, Perl will try to make those look like their corresponding uppercase
types if you treat them that way. (In other words, it does autoboxing.
Note, however, that sometimes repeated autoboxing can slow your program
more than the native type can speed it up.)
=head3 The C<.WHICH> Method for Value Types
Some object types can behave as value types. Every object can produce
a "WHICH" value that uniquely identifies the
object for hashing and other value-based comparisons. Normal objects
just use their location as their identity, but if a class wishes to behave as a
value type, it can define a C<.WHICH> method that makes different objects
look like the same object if they happen to have the same contents.
=head3 The C<ObjAt> Type
When we say that a normal object uses its location as its identity,
we do I<not> mean that it returns its address as a number. In the first
place, not all objects are in the same memory space (see the literature
on NUMA, for instance), and two objects should not accidentally have
the same identity merely because they were stored at the same offset in
two different memory spaces. We also do not want to allow accidental
identity collisions with values that really are numbers (or strings,
or any other mundane value type). Nor should we be encouraging people
to think of object locations that way in any case. So C<WHICH> still
returns a value rather than another object, but that value must be of
a special C<ObjAt> type that prevents accidental confusion with normal
value types, and at least discourages trivial pointer arithmetic.
Certainly, it is difficult to give a unique name to every possible
address space, let alone every possible address within every such
a space. In the absence of a universal naming scheme, it can only
be made improbable that two addresses from two different spaces will
collide. A sufficiently large random number may represent the current
address space on output of an C<ObjAt> to a different address space,
or if serialized to YAML or XML. (This extra identity component
need not be output for debugging messages that assume the current
address space, since it will be the same big number consistently,
unless your process really is running under a NUMA.)
Alternately, if an object is being serialized to a form that does
not preserve object identity, there is no requirement to preserve
uniqueness, since the object is in this case is really being translated
to a value type representation, and reconstituted on the other end
as a different unique object.
=head2 Variables Containing Undefined Values
A variable with a non-native type constraint may contain an I<undefined> value
such as a type object, provided the undefined value meets the type constraint.
my Int $x = Int; # works
my Buf $x = Buf8; # works
Variables with native types do not support undefinedness: it is an error
to assign an undefined value to them:
my int $y = Int; # dies
Since C<num> can support the value C<NaN> but not the general concept of
undefinedness, you can coerce an undefined value like this:
my num $n = computation() // NaN;
Variables of non-native types start out containing a type object of
the appropriate type unless explicitly initialized to a defined value.
Any container's default may be overridden by the C<is default(VALUE)>
trait. If the container's contents are deleted, the value is
notionally set to provided default value; this value may or may not
be physically represented in memory, depending on the implemention
of the container. You should officially not care about that (much).
=head2 The C<HOW> Method
Every object supports a C<HOW> function/method that returns the
metaclass instance managing it, regardless of whether the object
is defined:
'x'.HOW.methods('x'); # get available methods for strings
Str.HOW.methods(Str); # same thing with the prototype object Str
HOW(Str).methods(Str); # same thing as function call
'x'.methods; # this is likely an error - not a meta object
Str.methods; # same thing
(For a prototype system (a non-class-based object system), all objects
are merely managed by the same meta object.)
=head2 Roles
Perl supports generic types through what are called "roles"
which represent capabilities or interfaces. These roles
are generally not used directly as object types. For instance
all the numeric types perform the C<Numeric> role, and all
string types perform the C<Stringy> role, but there's no
such thing as a "Numeric" object, since these are generic
types that must be instantiated with extra arguments to produce
normal object types. Common roles include:
Stringy
Numeric
Real
Integral
Rational
Callable
Positional
Associative
Buf
Blob
=head2 The C<Num> and C<Rat> Types
Perl 6 intrinsically supports big integers and rationals through its
system of type declarations. C<Int> automatically supports promotion
to arbitrary precision, as well as holding C<Inf> and C<NaN> values.
Note that C<Int> assumes 2's complement arithmetic, so C<+^1 == -2>
is guaranteed. (Native C<int> operations need not support this on
machines that are not natively 2's complement. You must convert to
and from C<Int> to do portable bitops on such ancient hardware.)
C<Num> must support the largest native floating point format that
runs at full speed. It may be bound to an arbitrary precision type,
but by default it is the same type as a native C<num>. See below.
C<Rat> supports extended precision rational arithmetic.
Dividing two C<Integral> objects using C<< infix:</> >> produces a
a C<Rat>, which is generally usable anywhere a C<Num> is usable, but
may also be explicitly cast to C<Num>. (Also, if either side is
C<Num> already, C<< infix:</> >> gives you a C<Num> instead of a C<Rat>.)
C<Rat> and C<Num> both do the C<Real> role.
Lowercase types like C<int> and C<num> imply the native
machine representation for integers and floating-point numbers,
respectively, and do not promote to arbitrary precision, though
larger representations are always allowed for temporary values.
Unless qualified with a number of bits, C<int> and C<num> types represent
the largest native integer and floating-point types that run at full speed.
Numeric values in untyped variables use C<Int> and C<Num> semantics
rather than C<int> and C<num>.
However, for pragmatic reasons, C<Rat> values are guaranteed to be
exact only up to a certain point. By default, this is the precision
that would be represented by the C<Rat64> type, which is an alias for
C<Rational[Int,Uint64]>, which has a numerator
of C<Int> but is limited to a denominator of C<Uint64> (which may or may not
be implemented as a native C<uint64>, since small representations may be
desirable for small denominators). A C<Rat64> that
would require more than 64 bits of storage in the denominator is
automatically converted either to a C<Num> or to a lesser-precision
C<Rat>, at the discretion of the implementation. (Native types such
as C<rat64> limit the size of both numerator and denominator, though
not to the same size. The numerator should in general be twice the
size of the denominator to support user expectations. For instance,
a C<rat8> actually supports C<Rational[int16,uint8]>, allowing
numbers like C<100.01> to be represented, and a C<rat64>,
defined as C<Rational[int128,int64]>, can hold the number of seconds since
the Big Bang with attosecond precision. Though perhaps not with
attosecond accuracy...)
The limitation on C<Rat> values is intended to be enforced only on
user-visible types. Intermediate values used internally in calculation
the values of C<Rat> operators may exceed this precision, or represent
negative denominators. That is, the temporaries used in calculating
the new numerator and denominator are (at least in the abstract) of
C<Int> type. After a new numerator and denominator are determined,
any sign is forced to be represented only by the numerator. Then if
the denominator exceeds the storage size of the unsigned integer used,
the fraction is reduced via gcd. If the resulting denominator is still
larger than the storage size, then and I<only> then may the precision
be reduced to fit into a C<Rat> or C<Num>.
C<Rat> addition and subtraction should attempt to preserve the
denominator of the more precise argument if that denominator is
an integral multiple of the less precise denominator. That is,
in practical terms, adding a column of dollars and cents should
generally end up with a result that has a denominator of 100, even
if values like 42 and 3.5 were added in. With other operators,
this guarantee cannot be made; in such cases, the user should probably
be explicitly rounding to a particular denominator anyway.
For applications that really need arbitrary precision denominators as
well as numerators at the cost of performance, C<FatRat> may be used,
which is defined as C<Rational[Int,Int]>, that is, as arbitrary precision in
both parts. There is no literal form for a C<FatRat>, so it must
be constructed using C<FatRat.new($nu,$de)>. In general, only math
operators with at least one C<FatRat> argument will return another
C<FatRat>, to prevent accidental promotion of reasonably fast C<Rat>
values into arbitrarily slow C<FatRat> values.
Although most rational implementations normalize or "reduce" fractions
to their smallest representation immediately through a gcd algorithm,
Perl allows a rational datatype to do so lazily at need, such as
whenever the denominator would run out of precision, but avoid the
overhead otherwise. Hence, if you are adding a bunch of C<Rat>s that
represent, say, dollars and cents, the denominator may stay 100 the
entire way through. The C<.nu> and C<.de> methods will return these
unreduced values. You can use C<$rat.=norm> to normalize the fraction.
(This also forces the sign on the denominator to be positive.)
The C<.perl> method will produce a decimal number if the denominator is
a power of 10, or normalizable to a power of 10 (that is, having factors
of only 2 and 5 (and -1)). Otherwise it will normalize and return a rational
literal of the form C<< <-47/3> >>. Stringifying a rational via C<.gist>
or C<.Str> returns an exact decimal number if possible, and otherwise rounds
off the repeated decimal based on the size of the denominator. For full
details see the documentation of C<Rat.gist> in S32.
C<Num.Str> and C<Num.gist> both produce valid Num literals, so they must
include the C<e> for the exponential.
say 1/5; # 0.2 exactly
say 1/3; # 0.333333
say <2/6>.perl
# <1/3>
say 3.14159_26535_89793
# 3.141592653589793 including last digit
say 111111111111111111111111111111111111111111111.123
# 111111111111111111111111111111111111111111111.123
say 555555555555555555555555555555555555555555555/5
# 111111111111111111111111111111111111111111111
say <555555555555555555555555555555555555555555555/5>.perl
# <111111111111111111111111111111111111111111111/1>
say 2e2; # 200e0 or 2e2 or 200.0e0 or 2.0e2
=head2 Infinity and C<NaN>
Perl 6 by default makes standard IEEE floating point concepts
visible, such as C<Inf> (infinity) and C<NaN> (not a number). Within a
lexical scope, pragmas may specify the nature of temporary values,
and how floating point is to behave under various circumstances.
All IEEE modes must be lexically available via pragma except in cases
where that would entail heroic efforts to bypass a braindead platform.
The default floating-point modes do not throw exceptions but rather
propagate Inf and NaN. The boxed object types may carry more detailed
information on where overflow or underflow occurred. Numerics in Perl
are not designed to give the identical answer everywhere. They are
designed to give the typical programmer the tools to achieve a good
enough answer most of the time. (Really good programmers may occasionally
do even better.) Mostly this just involves using enough bits that the
stupidities of the algorithm don't matter much.
=head2 Strings, the C<Str> Type
A C<Str> is a Unicode string object. There is no corresponding native
C<str> type. However, since a C<Str> object may fill multiple roles,
we say that a C<Str> keeps track of its minimum and maximum Unicode
abstraction levels, and plays along nicely with the current lexical
scope's idea of the ideal character, whether that is bytes, codepoints,
graphemes, or characters in some language.
=head3 The C<StrPos> Type
For all builtin operations,
all C<Str> positions are reported as position objects, not integers.
These C<StrPos> objects point into a particular string at a particular
location independent of abstraction level, either by tracking the
string and position directly, or by generating an abstraction-level
independent representation of the offset from the beginning of the
string that will give the same results if applied to the same string
in any context. This is assuming the string isn't modified in the
meanwhile; a C<StrPos> is not a "marker" and is not required to follow
changes to a mutable string. For instance, if you ask for the positions
of matches done by a substitution, the answers are reported in terms of the
original string (which may now be inaccessible!), not as positions within
the modified string.
=head3 The C<StrLen> Type
The subtraction of two C<StrPos> objects gives a C<StrLen> object,
which is also not an integer, because the string between two positions
also has multiple integer interpretations depending on the units.
A given C<StrLen> may know that it represents 7 codepoints,
3 graphemes, and 1 letter in Malayalam, but it might only know this
lazily because it actually just hangs onto the two C<StrPos> endpoints
within the string that in turn may or may not just lazily point into
the string. (The lazy implementation of C<StrLen> is much like a
C<Range> object in that respect.)
=head3 Units of Position Arguments
If you use integers as arguments where position objects are expected,
it will be assumed that you mean the units of the current lexically
scoped Unicode abstraction level. (Which defaults to graphemes.)
Otherwise you'll need to coerce to the proper units:
substr($string, Bytes(42), ArabicChars(1))
Of course, such a dimensional number will fail if used on a string
that doesn't provide the appropriate abstraction level.
=head3 Numeric Coercion of C<StrPos> or C<StrLen>
If a C<StrPos> or C<StrLen> is forced into a numeric context, it will
assume the units of the current Unicode abstraction level. It is
erroneous to pass such a non-dimensional number to a routine that
would interpret it with the wrong units.
Implementation note: since Perl 6 mandates that the default Unicode
processing level must view graphemes as the fundamental unit rather
than codepoints, this has some implications regarding efficient
implementation. It is suggested that all graphemes be translated on
input to a unique grapheme numbers and represented as integers within
some kind of uniform array for fast substr access. For those graphemes
that have a precomposed form, use of that codepoint is suggested.
(Note that this means Latin-1 can still be represented internally
with 8-bit integers.)
For graphemes that have no precomposed form, a temporary private
id should be assigned that uniquely identifies the grapheme.
If such ids are assigned consistently throughout the process,
comparison of two graphemes is no more difficult than the comparison
of two integers, and comparison of base characters no more difficult
than a direct lookup into the id-to-NFD table.
Obviously, any temporary grapheme ids must be translated back to
some universal form (such as NFD) on output, and normal precomposed
graphemes may turn into either NFC or NFD forms depending on the
desired output. Maintaining a particular grapheme/id mapping over the
life of the process may have some GC implications for long-running
processes, but most processes will likely see a limited number of
non-precomposed graphemes.
If the program has a scope that wants a codepoint view rather than
a grapheme view, the string visible to that lexical scope must also
be translated to universal form, just as with output translation.
Alternately, the temporary grapheme ids may be hidden behind an
abstraction layer. In any case, codepoint scope should never see
any temporary grapheme ids. (The lexical codepoint declaration
should probably specify which normalization form it prefers to
view strings under. Such a declaration could be applied to input
translation as well.)
=head2 The C<Buf> Type
A C<Buf> is a stringish view of an array of
integers, and has no Unicode or character properties without explicit
conversion to some kind of C<Str>. (The C<buf8>, C<buf16>, C<buf32>,
and C<buf64> types are the native counterparts; native buf types are
required to occupy contiguous memory for the entire buffer.)
Typically a C<Buf> is an array of bytes serving as a buffer. Bitwise
operations on a C<Buf> treat the entire buffer as a single large
integer. Bitwise operations on a C<Str> generally fail unless the
C<Str> in question can provide an abstract C<Buf> interface somehow.
Coercion to C<Buf> should generally invalidate the C<Str> interface.
As a generic role C<Buf> may be instantiated as any
of C<buf8>, C<buf16>, or C<buf32> (or as any type that provides the
appropriate C<Buf> interface), but when used to create a buffer C<Buf>
is punned to a class implementing C<buf8> (actually C<Buf[uint8]>).
Unlike C<Str> types, C<Buf> types prefer to deal with integer string
positions, and map these directly to the underlying compact array
as indices. That is, these are not necessarily byte positions--an
integer position just counts over the number of underlying positions,
where one position means one cell of the underlying integer type.
Builtin string operations on C<Buf> types return integers and expect
integers when dealing with positions. As a limiting case, C<buf8> is
just an old-school byte string, and the positions are byte positions.
Note, though, that if you remap a section of C<buf32> memory to be
C<buf8>, you'll have to multiply all your positions by 4.
=head3 Native C<buf> Types
These native types are defined based on the C<Buf> role, parameterized
by the native integer type it is composed of:
Name Is really
==== =========
buf1 Buf[bit]
buf8 Buf[uint8]
buf16 Buf[uint16]
buf32 Buf[uint32]
buf64 Buf[uint64]
There are no signed buf types provided as built-ins, but you may say
Buf[int8]
Buf[int16]
Buf[int32]
Buf[int64]
to get buffers of signed integers. It is also possible to define
a C<Buf> based on non-integers or on non-native types:
Buf[complex64]
Buf[FatRat]
Buf[Int]
However, no guarantee of memory contiguity can be made for non-native types.
=head2 The C<utf8> Type
The C<utf8> type is derived from C<buf8>, with the additional constraint
that it may only contain validly encoded UTF-8. Likewise, C<utf16> is
derived from C<buf16>, and C<utf32> from C<buf32>.
Note that since these are type names, parentheses must always be
used to call them as coercers, since the listop form is not allowed
for coercions. That is:
utf8 op $x
is always parsed as
(utf8) op $x
and never as