@@ -17,7 +17,9 @@ Additionally, all Unicode codepoint names/named seq/emoji sequences are now case
17
17
18
18
say "\c[latin capital letter E]"; # OUTPUT: «E» (U+0045)
19
19
20
- = head1 Name Aliases
20
+ = head1 Entering Unicode Codepoints and Codepoint Sequences
21
+
22
+ = head2 Name Aliases
21
23
22
24
By name alias. Name Aliases are used mainly for codepoints without an official
23
25
name, for abbreviations, or for corrections (Unicode names never change).
@@ -41,15 +43,15 @@ Abbreviations:
41
43
say "\c[ZWJ]".uniname; # OUTPUT: «ZERO WIDTH JOINER»
42
44
say "\c[NBSP]".uniname; # OUTPUT: «NO-BREAK SPACE»
43
45
44
- = head1 Named Sequences
46
+ = head2 Named Sequences
45
47
46
48
You can also use any of the L < Named Sequences|http://www.unicode.org/Public/UCD/latest/ucd/NamedSequences.txt > ,
47
49
these are not single codepoints, but sequences of them. [Starting in 2017.02]
48
50
49
51
say "\c[LATIN CAPITAL LETTER E WITH VERTICAL LINE BELOW AND ACUTE]"; # OUTPUT: «É̩»
50
52
say "\c[LATIN CAPITAL LETTER E WITH VERTICAL LINE BELOW AND ACUTE]".ords; # OUTPUT: «(201 809)»
51
53
52
- = head2 Emoji Sequences
54
+ = head3 Emoji Sequences
53
55
54
56
Rakudo has support for Emoji 4.0 (the latest non-draft release) sequences.
55
57
For all of them see:
@@ -61,4 +63,50 @@ commas to separate different codepoints/sequences inside the same C<\c> sequence
61
63
say "\c[woman gesturing OK]"; # OUTPUT: «🙆♀️»
62
64
say "\c[family: man woman girl boy]"; # OUTPUT: «👨👩👧👦»
63
65
66
+ = head1 File Handles and I/O
67
+
68
+ Perl6 applies X < normalization > by default to all input and output it makes.
69
+ What does this mean? For example á can be represented 2 ways. Either using
70
+ one codepoint:
71
+
72
+ á (U+E1 "LATIN SMALL LETTER A WITH ACUTE")
73
+
74
+ Or two codepoints:
75
+
76
+ a + ́ (U+61 "LATIN SMALL LETTER A" + "U+301 COMBINING ACUTE ACCENT")
77
+
78
+ Perl 6 will turn both these inputs into one codepoint, as is specified for
79
+ normalization form canonical (B < X < NFC > > ). In most cases this is useful and means
80
+ that two inputs that are equivilant both are treated the same, and any text
81
+ you process or output from Perl 6 will be in this "canonical" form.
82
+
83
+ One case where we don't default to this, is for file handles. This is because
84
+ file handles must be accessed exactly as the bytes are written on the disk.
85
+
86
+ You can use UTF8-C8 with any file handle to read the exact bytes as they are
87
+ on disk. They may look funny when printed out, if you print it out using a
88
+ UTF8 handle. If you print it out to a handle where the output is UTF8-C8,
89
+ then it will render as you would normally expect, and be a byte for byte exact
90
+ copy. More technical details on UTF8-C8 on MoarVM below.
91
+
92
+ = head2 X < UT8-C8 >
93
+
94
+ X < UTF-8 Clean-8 > is an encoder/decoder that primarily works as the UTF-8 one.
95
+ However, upon encountering a byte sequence that will either not decode as
96
+ valid UTF-8, or that would not round-trip due to normalization, it will use
97
+ NFG synthetics to keep track of the original bytes involved. This means that
98
+ encoding back to UTF-8 Clean-8 will be able to recreate the bytes as they
99
+ originally existed. The synthetics contain 4 codepoints:
100
+
101
+ = item The codepoint 0x10FFFD (which is a private use codepoint)
102
+ = item The codepoint 'x'
103
+ = item The upper 4 bits of the non-decodable byte as a hex char (0..9A..F)
104
+ = item The lower 4 bits as the non-decodable byte as a hex char (0..9A..F)
105
+
106
+ Under normal UTF-8 encoding, this means the unrepresentable characters will
107
+ come out as something like `?xFF`.
108
+
109
+ UTF-8 Clean-8 is used in places where MoarVM receives strings from the
110
+ environment, command line arguments, and file system queries.
111
+
64
112
= end pod
0 commit comments