@@ -8,6 +8,55 @@ Perl 6 has a high level of support of Unicode. This document aims to be both an
8
8
overview as well as describe Unicode features which don't belong in the documentation
9
9
for routines and methods.
10
10
11
+ = head1 File Handles and I/O
12
+
13
+ Perl6 applies X < normalization > by default to all input and output except for
14
+ file names which are stored as L < C < UTF8-C8 > |#UTF8-C8> .
15
+ What does this mean? For example á can be represented 2 ways. Either using
16
+ one codepoint:
17
+
18
+ = for code :skip-test
19
+ á (U+E1 "LATIN SMALL LETTER A WITH ACUTE")
20
+
21
+ Or two codepoints:
22
+
23
+ = for code :skip-test
24
+ a + ́ (U+61 "LATIN SMALL LETTER A" + U+301 "COMBINING ACUTE ACCENT")
25
+
26
+ Perl 6 will turn both these inputs into one codepoint, as is specified for
27
+ normalization form canonical (B < X < NFC > > ). In most cases this is useful and means
28
+ that two inputs that are equivalent both are treated the same, and any text
29
+ you process or output from Perl 6 will be in this "canonical" form.
30
+
31
+ One case where we don't default to this, is for file handles. This is because
32
+ file handles must be accessed exactly as the bytes are written on the disk.
33
+
34
+ You can use L < UTF8-C8|#UTF8-C8 > with any file handle to read the exact bytes as they are
35
+ on disk. They may look funny when printed out, if you print it out using a
36
+ UTF8 handle. If you print it out to a handle where the output is UTF8-C8,
37
+ then it will render as you would normally expect, and be a byte for byte exact
38
+ copy. More technical details on L < UTF8-C8|#UTF8-C8 > on MoarVM below.
39
+
40
+ = head2 X < UTF8-C8 >
41
+
42
+ X < UTF-8 Clean-8 > is an encoder/decoder that primarily works as the UTF-8 one.
43
+ However, upon encountering a byte sequence that will either not decode as
44
+ valid UTF-8, or that would not round-trip due to normalization, it will use
45
+ NFG synthetics to keep track of the original bytes involved. This means that
46
+ encoding back to UTF-8 Clean-8 will be able to recreate the bytes as they
47
+ originally existed. The synthetics contain 4 codepoints:
48
+
49
+ = item The codepoint 0x10FFFD (which is a private use codepoint)
50
+ = item The codepoint 'x'
51
+ = item The upper 4 bits of the non-decodable byte as a hex char (0..9A..F)
52
+ = item The lower 4 bits as the non-decodable byte as a hex char (0..9A..F)
53
+
54
+ Under normal UTF-8 encoding, this means the unrepresentable characters will
55
+ come out as something like `?xFF`.
56
+
57
+ UTF-8 Clean-8 is used in places where MoarVM receives strings from the
58
+ environment, command line arguments, and file system queries.
59
+
11
60
= head1 Entering Unicode Codepoints and Codepoint Sequences
12
61
13
62
You can enter Unicode codepoints by number (decimal as well as hexadecimal). For example, the character named
@@ -84,53 +133,5 @@ commas to separate different codepoints/sequences inside the same C<\c> sequence
84
133
say "\c[woman gesturing OK]"; # OUTPUT: «🙆♀️»
85
134
say "\c[family: man woman girl boy]"; # OUTPUT: «👨👩👧👦»
86
135
87
- = head1 File Handles and I/O
88
-
89
- Perl6 applies X < normalization > by default to all input and output except for
90
- file names which are stored as L < C < UTF8-C8 > |#UTF8-C8> .
91
- What does this mean? For example á can be represented 2 ways. Either using
92
- one codepoint:
93
-
94
- = for code :skip-test
95
- á (U+E1 "LATIN SMALL LETTER A WITH ACUTE")
96
-
97
- Or two codepoints:
98
-
99
- = for code :skip-test
100
- a + ́ (U+61 "LATIN SMALL LETTER A" + U+301 "COMBINING ACUTE ACCENT")
101
-
102
- Perl 6 will turn both these inputs into one codepoint, as is specified for
103
- normalization form canonical (B < X < NFC > > ). In most cases this is useful and means
104
- that two inputs that are equivalent both are treated the same, and any text
105
- you process or output from Perl 6 will be in this "canonical" form.
106
-
107
- One case where we don't default to this, is for file handles. This is because
108
- file handles must be accessed exactly as the bytes are written on the disk.
109
-
110
- You can use UTF8-C8 with any file handle to read the exact bytes as they are
111
- on disk. They may look funny when printed out, if you print it out using a
112
- UTF8 handle. If you print it out to a handle where the output is UTF8-C8,
113
- then it will render as you would normally expect, and be a byte for byte exact
114
- copy. More technical details on UTF8-C8 on MoarVM below.
115
-
116
- = head2 X < UTF8-C8 >
117
-
118
- X < UTF-8 Clean-8 > is an encoder/decoder that primarily works as the UTF-8 one.
119
- However, upon encountering a byte sequence that will either not decode as
120
- valid UTF-8, or that would not round-trip due to normalization, it will use
121
- NFG synthetics to keep track of the original bytes involved. This means that
122
- encoding back to UTF-8 Clean-8 will be able to recreate the bytes as they
123
- originally existed. The synthetics contain 4 codepoints:
124
-
125
- = item The codepoint 0x10FFFD (which is a private use codepoint)
126
- = item The codepoint 'x'
127
- = item The upper 4 bits of the non-decodable byte as a hex char (0..9A..F)
128
- = item The lower 4 bits as the non-decodable byte as a hex char (0..9A..F)
129
-
130
- Under normal UTF-8 encoding, this means the unrepresentable characters will
131
- come out as something like `?xFF`.
132
-
133
- UTF-8 Clean-8 is used in places where MoarVM receives strings from the
134
- environment, command line arguments, and file system queries.
135
136
136
137
= end pod
0 commit comments