Skip to content

Commit 95a783e

Browse files
committed
Move Information on normalization higher up on Unicode page
1 parent a0e4e79 commit 95a783e

File tree

1 file changed

+49
-48
lines changed

1 file changed

+49
-48
lines changed

doc/Language/unicode.pod6

Lines changed: 49 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,55 @@ Perl 6 has a high level of support of Unicode. This document aims to be both an
88
overview as well as describe Unicode features which don't belong in the documentation
99
for routines and methods.
1010
11+
=head1 File Handles and I/O
12+
13+
Perl6 applies X<normalization> by default to all input and output except for
14+
file names which are stored as L<C<UTF8-C8>|#UTF8-C8>.
15+
What does this mean? For example á can be represented 2 ways. Either using
16+
one codepoint:
17+
18+
=for code :skip-test
19+
á (U+E1 "LATIN SMALL LETTER A WITH ACUTE")
20+
21+
Or two codepoints:
22+
23+
=for code :skip-test
24+
a + ́ (U+61 "LATIN SMALL LETTER A" + U+301 "COMBINING ACUTE ACCENT")
25+
26+
Perl 6 will turn both these inputs into one codepoint, as is specified for
27+
normalization form canonical (B<X<NFC>>). In most cases this is useful and means
28+
that two inputs that are equivalent both are treated the same, and any text
29+
you process or output from Perl 6 will be in this "canonical" form.
30+
31+
One case where we don't default to this, is for file handles. This is because
32+
file handles must be accessed exactly as the bytes are written on the disk.
33+
34+
You can use L<UTF8-C8|#UTF8-C8> with any file handle to read the exact bytes as they are
35+
on disk. They may look funny when printed out, if you print it out using a
36+
UTF8 handle. If you print it out to a handle where the output is UTF8-C8,
37+
then it will render as you would normally expect, and be a byte for byte exact
38+
copy. More technical details on L<UTF8-C8|#UTF8-C8> on MoarVM below.
39+
40+
=head2 X<UTF8-C8>
41+
42+
X<UTF-8 Clean-8> is an encoder/decoder that primarily works as the UTF-8 one.
43+
However, upon encountering a byte sequence that will either not decode as
44+
valid UTF-8, or that would not round-trip due to normalization, it will use
45+
NFG synthetics to keep track of the original bytes involved. This means that
46+
encoding back to UTF-8 Clean-8 will be able to recreate the bytes as they
47+
originally existed. The synthetics contain 4 codepoints:
48+
49+
=item The codepoint 0x10FFFD (which is a private use codepoint)
50+
=item The codepoint 'x'
51+
=item The upper 4 bits of the non-decodable byte as a hex char (0..9A..F)
52+
=item The lower 4 bits as the non-decodable byte as a hex char (0..9A..F)
53+
54+
Under normal UTF-8 encoding, this means the unrepresentable characters will
55+
come out as something like `?xFF`.
56+
57+
UTF-8 Clean-8 is used in places where MoarVM receives strings from the
58+
environment, command line arguments, and file system queries.
59+
1160
=head1 Entering Unicode Codepoints and Codepoint Sequences
1261
1362
You can enter Unicode codepoints by number (decimal as well as hexadecimal). For example, the character named
@@ -84,53 +133,5 @@ commas to separate different codepoints/sequences inside the same C<\c> sequence
84133
say "\c[woman gesturing OK]"; # OUTPUT: «🙆‍♀️␤»
85134
say "\c[family: man woman girl boy]"; # OUTPUT: «👨‍👩‍👧‍👦␤»
86135
87-
=head1 File Handles and I/O
88-
89-
Perl6 applies X<normalization> by default to all input and output except for
90-
file names which are stored as L<C<UTF8-C8>|#UTF8-C8>.
91-
What does this mean? For example á can be represented 2 ways. Either using
92-
one codepoint:
93-
94-
=for code :skip-test
95-
á (U+E1 "LATIN SMALL LETTER A WITH ACUTE")
96-
97-
Or two codepoints:
98-
99-
=for code :skip-test
100-
a + ́ (U+61 "LATIN SMALL LETTER A" + U+301 "COMBINING ACUTE ACCENT")
101-
102-
Perl 6 will turn both these inputs into one codepoint, as is specified for
103-
normalization form canonical (B<X<NFC>>). In most cases this is useful and means
104-
that two inputs that are equivalent both are treated the same, and any text
105-
you process or output from Perl 6 will be in this "canonical" form.
106-
107-
One case where we don't default to this, is for file handles. This is because
108-
file handles must be accessed exactly as the bytes are written on the disk.
109-
110-
You can use UTF8-C8 with any file handle to read the exact bytes as they are
111-
on disk. They may look funny when printed out, if you print it out using a
112-
UTF8 handle. If you print it out to a handle where the output is UTF8-C8,
113-
then it will render as you would normally expect, and be a byte for byte exact
114-
copy. More technical details on UTF8-C8 on MoarVM below.
115-
116-
=head2 X<UTF8-C8>
117-
118-
X<UTF-8 Clean-8> is an encoder/decoder that primarily works as the UTF-8 one.
119-
However, upon encountering a byte sequence that will either not decode as
120-
valid UTF-8, or that would not round-trip due to normalization, it will use
121-
NFG synthetics to keep track of the original bytes involved. This means that
122-
encoding back to UTF-8 Clean-8 will be able to recreate the bytes as they
123-
originally existed. The synthetics contain 4 codepoints:
124-
125-
=item The codepoint 0x10FFFD (which is a private use codepoint)
126-
=item The codepoint 'x'
127-
=item The upper 4 bits of the non-decodable byte as a hex char (0..9A..F)
128-
=item The lower 4 bits as the non-decodable byte as a hex char (0..9A..F)
129-
130-
Under normal UTF-8 encoding, this means the unrepresentable characters will
131-
come out as something like `?xFF`.
132-
133-
UTF-8 Clean-8 is used in places where MoarVM receives strings from the
134-
environment, command line arguments, and file system queries.
135136
136137
=end pod

0 commit comments

Comments
 (0)