Skip to content

Commit e556ea3

Browse files
committed
[unicode-grant] Add documentation on UTF8-C8
We previously had no documentation on UTF8-C8. Here is a pretty good start for people looking to better understand what it means and at least *some* of the reasons why it exists.
1 parent 4b32979 commit e556ea3

File tree

1 file changed

+51
-3
lines changed

1 file changed

+51
-3
lines changed

doc/Language/unicode.pod6

Lines changed: 51 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,9 @@ Additionally, all Unicode codepoint names/named seq/emoji sequences are now case
1717
1818
say "\c[latin capital letter E]"; # OUTPUT: «E␤» (U+0045)
1919
20-
=head1 Name Aliases
20+
=head1 Entering Unicode Codepoints and Codepoint Sequences
21+
22+
=head2 Name Aliases
2123
2224
By name alias. Name Aliases are used mainly for codepoints without an official
2325
name, for abbreviations, or for corrections (Unicode names never change).
@@ -41,15 +43,15 @@ Abbreviations:
4143
say "\c[ZWJ]".uniname; # OUTPUT: «ZERO WIDTH JOINER␤»
4244
say "\c[NBSP]".uniname; # OUTPUT: «NO-BREAK SPACE␤»
4345
44-
=head1 Named Sequences
46+
=head2 Named Sequences
4547
4648
You can also use any of the L<Named Sequences|http://www.unicode.org/Public/UCD/latest/ucd/NamedSequences.txt>,
4749
these are not single codepoints, but sequences of them. [Starting in 2017.02]
4850
4951
say "\c[LATIN CAPITAL LETTER E WITH VERTICAL LINE BELOW AND ACUTE]"; # OUTPUT: «É̩␤»
5052
say "\c[LATIN CAPITAL LETTER E WITH VERTICAL LINE BELOW AND ACUTE]".ords; # OUTPUT: «(201 809)␤»
5153
52-
=head2 Emoji Sequences
54+
=head3 Emoji Sequences
5355
5456
Rakudo has support for Emoji 4.0 (the latest non-draft release) sequences.
5557
For all of them see:
@@ -61,4 +63,50 @@ commas to separate different codepoints/sequences inside the same C<\c> sequence
6163
say "\c[woman gesturing OK]"; # OUTPUT: «🙆‍♀️␤»
6264
say "\c[family: man woman girl boy]"; # OUTPUT: «👨‍👩‍👧‍👦␤»
6365
66+
=head1 File Handles and I/O
67+
68+
Perl6 applies X<normalization> by default to all input and output it makes.
69+
What does this mean? For example á can be represented 2 ways. Either using
70+
one codepoint:
71+
72+
á (U+E1 "LATIN SMALL LETTER A WITH ACUTE")
73+
74+
Or two codepoints:
75+
76+
a + ́ (U+61 "LATIN SMALL LETTER A" + "U+301 COMBINING ACUTE ACCENT")
77+
78+
Perl 6 will turn both these inputs into one codepoint, as is specified for
79+
normalization form canonical (B<X<NFC>>). In most cases this is useful and means
80+
that two inputs that are equivilant both are treated the same, and any text
81+
you process or output from Perl 6 will be in this "canonical" form.
82+
83+
One case where we don't default to this, is for file handles. This is because
84+
file handles must be accessed exactly as the bytes are written on the disk.
85+
86+
You can use UTF8-C8 with any file handle to read the exact bytes as they are
87+
on disk. They may look funny when printed out, if you print it out using a
88+
UTF8 handle. If you print it out to a handle where the output is UTF8-C8,
89+
then it will render as you would normally expect, and be a byte for byte exact
90+
copy. More technical details on UTF8-C8 on MoarVM below.
91+
92+
=head2 X<UT8-C8>
93+
94+
X<UTF-8 Clean-8> is an encoder/decoder that primarily works as the UTF-8 one.
95+
However, upon encountering a byte sequence that will either not decode as
96+
valid UTF-8, or that would not round-trip due to normalization, it will use
97+
NFG synthetics to keep track of the original bytes involved. This means that
98+
encoding back to UTF-8 Clean-8 will be able to recreate the bytes as they
99+
originally existed. The synthetics contain 4 codepoints:
100+
101+
=item The codepoint 0x10FFFD (which is a private use codepoint)
102+
=item The codepoint 'x'
103+
=item The upper 4 bits of the non-decodable byte as a hex char (0..9A..F)
104+
=item The lower 4 bits as the non-decodable byte as a hex char (0..9A..F)
105+
106+
Under normal UTF-8 encoding, this means the unrepresentable characters will
107+
come out as something like `?xFF`.
108+
109+
UTF-8 Clean-8 is used in places where MoarVM receives strings from the
110+
environment, command line arguments, and file system queries.
111+
64112
=end pod

0 commit comments

Comments
 (0)