Web character encodings for Perl
Perl Perl6 Makefile
Pull request Compare This branch is 462 commits ahead of wakaba:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
config/perl
lib/Web
sketch
t
t_deps/modules
.gitignore
.gitmodules
.travis.yml
Makefile
README.pod

README.pod

NAME

Web::Encoding - Web Encodings APIs

SYNOPSIS

  use Web::Encoding;
  $bytes = encode_web_utf8 $chars;
  $chars = decode_web_utf8 $bytes;

DESCRIPTION

The Web::Encoding module provides a set of functions to handle Web-compatible character encodings.

Also, there are following modules in the perl-web-encodings repository:

Web::Encoding::UnivCharDet

The universalchardet (or universal detector) implementation in Perl, which can be used to implement HTML parsers.

Web::Encoding::Normalization

Implementation of Unicode's string normalization algorithms, i.e. NFC, NFD, NFKC, and NFKD.

FUNCTIONS

Functions described in these subsections are exported by default.

Encoding labels and properties of encodings

There are following functions to handle encoding labels and to obtain properties of encodings:

$key = encoding_label_to_name $label

Find the encoding identified by the specified label. As does Encoding Standard's "get an encoding" steps, this function ignores leading and trailing spaces, and compares labels ASCII case-insensitively. The function returns the encoding key (not a name), if found, or undef.

$key = fixup_html_meta_encoding_name $key

Replace a encoding key for the purpose of HTML character encoding declaration, as in "prescan a byte stream to determine its encoding" and "change the encoding" algorithms. The argument must be an encoding key (not a name or label). The function returns an encoding key.

$key = get_output_encoding_key $key

Return the result of applying the steps to get an output encoding. The argument must be an encoding key (not a name or label). The function returns an encoding key.

$name = encoding_name_to_compat_name $key

Replace an encoding key to its official name as used in e.g. characterSet or inputEncoding attributes of the Document interface. The argument must be an encoding key (not a name or label). The function returns an encoding name.

$boolean = is_ascii_compat_encoding_name $key

Return whether the specified encoding is an ASCII-compatible character encoding or not. The argument must be an encoding key (not a name or label).

$boolean = is_encoding_label $label

Return whether the specified label identifies an encoding or not. It compares labels ASCII case-insensitively. Unlike the encoding_label_to_name function, however, this function does not ignore spaces.

$key = locale_default_encoding_name $tag

Return the encoding key (not a name or label) of the default character encoding for a locale. If no default is known for the specified locale, undef is returned.

The argument, which identifies the locale, must be either a BCP 47 language tag or a string *. The language tag must be the primary language tag only, zh-TW, or zh-CN, otherwise no data is available. The tags are ASCII case-insensitive. If * is specified, the global default encoding that can be used when the locale is not known or the locale has no default is returned.

For the purpose of this module, the key of the encoding is a short string uniquly identifying the encoding. It is a lowercased variant of the encoding name defined in the Encoding Standard.

Note that the encoding names in the Encoding Standard are not compatible with Perl Encode module's encoding names.

Encoders and decoders

There are following functions for encoding and decoding:

$bytes = encode_web_utf8 $chars

Encode the character string in UTF-8 and return the encoded bytes.

This function corresponds to the "UTF-8 encode" operation of the Encoding Standard.

$chars = decode_web_utf8 $bytes

Decode the bytes as UTF-8 and return the decoded character string. Any bad byte is replaced by U+FFFD characters without failure.

This function corresponds to the "UTF-8 decode" operation of the Encoding Standard.

$chars = decode_web_utf8_no_bom $bytes

Decode the bytes as UTF-8, not recognizing BOM, and returns the decoded character string. Any bad byte is replaced by U+FFFD characters without failure.

This function corresponds to the "UTF-8 decode without BOM" operation of the Encoding Standard.

Unfortunately, no encoding other than UTF-8 from the Encoding Standard is supported for now.

SPECIFICATIONS

Encoding Standard <https://encoding.spec.whatwg.org/>.

HTML Standard <https://html.spec.whatwg.org/>.

DOM Standard <https://dom.spec.whatwg.org/>.

DEPENDENCY

The module requires Perl 5.8 or later.

AUTHOR

Wakaba <wakaba@suikawiki.org>.

LICENSE

Copyright 2011-2016 Wakaba <wakaba@suikawiki.org>.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.