-
Notifications
You must be signed in to change notification settings - Fork 0
manakai/perl-web-langtag
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
=head1 NAME Web::LangTag - Language Tag Parsing, Conformance Checking, and Normalization =head1 SYNOPSIS use Web::LangTag; my $lt = Web::LangTag->new; $lt->onerror ($code); $parsed = $lt->parse_tag ($tag); $result = $lt->check_parsed_tag ($parsed); $tag = $lt->normalize_tag ($tag); =head1 DESCRIPTION The C<Web::LangTag> module contains methods to handle language tags as defined by BCP 47. It can be used to parse, validate, or normalize language tags according to relevant standard. =head1 METHODS For the following methods, if an input or output is a language tag or a language range, it is interpreted as a character string (or possibly utf8 flagged string of characters), not a byte string. Note that although language tags and ranges are specified as a string of ASCII characters, illegal tags and ranges can always contain any non-ASCII characters. Since relevant standards have been incompatibly changed, a language tag comformant to old standard can be non-conforming according to the latest standard. For this reason, the module provides parsing, validating, and normalizing methods for every versions of standards. However, in general, you should simply use B<non-versioned> methods. =over 4 =item $lt = Web::LangTag->new Create a new language tag processor. =item $lt->onerror ($code) =item $code = $lt->onerror Get or set the error handler for the processor. Any parse error, as well as warning and additional processing information, is reported to the handler. See <https://github.com/manakai/data-errors/blob/master/doc/onerror.txt> for details of error handling. The value should not be set while the processor is running. If the value is changed, the result is undefined. =back =head2 Parsing =over 4 =item $parsed = $lt->parse_tag ($tag) Parses a language tag into subtags. This method interprets the language tag using the latest version of the language tag specification. At the time of writing, the latest version is RFC 5646. =item $parsed = $lt->parse_rfc5646_tag ($tag) Parses a language tag into subtags. This method interprets the language tag using the definition in RFC 5646. =item $parsed = $lt->parse_rfc4646_tag ($tag) Parses a language tag into subtags. This method interprets the language tag using the definition in RFC 4646. =back These methods return a hash reference, which contains one or more key-value pairs from the following list: =over 4 =item language (string) The language subtag. There is always a language subtag, even if the input is illegal, unless there is C<grandfathered> tag. E.g. C<'ja'> for input C<ja-JP>. =item extlang (arrayref of strings) The extlang subtags. E.g. C<'yue'> for input C<zh-yue>. =item script (string or C<undef>) The script subtag. E.g. C<'Latn'> for input C<ja-Latn-JP>. =item region (string or C<undef>) The region subtag. E.g. C<'JP'> for input C<en-JP>. =item variant (arrayref of strings) The variant subtags. E.g. C<['fonipa']> for input C<en-JP-fonipa>. =item extension (arrayref of arrayrefs of strings) The extension subtags. E.g. C<[['u', 'islamCal']]> for input C<en-US-u-islamCal>. =item privateuse (arrayref of strings) The privateuse subtags. E.g. C<['x', 'pig', 'latin']> for input C<x-pig-latin>. =item illegal (arrayref of strings) Illegal (syntactically non-conforming) string fragments. E.g. C<['1234', 'xyz', 'abc']> for input C<1234-xyz-abc>. =item grandfathered (string or C<undef>) "Grandfathered" language tag. E.g. C<'i-default'> for input C<i-default>. =item u If the tag contains a C<u> extension, parse result of the extension is contained here. The value is an array reference of array references of strings. The first inner array reference contains the attributes in the extension. The remaining inner array references, if any, represent the keywords (i.e. the key-type pairs) in the extension in original order. E.g. C<[[], ['ca', 'japanese'], ['va', '0061', '0061']]> for input C<ja-u-ca-japanese-va-0061-0061>. =item t If the tag contains a C<t> extension, parse result of the extension is contained here. The value is an array reference of parsed language tag and array references of strings. The first (zeroth) item in the outer array reference is the embedded language tag, if any, or the C<undef> value. The remaining items, if any, represent fields in the extension as array references of subtags, in original order. E.g. C<[{language => 'de', region => 'JP'}, ['m0', 'und'], ['x0', 'medical']]> for input C<ja-Latn-t-de-JP-m0-und-x0-medical>. =back Note that original cases (lower- or upper-case) is preserved in the output. =head2 Serialization =over 4 =item $tag = $lt->serialize_parsed_tag ($parsed_tag) Convert a parsed language tag into a language tag string. The argument must be a parsed tag as defined in the previous section; a broken value would not be processed properly. If the given parsed tag does not represent a well-formed language tag, the result string would not be a well-formed language tag. =back =head2 Conformance checking (validation) =over 4 =item $result = $lt->check_parsed_tag ($parsed) Checks for conformance errors in the parsed language tag, against the latest version of the language tag specification. At the time of writing, the latest version is RFC 5646. =item $result = $lt->check_rfc5646_parsed_tag ($parsed) Checks for conformance errors in the parsed language tag, against RFC 5646. This method does not report any parse errors, as this method receives a B<parsed> language tag. The method returns a hash reference with two keys: C<well-formed> and C<valid>. They represent whether the given language tag is well-formed or valid or not as per RFC 5646. =item $result = $lt->check_rfc4646_parsed_tag ($parsed) Checks for conformance errors in the parsed language tag, against RFC 4646. This method does not report any parse erros, as this method receives a B<parsed> language tag. The method returns a hash reference with two keys: C<well-formed> and C<valid>. They represent whether the given language tag is well-formed or valid or not as per RFC 4646. =item $result = $lt->check_rfc3066_tag ($tag) Parses and checks for conformance errors in the parsed language tag, against RFC 3066. The method returns an empty hash reference. =item $result = $lt->check_rfc1766_tag ($tag) Parses and checks for conformance errors in the parsed language tag, against RFC 1766. The method returns an empty hash reference. =back Note that specs sometimes contain semantic or contextual conformance rules, such as: "strongly RECOMMENDED that users not define their own rules for language tag choice" (RFC 4646 4.1.), "Subtags SHOULD only be used where they add useful distinguishing information" (RFC 4646 4.1.), and "Use as precise a tag as possible, but no more specific than is justified" (RFC 4646 4.1. 1.). These kinds of requirements cannot be tested without human interpretation, and therefore the methods in this module do not (or in fact cannot) try to detect violation to these rules. =head2 Normalization =over 4 =item $tag = $lt->normalize_tag ($tag_orig) Normalize the language tag by folding cases, following the latest version of the language tag specification. At the time of writing, the latest version is RFC 5646. =item $tag = $lt->normalize_rfc5646_tag ($tag_orig) Normalize the language tag by folding cases, following RFC 5646 2.1. and 2.2.6. Note that this method does not replace any subtag into its preferred alternative; this method does not rearrange ordering of subtags. Although this method does not completely convert language tags into their canonical form, its result will be good enough for comparison in most usual situations. =item $tag = $lt->canonicalize_tag ($tag_orig) Normalize the language tag into its canonicalized form, as per the latest version of the language tag specification. At the time of writing, the latest version is RFC 5646. =item $tag = $lt->canonicalize_rfc5646_tag ($tag_orig) Normalize the language tag into its canonicalized form, as per RFC 5646 4.5. That is, replace any subtag into its Preferred-Value form if possible and sort any extension subtags. Note that this method does NOT do any case folding. In addition, the "canonicalized form" of a langauge tag is not necessary a fully canonicalized form at all - for example, variant subtags might not be in the recommended order. Also, it does not canonicalize extension subtags. Note that if the input is not a well-formed language tag according to RFC 5646, the result string might not be a well-formed language tag as well. Sometimes the canonicalization would turn a valid langauge tag into an invalid language tag. =item $tag = $lt->to_extlang_form_tag ($tag_orig) Normalize the language tag into its extlang form, as per the latest version of the language tag specification. At the time of writing, the latest version is RFC 5646. =item $tag = $lt->to_extlang_form_rfc5646_tag ($tag_orig) Normalize the language tag into its extlang form, as per RFC 5646 4.5. The extlang form is same as the canonicalized form, except that use of extlang subtags is preferred to language-only (or extlang-free) representation. Note that if the input is not a well-formed language tag according to RFC 5646, the result string might not be a well-formed language tag as well. Sometimes the canonicalization would turn a valid langauge tag into an invalid language tag. =back =head2 Comparison =over 4 =item $boolean = $lt->basic_filtering_range ($range, $tag) Compares a basic language range to a language tag, according to the latest version of the language range specification. At the time of writing, the latest version is RFC 4645. =item $boolean = $lt->basic_filtering_rfc4647_range ($range, $tag) Compares a basic language range to a language tag, according to RFC 4647 Section 3.3.1. This method returns whether the range matches to the tag or not. A basic language range is either a language tag or C<*>. (For more information, see RFC 4647 Section 2.1.). =item $boolean = $lt->match_rfc3066_range ($range, $tag) Compares a language-range to a language tag according to RFC 3066 Section 2.5. This method returns whether the range matches to the tag or not. Note that RFC 3066 is obsoleted by RFC 4647. A language range is either a language tag or C<*>. (For more information, see RFC 3066 2.5). Note that this method is equivalent to C<basic_filtering_rfc4647_range> by definition. =item $boolean = $lt->extended_filtering_range ($range, $tag) Compares an extended language range to a language tag, according to the latest version of the language range specification. At the time of writing, the latest version is RFC 4647. =item $boolean = $lt->extended_filtering_rfc4647_range ($range, $tag) Compares an extended language range to a language tag, according to RFC 4647 Section 3.3.2. This method returns whether the range matches to the tag or not. An extended language range is a language tag whose subtags can be C<*>s. (For more information, see RFC 4647 Section 2.2.). =back =head1 SPECIFICATIONS =over 4 =item RFC1766 RFC 1766: Tags for the Identification of Languages <http://tools.ietf.org/html/rfc1766>. (Obsolete) =item RFC3066 RFC 3066: Tags for the Identification of Languages <http://tools.ietf.org/html/rfc3066>. (Obsolete) =item RFC4646 RFC 4646: Tags for Identifying Languages <http://tools.ietf.org/html/rfc4646>. (Obsolete) =item RFC4647 RFC 4647: Matching of Language Tags <http://tools.ietf.org/html/rfc4647>. =item RFC5646 RFC 5646: Tags for Identifying Languages <http://tools.ietf.org/html/rfc5646>. =item RFC6067 RFC 6067: BCP 47 Extension U <http://tools.ietf.org/html/rfc6067>. =item RFC6497 RFC 6497: BCP 47 Extension T - Transformed Content <http://tools.ietf.org/html/rfc6497>. =item LANGSUBTAGREG IANA Language Subtag Registry <http://www.iana.org/assignments/language-subtag-registry>. =item LANGEXTREG Language Tag Extensions Registry <http://www.iana.org/assignments/language-tag-extensions-registry>. =item LDML UTS #35: Unicode Locale Data Markup Language <http://unicode.org/reports/tr35/>. =item UNICODELOCALEREG Unicode Locale Extensions for BCP 47 <http://cldr.unicode.org/index/bcp47-extension>, <http://unicode.org/repos/cldr/trunk/common/bcp47/>. =item WEBLANGTAG Comments in the C<lib/Web/LangTag.pm>. =back =head1 DEPENDENCY The module requires Perl 5.8 or later. =head1 DEVELOPMENT Latest version of the module is available at GitHub <https://github.com/manakai/perl-web-langtag>. Tests are run at Travis CI: <https://travis-ci.org/manakai/perl-web-langtag>. =head1 SEE ALSO SuikaWiki:Language Tags <http://suika.suikawiki.org/~wakaba/wiki/sw/n/language%20tags> Language tags <https://github.com/manakai/data-web-defs/blob/master/data/langtags.json>. =head1 AUTHOR Wakaba <wakaba@suikawiki.org>. =head1 LICENSE Copyright 2007-2014 Wakaba <wakaba@suikawiki.org>. This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself. =cut
About
Language tags
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published