Skip to content

manakai/perl-web-langtag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

=head1 NAME

Web::LangTag - Language Tag Parsing, Conformance Checking, and Normalization

=head1 SYNOPSIS

  use Web::LangTag;
  
  my $lt = Web::LangTag->new;
  $lt->onerror ($code);
  $parsed = $lt->parse_tag ($tag);
  $result = $lt->check_parsed_tag ($parsed);
  $tag = $lt->normalize_tag ($tag);

=head1 DESCRIPTION

The C<Web::LangTag> module contains methods to handle language tags as
defined by BCP 47.  It can be used to parse, validate, or normalize
language tags according to relevant standard.

=head1 METHODS

For the following methods, if an input or output is a language tag or
a language range, it is interpreted as a character string (or possibly
utf8 flagged string of characters), not a byte string.  Note that
although language tags and ranges are specified as a string of ASCII
characters, illegal tags and ranges can always contain any non-ASCII
characters.

Since relevant standards have been incompatibly changed, a language
tag comformant to old standard can be non-conforming according to the
latest standard.  For this reason, the module provides parsing,
validating, and normalizing methods for every versions of standards.
However, in general, you should simply use B<non-versioned> methods.

=over 4

=item $lt = Web::LangTag->new

Create a new language tag processor.

=item $lt->onerror ($code)

=item $code = $lt->onerror

Get or set the error handler for the processor.  Any parse error, as
well as warning and additional processing information, is reported to
the handler.  See
<https://github.com/manakai/data-errors/blob/master/doc/onerror.txt>
for details of error handling.

The value should not be set while the processor is running.  If the
value is changed, the result is undefined.

=back

=head2 Parsing

=over 4

=item $parsed = $lt->parse_tag ($tag)

Parses a language tag into subtags.  This method interprets the
language tag using the latest version of the language tag
specification.  At the time of writing, the latest version is RFC
5646.

=item $parsed = $lt->parse_rfc5646_tag ($tag)

Parses a language tag into subtags.  This method interprets the
language tag using the definition in RFC 5646.

=item $parsed = $lt->parse_rfc4646_tag ($tag)

Parses a language tag into subtags.  This method interprets the
language tag using the definition in RFC 4646.

=back

These methods return a hash reference, which contains one or more
key-value pairs from the following list:

=over 4

=item language (string)

The language subtag.  There is always a language subtag, even if the
input is illegal, unless there is C<grandfathered> tag.  E.g. C<'ja'>
for input C<ja-JP>.

=item extlang (arrayref of strings)

The extlang subtags.  E.g. C<'yue'> for input C<zh-yue>.

=item script (string or C<undef>)

The script subtag.  E.g. C<'Latn'> for input C<ja-Latn-JP>.

=item region (string or C<undef>)

The region subtag.  E.g. C<'JP'> for input C<en-JP>.

=item variant (arrayref of strings)

The variant subtags.  E.g. C<['fonipa']> for input C<en-JP-fonipa>.

=item extension (arrayref of arrayrefs of strings)

The extension subtags.  E.g. C<[['u', 'islamCal']]> for input
C<en-US-u-islamCal>.

=item privateuse (arrayref of strings)

The privateuse subtags.  E.g. C<['x', 'pig', 'latin']> for input
C<x-pig-latin>.

=item illegal (arrayref of strings)

Illegal (syntactically non-conforming) string fragments.
E.g. C<['1234', 'xyz', 'abc']> for input C<1234-xyz-abc>.

=item grandfathered (string or C<undef>)

"Grandfathered" language tag.  E.g. C<'i-default'> for input
C<i-default>.

=item u

If the tag contains a C<u> extension, parse result of the extension is
contained here.  The value is an array reference of array references
of strings.  The first inner array reference contains the attributes
in the extension.  The remaining inner array references, if any,
represent the keywords (i.e. the key-type pairs) in the extension in
original order.  E.g. C<[[], ['ca', 'japanese'], ['va', '0061',
'0061']]> for input C<ja-u-ca-japanese-va-0061-0061>.

=item t

If the tag contains a C<t> extension, parse result of the extension is
contained here.  The value is an array reference of parsed language
tag and array references of strings.  The first (zeroth) item in the
outer array reference is the embedded language tag, if any, or the
C<undef> value.  The remaining items, if any, represent fields in the
extension as array references of subtags, in original order.
E.g. C<[{language => 'de', region => 'JP'}, ['m0', 'und'], ['x0',
'medical']]> for input C<ja-Latn-t-de-JP-m0-und-x0-medical>.

=back

Note that original cases (lower- or upper-case) is preserved in the
output.

=head2 Serialization

=over 4

=item $tag = $lt->serialize_parsed_tag ($parsed_tag)

Convert a parsed language tag into a language tag string.  The
argument must be a parsed tag as defined in the previous section; a
broken value would not be processed properly.

If the given parsed tag does not represent a well-formed language tag,
the result string would not be a well-formed language tag.

=back

=head2 Conformance checking (validation)

=over 4

=item $result = $lt->check_parsed_tag ($parsed)

Checks for conformance errors in the parsed language tag, against the
latest version of the language tag specification.  At the time of
writing, the latest version is RFC 5646.

=item $result = $lt->check_rfc5646_parsed_tag ($parsed)

Checks for conformance errors in the parsed language tag, against RFC
5646.

This method does not report any parse errors, as this method receives
a B<parsed> language tag.

The method returns a hash reference with two keys: C<well-formed> and
C<valid>.  They represent whether the given language tag is
well-formed or valid or not as per RFC 5646.

=item $result = $lt->check_rfc4646_parsed_tag ($parsed)

Checks for conformance errors in the parsed language tag, against RFC
4646.

This method does not report any parse erros, as this method receives a
B<parsed> language tag.

The method returns a hash reference with two keys: C<well-formed> and
C<valid>.  They represent whether the given language tag is
well-formed or valid or not as per RFC 4646.

=item $result = $lt->check_rfc3066_tag ($tag)

Parses and checks for conformance errors in the parsed language tag,
against RFC 3066.

The method returns an empty hash reference.

=item $result = $lt->check_rfc1766_tag ($tag)

Parses and checks for conformance errors in the parsed language tag,
against RFC 1766.

The method returns an empty hash reference.

=back

Note that specs sometimes contain semantic or contextual conformance
rules, such as: "strongly RECOMMENDED that users not define their own
rules for language tag choice" (RFC 4646 4.1.), "Subtags SHOULD only
be used where they add useful distinguishing information" (RFC 4646
4.1.), and "Use as precise a tag as possible, but no more specific
than is justified" (RFC 4646 4.1. 1.).  These kinds of requirements
cannot be tested without human interpretation, and therefore the
methods in this module do not (or in fact cannot) try to detect
violation to these rules.

=head2 Normalization

=over 4

=item $tag = $lt->normalize_tag ($tag_orig)

Normalize the language tag by folding cases, following the latest
version of the language tag specification.  At the time of writing,
the latest version is RFC 5646.

=item $tag = $lt->normalize_rfc5646_tag ($tag_orig)

Normalize the language tag by folding cases, following RFC 5646
2.1. and 2.2.6.  Note that this method does not replace any subtag
into its preferred alternative; this method does not rearrange
ordering of subtags.

Although this method does not completely convert language tags into
their canonical form, its result will be good enough for comparison in
most usual situations.

=item $tag = $lt->canonicalize_tag ($tag_orig)

Normalize the language tag into its canonicalized form, as per the
latest version of the language tag specification.  At the time of
writing, the latest version is RFC 5646.

=item $tag = $lt->canonicalize_rfc5646_tag ($tag_orig)

Normalize the language tag into its canonicalized form, as per RFC
5646 4.5.  That is, replace any subtag into its Preferred-Value form
if possible and sort any extension subtags.  Note that this method
does NOT do any case folding.  In addition, the "canonicalized form"
of a langauge tag is not necessary a fully canonicalized form at all -
for example, variant subtags might not be in the recommended order.
Also, it does not canonicalize extension subtags.

Note that if the input is not a well-formed language tag according to
RFC 5646, the result string might not be a well-formed language tag as
well.  Sometimes the canonicalization would turn a valid langauge tag
into an invalid language tag.

=item $tag = $lt->to_extlang_form_tag ($tag_orig)

Normalize the language tag into its extlang form, as per the latest
version of the language tag specification.  At the time of writing,
the latest version is RFC 5646.

=item $tag = $lt->to_extlang_form_rfc5646_tag ($tag_orig)

Normalize the language tag into its extlang form, as per RFC 5646 4.5.
The extlang form is same as the canonicalized form, except that use of
extlang subtags is preferred to language-only (or extlang-free)
representation.

Note that if the input is not a well-formed language tag according to
RFC 5646, the result string might not be a well-formed language tag as
well.  Sometimes the canonicalization would turn a valid langauge tag
into an invalid language tag.

=back

=head2 Comparison

=over 4

=item $boolean = $lt->basic_filtering_range ($range, $tag)

Compares a basic language range to a language tag, according to the
latest version of the language range specification.  At the time of
writing, the latest version is RFC 4645.

=item $boolean = $lt->basic_filtering_rfc4647_range ($range, $tag)

Compares a basic language range to a language tag, according to RFC
4647 Section 3.3.1.  This method returns whether the range matches to
the tag or not.

A basic language range is either a language tag or C<*>.  (For more
information, see RFC 4647 Section 2.1.).

=item $boolean = $lt->match_rfc3066_range ($range, $tag)

Compares a language-range to a language tag according to RFC 3066
Section 2.5.  This method returns whether the range matches to the tag
or not.  Note that RFC 3066 is obsoleted by RFC 4647.

A language range is either a language tag or C<*>.  (For more
information, see RFC 3066 2.5).

Note that this method is equivalent to
C<basic_filtering_rfc4647_range> by definition.

=item $boolean = $lt->extended_filtering_range ($range, $tag)

Compares an extended language range to a language tag, according to
the latest version of the language range specification.  At the time
of writing, the latest version is RFC 4647.

=item $boolean = $lt->extended_filtering_rfc4647_range ($range, $tag)

Compares an extended language range to a language tag, according to
RFC 4647 Section 3.3.2.  This method returns whether the range matches
to the tag or not.

An extended language range is a language tag whose subtags can be
C<*>s.  (For more information, see RFC 4647 Section 2.2.).

=back

=head1 SPECIFICATIONS

=over 4

=item RFC1766

RFC 1766: Tags for the Identification of Languages
<http://tools.ietf.org/html/rfc1766>. (Obsolete)

=item RFC3066

RFC 3066: Tags for the Identification of Languages
<http://tools.ietf.org/html/rfc3066>. (Obsolete)

=item RFC4646

RFC 4646: Tags for Identifying Languages
<http://tools.ietf.org/html/rfc4646>. (Obsolete)

=item RFC4647

RFC 4647: Matching of Language Tags
<http://tools.ietf.org/html/rfc4647>.

=item RFC5646

RFC 5646: Tags for Identifying Languages
<http://tools.ietf.org/html/rfc5646>.

=item RFC6067

RFC 6067: BCP 47 Extension U <http://tools.ietf.org/html/rfc6067>.

=item RFC6497

RFC 6497: BCP 47 Extension T - Transformed Content
<http://tools.ietf.org/html/rfc6497>.

=item LANGSUBTAGREG

IANA Language Subtag Registry
<http://www.iana.org/assignments/language-subtag-registry>.

=item LANGEXTREG

Language Tag Extensions Registry
<http://www.iana.org/assignments/language-tag-extensions-registry>.

=item LDML

UTS #35: Unicode Locale Data Markup Language
<http://unicode.org/reports/tr35/>.

=item UNICODELOCALEREG

Unicode Locale Extensions for BCP 47
<http://cldr.unicode.org/index/bcp47-extension>,
<http://unicode.org/repos/cldr/trunk/common/bcp47/>.

=item WEBLANGTAG

Comments in the C<lib/Web/LangTag.pm>.

=back

=head1 DEPENDENCY

The module requires Perl 5.8 or later.

=head1 DEVELOPMENT

Latest version of the module is available at GitHub
<https://github.com/manakai/perl-web-langtag>.

Tests are run at Travis CI:
<https://travis-ci.org/manakai/perl-web-langtag>.

=head1 SEE ALSO

SuikaWiki:Language Tags
<http://suika.suikawiki.org/~wakaba/wiki/sw/n/language%20tags>

Language tags
<https://github.com/manakai/data-web-defs/blob/master/data/langtags.json>.

=head1 AUTHOR

Wakaba <wakaba@suikawiki.org>.

=head1 LICENSE

Copyright 2007-2014 Wakaba <wakaba@suikawiki.org>.

This library is free software; you can redistribute it and/or modify
it under the same terms as Perl itself.

=cut

Releases

No releases published

Packages

No packages published

Languages