Skip to content

Latest commit

 

History

History
1460 lines (991 loc) · 42.5 KB

perlunicook.pod

File metadata and controls

1460 lines (991 loc) · 42.5 KB

NAME

perlunicook - Perl 縺ァ Unicode 繧呈桶縺�◆繧√�繧ッ繝�け繝悶ャ繧ッ鬚ィ縺ョ萓�

DESCRIPTION

縺薙� man 繝壹�繧ク縺ォ縺ッ縲 ̄erl 縺ァ荳�闊ャ逧�↑ Unicode 謫堺ス懊r謇ア縺�婿豕輔r隱ャ譏弱☆繧� 遏ュ縺�Ξ繧キ繝斐→縲∵怙蠕後↓荳�縺、縺ョ螳悟�縺ェ繝励Ο繧ー繝ゥ繝�縺悟性縺セ繧後※縺�∪縺吶�� 蛟九���繝ャ繧キ繝泌�縺ョ螳」險�縺輔l縺ヲ縺�↑縺�、画焚縺ッ縲√◎繧御サ・蜑阪↓驕ゥ蛻�↑蛟、縺� 險ュ螳壹&繧後※縺�k縺薙→繧剃サョ螳壹@縺ヲ縺�∪縺吶��

EXAMPLES

邃� 0: Standard preamble

(邃� 0: 讓呎コ悶�蜑肴署)

迚ケ縺ォ豕ィ險倥′縺ェ縺�剞繧翫�∽サ・荳九�縺吶∋縺ヲ縺ョ萓九〒縺ッ縲√%縺ョ讓呎コ悶�蜑肴署縺梧ュ」縺励¥蜍穂ス懊@縲� #! 縺後す繧ケ繝�Β荳翫〒蜍穂ス懊☆繧九h縺�↓隱ソ謨エ縺輔l縺ヲ縺�k蠢�ヲ√′縺ゅj縺セ縺吶��

#!/usr/bin/env perl
use utf8;      # 蠕薙▲縺ヲ繝ェ繝�Λ繝ォ縺ィ隴伜挨蟄舌〒 UTF-8 繧剃スソ縺医k
use v5.12;     # 縺セ縺溘�縺昴l莉・髯�; "unicode_strings" 讖溯�繧呈怏蜉ケ縺ォ
use strict;    # 譁�ュ怜�繧偵け繧ゥ繝シ繝医�∝、画焚繧貞ョ」險�
use warnings;  # 繝�ヵ繧ゥ繝ォ繝医〒繧ェ繝ウ
use warnings  qw(FATAL utf8);    # 繧ィ繝ウ繧ウ繝シ繝�ぅ繝ウ繧ー繧ィ繝ゥ繝シ繧定�蜻ス逧�お繝ゥ繝シ縺ォ
use open      qw(:std :encoding(UTF-8)); # 譛ェ螳」險�繧ケ繝医Μ繝シ繝�繧� UTF-8 縺ォ
use charnames qw(:full :short);  # v5.16 縺ァ縺ッ荳崎ヲ�

縺薙l縺ッ Unix 繝励Ο繧ー繝ゥ繝槭〒縺輔∴繝舌う繝翫Μ繧ケ繝医Μ繝シ繝�繧� binmode 縺励◆繧翫�� :raw 縺ァ髢九>縺溘j 縺励※縺�∪縺� 縺後�√◎繧後′縺ィ縺ォ縺九¥縺薙l繧峨r 遘サ讀肴�ァ縺ョ縺ゅk繧ゅ�縺ォ縺吶k蜚ッ荳�縺ョ譁ケ豕輔〒縺吶��

隴ヲ蜻�: use autodie(2.26 繧医j蜑�)縺ィ use open 縺ッ蜷梧凾縺ォ菴ソ縺医∪縺帙s縲�

邃� 1: Generic Unicode-savvy filter

(邃� 1: 荳�闊ャ逧�↑ Unicode 縺御スソ縺医k繝輔ぅ繝ォ繧ソ)

蟶ク縺ォ縲∝�繧雁哨縺ァ蛻�ァ」縺励�∝�蜿」縺ァ蜀榊粋謌舌@縺セ縺吶��

use Unicode::Normalize;

while (<>) {
    $_ = NFD($_);   # decompose + reorder canonically
    ...
} continue {
    print NFC($_);  # recompose (where possible) + reorder canonically
}

邃� 2: Fine-tuning Unicode warnings

(邃� 2: Unicode 隴ヲ蜻翫�蠕ョ隱ソ謨エ)

v5.14 縺九i縲 ̄erl 縺ッ UTF-8 隴ヲ蜻翫�荳峨▽縺ョ繧オ繝悶け繝ゥ繧ケ繧貞玄蛻・縺励※縺�∪縺吶��

use v5.14;                  # subwarnings unavailable any earlier
no warnings "nonchar";      # the 66 forbidden non-characters
no warnings "surrogate";    # UTF-16/CESU-8 nonsense
no warnings "non_unicode";  # for codepoints over 0x10_FFFF

邃� 3: Declare source in utf8 for identifiers and literals

(邃� 3: 隴伜挨蟄舌→繝ェ繝�Λ繝ォ縺ョ縺溘a縺ォ繧ス繝シ繧ケ縺� utf8 縺ァ縺ゅk縺ィ螳」險�縺吶k)

譛�繧る㍾隕√↑ use utf8 螳」險�縺ェ縺励�蝣エ蜷医�√Μ繝�Λ繝ォ縺ィ隴伜挨蟄舌↓ UTF-8 繧貞�繧後k縺ィ豁」縺励¥蜍穂ス懊@縺セ縺帙s縲� 蜑崎ソー縺励◆讓呎コ悶�蜑肴署繧剃スソ縺」縺溷�エ蜷医�√%繧後�譌「縺ォ蜷ォ縺セ繧後※縺�∪縺吶�� 縺昴�蝣エ蜷医�∽サ・荳九�繧医≧縺ェ縺薙→縺後〒縺阪∪縺�:

use utf8;

my $measure   = "テ�gstrテカm";
my @ホシsoft     = qw( cp852 cp1251 cp1252 );
my @眇耐�ホュマ∃シホオホウホアマ� = qw( 眇耐�ホュマ�  ホシホオホウホアマ� );
my @魃�        = qw( koi8-f koi8-u koi8-r );
my $motto     = "測 苧 征"; # FAMILY, GROWING HEART, DROMEDARY CAMEL

use utf8 繧貞ソ倥l繧九→縲∽ク贋ス阪ヰ繧、繝医�蛻・縲��譁�ュ励→縺励※隱、隗」縺輔l縲� 菴輔b豁」縺励¥蜍穂ス懊@縺セ縺帙s縲�

邃� 4: Characters and their numbers

(邃� 4: 譁�ュ励→縺昴�逡ェ蜿キ)

ord 髢「謨ー縺ィ chr 髢「謨ー縺ッ縲√☆縺ケ縺ヲ縺ョ隨ヲ蜿キ菴咲スョ縺ァ騾城℃逧�↓蜍穂ス懊@縺セ縺�; ASCII 縺�縺代〒縺ッ縺ェ縺上�∝ョ滄圀縺ォ縺ッ Unicode 縺�縺代〒繧ゅ≠繧翫∪縺帙s縲�

# ASCII characters
ord("A")
chr(65)

# characters from the Basic Multilingual Plane
ord("ホ」")
chr(0x3A3)

# beyond the BMP
ord("騒")               # MATHEMATICAL ITALIC SMALL N
chr(0x1D45B)

# beyond Unicode! (up to MAXINT)
ord("\x{20_0000}")
chr(0x20_0000)

邃� 5: Unicode literals by character number

(邃� 5: 譁�ュ礼分蜿キ縺ォ繧医k Unicode 繝ェ繝�Λ繝ォ)

螻暮幕繝ェ繝�Λ繝ォ縺ァ縺ッ縲√ム繝悶Ν繧ッ繧ゥ繝シ繝医〒蝗イ縺セ繧後◆譁�ュ怜�縺区ュ」隕剰。ィ迴セ縺九↓縺九°繧上i縺壹�� \x{HHHHHH} 繧ィ繧ケ繧ア繝シ繝励r菴ソ逕ィ縺励※逡ェ蜿キ縺ァ譁�ュ励r謖�ョ壹〒縺阪∪縺吶��

String: "\x{3a3}"
Regex:  /\x{3a3}/

String: "\x{1d45b}"
Regex:  /\x{1d45b}/

# even non-BMP ranges in regex work fine
/[\x{1D434}-\x{1D467}]/

邃� 6: Get character name by number

(邃� 6: 逡ェ蜿キ縺ァ譁�ュ怜錐繧貞叙蠕励☆繧�)

use charnames ();
my $name = charnames::viacode(0x03A3);

邃� 7: Get character number by name

(邃� 7: 蜷榊燕縺ァ譁�ュ礼分蜿キ繧貞叙蠕励☆繧�)

use charnames ();
my $number = charnames::vianame("GREEK CAPITAL LETTER SIGMA");

邃� 8: Unicode named characters

(邃� 8: Unicode 蜷阪↓繧医k譁�ュ�)

螻暮幕繝ェ繝�Λ繝ォ(繝�繝悶Ν繧ッ繧ゥ繝シ繝医〒蝗イ縺セ繧後◆譁�ュ怜�縺ィ豁」隕剰。ィ迴セ)縺ァ逕ィ縺�k縲� 蜷榊燕縺ァ譁�ュ励r蠕励k縺溘a縺ォ <\N{charname}> 陦ィ險倥r菴ソ縺�∪縺吶�� v5.16 縺ァ縺ッ縲√%繧後�證鈴サ吶↓謖�ョ壹&繧後∪縺�:

use charnames qw(:full :short);

縺励°縺励�」5.16 繧医j蜑阪�繝舌�繧ク繝ァ繝ウ縺ァ縺ッ縲√←縺ョ charnames 縺ョ髮�粋繧剃スソ逕ィ縺吶k縺九r 譏守、コ逧�↓謖�ョ壹@縺ェ縺代l縺ー縺ェ繧翫∪縺帙s縲� :full 縺ョ蜷榊燕縺ッ縲ゞnicode 縺ョ豁」蠑上↑譁�ュ怜錐縲∝挨蜷阪�√∪縺溘� 荳ヲ縺ウ縺ァ縺ゅj縲√☆縺ケ縺ヲ蜷榊燕遨コ髢薙r蜈ア譛峨@縺セ縺吶��

use charnames qw(:full :short latin greek);

"\N{MATHEMATICAL ITALIC SMALL N}"      # :full
"\N{GREEK CAPITAL LETTER SIGMA}"       # :full

縺昴l莉・螟悶�縲 ̄erl 蝗コ譛峨�萓ソ蛻ゥ縺ェ逵∫払蠖「縺ァ縺吶�� 逕ィ蟄怜崋譛峨�遏ュ縺�錐蜑阪′蠢�ヲ√↑蝣エ蜷医�縲∽ク�縺、莉・荳翫�逕ィ蟄励r蜷榊燕縺ァ謖�ョ壹@縺セ縺吶��

"\N{Greek:Sigma}"                      # :short
"\N{ae}"                               #  latin
"\N{epsilon}"                          #  greek

v5.16 繝ェ繝ェ繝シ繧ケ縺ァ縺ッ縲∵枚蟄怜錐縺ョ邱ゥ繧�°縺ェ繝槭ャ繝√Φ繧ー縺ョ縺溘a縺ョ :loose 繧、繝ウ繝昴�繝医↓繧ょッセ蠢懊@縺ヲ縺�∪縺�; 縺薙l縺ッ迚ケ諤ァ蜷阪�邱ゥ繧�°縺ェ繝槭ャ繝√Φ繧ー縺ィ蜷後§繧医≧縺ォ讖溯�縺励∪縺�: 縺、縺セ繧翫�∝、ァ譁�ュ怜ー乗枚蟄励�∫ゥコ逋ス縲∽ク狗キ壹�辟。隕悶&繧後∪縺�:

"\N{euro sign}"                        # :loose (from v5.16)

邃� 9: Unicode named sequences

(邃� 9: Unicode 蜷阪↓繧医k荳ヲ縺ウ)

縺薙l繧峨�譁�ュ怜錐縺ョ繧医≧縺ォ隕九∴縺セ縺吶′縲∬、�焚縺ョ隨ヲ蜿キ菴咲スョ繧定ソ斐@縺セ縺吶�� printf 縺ョ %vx 繝吶け繝医Ν陦ィ遉コ讖溯�縺ォ豕ィ逶ョ縺励※縺上□縺輔>縲�

use charnames qw(:full);
my $seq = "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}";
printf "U+%v04X\n", $seq;
U+0100.0300

邃� 10: Custom named characters

(邃� 10: 繧ォ繧ケ繧ソ繝�蜷阪↓繧医k譁�ュ�)

:alias 繧剃スソ逕ィ縺励※縲∵里蟄倥�譁�ュ励↓蟇セ縺励※繝ャ繧ュ繧キ繧ォ繝ォ繧ケ繧ウ繝シ繝励� 迢ャ閾ェ縺ョ繝九ャ繧ッ繝阪�繝�繧剃サ倥¢縺溘j縲∫┌蜷阪�遘∫畑譁�ュ励↓譛臥畑縺ェ蜷榊燕繧� 莉倥¢繧九%縺ィ縺後〒縺阪∪縺吶��

use charnames ":full", ":alias" => {
    ecute => "LATIN SMALL LETTER E WITH ACUTE",
    "APPLE LOGO" => 0xF8FF, # private use character
};

"\N{ecute}"
"\N{APPLE LOGO}"

邃� 11: Names of CJK codepoints

(邃� 11: CJK 隨ヲ蜿キ菴咲スョ縺ョ蜷榊燕)

縲梧擲莠ャ縲阪�繧医≧縺ェ荳ュ蝗ス貍「蟄励�縲√�悟錐蜑阪�阪′逡ー縺ェ繧九◆繧√�� CJK UNIFIED IDEOGRAPH-6771 縺ィ CJK UNIFIED IDEOGRAPH-4EAC 縺ィ縺�≧譁�ュ怜錐縺ァ謌サ縺」縺ヲ縺阪∪縺吶�� CPAN 縺ョ Unicode::Unihan 繝「繧ク繝・繝シ繝ォ縺ッ縲√◎縺ョ蜃コ蜉帙r逅�ァ」縺吶k譁ケ豕輔r 遏・縺」縺ヲ縺�l縺ー縲√%繧後i(縺翫h縺ウ縺輔i縺ォ螟壹¥縺ョ)譁�ュ励r繝�さ繝シ繝峨☆繧九◆繧√� 螟ァ隕乗ィ。縺ェ繝��繧ソ繝吶�繧ケ繧呈戟縺。縺セ縺吶��

# cpan -i Unicode::Unihan
use Unicode::Unihan;
my $str = "譚ア莠ャ";
my $unhan = Unicode::Unihan->new;
for my $lang (qw(Mandarin Cantonese Korean JapaneseOn JapaneseKun)) {
    printf "CJK $str in %-12s is ", $lang;
    say $unhan->$lang($str);
}

縺薙l縺ッ谺。縺ョ繧ゅ�繧定。ィ遉コ縺励∪縺�:

CJK 譚ア莠ャ in Mandarin     is DONG1JING1
CJK 譚ア莠ャ in Cantonese    is dung1ging1
CJK 譚ア莠ャ in Korean       is TONGKYENG
CJK 譚ア莠ャ in JapaneseOn   is TOUKYOU KEI KIN
CJK 譚ア莠ャ in JapaneseKun  is HIGASHI AZUMAMIYAKO

迚ケ螳壹�繝ュ繝シ繝槫ュ怜喧繧ケ繧ュ繝シ繝�繧定��∴縺ヲ縺�k蝣エ蜷医�縲∫音螳壹�繝「繧ク繝・繝シ繝ォ繧剃スソ縺�∪縺�:

# cpan -i Lingua::JA::Romanize::Japanese
use Lingua::JA::Romanize::Japanese;
my $k2r = Lingua::JA::Romanize::Japanese->new;
my $str = "譚ア莠ャ";
say "Japanese for $str is ", $k2r->chars($str);

縺薙l縺ッ谺。縺ョ繧ゅ�繧定。ィ遉コ縺励∪縺�:

Japanese for 譚ア莠ャ is toukyou

邃� 12: Explicit encode/decode

(邃� 12: 譏守、コ逧�↑繧ィ繝ウ繧ウ繝シ繝�/繝�さ繝シ繝�)

縺セ繧後↓縲√ョ繝シ繧ソ繝吶�繧ケ縺ョ隱ュ縺ソ蜿悶j縺ェ縺ゥ縲√ョ繧ウ繝シ繝峨☆繧句ソ�ヲ√′縺ゅk 繧ィ繝ウ繧ウ繝シ繝峨&繧後◆繝�く繧ケ繝医r蜿励¢蜿悶k縺薙→縺後≠繧翫∪縺吶��

use Encode qw(encode decode);

my $chars = decode("shiftjis", $bytes, 1);
 # OR
my $bytes = encode("MIME-Header-ISO_2022_JP", $chars, 1);

蜷後§繧ィ繝ウ繧ウ繝シ繝�ぅ繝ウ繧ー縺ョ繧ケ繝医Μ繝シ繝�縺ォ蟇セ縺励※縺ッ縲‘ncode/decode 繧� 菴ソ繧上↑縺�〒縺上□縺輔>; 莉」繧上j縺ォ縲∝セ瑚ソー縺吶k繧医≧縺ォ縲√ヵ繧。繧、繝ォ繧帝幕縺上→縺阪�√∪縺溘�縺昴�逶エ蠕後↓ binmode 縺ァ繝輔ぃ繧、繝ォ繧ィ繝ウ繧ウ繝シ繝�ぅ繝ウ繧ー繧定ィュ螳壹@縺ヲ縺上□縺輔>縲�

邃� 13: Decode program arguments as utf8

(邃� 13: 繝励Ο繧ー繝ゥ繝�蠑墓焚繧� utf8 縺ィ縺励※繝�さ繝シ繝峨☆繧�)

$ perl -CA ...
 or
$ export PERL_UNICODE=A
 or
    use Encode qw(decode);
    @ARGV = map { decode('UTF-8', $_, 1) } @ARGV;

邃� 14: Decode program arguments as locale encoding

(邃� 14: 繝励Ο繧ー繝ゥ繝�蠑墓焚繧偵Ο繧ア繝シ繝ォ繧ィ繝ウ繧ウ繝シ繝�ぅ繝ウ繧ー縺ィ縺励※繝�さ繝シ繝峨☆繧�)

# cpan -i Encode::Locale
use Encode qw(locale);
use Encode::Locale;

# use "locale" as an arg to encode/decode
@ARGV = map { decode(locale => $_, 1) } @ARGV;

邃� 15: Declare STD{IN,OUT,ERR} to be utf8

(邃� 15: STD{IN,OUT,ERR} 繧� utf8 縺ィ縺励※螳」險�縺吶k)

繧ウ繝槭Φ繝峨Λ繧、繝ウ繧ェ繝励す繝ァ繝ウ繧�腸蠅�、画焚繧剃スソ縺�°縲∵�遉コ逧�↓ binmode 繧貞他縺ウ蜃コ縺励∪縺吶��

$ perl -CS ...
 or
$ export PERL_UNICODE=S
 or
use open qw(:std :encoding(UTF-8));
 or
binmode(STDIN,  ":encoding(UTF-8)");
binmode(STDOUT, ":utf8");
binmode(STDERR, ":utf8");

邃� 16: Declare STD{IN,OUT,ERR} to be in locale encoding

(邃� 15: STD{IN,OUT,ERR} 繧偵Ο繧ア繝シ繝ォ繧ィ繝ウ繧ウ繝シ繝�ぅ繝ウ繧ー縺ィ縺励※螳」險�縺吶k)

# cpan -i Encode::Locale
use Encode;
use Encode::Locale;

# or as a stream for binmode or open
binmode STDIN,  ":encoding(console_in)"  if -t STDIN;
binmode STDOUT, ":encoding(console_out)" if -t STDOUT;
binmode STDERR, ":encoding(console_out)" if -t STDERR;

邃� 17: Make file I/O default to utf8

(邃� 17: 繝輔ぃ繧、繝ォ I/O 縺ョ繝�ヵ繧ゥ繝ォ繝医r utf8 縺ォ縺吶k)

encoding 蠑墓焚縺ェ縺励〒髢九°繧後◆繝輔ぃ繧、繝ォ縺ッ UTF-8 縺ォ縺ェ繧翫∪縺�:

$ perl -CD ...
 or
$ export PERL_UNICODE=D
 or
use open qw(:encoding(UTF-8));

邃� 18: Make all I/O and args default to utf8

(邃� 18: 蜈ィ縺ヲ縺ョ I/O 縺ィ蠑墓焚縺ョ繝�ヵ繧ゥ繝ォ繝医r utf8 縺ォ縺吶k)

$ perl -CSDA ...
 or
$ export PERL_UNICODE=SDA
 or
use open qw(:std :encoding(UTF-8));
use Encode qw(decode);
@ARGV = map { decode('UTF-8', $_, 1) } @ARGV;

邃� 19: Open file with specific encoding

(邃� 19: 迚ケ螳壹�繧ィ繝ウ繧ウ繝シ繝�ぅ繝ウ繧ー縺ァ繝輔ぃ繧、繝ォ繧帝幕縺�)

繧ケ繝医Μ繝シ繝�繧ィ繝ウ繧ウ繝シ繝�ぅ繝ウ繧ー繧呈欠螳壹@縺セ縺吶�� 縺薙l縺ッ縲∽ス弱Ξ繝吶Ν髢「謨ー繧貞他縺ウ蜃コ縺吶�縺ァ縺ッ縺ェ縺上�√お繝ウ繧ウ繝シ繝峨&繧後◆繝�く繧ケ繝医r 蜃ヲ逅�☆繧矩�壼クク縺ョ譁ケ豕輔〒縺吶��

# input file
    open(my $in_file, "< :encoding(UTF-16)", "wintext");
OR
    open(my $in_file, "<", "wintext");
    binmode($in_file, ":encoding(UTF-16)");
THEN
    my $line = <$in_file>;

# output file
    open($out_file, "> :encoding(cp1252)", "wintext");
OR
    open(my $out_file, ">", "wintext");
    binmode($out_file, ":encoding(cp1252)");
THEN
    print $out_file "some text\n";

縺薙%縺ァ謖�ョ壹〒縺阪k縺ョ縺ッ縲√お繝ウ繧ウ繝シ繝�ぅ繝ウ繧ー縺�縺代〒縺ッ縺ゅj縺セ縺帙s縲� 萓九∴縺ー縲∝測譁� ":raw :encoding(UTF-16LE) :crlf" 縺ォ縺ッ 證鈴サ咏噪縺ェ CRLF 蜃ヲ逅�′蜷ォ縺セ繧後※縺�∪縺吶��

邃� 20: Unicode casing

(邃� 20: Unicode 縺ョ螟ァ譁�ュ怜ー乗枚蟄�)

Unicode 縺ョ螟ァ譁�ュ怜ー乗枚蟄励� ASCII 縺ョ螟ァ譁�ュ怜ー乗枚蟄励→縺ッ螟ァ縺阪¥逡ー縺ェ繧翫∪縺吶��

uc("henry 竇キ")  # "HENRY 竇ァ"
uc("tschテシテ�")   # "TSCHテ彜S"  notice テ� => SS

# both are true:
"tschテシテ�"  =~ /TSCHテ彜S/i   # notice テ� => SS
"ホ」ホッマρ��ソマ�" =~ /ホ」ホ莞」ホ・ホヲホ湮」/i   # notice ホ」,マ�,マ� sameness

邃� 21: Unicode case-insensitive comparisons

(邃� 21: Unicode 縺ョ螟ァ譁�ュ怜ー乗枚蟄励r辟。隕悶@縺滓ッ碑シ�)

CPAN 縺ョ Unicode::CaseFold 繝「繧ク繝・繝シ繝ォ縺ァ繧ょ茜逕ィ蜿ッ閭ス縺ェ縲」5.16 縺ョ譁ー縺励> fc "foldcase" 髢「謨ー縺ッ縲�/i 繝代ち繝シ繝ウ菫ョ鬟セ蟄舌′蟶ク縺ォ菴ソ縺」縺ヲ縺阪◆縺ョ縺ィ蜷後§ Unicode 螟ァ譁�ュ怜ー乗枚蟄礼糞縺ソ霎シ縺ソ縺ク縺ョ繧「繧ッ繧サ繧ケ繧剃ク弱∴縺セ縺吶��

use feature "fc"; # fc() function is from v5.16

# sort case-insensitively
my @sorted = sort { fc($a) cmp fc($b) } @list;

# both are true:
fc("tschテシテ�")  eq fc("TSCHテ彜S")
fc("ホ」ホッマρ��ソマ�") eq fc("ホ」ホ莞」ホ・ホヲホ湮」")

邃� 22: Match Unicode linebreak sequence in regex

(邃� 22: 豁」隕剰。ィ迴セ荳ュ縺ョ Unicode 謾ケ陦御クヲ縺ウ縺ョ繝槭ャ繝√Φ繧ー)

Unicode 縺ョ謾ケ陦後�縲�2 譁�ュ励� CRLF 譖ク險倡エ�縺セ縺溘�荳�▽縺ョ蝙ら峩遨コ逋ス譁�ュ励� 縺�★繧後°縺ォ繝槭ャ繝√Φ繧ー縺励∪縺吶�� 逡ー縺ェ繧九が繝壹Ξ繝シ繝�ぅ繝ウ繧ー繧キ繧ケ繝�Β縺九i騾√i繧後※縺上k繝�く繧ケ繝医ヵ繧。繧、繝ォ繧� 謇ア縺��縺ォ驕ゥ縺励※縺�∪縺吶��

\R

s/\R/\n/g;  # normalize all linebreaks to \n

邃� 23: Get character category

(邃� 23: 譁�ュ励き繝�ざ繝ェ繧貞セ励k)

謨ー蛟、隨ヲ蜿キ菴咲スョ縺ョ荳�闊ャ繧ォ繝�ざ繝ェ繧定ヲ九▽縺代∪縺吶��

use Unicode::UCD qw(charinfo);
my $cat = charinfo(0x3A3)->{category};  # "Lu"

邃� 24: Disabling Unicode-awareness in builtin charclasses

(邃� 24: 邨�∩霎シ縺ソ譁�ュ励け繝ゥ繧ケ縺ァ Unicode 蛻、螳壹r辟。蜉ケ縺ォ縺吶k)

縺薙�繧ケ繧ウ繝シ繝励∪縺溘�荳�縺、縺ョ豁」隕剰。ィ迴セ縺ァ縲�\w縲�\b縲�\s縲�\d縲� 縺翫h縺ウ POSIX 繧ッ繝ゥ繧ケ縺� Unicode 縺ァ豁」縺励¥蜍穂ス懊@縺ェ縺�h縺�↓縺励∪縺吶��

use v5.14;
use re "/a";

# OR

my($num) = $str =~ /(\d+)/a;

縺セ縺溘�縲�\p{ahex} 繧� \p{POSIX_Digit} 縺ェ縺ゥ縺ョ迚ケ螳壹�髱� Unicode 迚ケ諤ァ繧� 菴ソ縺�∪縺吶�� 縺ゥ縺ョ譁�ュ鈴寔蜷井ソョ鬟セ蟄� (/d /u /l /a /aa) 縺梧怏蜉ケ縺ァ縺ゅ▲縺ヲ繧ゅ�� 迚ケ諤ァ縺ッ豁」蟶ク縺ォ蜍穂ス懊@縺セ縺吶��

邃� 25: Match Unicode properties in regex with \p, \P

(邃� 25: 豁」隕剰。ィ迴セ荳ュ縺ォ \p, \P 繧剃スソ縺」縺ヲ Unicode 迚ケ諤ァ縺ォ繝槭ャ繝√Φ繧ー縺吶k)

縺薙l繧峨�縺吶∋縺ヲ縲∵欠螳壹&繧後◆迚ケ諤ァ繧呈戟縺、荳�縺、縺ョ隨ヲ蜿キ菴咲スョ縺ォ繝槭ャ繝√Φ繧ー縺励∪縺吶�� \p 縺ョ莉」繧上j縺ォ \P 繧剃スソ逕ィ縺吶k縺ィ縲√◎縺ョ迚ケ諤ァ繧呈戟縺溘↑縺�ク�縺、縺ョ隨ヲ蜿キ菴咲スョ縺ォ 繝槭ャ繝√Φ繧ー縺励∪縺吶��

\pL, \pN, \pS, \pP, \pM, \pZ, \pC
\p{Sk}, \p{Ps}, \p{Lt}
\p{alpha}, \p{upper}, \p{lower}
\p{Latin}, \p{Greek}
\p{script_extensions=Latin}, \p{scx=Greek}
\p{East_Asian_Width=Wide}, \p{EA=W}
\p{Line_Break=Hyphen}, \p{LB=HY}
\p{Numeric_Value=4}, \p{NV=4}

邃� 26: Custom character properties

(邃� 26: 繧ォ繧ケ繧ソ繝�譁�ュ礼音諤ァ)

豁」隕剰。ィ迴セ縺ァ菴ソ逕ィ縺吶k迢ャ閾ェ縺ョ繧ォ繧ケ繧ソ繝�譁�ュ礼音諤ァ繧偵さ繝ウ繝代う繝ォ譎ゅ↓螳夂セゥ縺励∪縺吶��

# using private-use characters
sub In_Tengwar { "E000\tE07F\n" }

if (/\p{In_Tengwar}/) { ... }

# blending existing properties
sub Is_GraecoRoman_Title {<<'END_OF_SET'}
+utf8::IsLatin
+utf8::IsGreek
&utf8::IsTitle
END_OF_SET

if (/\p{Is_GraecoRoman_Title}/ { ... }

邃� 27: Unicode normalization

(邃� 27: Unicode 豁」隕丞喧)

騾壼クク縺ッ縲∝�蜉帙〒縺ッ NFD 縺ォ縲∝�蜉帙〒縺ッ NFC 縺ォ繝ャ繝ウ繝�繝ェ繝ウ繧ー縺輔l縺セ縺吶�� NFKC 縺セ縺溘� NFKD 髢「謨ー繧剃スソ縺�%縺ィ縺ァ縲∵、懃エ「蟇セ雎。縺ョ蜷後§繝�く繧ケ繝医↓蟇セ縺励※ 譌「縺ォ螳溯。後@縺ヲ縺�k縺薙→繧貞燕謠舌→縺励※縲∵、懃エ「譎ゅ�蜀榊他縺ウ蜃コ縺励′謾ケ蝟�&繧後∪縺吶�� 縺薙l縺ッ蜊倥↓莠句燕邨仙粋縺輔l縺滉コ呈鋤繧ー繝ェ繝穂サ・荳翫�繧ゅ�縺ァ縺ゅk縺薙→縺ォ 豕ィ諢上@縺ヲ縺上□縺輔>; 豁」貅也オ仙粋繧ッ繝ゥ繧ケ縺ォ蠕薙▲縺ヲ繝槭�繧ッ繧剃クヲ縺ウ譖ソ縺医�√す繝ウ繧ー繝ォ繝医Φ繧貞炎髯、縺励∪縺吶��

use Unicode::Normalize;
my $nfd  = NFD($orig);
my $nfc  = NFC($orig);
my $nfkd = NFKD($orig);
my $nfkc = NFKC($orig);

邃� 28: Convert non-ASCII Unicode numerics

(邃� 28: 髱� ASCII Unicode 謨ー蟄励r螟画鋤縺吶k)

/a 繧� /aa 繧剃スソ逕ィ縺励※縺�↑縺�剞繧翫��\d 縺ッ ASCII 謨ー蟄嶺サ・荳翫�繧ゅ�縺ォ 繝槭ャ繝√Φ繧ー縺励∪縺吶′縲� Perl 縺ョ證鈴サ咏噪縺ェ譁�ュ怜�縺九i謨ー蛟、縺ク縺ョ螟画鋤縺ァ縺ッ縲∫樟蝨ィ縺ョ縺ィ縺薙m縺薙l繧峨r 隱崎ュ倥〒縺阪∪縺帙s縲� 縺薙�繧医≧縺ェ譁�ュ怜�繧呈焔蜍輔〒螟画鋤縺吶k譁ケ豕輔r莉・荳九↓遉コ縺励∪縺吶��

use v5.14;  # needed for num() function
use Unicode::UCD qw(num);
my $str = "got 竇ォ and 爭ェ爭ォ爭ャ爭ュ and 竇� and here";
my @nums = ();
while ($str =~ /(\d+|\N)/g) {  # not just ASCII!
   push @nums, num($1);
}
say "@nums";   #     12      4567      0.875

use charnames qw(:full);
my $nv = num("\N{RUMI DIGIT ONE}\N{RUMI DIGIT TWO}");

邃� 29: Match Unicode grapheme cluster in regex

(邃� 29: 豁」隕剰。ィ迴セ荳ュ縺ョ Unicode 譖ク險倡エ�繧ッ繝ゥ繧ケ繧ソ縺ォ繝槭ャ繝√Φ繧ー縺吶k)

繝励Ο繧ー繝ゥ繝槭°繧芽ヲ九∴繧九�梧枚蟄励�阪�縲�/./s 縺後�繝�メ縺吶k隨ヲ蜿キ菴咲スョ縺ァ縺吶′縲� 繝ヲ繝シ繧カ縺九i隕九∴繧九�梧枚蟄励�阪�縲�/\X/ 縺後�繝�メ縺吶k譖ク險倡エ�縺ァ縺吶��

# Find vowel *plus* any combining diacritics,underlining,etc.
my $nfd = NFD($orig);
$nfd =~ / (?=[aeiou]) \X /xi

邃� 30: Extract by grapheme instead of by codepoint (regex)

(邃� 30: 隨ヲ蜿キ菴咲スョ縺ォ繧医▲縺ヲ縺ァ縺ッ縺ェ縺上�∵嶌險倡エ�縺ォ繧医▲縺ヲ螻暮幕縺吶k (豁」隕剰。ィ迴セ))

# match and grab five first graphemes
my($first_five) = $str =~ /^ ( \X{5} ) /x;

邃� 31: Extract by grapheme instead of by codepoint (substr)

(邃� 31: 隨ヲ蜿キ菴咲スョ縺ォ繧医▲縺ヲ縺ァ縺ッ縺ェ縺上�∵嶌險倡エ�縺ォ繧医▲縺ヲ螻暮幕縺吶k (substr))

# cpan -i Unicode::GCString
use Unicode::GCString;
my $gcs = Unicode::GCString->new($str);
my $first_five = $gcs->substr(0, 5);

邃� 32: Reverse string by grapheme

(邃� 32: 譁�ュ怜�繧呈嶌險倡エ�蜊倅ス阪〒蜿崎サ「縺吶k)

隨ヲ蜿キ菴咲スョ縺ォ繧医k蜿崎サ「縺ッ繝�繧、繧「繧ッ繝ェ繝�ぅ繧ォ繝ォ繝槭�繧ッ繧呈キキ荵ア縺輔○縲∬ェ、縺」縺ヲ crティme brテシlテゥe 繧� eテゥlテサrb emティrc 縺ァ縺ッ縺ェ縺� テゥelフVrb emフ�erc 縺ォ螟画鋤縺励∪縺�; 縺昴%縺ァ縲∽サ」繧上j縺ォ譖ク險倡エ�縺ォ繧医k蜿崎サ「繧定。後>縺セ縺吶�� 縺薙l繧峨�謇区ウ輔�縺ゥ縺。繧峨b縲∵枚蟄怜�縺ョ豁」隕丞喧縺後←縺ョ繧医≧縺ェ繧ゅ�縺ァ縺ゅ▲縺ヲ繧� 豁」縺励¥讖溯�縺励∪縺吶��

$str = join("", reverse $str =~ /\X/g);

# OR: cpan -i Unicode::GCString
use Unicode::GCString;
$str = reverse Unicode::GCString->new($str);

邃� 33: String length in graphemes

(邃� 33: 譖ク險倡エ�縺ァ縺ョ譁�ュ怜�髟キ)

譁�ュ怜� brテシlテゥe 縺ッ蜈ュ縺、縺ョ譖ク險倡エ�繧呈戟縺。縺セ縺吶′縲∵怙螟ァ蜈ォ縺、縺ョ隨ヲ蜿キ菴咲スョ繧呈戟縺。縺セ縺吶�� 縺薙l縺ッ縲∫ャヲ蜿キ菴咲スョ縺ァ縺ッ縺ェ縺上�∵嶌險倡エ�縺ォ繧医▲縺ヲ繧ォ繧ヲ繝ウ繝医&繧後∪縺�:

my $str = "br璉馥";
my $count = 0;
while ($str =~ /\X/g) { $count++ }

 # OR: cpan -i Unicode::GCString
use Unicode::GCString;
my $gcs = Unicode::GCString->new($str);
my $count = $gcs->length;

邃� 34: Unicode column-width for printing

(邃� 34: 陦ィ遉コ縺ョ縺溘a縺ョ Unicode 譯∝ケ�)

Perl 縺ョ printf縲�sprintf縲�format 縺ッ縲√☆縺ケ縺ヲ縺ョ隨ヲ蜿キ菴咲スョ縺� 荳�縺、縺ョ陦ィ遉コ譯√r蜊�譛峨☆繧九→閠�∴縺ヲ縺�∪縺吶′縲∝、壹¥縺ョ隨ヲ蜿キ菴咲スョ縺ッ 0 縺九i 2 繧� 蜊�譛峨@縺セ縺吶�� 縺薙%縺ァ縺ッ縲∵ュ」隕丞喧縺ォ驕輔>縺後↑縺�%縺ィ繧堤、コ縺吶◆繧√↓縲∽ク。譁ケ縺ョ蠖「蠑上r蜃コ蜉帙@縺セ縺吶��

use Unicode::GCString;
use Unicode::Normalize;

my @words = qw/cr鑪e br璉馥/;
@words = map { NFC($_), NFD($_) } @words;

for my $str (@words) {
    my $gcs = Unicode::GCString->new($str);
    my $cols = $gcs->columns;
    my $pad = " " x (10 - $cols);
    say str, $pad, " |";
}

縺薙l縺ッ縲∵ュ」隕丞喧縺ォ髢「菫ゅ↑縺乗ュ」縺励¥繝代ャ繝�ぅ繝ウ繧ー縺輔l縺ヲ縺�k縺薙→繧堤、コ縺吶◆繧√↓ 谺。縺ョ繧医≧縺ォ逕滓�縺輔l縺セ縺吶��

crティme      |
creフ�me      |
brテサlテゥe     |
bruフMeフ‘     |

邃� 35: Unicode collation

(邃� 35: Unicode 縺ョ辣ァ蜷磯��コ�)

謨ー蛟、隨ヲ蜿キ菴咲スョ縺ァ繧ス繝シ繝医&繧後◆繝�く繧ケ繝医�縲∝粋逅�噪縺ェ繧「繝ォ繝輔ぃ繝吶ャ繝磯��〒縺ッ縺ゅj縺セ縺帙s; 繝�く繧ケ繝医�繧ス繝シ繝医↓縺ッ UCA 繧剃スソ逕ィ縺励※縺上□縺輔>縲�

use Unicode::Collate;
my $col = Unicode::Collate->new();
my @list = $col->sort(@old_list);

縺薙�繝「繧ク繝・繝シ繝ォ縺ク縺ョ萓ソ蛻ゥ縺ェ繧ウ繝槭Φ繝峨Λ繧、繝ウ繧、繝ウ繧ソ繝輔ぉ繝シ繧ケ縺ォ縺、縺�※縺ッ縲� Unicode::Tassil CPAN 繝「繧ク繝・繝シ繝ォ縺ョ ucsort 繝励Ο繧ー繝ゥ繝�繧貞盾辣ァ縺励※縺上□縺輔>縲�

邃� 36: Case- and accent-insensitive Unicode sort

(邃� 36: 螟ァ譁�ュ怜ー乗枚蟄� 縺翫h縺ウ 繧「繧ッ繧サ繝ウ繝医r辟。隕悶@縺� Unicode 縺ョ繧ス繝シ繝�)

辣ァ蜷亥シキ蠎ヲ繝ャ繝吶Ν 1 繧呈欠螳壹@縺ヲ縲∝、ァ譁�ュ怜ー乗枚蟄励→繝�繧、繧「繧ッ繝ェ繝�ぅ繧ォ繝ォ繝槭�繧ッ繧� 辟。隕悶@縲∝渕譛ャ譁�ュ励□縺代r蜿ら�縺吶k繧医≧縺ォ縺励∪縺吶��

use Unicode::Collate;
my $col = Unicode::Collate->new(level => 1);
my @list = $col->sort(@old_list);

邃� 37: Unicode locale collation

(邃� 37: Unicode 繝ュ繧ア繝シ繝ォ縺ョ辣ァ蜷磯��コ�)

荳�驛ィ縺ョ繝ュ繧ア繝シ繝ォ縺ォ縺ッ縲∫音蛻・縺ェ繧ス繝シ繝郁ヲ丞援縺後≠繧翫∪縺吶��

# either use v5.12, OR: cpan -i Unicode::Collate::Locale
use Unicode::Collate::Locale;
my $col = Unicode::Collate::Locale->new(locale => "de__phonebook");
my @list = $col->sort(@old_list);

荳願ィ倥� ucsort 繝励Ο繧ー繝ゥ繝�縺ッ縲�--locale 繝代Λ繝。繝シ繧ソ繧貞女縺台サ倥¢縺セ縺吶��

邃� 38: Making cmp work on text instead of codepoints

(邃� 38: 隨ヲ蜿キ菴咲スョ縺ァ縺ッ縺ェ縺上ユ繧ュ繧ケ繝医〒g cmp 縺悟虚菴懊☆繧九h縺�↓縺吶k)

谺。縺ョ繧医≧縺ォ縺帙★縺ォ:

@srecs = sort {
    $b->{AGE}   <=>  $a->{AGE}
                ||
    $a->{NAME}  cmp  $b->{NAME}
} @recs;

谺。繧剃スソ縺�∪縺�:

my $coll = Unicode::Collate->new();
for my $rec (@recs) {
    $rec->{NAME_key} = $coll->getSortKey( $rec->{NAME} );
}
@srecs = sort {
    $b->{AGE}       <=>  $a->{AGE}
                    ||
    $a->{NAME_key}  cmp  $b->{NAME_key}
} @recs;

邃� 39: Case- and accent-insensitive comparisons

(邃� 39: 螟ァ譁�ュ怜ー乗枚蟄� 縺翫h縺ウ 繧「繧ッ繧サ繝ウ繝医r辟。隕悶@縺滓ッ碑シ�)

辣ァ蜷医が繝悶ず繧ァ繧ッ繝医r菴ソ逕ィ縺励※縲ゞnicode 繝�く繧ケ繝医r隨ヲ蜿キ菴咲スョ縺ァ縺ッ縺ェ縺� 譁�ュ励〒豈碑シ�@縺セ縺吶��

use Unicode::Collate;
my $es = Unicode::Collate->new(
    level => 1,
    normalization => undef
);

 # now both are true:
$es->eq("Garc僘",  "GARCIA" );
$es->eq("M疵quez", "MARQUEZ");

邃� 40: Case- and accent-insensitive locale comparisons

(邃� 40: 螟ァ譁�ュ怜ー乗枚蟄� 縺翫h縺ウ 繧「繧ッ繧サ繝ウ繝医r辟。隕悶@縺溘Ο繧ア繝シ繝ォ縺ァ縺ョ豈碑シ�)

蜷後§縺ァ縺吶′縲∫音螳壹�繝ュ繧ア繝シ繝ォ縺ァ縺吶��

my $de = Unicode::Collate::Locale->new(
           locale => "de__phonebook",
         );

# now this is true:
$de->eq("tsch�", "TSCHUESS");  # notice � => UE, ゚ => SS

邃� 41: Unicode linebreaking

(邃� 41: Unicode 縺ョ謾ケ陦�)

Unicode 隕丞援縺ォ蠕薙▲縺ヲ繝�く繧ケ繝医r陦後↓蛻�牡縺励∪縺吶��

# cpan -i Unicode::LineBreak
use Unicode::LineBreak;
use charnames qw(:full);

my $para = "This is a super\N{HYPHEN}long string. " x 20;
my $fmt = Unicode::LineBreak->new;
print $fmt->break($para), "\n";

邃� 42: Unicode text in DBM hashes, the tedious way

(邃� 42: DBM 繝上ャ繧キ繝・縺ョ荳ュ縺ョ Unicode 繝�く繧ケ繝医����螻医↑譁ケ豕�)

DBM 繝上ャ繧キ繝・縺ョ繧ュ繝シ縺セ縺溘�蛟、縺ィ縺励※騾壼クク縺ョ Perl 譁�ュ怜�繧剃スソ逕ィ縺吶k縺ィ縲� 隨ヲ蜿キ菴咲スョ縺� 1 繝舌う繝医↓蜿弱∪繧峨↑縺��エ蜷医↓繝ッ繧、繝画枚蟄嶺セ句、悶′逋コ逕溘@縺セ縺吶�� 谺。縺ォ縲∵焔蜍輔〒螟画鋤繧堤ョ。逅�☆繧区婿豕輔r遉コ縺励∪縺�:

use DB_File;
use Encode qw(encode decode);
tie %dbhash, "DB_File", "pathname";

 # STORE

# assume $uni_key and $uni_value are abstract Unicode strings
my $enc_key   = encode("UTF-8", $uni_key, 1);
my $enc_value = encode("UTF-8", $uni_value, 1);
$dbhash{$enc_key} = $enc_value;

 # FETCH

# assume $uni_key holds a normal Perl string (abstract Unicode)
my $enc_key   = encode("UTF-8", $uni_key, 1);
my $enc_value = $dbhash{$enc_key};
my $uni_value = decode("UTF-8", $enc_value, 1);

邃� 43: Unicode text in DBM hashes, the easy way

(邃� 43: DBM 繝上ャ繧キ繝・縺ョ荳ュ縺ョ Unicode 繝�く繧ケ繝医�∫ー。蜊倥↑譁ケ豕�)

谺。縺ォ縲∝、画鋤繧呈囓鮟咏噪縺ォ邂。逅�☆繧区婿豕輔r遉コ縺励∪縺�; 縺吶∋縺ヲ縺ョ繧ィ繝ウ繧ウ繝シ繝峨→繝�さ繝シ繝峨�縲∫音螳壹�繧ィ繝ウ繧ウ繝シ繝�ぅ繝ウ繧ー縺御サ伜刈縺輔l縺� 繧ケ繝医Μ繝シ繝�縺ィ蜷後§繧医≧縺ォ閾ェ蜍慕噪縺ォ陦後o繧後∪縺�:

use DB_File;
use DBM_Filter;

my $dbobj = tie %dbhash, "DB_File", "pathname";
$dbobj->Filter_Value("utf8");  # this is the magic bit

 # STORE

# assume $uni_key and $uni_value are abstract Unicode strings
$dbhash{$uni_key} = $uni_value;

  # FETCH

# $uni_key holds a normal Perl string (abstract Unicode)
my $uni_value = $dbhash{$uni_key};

邃� 44: PROGRAM: Demo of Unicode collation and printing

(邃� 44: 繝励Ο繧ー繝ゥ繝�: Unicode 縺ョ辣ァ蜷医→陦ィ遉コ縺ョ繝�Δ)

莉・荳九�螳悟�縺ェ繝励Ο繧ー繝ゥ繝�縺ァ縺ッ縲√Ο繧ア繝シ繝ォ繧定ェ崎ュ倥☆繧九た繝シ繝医�� Unicode 縺ョ螟ァ譁�ュ怜ー乗枚蟄励�√◎縺励※縺�¥縺、縺九�譁�ュ励′ 1 譯√〒縺ッ縺ェ縺� 0 縺セ縺溘� 2 譯√r 蜊�繧√k蝣エ蜷医�蜊ー蛻キ蟷��邂。逅�r縺ゥ縺ョ繧医≧縺ォ蛻ゥ逕ィ縺吶k縺九r遉コ縺励※縺�∪縺吶�� 谺。縺ョ繝励Ο繧ー繝ゥ繝�繧貞ョ溯。後☆繧九→縲∵ャ。縺ョ繧医≧縺ェ縺�∪縺乗紛蛻励@縺溷�蜉帙′逕滓�縺輔l縺セ縺�:

Crティme Brテサlテゥe....... 竄ャ2.00
テ営lair............. 竄ャ1.60
Fideuテ�............. 竄ャ4.20
Hamburger.......... 竄ャ6.00
Jamテウn Serrano...... 竄ャ4.45
Linguiテァa........... 竄ャ7.00
Pテ「tテゥ............... 竄ャ4.15
Pears.............. 竄ャ2.00
Pテェches............. 竄ャ2.25
Smテクrbrテクd........... 竄ャ5.75
Spテ、tzle............ 竄ャ5.50
Xoriテァo............. 竄ャ3.00
ホ苫済∃ソマ�.............. 竄ャ6.50
�賀アク�ャ............. 竄ャ4.00
縺翫b縺。............. 竄ャ2.65
縺雁・ス縺ソ辟シ縺�......... 竄ャ8.00
繧キ繝・繝シ繧ッ繝ェ繝シ繝�..... 竄ャ1.85
蟇ソ蜿ク............... 竄ャ9.99
蛹�ュ�............... 竄ャ7.50

縺薙l縺後�繝ュ繧ー繝ゥ繝�縺ァ縺�; v5.14 縺ァ繝�せ繝医&繧後※縺�∪縺吶��

#!/usr/bin/env perl
# umenu - demo sorting and printing of Unicode food
#
# (obligatory and increasingly long preamble)
#
use utf8;
use v5.14;                       # for locale sorting
use strict;
use warnings;
use warnings  qw(FATAL utf8);    # fatalize encoding faults
use open      qw(:std :encoding(UTF-8)); # undeclared streams in UTF-8
use charnames qw(:full :short);  # unneeded in v5.16

# std modules
use Unicode::Normalize;          # std perl distro as of v5.8
use List::Util qw(max);          # std perl distro as of v5.10
use Unicode::Collate::Locale;    # std perl distro as of v5.14

# cpan modules
use Unicode::GCString;           # from CPAN

# forward defs
sub pad($$$);
sub colwidth(_);
sub entitle(_);

my %price = (
    "ホウマ済∃ソマ�"             => 6.50, # gyros
    "pears"             => 2.00, # like um, pears
    "linguiテァa"          => 7.00, # spicy sausage, Portuguese
    "xoriテァo"            => 3.00, # chorizo sausage, Catalan
    "hamburger"         => 6.00, # burgermeister meisterburger
    "テゥclair"            => 1.60, # dessert, French
    "smテクrbrテクd"          => 5.75, # sandwiches, Norwegian
    "spテ、tzle"           => 5.50, # Bayerisch noodles, little sparrows
    "蛹�ュ�"              => 7.50, # bao1 zi5, steamed pork buns, Mandarin
    "jamテウn serrano"     => 4.45, # country ham, Spanish
    "pテェches"            => 2.25, # peaches, French
    "繧キ繝・繝シ繧ッ繝ェ繝シ繝�"    => 1.85, # cream-filled pastry like eclair
    "�賀アク�ャ"            => 4.00, # makgeolli, Korean rice wine
    "蟇ソ蜿ク"              => 9.99, # sushi, Japanese
    "縺翫b縺。"            => 2.65, # omochi, rice cakes, Japanese
    "crティme brテサlテゥe"      => 2.00, # crema catalana
    "fideuテ�"            => 4.20, # more noodles, Valencian
                                 # (Catalan=fideuada)
    "pテ「tテゥ"              => 4.15, # gooseliver paste, French
    "縺雁・ス縺ソ辟シ縺�"        => 8.00, # okonomiyaki, Japanese
);

my $width = 5 + max map { colwidth } keys %price;

# So the Asian stuff comes out in an order that someone
# who reads those scripts won't freak out over; the
# CJK stuff will be in JIS X 0208 order that way.
my $coll  = Unicode::Collate::Locale->new(locale => "ja");

for my $item ($coll->sort(keys %price)) {
    print pad(entitle($item), $width, ".");
    printf " 竄ャ%.2f\n", $price{$item};
}

sub pad($$$) {
    my($str, $width, $padchar) = @_;
    return $str . ($padchar x ($width - colwidth($str)));
}

sub colwidth(_) {
    my($str) = @_;
    return Unicode::GCString->new($str)->columns;
}

sub entitle(_) {
    my($str) = @_;
    $str =~ s{ (?=\pL)(\S)     (\S*) }
             { ucfirst($1) . lc($2)  }xge;
    return $str;
}

SEE ALSO

莉・荳九� man 繝壹�繧ク; 荳�驛ィ縺ッ CPAN 繝「繧ク繝・繝シ繝ォ縺ョ繧ゅ�縺ァ縺�: perlunicode, perluniprops, perlre, perlrecharclass, perluniintro, perlunitut, perlunifaq, PerlIO, DB_File, DBM_Filter, DBM_Filter::utf8, Encode, Encode::Locale, Unicode::UCD, Unicode::Normalize, Unicode::GCString, Unicode::LineBreak, Unicode::Collate, Unicode::Collate::Locale, Unicode::Unihan, Unicode::CaseFold, Unicode::Tussle, Lingua::JA::Romanize::Japanese, Lingua::ZH::Romanize::Pinyin, Lingua::KO::Romanize::Hangul.

Unicode::Tussle CPAN 繝「繧ク繝・繝シ繝ォ縺ォ縺ッ縲ゞnicode 繧呈桶縺�◆繧√�螟壹¥縺ョ 繝励Ο繧ー繝ゥ繝�縺悟性縺セ繧後※縺�∪縺�; 縺薙l繧峨�繝励Ο繧ー繝ゥ繝�縺ッ縲∵ィ呎コ悶Θ繝シ繝�ぅ繝ェ繝�ぅ繧貞ョ悟�縺ォ縺セ縺溘�驛ィ蛻�噪縺ォ 鄂ョ縺肴鋤縺医k縺溘a縺ョ繧ゅ�縺ァ縺�: egrep 縺ョ莉」繧上j縺ォ tcgrep縲� cat -v 縺セ縺溘� hexdump 縺ョ莉」繧上j縺ォ uniquote縲� wc 縺ョ莉」繧上j縺ォ uniwc縲� look 縺ョ莉」繧上j縺ォ unilook縲� fmt 縺ョ莉」繧上j縺ォ unifmt縲� sort 縺ョ莉」繧上j縺ォ ucsort縲� Unicode 譁�ュ怜錐縺ィ譁�ュ礼音諤ァ繧定ェソ縺ケ繧九↓縺ッ縲�uniprops縲�unichars縲� uninames 繝励Ο繧ー繝ゥ繝�繧貞盾辣ァ縺励※縺上□縺輔>縲� 縺セ縺溘�√%繧後i縺ョ繝励Ο繧ー繝ゥ繝�繧よ署萓帙@縺ヲ縺�∪縺吶�� 縺薙l繧峨�縺吶∋縺ヲ Unicode 蟇セ蠢懊�荳�闊ャ逧�↑繝輔ぅ繝ォ繧ソ縺ァ縺�: unititle 縺ィ unicaps縲� uniwide 縺ィ uninarrow縲� unisupers 縺ィ unisubs縲� nfd縲�nfc縲�nfkd縲�nfkc; uc縲�lc縲�tc縲�

譛�蠕後↓縲√%繧後i縺ョ迚ケ螳壹�莉伜ア樊枚譖ク縺翫h縺ウ謚�陦灘�ア蜻頑嶌繧貞性繧�縲∝�髢九&繧後◆ Unicode 讓呎コ�(繝壹�繧ク逡ェ蜿キ縺ッ繝舌�繧ク繝ァ繝ウ6.0.0 縺九i) 繧貞盾辣ァ縺励※縺上□縺輔>縲�

ツァ3.13 Default Case Algorithms, page 113; ツァ4.2 Case, pages 120窶�122; Case Mappings, page 166窶�172, especially Caseless Matching starting on page 170.
UAX #44: Unicode Character Database
UTS #18: Unicode Regular Expressions
UAX #15: Unicode Normalization Forms
UTS #10: Unicode Collation Algorithm
UAX #29: Unicode Text Segmentation
UAX #14: Unicode Line Breaking Algorithm
UAX #11: East Asian Width

AUTHOR

Tom Christiansen <tchrist@perl.com> 縺後�� 譎ゅ�� Larry Wall 縺ィ Jeffrey Friedl 縺ォ蠕後m縺九i蜿」蜃コ縺励&繧後↑縺後i譖ク縺阪∪縺励◆縲�

COPYRIGHT AND LICENCE

Copyright ゥ 2012 Tom Christiansen.

This program is free software; you may redistribute it and/or modify it under the same terms as Perl itself.

縺薙l繧峨�萓九�縺サ縺ィ繧薙←縺ッ縲�"Camel Book"縺ョ迴セ蝨ィ縺ョ迚医°繧牙シ慕畑縺輔l縺ヲ縺�∪縺�: 縺吶↑繧上■縲�4盞量ー迚�Programming Perl, Copyright ツゥ 2012 Tom Christiansen <et al.>, 2012-02-13 by O'Reilly Media縲� 繧ウ繝シ繝芽�菴薙�閾ェ逕ア縺ォ蜀埼�蟶�庄閭ス縺ァ縺ゅj縲√%縺ョ man 繝壹�繧ク縺ョ萓九r遘サ讀阪@縺溘j縲� 謚倥j縺溘◆繧薙□繧翫�∫エ。骭伜ス「縺ォ縺励◆繧翫�∝�譁ュ縺励◆繧翫☆繧九%縺ィ縺梧耳螂ィ縺輔l縺セ縺吶′縲� 縺ゅ↑縺溯�霄ォ縺ョ繝励Ο繧ー繝ゥ繝�縺ォ蜷ォ繧√k縺溘a縺ォ縺ッ縲∽ス輔b豌励↓縺帙★縺ォ陦後▲縺ヲ縺上□縺輔>縲� 繧ウ繝シ繝峨さ繝。繝ウ繝医↓繧医k隰晁セ槭�荳∝ッァ縺ァ縺吶′縲∝ソ��医〒縺ッ縺ゅj縺セ縺帙s縲�

REVISION HISTORY

v1.0.0 - 譛�蛻昴�荳�闊ャ蜈ャ髢九��2012-02-27