Unicode mixed script confusables #229

rurban · 2016-11-30T09:53:06Z

In order to avoid TR39 confusable security hacks, we add the following unicode rules for identifiers and literals:

The first non-Latin and not-Common unicode script for an identifier is the only allowed one. Others lead to parsers errors.
Additional unicode scripts can and should be declared via `use utf8 'Greek', 'script-name2'... to prevent mixed script errors. This allows more scripts than in rule 1. This can be scoped in blocks.
The 'Common' and 'Latin' scripts are always enabled and don't need to be declared.

See http://www.unicode.org/reports/tr39/#Mixed_Script_Detection

This holds for all identifiers (all names: package, gv, sub, variables) and literal numbers.
The scriptname is returned by Unicode::UCD::charscript($codepoint_as_uv)

Currently there exist 131 scripts:
perl -alne'/; (\w+) #/ && print $1' lib/unicore/Scripts.txt | sort -u > scripts.lst

Ahom
Anatolian_Hieroglyphs
Arabic
Armenian
Avestan
Balinese
Bamum
Bassa_Vah
Batak
Bengali
Bopomofo
Brahmi
Braille
Buginese
Buhid
Canadian_Aboriginal
Carian
Caucasian_Albanian
Chakma
Cham
Cherokee
Common
Coptic
Cuneiform
Cypriot
Cyrillic
Deseret
Devanagari
Duployan
Egyptian_Hieroglyphs
Elbasan
Ethiopic
Georgian
Glagolitic
Gothic
Grantha
Greek
Gujarati
Gurmukhi
Han
Hangul
Hanunoo
Hatran
Hebrew
Hiragana
Imperial_Aramaic
Inherited
Inscriptional_Pahlavi
Inscriptional_Parthian
Javanese
Kaithi
Kannada
Katakana
Kayah_Li
Kharoshthi
Khmer
Khojki
Khudawadi
Lao
Latin
Lepcha
Limbu
Linear_A
Linear_B
Lisu
Lycian
Lydian
Mahajani
Malayalam
Mandaic
Manichaean
Meetei_Mayek
Mende_Kikakui
Meroitic_Cursive
Meroitic_Hieroglyphs
Miao
Modi
Mongolian
Mro
Multani
Myanmar
Nabataean
New_Tai_Lue
Nko
Ogham
Ol_Chiki
Old_Hungarian
Old_Italic
Old_North_Arabian
Old_Permic
Old_Persian
Old_South_Arabian
Old_Turkic
Oriya
Osmanya
Pahawh_Hmong
Palmyrene
Pau_Cin_Hau
Phags_Pa
Phoenician
Psalter_Pahlavi
Pau_Cin_Hau
Phags_Pa
Phoenician
Psalter_Pahlavi
Rejang
Runic
Samaritan
Saurashtra
Sharada
Shavian
Siddham
SignWriting
Sinhala
Sora_Sompeng
Sundanese
Syloti_Nagri
Syriac
Tagalog
Tagbanwa
Tai_Le
Tai_Tham
Tai_Viet
Takri
Tamil
Telugu
Thaana
Thai
Tibetan
Tifinagh
Tirhuta
Ugaritic
Vai
Warang_Citi
Yi

The text was updated successfully, but these errors were encountered:

rurban · 2016-12-02T19:41:55Z

The remaining question if certain languages need alias for sets of Scripts, because they use multiple scripts by default. Such as Japanese for Hiragana and Katakana (what about Kanji? = Han?),
Korean for Hangul and Han (Chinese).

Document the new unicode mixed script confusable security restriction. Declare valid unicode scripts via use utf8 arguments. This bug was introduced with 5.16. See #229.

add isLATIN_or_COMMON_uni with data from from lib/unicore/To/Sc.pl add dummy utf8_check_script(uv) to check if the script was declared. See #229.

let them be generated by regen/regcharclass.pl change API from UV cp to U8*s. Spares a costly utf8n_to_uvchr conversion in hot code. Do this only when throwing the error. Slowdown <1% See GH #229.

and uvuni_get_script, and a utf8_error_script helper. done via the simple Unicode::UCD::charscript method. See #229

rurban added enhancement security labels Nov 30, 2016

rurban self-assigned this Nov 30, 2016

rurban added the in progress label Dec 1, 2016

rurban pushed a commit that referenced this issue Dec 3, 2016

use utf8 Script - add ident check

011d777

add isLATIN_or_COMMON_uni with data from from lib/unicore/To/Sc.pl add dummy utf8_check_script(uv) to check if the script was declared. See #229.

rurban pushed a commit that referenced this issue Dec 4, 2016

use utf8 Script - implement utf8_check_script

f32cad2

and uvuni_get_script, and a utf8_error_script helper. done via the simple Unicode::UCD::charscript method. See #229

rurban closed this as completed Dec 4, 2016

ghost removed the in progress label Dec 4, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode mixed script confusables #229

Unicode mixed script confusables #229

rurban commented Nov 30, 2016 •

edited

Loading

rurban commented Dec 2, 2016

Unicode mixed script confusables #229

Unicode mixed script confusables #229

Comments

rurban commented Nov 30, 2016 • edited Loading

rurban commented Dec 2, 2016

rurban commented Nov 30, 2016 •

edited

Loading