Skip to content
This repository has been archived by the owner on Jun 1, 2023. It is now read-only.

Unicode mixed script confusables #229

Closed
rurban opened this issue Nov 30, 2016 · 1 comment
Closed

Unicode mixed script confusables #229

rurban opened this issue Nov 30, 2016 · 1 comment

Comments

@rurban
Copy link
Member

rurban commented Nov 30, 2016

In order to avoid TR39 confusable security hacks, we add the following unicode rules for identifiers and literals:

  1. The first non-Latin and not-Common unicode script for an identifier is the only allowed one. Others lead to parsers errors.
  2. Additional unicode scripts can and should be declared via `use utf8 'Greek', 'script-name2'... to prevent mixed script errors. This allows more scripts than in rule 1. This can be scoped in blocks.
  3. The 'Common' and 'Latin' scripts are always enabled and don't need to be declared.

See http://www.unicode.org/reports/tr39/#Mixed_Script_Detection

This holds for all identifiers (all names: package, gv, sub, variables) and literal numbers.
The scriptname is returned by Unicode::UCD::charscript($codepoint_as_uv)

Currently there exist 131 scripts:
perl -alne'/; (\w+) #/ && print $1' lib/unicore/Scripts.txt | sort -u > scripts.lst

Ahom
Anatolian_Hieroglyphs
Arabic
Armenian
Avestan
Balinese
Bamum
Bassa_Vah
Batak
Bengali
Bopomofo
Brahmi
Braille
Buginese
Buhid
Canadian_Aboriginal
Carian
Caucasian_Albanian
Chakma
Cham
Cherokee
Common
Coptic
Cuneiform
Cypriot
Cyrillic
Deseret
Devanagari
Duployan
Egyptian_Hieroglyphs
Elbasan
Ethiopic
Georgian
Glagolitic
Gothic
Grantha
Greek
Gujarati
Gurmukhi
Han
Hangul
Hanunoo
Hatran
Hebrew
Hiragana
Imperial_Aramaic
Inherited
Inscriptional_Pahlavi
Inscriptional_Parthian
Javanese
Kaithi
Kannada
Katakana
Kayah_Li
Kharoshthi
Khmer
Khojki
Khudawadi
Lao
Latin
Lepcha
Limbu
Linear_A
Linear_B
Lisu
Lycian
Lydian
Mahajani
Malayalam
Mandaic
Manichaean
Meetei_Mayek
Mende_Kikakui
Meroitic_Cursive
Meroitic_Hieroglyphs
Miao
Modi
Mongolian
Mro
Multani
Myanmar
Nabataean
New_Tai_Lue
Nko
Ogham
Ol_Chiki
Old_Hungarian
Old_Italic
Old_North_Arabian
Old_Permic
Old_Persian
Old_South_Arabian
Old_Turkic
Oriya
Osmanya
Pahawh_Hmong
Palmyrene
Pau_Cin_Hau
Phags_Pa
Phoenician
Psalter_Pahlavi
Pau_Cin_Hau
Phags_Pa
Phoenician
Psalter_Pahlavi
Rejang
Runic
Samaritan
Saurashtra
Sharada
Shavian
Siddham
SignWriting
Sinhala
Sora_Sompeng
Sundanese
Syloti_Nagri
Syriac
Tagalog
Tagbanwa
Tai_Le
Tai_Tham
Tai_Viet
Takri
Tamil
Telugu
Thaana
Thai
Tibetan
Tifinagh
Tirhuta
Ugaritic
Vai
Warang_Citi
Yi
@rurban
Copy link
Member Author

rurban commented Dec 2, 2016

The remaining question if certain languages need alias for sets of Scripts, because they use multiple scripts by default. Such as Japanese for Hiragana and Katakana (what about Kanji? = Han?),
Korean for Hangul and Han (Chinese).

rurban pushed a commit that referenced this issue Dec 3, 2016
Document the new unicode mixed script confusable security
restriction. Declare valid unicode scripts via use utf8 arguments.
This bug was introduced with 5.16.

See #229.
rurban pushed a commit that referenced this issue Dec 3, 2016
add isLATIN_or_COMMON_uni with data from from lib/unicore/To/Sc.pl
add dummy utf8_check_script(uv) to check if the script was declared.

See #229.
rurban pushed a commit that referenced this issue Dec 4, 2016
let them be generated by regen/regcharclass.pl

change API from UV cp to U8*s. Spares a costly utf8n_to_uvchr
conversion in hot code. Do this only when throwing the error.

Slowdown <1%
See GH #229.
rurban pushed a commit that referenced this issue Dec 4, 2016
and uvuni_get_script, and a utf8_error_script helper.
done via the simple Unicode::UCD::charscript method.

See #229
@rurban rurban closed this as completed Dec 4, 2016
@ghost ghost removed the in progress label Dec 4, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant