Skip to content

Commit d573054

Browse files
committed
Enable encoding detection for Polish text
Previously, some accented letters commonly used to write Polish text were counted as 'rare' codepoints. Treat them as 'common' instead. Thanks to Alec for pointing this out.
1 parent 80d63e9 commit d573054

File tree

2 files changed

+6
-1
lines changed

2 files changed

+6
-1
lines changed

ext/mbstring/common_codepoints.txt

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,11 @@
33
0x0020 0x007E # ASCII
44
0x00A1 0x00AC # Pound sign, Yen sign, copyright sign...
55
0x00AE 0x00FF # Accented Latin characters
6+
0x0104 0x0107 # Polish
7+
0x0118 0x0119 # Polish
8+
0x0141 0x0144 # Polish
9+
0x015A 0x015B # Polish
10+
0x0179 0x017C # Polish
611
0x0300 0x030A # Diacritical marks
712
0x0370 0x0377 # Greek
813
0x037A 0x037F # Greek

ext/mbstring/rare_cp_bitvec.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111

1212
static uint32_t rare_codepoint_bitvec[] = {
1313
0xffffd9ff, 0x00000000, 0x00000000, 0x80000000, 0xffffffff, 0x00002001, 0x00000000, 0x00000000,
14-
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
14+
0xfcffff0f, 0xffffffff, 0xf3ffffe1, 0xe1ffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
1515
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
1616
0xfffff800, 0xffffffff, 0xffffffff, 0x0300ffff, 0x0000280f, 0x00000004, 0x00000000, 0x00000000,
1717
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,

0 commit comments

Comments
 (0)