Skip to content

Commit

Permalink
Optimize text encoding detection for speed (eliminate Unicode propert…
Browse files Browse the repository at this point in the history
…y lookups)

...By just testing the input codepoints if they are within a few fixed
ranges instead. This avoids hash lookups in property tables.

From (micro-)benchmarking on my PC, this looks to be a bit less than 4x
faster than the existing code.
  • Loading branch information
alexdowad committed Sep 20, 2021
1 parent ad14d65 commit 6acd4f7
Showing 1 changed file with 6 additions and 3 deletions.
9 changes: 6 additions & 3 deletions ext/mbstring/libmbfl/mbfl/mbfilter.c
Original file line number Diff line number Diff line change
Expand Up @@ -294,13 +294,16 @@ static int mbfl_estimate_encoding_likelihood(int c, void *void_data)
* it's the wrong one. */
if (c == MBFL_BAD_INPUT) {
data->num_illegalchars++;
} else if (php_unicode_is_cntrl(c) || php_unicode_is_private(c)) {
} else if (c < 0x9 || (c >= 0xE && c <= 0x1F) || (c >= 0xE000 && c <= 0xF8FF) || c >= 0xF0000) {
/* Otherwise, count how many control characters and 'private use'
* codepoints we see. Those are rarely used and may indicate that
* the candidate encoding is not the right one. */
data->score += 10;
} else if (php_unicode_is_punct(c)) {
/* Punctuation is also less common than letters/digits */
} else if ((c >= 0x21 && c <= 0x2F) || (c >= 0x3A && c <= 0x40) || (c >= 0x5B && c <= 0x60)) {
/* Punctuation is also less common than letters/digits; further, if
* text in ISO-2022 or similar encodings is mistakenly identified as
* ASCII or UTF-8, the misinterpreted string will tend to have an
* unusually high density of ASCII punctuation characters. */
data->score++;
}
return 0;
Expand Down

0 comments on commit 6acd4f7

Please sign in to comment.