Optimize text encoding detection for speed (eliminate Unicode propert…

…y lookups) ...By just testing the input codepoints if they are within a few fixed ranges instead. This avoids hash lookups in property tables. From (micro-)benchmarking on my PC, this looks to be a bit less than 4x faster than the existing code.
php · Sep 20, 2021 · 6acd4f7 · 6acd4f7
1 parent ad14d65
commit 6acd4f7
Showing 1 changed file with 6 additions and 3 deletions.
diff --git a/ext/mbstring/libmbfl/mbfl/mbfilter.c b/ext/mbstring/libmbfl/mbfl/mbfilter.c
@@ -294,13 +294,16 @@ static int mbfl_estimate_encoding_likelihood(int c, void *void_data)
 	 * it's the wrong one. */
 	if (c == MBFL_BAD_INPUT) {
 		data->num_illegalchars++;
-	} else if (php_unicode_is_cntrl(c) || php_unicode_is_private(c)) {
+	} else if (c < 0x9 || (c >= 0xE && c <= 0x1F) || (c >= 0xE000 && c <= 0xF8FF) || c >= 0xF0000) {
 		/* Otherwise, count how many control characters and 'private use'
 		 * codepoints we see. Those are rarely used and may indicate that
 		 * the candidate encoding is not the right one. */
 		data->score += 10;
-	} else if (php_unicode_is_punct(c)) {
-		/* Punctuation is also less common than letters/digits */
+	} else if ((c >= 0x21 && c <= 0x2F) || (c >= 0x3A && c <= 0x40) || (c >= 0x5B && c <= 0x60)) {
+		/* Punctuation is also less common than letters/digits; further, if
+		 * text in ISO-2022 or similar encodings is mistakenly identified as
+		 * ASCII or UTF-8, the misinterpreted string will tend to have an
+		 * unusually high density of ASCII punctuation characters. */
 		data->score++;
 	}
 	return 0;