UCS-4 conversion does not pass BOM through to output

This is to match the way that we handle UCS-2. When a BOM is found at the beginning of a 'UCS-2' string (NOT 'UCS-2BE' or 'UCS-2LE'), we take note of the intended byte order and handle the string accordingly, but do NOT emit a BOM to the output. Rather, we just use the default byte order for the requested output encoding. Some might argue that if the input string used a BOM, and we are emitting output in a text encoding where both big-endian and little-endian byte orders are possible, we should include a BOM in the output string. To such hypothetical debaters of minutiae, I can only offer you a shoulder shrug. No reasonable program which handles UCS-2 and UCS-4 text should require a BOM. Really, the concept of the BOM is a poor idea and should not have been included in Unicode. Standardizing on a single byte order would have been much better, similar to 'network byte order' for the Internet Protocol. But this is not the place to speak at length of such things.
php · Aug 30, 2021 · 97f8495 · 97f8495
1 parent e6f1a72
commit 97f8495
Showing 1 changed file with 2 additions and 3 deletions.
diff --git a/ext/mbstring/libmbfl/filters/mbfilter_ucs4.c b/ext/mbstring/libmbfl/filters/mbfilter_ucs4.c
@@ -185,11 +185,10 @@ int mbfl_filt_conv_ucs4_wchar(int c, mbfl_convert_filter *filter)
 			} else {
 				filter->status = 0x100;		/* little-endian */
 			}
-			CK((*filter->output_function)(0xfeff, filter->data));
-		} else {
-			filter->status &= ~0xff;
+		} else if (n != 0xfeff) {
 			CK((*filter->output_function)(n, filter->data));
 		}
+		filter->status &= ~0xff;
 		break;
 	}