Skip to content

Commit

Permalink
UCS-4 conversion does not pass BOM through to output
Browse files Browse the repository at this point in the history
This is to match the way that we handle UCS-2. When a BOM is found at
the beginning of a 'UCS-2' string (NOT 'UCS-2BE' or 'UCS-2LE'), we take
note of the intended byte order and handle the string accordingly, but
do NOT emit a BOM to the output. Rather, we just use the default byte
order for the requested output encoding.

Some might argue that if the input string used a BOM, and we are
emitting output in a text encoding where both big-endian and
little-endian byte orders are possible, we should include a BOM in the
output string. To such hypothetical debaters of minutiae, I can only
offer you a shoulder shrug. No reasonable program which handles UCS-2
and UCS-4 text should require a BOM.

Really, the concept of the BOM is a poor idea and should not have been
included in Unicode. Standardizing on a single byte order would have
been much better, similar to 'network byte order' for the Internet
Protocol. But this is not the place to speak at length of such things.
  • Loading branch information
alexdowad committed Aug 30, 2021
1 parent e6f1a72 commit 97f8495
Showing 1 changed file with 2 additions and 3 deletions.
5 changes: 2 additions & 3 deletions ext/mbstring/libmbfl/filters/mbfilter_ucs4.c
Original file line number Diff line number Diff line change
Expand Up @@ -185,11 +185,10 @@ int mbfl_filt_conv_ucs4_wchar(int c, mbfl_convert_filter *filter)
} else {
filter->status = 0x100; /* little-endian */
}
CK((*filter->output_function)(0xfeff, filter->data));
} else {
filter->status &= ~0xff;
} else if (n != 0xfeff) {
CK((*filter->output_function)(n, filter->data));
}
filter->status &= ~0xff;
break;
}

Expand Down

0 comments on commit 97f8495

Please sign in to comment.