Skip to content

Commit

Permalink
Deprecate use of mbstring to convert text to Base64/QPrint/HTML entit…
Browse files Browse the repository at this point in the history
…ies/etc

The purpose of mbstring is for working with Unicode and legacy text
encodings; but Base64, QPrint, etc. are not text encodings and don't
really belong in mbstring. PHP already contains separate implementations
of Base64, QPrint, and HTML entities. It will be better to eventually
remove these non-encodings from mbstring.

Regarding HTML entities... there is a bit more to say. mbstring's
implementation of HTML entities is different from the other built-in
implementation (htmlspecialchars and htmlentities). Those functions
convert <, >, and & to HTML entities, but mbstring does not.

It appears that the original author of mbstring intended for something
to be done with <, >, and &. He used a table to identify which
characters should be converted to HTML entities, and </>/& all have a
special value in that table. However, nothing ever checks for that
special value, so the characters are passed through unconverted.

This seems like a very useless implementation of HTML entities. The most
important characters which need to be expressed as entities in HTML
documents are those three!
  • Loading branch information
alexdowad committed Nov 1, 2021
1 parent a899dc9 commit 9308974
Show file tree
Hide file tree
Showing 6 changed files with 54 additions and 12 deletions.
10 changes: 10 additions & 0 deletions UPGRADING
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,16 @@ PHP 8.2 UPGRADE NOTES

RFC: https://wiki.php.net/rfc/deprecate_partially_supported_callables

- Mbstring:
. Use of QPrint, Base64, Uuencode, and HTML-ENTITIES 'text encodings' is
deprecated for all Mbstring functions. Unlike all the other text
encodings supported by Mbstring, these do not encode a sequence of
Unicode codepoints, but rather a sequence of raw bytes. It is not
clear what the correct return values for most Mbstring functions should
be when one of these non-encodings is specified. Further, PHP has
separate, built-in implementations of all of them; for example, UUencoded
data can be handled using convert_uuencode/convert_uudecode.

========================================
5. Changed Functions
========================================
Expand Down
12 changes: 12 additions & 0 deletions ext/mbstring/mbstring.c
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,8 @@
#include "libmbfl/mbfl/mbfilter_wchar.h"
#include "libmbfl/filters/mbfilter_base64.h"
#include "libmbfl/filters/mbfilter_qprint.h"
#include "libmbfl/filters/mbfilter_htmlent.h"
#include "libmbfl/filters/mbfilter_uuencode.h"
#include "libmbfl/filters/mbfilter_ucs4.h"
#include "libmbfl/filters/mbfilter_utf8.h"
#include "libmbfl/filters/mbfilter_tl_jisx0201_jisx0208.h"
Expand Down Expand Up @@ -216,6 +218,16 @@ static const mbfl_encoding *php_mb_get_encoding(zend_string *encoding_name, uint
if (!encoding) {
zend_argument_value_error(arg_num, "must be a valid encoding, \"%s\" given", ZSTR_VAL(encoding_name));
return NULL;
} else if (encoding->no_encoding <= mbfl_no_encoding_qprint) {
if (encoding == &mbfl_encoding_base64) {
php_error_docref(NULL, E_DEPRECATED, "Handling Base64 via mbstring is deprecated; use base64_encode/base64_decode instead");
} else if (encoding == &mbfl_encoding_qprint) {
php_error_docref(NULL, E_DEPRECATED, "Handling QPrint via mbstring is deprecated; use quoted_printable_encode/quoted_printable_decode instead");
} else if (encoding == &mbfl_encoding_html_ent) {
php_error_docref(NULL, E_DEPRECATED, "Handling HTML entities via mbstring is deprecated; use htmlspecialchars, htmlentities, or mb_encode_numericentity/mb_decode_numericentity instead");
} else if (encoding == &mbfl_encoding_uuencode) {
php_error_docref(NULL, E_DEPRECATED, "Handling Uuencode via mbstring is deprecated; use convert_uuencode/convert_uudecode instead");
}
}

if (last_encoding_name) {
Expand Down
3 changes: 2 additions & 1 deletion ext/mbstring/tests/bug45722.phpt
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,6 @@ mbstring
<?php
var_dump(mb_check_encoding("&\xc2\xb7 TEST TEST TEST TEST TEST TEST", "HTML-ENTITIES"));
?>
--EXPECT--
--EXPECTF--
Deprecated: mb_check_encoding(): Handling HTML entities via mbstring is deprecated; use htmlspecialchars, htmlentities, or mb_encode_numericentity/mb_decode_numericentity instead in %s on line %d
bool(true)
3 changes: 2 additions & 1 deletion ext/mbstring/tests/bug71606.phpt
Original file line number Diff line number Diff line change
Expand Up @@ -7,5 +7,6 @@ mbstring
echo mb_strcut('&quot;', 0, 0, 'HTML-ENTITIES');
echo 'DONE', PHP_EOL;
?>
--EXPECT--
--EXPECTF--
Deprecated: mb_strcut(): Handling HTML entities via mbstring is deprecated; use htmlspecialchars, htmlentities, or mb_encode_numericentity/mb_decode_numericentity instead in %s on line %d
DONE
32 changes: 23 additions & 9 deletions ext/mbstring/tests/mb_convert_encoding.phpt
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ $sjis = base64_decode('k/qWe4zqg2WDTINYg2eCxYK3gUIwMTIzNIJUglWCVoJXgliBQg==');
// JIS string (BASE64 encoded)
$jis = base64_decode('GyRCRnxLXDhsJUYlLSU5JUgkRyQ5ISMbKEIwMTIzNBskQiM1IzYjNyM4IzkhIxsoQg==');
// EUC-JP string
$euc_jp = '日本語テキストです。0123456789。';
$euc_jp = "\xC6\xFC\xCB\xDC\xB8\xEC\xA5\xC6\xA5\xAD\xA5\xB9\xA5\xC8\xA4\xC7\xA4\xB9\xA1\xA301234\xA3\xB5\xA3\xB6\xA3\xB7\xA3\xB8\xA3\xB9\xA1\xA3";

// Test with single "form encoding"
// Note: For some reason it complains, results are different. Not researched.
Expand Down Expand Up @@ -85,8 +85,6 @@ $s = $euc_jp;
$s = mb_convert_encoding($s, 'JIS', 'auto');
print("JIS: ".base64_encode($s)."\n"); // JIS


// Invalid Parameters
echo "== INVALID PARAMETER ==\n";

$s = mb_convert_encoding(1234, 'EUC-JP');
Expand All @@ -95,15 +93,22 @@ print("INT: $s\n");
$s = mb_convert_encoding('', 'EUC-JP');
print("EUC-JP: $s\n"); // SJIS

$s = $euc_jp;
try {
var_dump(mb_convert_encoding($s, 'BAD'));
} catch (\ValueError $e) {
echo $e->getMessage() . \PHP_EOL;
function tryBadConversion($str, $encoding) {
try {
var_dump(mb_convert_encoding($str, $encoding));
} catch (ValueError $e) {
echo $e->getMessage(), "\n";
}
}

tryBadConversion($euc_jp, 'BAD');

tryBadConversion('abc', 'Quoted-Printable');
tryBadConversion('abc', 'BASE64');
tryBadConversion('abc', 'HTML-ENTITIES');

?>
--EXPECT--
--EXPECTF--
== BASIC TEST ==
EUC-JP: c6fccbdcb8eca5c6a5ada5b9a5c8a4c7a4b9a1a33031323334a3b5a3b6a3b7a3b8a3b9a1a3
EUC-JP: c6fccbdcb8eca5c6a5ada5b9a5c8a4c7a4b9a1a33031323334a3b5a3b6a3b7a3b8a3b9a1a3
Expand All @@ -125,3 +130,12 @@ JIS: GyRCRnxLXDhsJUYlLSU5JUgkRyQ5ISMbKEIwMTIzNBskQiM1IzYjNyM4IzkhIxsoQg==
INT: 1234
EUC-JP:
mb_convert_encoding(): Argument #2 ($to_encoding) must be a valid encoding, "BAD" given

Deprecated: mb_convert_encoding(): Handling QPrint via mbstring is deprecated; use quoted_printable_encode/quoted_printable_decode instead in %s on line %d
string(3) "abc"

Deprecated: mb_convert_encoding(): Handling Base64 via mbstring is deprecated; use base64_encode/base64_decode instead in %s on line %d
string(4) "YWJj"

Deprecated: mb_convert_encoding(): Handling HTML entities via mbstring is deprecated; use htmlspecialchars, htmlentities, or mb_encode_numericentity/mb_decode_numericentity instead in %s on line %d
string(3) "abc"
6 changes: 5 additions & 1 deletion ext/mbstring/tests/mb_strlen_variation3.phpt
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ foreach($encoding as $enc) {

echo "Done";
?>
--EXPECT--
--EXPECTF--
*** Testing mb_strlen() : usage variations ***

-- Iteration 1: UCS-4 --
Expand Down Expand Up @@ -335,12 +335,16 @@ Encoding byte4le recognised

-- Iteration 40: BASE64 --
-- ASCII String --

Deprecated: mb_strlen(): Handling Base64 via mbstring is deprecated; use base64_encode/base64_decode instead in %s on line %d
Encoding BASE64 recognised
-- Multibyte String --
Encoding BASE64 recognised

-- Iteration 41: HTML-ENTITIES --
-- ASCII String --

Deprecated: mb_strlen(): Handling HTML entities via mbstring is deprecated; use htmlspecialchars, htmlentities, or mb_encode_numericentity/mb_decode_numericentity instead in %s on line %d
Encoding HTML-ENTITIES recognised
-- Multibyte String --
Encoding HTML-ENTITIES recognised
Expand Down

6 comments on commit 9308974

@twose
Copy link
Member

@twose twose commented on 9308974 Mar 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/phpstan/phpstan-src/blob/master/src/Type/Php/MbFunctionsReturnTypeExtension.php#L54

Code like this will trigger deprecation warnings now, and I can't think of an elegant way to solve it except for error suppression.

@alexdowad
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@twose Thanks for letting us know the challenge you are facing.

It's unfortunate that some features which do not really make sense as part of MBString were added long ago. Some of these, like MBString's HTML entity handling feature, were completely broken from the day they were committed to the code base and have never worked properly, but now that we want to remove them, this has created some trouble for people like yourself. Sorry for that; it's not my intention to cause problems for any of our users.

The easiest and simplest way for you to avoid deprecation warnings in the code you link to is to include an if...continue clause in that loop. This is untested code, but just as a sample, it might look something like:

foreach (mb_list_encodings() as $encoding) {
    if ($encoding === 'HTML-ENTITIES' || $encoding === 'BASE64' || $encoding === 'Quoted-Printable')
        continue;
    // ...now go on and use $encoding as you wish

Please let me know if something similar to that does the trick for you.

@twose
Copy link
Member

@twose twose commented on 9308974 Mar 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexdowad
Thank you for your quick response!

I wonder if these encodings will be removed from the mb_list_encodings() in the future.
If yes, then we need to remove such compatibility code at that time, right?

And, maybe it is reasonable to allow the use of 'mb_encoding_aliases()' without throwing deprecation warnings before removing these encodings?

@alexdowad
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if these encodings will be removed from the mb_list_encodings() in the future. If yes, then we need to remove such compatibility code at that time, right?

Well, whenever you do not support PHP 8.1 any more, then yes... you could remove it.

And, maybe it is reasonable to allow the use of 'mb_encoding_aliases()' without throwing deprecation warnings before removing these encodings?

For your use case, it does seem reasonable. Then again, if someone is specifically passing in a deprecated encoding which we want to remove (i.e. mb_encoding_aliases('BASE64')), then we would like that person to be warned of the deprecation.

Again, if mb_list_encodings was modified so as not to return the names of deprecated encodings, that would be better for your use case. But there is a chance that it might break existing code for someone else.

In general, trying to refactor APIs which are already in use is very, very hard. (Please see Hyrum's Law.) The best solution would be if we could just make a time machine, then go back in time and ensure that API design mistakes were never committed.

Failing that, another approach is to say that once released, APIs can never change, so if we want to fix design problems with mbstring, we should instead add mbstring2 and leave the old version untouched. This is the approach taken by Go's standard libraries (and there are notable advantages to that approach). However, in this case, I think the maintenance burden of maintaining two different versions of mbstring is unacceptable.

The remaining alternative, which has been chosen, is to 1) carefully think about breaking changes to make sure they are really better in the long term, 2) try to make changes in a way which minimizes the number of users affected, and 3) provide deprecation warnings in advance so users become aware of pending changes and can take action. In some cases, the "action" may be to provide feedback to the developers, which might impact when and how the changes are done.

We are very interested in hearing the views of PHP users. You have already expressed your view that mb_encoding_aliases should not raise deprecation warnings for encodings which may be removed later. If you can provide reasons why this is the best way to handle the deprecation (for all involved, not just yourself), and a developer is available to implement the suggestion, then it might be possible to go that way.

Thanks again for the report.

@twose
Copy link
Member

@twose twose commented on 9308974 Mar 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexdowad
Thank you very much for your detailed explanation!
I don't actually have any point of view, I'm just asking. (Maybe my poor English expressive faculty makes you think I'm supporting another point of view 🥲).
Now I've got all the answers I want.
Thank you again for the quick response!

@stof
Copy link
Contributor

@stof stof commented on 9308974 Sep 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexdowad I don't see any RFC about this new deprecation. This looks weird to me.

Please sign in to comment.