Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mb_str_split without any mbfl modifications PR3715 related #3808

Closed
wants to merge 18 commits into from

Conversation

@legale
Copy link
Contributor

legale commented Feb 9, 2019

mb_str_split without any mbfl modifications
This PR is related to #3715

ext/mbstring/mbstring.c Outdated Show resolved Hide resolved
@nikic nikic self-requested a review Feb 9, 2019
@legale legale force-pushed the legale:master branch 4 times, most recently from ff56509 to a501664 Feb 9, 2019
mbniebergall and others added 3 commits Feb 9, 2019
Tests for __set_state magic method for DateTime, DateTimeImmutable,
DateTimeZone and DatePeriod.
@legale legale force-pushed the legale:master branch from a501664 to bc5a87a Feb 9, 2019
ext/mbstring/mbstring.c Outdated Show resolved Hide resolved
legale added 2 commits Feb 9, 2019
@legale

This comment has been minimized.

Copy link
Contributor Author

legale commented Feb 9, 2019

Why windows tests are so unstable???

@petk petk added the Enhancement label Feb 10, 2019
This reverts commit b743eab.
ext/date/tests/DatePeriod_set_state.phpt Show resolved Hide resolved
ext/mbstring/mbstring.c Outdated Show resolved Hide resolved
ext/mbstring/mbstring.c Outdated Show resolved Hide resolved
ext/mbstring/mbstring.c Show resolved Hide resolved
ext/mbstring/mbstring.c Outdated Show resolved Hide resolved
ext/mbstring/mbstring.c Outdated Show resolved Hide resolved
ext/mbstring/mbstring.c Outdated Show resolved Hide resolved
ext/mbstring/mbstring.c Outdated Show resolved Hide resolved
ext/mbstring/mbstring.c Outdated Show resolved Hide resolved
legale added 3 commits Feb 11, 2019
This reverts commit 7b109cd.
@legale

This comment has been minimized.

Copy link
Contributor Author

legale commented Feb 11, 2019

@nikic, @guilliamxavier, @IMSoP
The test fails, but the reason is deeper. Unusual situation. Thing is that it does not pass the test for an incorrect utf16 string.

https://www.ietf.org/rfc/rfc2781.txt
RFC2781 about UTF-16 says:

2.2 Decoding UTF-16
   Decoding of a single character from UTF-16 to an ISO 10646 character
   value proceeds as follows. Let W1 be the next 16-bit integer in the
   sequence of integers representing the text. Let W2 be the (eventual)
   next integer following W1.
   1) If W1 < 0xD800 or W1 > 0xDFFF, the character value U is the value
      of W1. Terminate.
   **2) Determine if W1 is between 0xD800 and 0xDBFF. If not, the sequence
      is in error and no valid character can be obtained using W1.
      Terminate.**
   3) If there is no W2 (that is, the sequence ends with W1), or if W2
      is not between 0xDC00 and 0xDFFF, the sequence is in error.
      Terminate.
   4) Construct a 20-bit unsigned integer U', taking the 10 low-order
      bits of W1 as its 10 high-order bits and the 10 low-order bits of
      W2 as its 10 low-order bits.

So there is no valid character for the 0xDC00 (first low surrogate). I think that in such cases, decoder should skip the character as it is, but the mbfl library decoder changes it.

@nikic

This comment has been minimized.

Copy link
Member

nikic commented Feb 12, 2019

@legale As mentioned on https://bugs.php.net/bug.php?id=77607, invalid character being replaced with ? or some other substitution character is expected behavior here (the problem is actually that the same does not happen for UTF-8 due to the table optimization, but that's a much more general issue than just this function).

@legale

This comment has been minimized.

Copy link
Contributor Author

legale commented Feb 12, 2019

@nikic

This comment has been minimized.

Copy link
Member

nikic commented Feb 12, 2019

@legale Yes ;)

@nikic

This comment has been minimized.

Copy link
Member

nikic commented Feb 12, 2019

When compiling this just now, I got a bunch of warnings:

/home/nikic/php-7.4/ext/mbstring/mbstring.c: In function ‘mbfl_split_output’:
/home/nikic/php-7.4/ext/mbstring/mbstring.c:2308:54: warning: pointer targets in passing argument 2 of ‘add_next_index_stringl’ differ in signedness [-Wpointer-sign]
         add_next_index_stringl(params->return_value, chunk->val, chunk->len); /* add chunk to the array */
                                                      ^~~~~
In file included from /home/nikic/php-7.4/main/php.h:37:0,
                 from /home/nikic/php-7.4/ext/mbstring/mbstring.c:27:
/home/nikic/php-7.4/Zend/zend_API.h:429:14: note: expected ‘const char *’ but argument is of type ‘unsigned char *’
 ZEND_API int add_next_index_stringl(zval *arg, const char *str, size_t length);
              ^~~~~~~~~~~~~~~~~~~~~~
/home/nikic/php-7.4/ext/mbstring/mbstring.c: In function ‘zif_mb_str_split’:
/home/nikic/php-7.4/ext/mbstring/mbstring.c:2384:35: warning: pointer targets in initialization differ in signedness [-Wpointer-sign]
             char const *chunk_p = p; /* chunk first byte pointer */
                                   ^
/home/nikic/php-7.4/ext/mbstring/mbstring.c:2465:50: warning: pointer targets in passing argument 2 of ‘add_next_index_stringl’ differ in signedness [-Wpointer-sign]
             add_next_index_stringl(return_value, p, chunk_len);
                                                  ^
In file included from /home/nikic/php-7.4/main/php.h:37:0,
                 from /home/nikic/php-7.4/ext/mbstring/mbstring.c:27:
/home/nikic/php-7.4/Zend/zend_API.h:429:14: note: expected ‘const char *’ but argument is of type ‘const unsigned char *’
 ZEND_API int add_next_index_stringl(zval *arg, const char *str, size_t length);
              ^~~~~~~~~~~~~~~~~~~~~~
/home/nikic/php-7.4/ext/mbstring/mbstring.c:2467:46: warning: pointer targets in passing argument 2 of ‘add_next_index_stringl’ differ in signedness [-Wpointer-sign]
         add_next_index_stringl(return_value, p, last - p);
                                              ^
In file included from /home/nikic/php-7.4/main/php.h:37:0,
                 from /home/nikic/php-7.4/ext/mbstring/mbstring.c:27:
/home/nikic/php-7.4/Zend/zend_API.h:429:14: note: expected ‘const char *’ but argument is of type ‘const unsigned char *’
 ZEND_API int add_next_index_stringl(zval *arg, const char *str, size_t length);
              ^~~~~~~~~~~~~~~~~~~~~~
ext/mbstring/mbstring.c Outdated Show resolved Hide resolved
legale added 2 commits Feb 12, 2019
ext/mbstring/mbstring.c Outdated Show resolved Hide resolved
@nikic nikic mentioned this pull request Feb 12, 2019
legale added 5 commits Feb 12, 2019
@legale

This comment has been minimized.

Copy link
Contributor Author

legale commented Feb 12, 2019

@nikic Did we've finished?

@nikic

This comment has been minimized.

Copy link
Member

nikic commented Feb 12, 2019

Merged as d77ad27 into 7.4+, with some style fixes. In particular, note that C code in php-src uses tab indentation, not space indentation.

Thanks for seeing this through :)

@nikic nikic closed this Feb 12, 2019
@kamil-tekiela

This comment has been minimized.

Copy link

kamil-tekiela commented Apr 27, 2019

Thank you for this RFC. It is the true answer to my Stack Overflow question: https://stackoverflow.com/questions/55782088/convert-a-string-into-an-array-of-characters-multi-byte
It has passed my UTF-8 test according to my expectations: https://3v4l.org/M85Fi/rfc#output

@legale

This comment has been minimized.

Copy link
Contributor Author

legale commented Apr 28, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
8 participants
You can’t perform that action at this time.