Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mb_str_split() added #3715

Closed
wants to merge 31 commits into from
Closed
Changes from 1 commit
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
57e82dc
mb_str_split() added
legale Dec 23, 2018
2b99b9d
mb_str_split() error fixed
legale Dec 23, 2018
47ce704
mb_str_split() error fixed
legale Dec 23, 2018
63c8e79
Update .gitignore
petk Dec 23, 2018
bef21b5
new mb_str_split() using libmbfl library functions
legale Dec 28, 2018
91a7309
new mb_str_split() using libmbfl library functions
legale Dec 28, 2018
223a8a9
mb_str_split() optional argument "encoding" added + minor changes
legale Dec 28, 2018
79978f7
mb_str_split() minor changes in function argument names
legale Dec 28, 2018
cbdf106
mb_str_split() tests
legale Dec 28, 2018
64cd160
minor changes to pass appveyor tests
legale Dec 28, 2018
24082c4
minor src and tests refactoring
legale Jan 2, 2019
781c1d6
// comments replaced with /**/
legale Jan 2, 2019
62bcf9e
more tests
legale Jan 5, 2019
7f2ce93
minor tests changes
legale Jan 10, 2019
314619e
rerun test
legale Jan 10, 2019
e09d115
mb_str_split function rewritten completely
legale Jan 13, 2019
d9bb662
mbfl collector_substr moved back to static
legale Jan 13, 2019
2f412a3
tests improved
legale Jan 13, 2019
5a64309
trying to fix a memory leak 1
legale Jan 13, 2019
4b0523f
tests minor changes
legale Jan 13, 2019
f6ee1fa
tests minor changes
legale Jan 13, 2019
12c5928
refactoring + faster way to parse UTF-16
legale Jan 17, 2019
e202945
refactoring & more tests
legale Jan 17, 2019
65eaec9
minor comment changes
legale Jan 17, 2019
87824d8
minor changes
legale Jan 17, 2019
95e0647
comments changes
legale Jan 18, 2019
c0b3f57
`git checkout ecd533d -- ext/mbstring/libmbfl/` path /ext/mbstring/li…
legale Jan 18, 2019
f036661
UTF-16 parse bug fixed and related test added
legale Jan 19, 2019
d868059
utf-16 optimization
legale Jan 22, 2019
ad77e03
endian.h replaced with brg_endian.h
legale Jan 22, 2019
2ff7061
minor changes + bug fix in php_mb_mbchar_bytes_ex()
legale Jan 23, 2019
File filter...
Filter file types
Jump to…
Jump to file or symbol
Failed to load files and symbols.

Always

Just for now

Next

mb_str_split() added

  • Loading branch information
legale committed Dec 23, 2018
commit 57e82dca712f60a440a7824ee5a9cfffb532be7a
@@ -303,3 +303,8 @@ tmp-php.ini
!/ext/mbstring/oniguruma.patch
!/ext/pcre/pcre2lib/config.h
!/win32/build/Makefile

# ------------------------------------------------------------------------------
# Clion IDE files
# ------------------------------------------------------------------------------
.idea/**
This conversation was marked as resolved by legale

This comment has been minimized.

Copy link
@petk

petk Dec 23, 2018

Contributor

Same as before...

This comment has been minimized.

Copy link
@legale

legale Dec 23, 2018

Author Contributor

Sorry for that.

This comment has been minimized.

Copy link
@petk

petk Dec 24, 2018

Contributor

Hello, thanks for the patches! Very nice of you to get to this part. What was meant in the previous comment was to omit this change :) Because /.idea/ is only one of the folders that we should then ignore. Visual studio Code for example uses /.vscode/ and so on... It can quickly increase the .gitignore even more what currently already is. I'm not sure if there would be performance impact on these editors and command line with too long .gitignore file but if not anything for the readability. In any case, you're welcome to send a separate pull request concerning this change only. Thanks.

This comment has been minimized.

Copy link
@cmb69

cmb69 Dec 24, 2018

Contributor

What was meant in the previous comment was to omit this change :)

ACK. I suggest to configure the Git client to ignore IDE files – this can also be done globally (i.e. for all Git repos on the machine).

@@ -529,6 +529,7 @@ static const zend_function_entry mbstring_functions[] = {
PHP_FE(mb_parse_str, arginfo_mb_parse_str)
PHP_FE(mb_output_handler, arginfo_mb_output_handler)
PHP_FE(mb_preferred_mime_name, arginfo_mb_preferred_mime_name)
PHP_FE(mb_str_split, arginfo_mb_str_split)
PHP_FE(mb_strlen, arginfo_mb_strlen)
PHP_FE(mb_strpos, arginfo_mb_strpos)
PHP_FE(mb_strrpos, arginfo_mb_strrpos)
@@ -2282,6 +2283,64 @@ PHP_FUNCTION(mb_output_handler)
}
/* }}} */

/* {{{ proto array mb_str_split(string str [, int split_length])
Convert a multibyte string to an array. If split_length is specified,
break the string down into chunks each split_length characters long. */
PHP_FUNCTION(mb_str_split){
zend_string *str;
zend_long split_length = 1;
const unsigned char *p, *last; //string pointers

ZEND_PARSE_PARAMETERS_START(1, 2)
Z_PARAM_STR(str)
Z_PARAM_OPTIONAL
Z_PARAM_LONG(split_length)
ZEND_PARSE_PARAMETERS_END();

if (split_length <= 0) {
php_error_docref(NULL, E_WARNING, "The length of each segment must be greater than zero");
RETURN_FALSE;
}

if (0 == ZSTR_LEN(str) || (size_t)split_length >= ZSTR_LEN(str)) {
array_init_size(return_value, 1);
add_next_index_stringl(return_value, ZSTR_VAL(str), ZSTR_LEN(str));
return;
}

//minimum array size is string length / 4 bytes / split_length
array_init_size(return_value, (ZSTR_LEN(str) >> 2) / (size_t)split_length);


//MACROS to create precount array
#define B4(n) n,n,n,n
#define B8(n) B4(n),B4(n)
#define B16(n) B8(n),B8(n)
#define B32(n) B16(n),B16(n)
#define B64(n) B32(n),B32(n)

unsigned char byte[256] = {B64(1),B64(1),B64(2),B32(2),B16(3),B8(4),B8(0)};

p = ZSTR_VAL(str);
last = p + ZSTR_LEN(str);

while (p < last) {
const unsigned char *chunk_p = p;
uint32_t chunk_length = 0;
for(uint32_t char_count = 0; char_count < split_length; ++char_count) {
/* 1 byte max 0b01111111 (127)
* 2 byte max 0b11011111 (223)
* 3 byte max 0b11101111 (239)
* 4 byte max ob11110111 (247) */
p += byte[*p];
chunk_length += byte[*p];
}
add_next_index_stringl(return_value, chunk_p, chunk_length);
}
This conversation was marked as resolved by legale

This comment has been minimized.

Copy link
@nikic

nikic Jan 11, 2019

Member

While this code will support all encodings, this is also going to be very slow. The issue is that we're operating on codepoint offsets here and finding which byte offset corresponds to a certain codepoint offset is an O(n) operation. Every mbfl_substr call is going to scan the string to find the offset anew. This means that the overall complexity of this function is going to be O(n^2), making it impractically slow for strings beyond a certain length (try to put a string with something like 100000 characters in here -- might also make for a good test).

To make this run in O(n) it will require a more specialized implementation. As already mentioned in #3715 (comment), there are usually three different implementations necessary, one for fixed-length encodings (trivial), one for UTF8 (basically what you did initially) and one general (you can keep your current code for the general case for now, as this is less important).

This comment has been minimized.

Copy link
@legale

legale Jan 13, 2019

Author Contributor

done.

return;
}
/* }}} */

/* {{{ proto int mb_strlen(string str [, string encoding])
Get character numbers of a string */
PHP_FUNCTION(mb_strlen)
@@ -78,6 +78,7 @@ PHP_FUNCTION(mb_substitute_character);
PHP_FUNCTION(mb_preferred_mime_name);
PHP_FUNCTION(mb_parse_str);
PHP_FUNCTION(mb_output_handler);
PHP_FUNCTION(mb_str_split);
PHP_FUNCTION(mb_strlen);
PHP_FUNCTION(mb_strpos);
PHP_FUNCTION(mb_strrpos);
ProTip! Use n and p to navigate between commits in a pull request.
You can’t perform that action at this time.