Skip to content

Conversation

alexdowad
Copy link
Contributor

Use the new (faster) encoding conversion code for case conversion functions like mb_convert_case, mb_strtoupper, and mb_strtolower. Speed increase is only about 50% for title casing, but 2-3x for other types of case conversion.

Fuzzed with libfuzzer. One bug in my first draft of the implementation was found, and a regression test added.

Note: the signature of one function with public symbol (php_unicode_convert_case) is changed. This could break C extensions which link directly to mbstring and call this function. However, none of the PECL extensions do so.

FYA @cmb69 @nikic @kamil-tekiela

Perhaps @mvorisek might be interested. Recently he raised some suggestions about how to make mb_strtoupper and mb_strtolower faster. This PR does not close the performance gap with strtoupper and strtolower, but at least makes it much smaller than it was.

@alexdowad
Copy link
Contributor Author

In case they are interested... @Girgias @kocsismate

};

static int convert_case_filter(int c, void *void_data)
MBSTRING_API zend_string *php_unicode_convert_case(int case_mode, const char *srcstr, size_t in_len, const mbfl_encoding *src_encoding, int illegal_mode, int illegal_substchar)
Copy link
Member

@Girgias Girgias Sep 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the case mode, would it make sense to add an enum which contains all of the different modes?

And are the illegal_mode and illegal_substchar really ints? And maybe not bool, and a char? (Again I know very little about character encodings so the current API might be sensible)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using an enum for case mode is a good idea.

illegal_mode is essentially an enum; there are 3-4 valid values. illegal_substchar is a Unicode codepoint, and therefore could be uint32_t.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. It seems that introducing an enum for illegal_mode may require reorganizing the header files for mbstring. Depending on which header file I define it in, there are a lot of issues with the order of header file inclusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just converting illegal_substchar to uint32_t.

For illegal_mode, I would like to handle this in the future. It seems to open a "can of worms" which is not directly related to this PR.

@kamil-tekiela
Copy link
Member

kamil-tekiela commented Sep 23, 2022

Hi Alex,

I am doing some light testing and I cannot figure out why I get different outputs for mb_strtoupper() with your PR. I computed a diff of the codepoints, but I don't think these codepoints are valid Unicode characters, so I am a little confused why I got a mismatch. Is this something that your implementation changed?

Element 0 is input codepoint, element 2 is output codepoint.

array(40) {
  [0] =>
  array(2) {
    [0] =>
    int(11359)
    [2] =>
    int(11311)
  }
  [1] =>
  array(2) {
    [0] =>
    int(42945)
    [2] =>
    int(42944)
  }
  [2] =>
  array(2) {
    [0] =>
    int(42961)
    [2] =>
    int(42960)
  }
  [3] =>
  array(2) {
    [0] =>
    int(42967)
    [2] =>
    int(42966)
  }
  [4] =>
  array(2) {
    [0] =>
    int(42969)
    [2] =>
    int(42968)
  }
  [5] =>
  array(2) {
    [0] =>
    int(66967)
    [2] =>
    int(66928)
  }
  [6] =>
  array(2) {
    [0] =>
    int(66968)
    [2] =>
    int(66929)
  }
  [7] =>
  array(2) {
    [0] =>
    int(66969)
    [2] =>
    int(66930)
  }
  [8] =>
  array(2) {
    [0] =>
    int(66970)
    [2] =>
    int(66931)
  }
  [9] =>
  array(2) {
    [0] =>
    int(66971)
    [2] =>
    int(66932)
  }
  [10] =>
  array(2) {
    [0] =>
    int(66972)
    [2] =>
    int(66933)
  }
  [11] =>
  array(2) {
    [0] =>
    int(66973)
    [2] =>
    int(66934)
  }
  [12] =>
  array(2) {
    [0] =>
    int(66974)
    [2] =>
    int(66935)
  }
  [13] =>
  array(2) {
    [0] =>
    int(66975)
    [2] =>
    int(66936)
  }
  [14] =>
  array(2) {
    [0] =>
    int(66976)
    [2] =>
    int(66937)
  }
  [15] =>
  array(2) {
    [0] =>
    int(66977)
    [2] =>
    int(66938)
  }
  [16] =>
  array(2) {
    [0] =>
    int(66979)
    [2] =>
    int(66940)
  }
  [17] =>
  array(2) {
    [0] =>
    int(66980)
    [2] =>
    int(66941)
  }
  [18] =>
  array(2) {
    [0] =>
    int(66981)
    [2] =>
    int(66942)
  }
  [19] =>
  array(2) {
    [0] =>
    int(66982)
    [2] =>
    int(66943)
  }
  [20] =>
  array(2) {
    [0] =>
    int(66983)
    [2] =>
    int(66944)
  }
  [21] =>
  array(2) {
    [0] =>
    int(66984)
    [2] =>
    int(66945)
  }
  [22] =>
  array(2) {
    [0] =>
    int(66985)
    [2] =>
    int(66946)
  }
  [23] =>
  array(2) {
    [0] =>
    int(66986)
    [2] =>
    int(66947)
  }
  [24] =>
  array(2) {
    [0] =>
    int(66987)
    [2] =>
    int(66948)
  }
  [25] =>
  array(2) {
    [0] =>
    int(66988)
    [2] =>
    int(66949)
  }
  [26] =>
  array(2) {
    [0] =>
    int(66989)
    [2] =>
    int(66950)
  }
  [27] =>
  array(2) {
    [0] =>
    int(66990)
    [2] =>
    int(66951)
  }
  [28] =>
  array(2) {
    [0] =>
    int(66991)
    [2] =>
    int(66952)
  }
  [29] =>
  array(2) {
    [0] =>
    int(66992)
    [2] =>
    int(66953)
  }
  [30] =>
  array(2) {
    [0] =>
    int(66993)
    [2] =>
    int(66954)
  }
  [31] =>
  array(2) {
    [0] =>
    int(66995)
    [2] =>
    int(66956)
  }
  [32] =>
  array(2) {
    [0] =>
    int(66996)
    [2] =>
    int(66957)
  }
  [33] =>
  array(2) {
    [0] =>
    int(66997)
    [2] =>
    int(66958)
  }
  [34] =>
  array(2) {
    [0] =>
    int(66998)
    [2] =>
    int(66959)
  }
  [35] =>
  array(2) {
    [0] =>
    int(66999)
    [2] =>
    int(66960)
  }
  [36] =>
  array(2) {
    [0] =>
    int(67000)
    [2] =>
    int(66961)
  }
  [37] =>
  array(2) {
    [0] =>
    int(67001)
    [2] =>
    int(66962)
  }
  [38] =>
  array(2) {
    [0] =>
    int(67003)
    [2] =>
    int(66964)
  }
  [39] =>
  array(2) {
    [0] =>
    int(67004)
    [2] =>
    int(66965)
  }
}

Edit: I am running PHP 8.0 and while writing this, I realized I haven't compared it against 8.1. https://3v4l.org/M9sqM So it's not this PR.

@alexdowad
Copy link
Contributor Author

@kamil-tekiela Thank you very much for testing! Could you share your test code?

@kamil-tekiela
Copy link
Member

I only did a simple loop for ($i = 42; $i < 0x10FFFF; $i++) { and executed mb_strtoupper(mb_chr($i)); inside it. I then compared output on PHP 8.0 and with your patch. That's it. :)

@alexdowad
Copy link
Contributor Author

@kamil-tekiela Then mb_strtoupper will use your default internal character encoding... what does mb_internal_encoding return on your system?

@kamil-tekiela
Copy link
Member

It says UTF-8

@alexdowad
Copy link
Contributor Author

@kamil-tekiela I am trying to investigate your findings, but am not getting the same results. Could you kindly share more of your testing code?

@alexdowad
Copy link
Contributor Author

As an example:

<?php
var_dump(bin2hex(mb_strtoupper(mb_chr(11359))));

gives me:

string(6) "e2b0af"

@alexdowad
Copy link
Contributor Author

@kamil-tekiela Sorry, I just saw the link you kindly provided to 3v4l. This is enough to work with.

@alexdowad
Copy link
Contributor Author

Hmm. U+2C5F is Glagolitic Small Letter Caudate Chrivi, U+2C2F is Glagolitic Capital Letter Caudate Chivi. So it looks like PHP 8.1 is doing the right thing there.

Looks like this difference must be caused by fe36b81.

@alexdowad
Copy link
Contributor Author

Just pushed 2 commits based on the helpful comments from @Girgias. Any further comments?

Thank you, everyone!

@alexdowad
Copy link
Contributor Author

Failure on Travis is spurious (it's on a test for stat and lstat functions).

@Girgias
Copy link
Member

Girgias commented Oct 4, 2022

From my PoV, this looks OK, but as I said far from an expert :)

…mb_strtolower

Speed increase is only about 50% for title casing, but 2-3x for other
forms of case conversion.
This value is a wchar, so the best type for it is uint32_t.
@alexdowad alexdowad force-pushed the cleanup-mbstring-26 branch from c7b8cf7 to beae80b Compare October 4, 2022 20:06
@alexdowad
Copy link
Contributor Author

Merged. If anyone notices anything that can be improved later on, I can still revise these changes.

@alexdowad alexdowad closed this Oct 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants