Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Fixing a bug in php-src

Foreword

These are notes on my first experience of a patch on the PHP core. It was primarily intended as personal log, but I wrote it again in a way to share my experience.

The objectives

I've been writing PHP code for years, so I've always wanted to give back a little and contribute to the PHP project. My C is a bit rusty, but it should be enough to fix a simple bug. So when I heard about a new call for testing, I tried my luck.

The new feature to get a random bug gave me #40531 mb_substr optional parameters from the official extension mbstring. This seemed like a good first task. It even looked too easy: I'll just add a if (x == NULL) and I'm done... or maybe not!

Fetching the sources

Through git

The PHP source code has migrated from SVN to git (2012-03-19). The official repositories have a copy on Github, so the easiest way to fetch the code is to fork it through the web interface.

  • You need an account on Github. Don't forget to configure your SSH keys.
  • Go to github php-src.
  • Click on the "Fork" link on the upper right part of the page.
  • Once your own fork is created, copy its URL (prefer SSH to HTTPS if possible).
  • Clone your repository on your local machine.
  • Switch to the branch you want to work on, here "PHP-5.4".
% git clone git@github.com:mytskine/php-src.git
% cd php-src
% git checkout PHP-5.4

Through an archive (not recommended)

This is how I first had a look at PHP's source before they used a Distributed VCS. If you have very little bandwidth, you may want to avoid fetching the source history through git. In this case, you can download the official archive of PHP 5.4.0, uncompress it, and of course put everything under a VCS (Git, Mercurial, Subversion, etc).

Here is an example with Mercurial (hg).

% wget http://fr.php.net/distributions/php-5.4.0.tar.bz2
% tar xf php-5.4.0.tar.bz2
% cd php-5.4.0
% hg init
% hg addremove
% hg commit -m "PHP 5.4.0 vanilla"

There are drawbacks to this. Mainly, you'll have to submit a patch instead of a pull request, and you'll have a hard time synchronizing your repository with the upstream changes.

The development environment

Configure, compile and test

Before changing anything, let's check the vanilla code works! I compiled PHP with the developer flags on. I had to force the extension ext/mbstring to load.

% ./buildconf
% ./configure \
--enable-debug \
--enable-maintainer-zts \
--enable-mbstring
% make -j5
% make -j5 test

The first line ("buildconf") is not necessary if you uncompressed an archive instead of cloning the repository. Of course, you'll nee a complete build system (autoconf, GCC, etc).

Compiling was done in less than a minute, while running the thousands of tests took 5 minutes. If all goes well, you end up with a summary like:

=====================================================================
TEST RESULT SUMMARY
---------------------------------------------------------------------
Exts skipped    :   50
Exts tested     :   27
---------------------------------------------------------------------

Number of tests : 12013              8267
Tests skipped   : 3746 ( 31.2%) --------
Tests warned    :    0 (  0.0%) (  0.0%)
Tests failed    :    0 (  0.0%) (  0.0%)
Expected fail   :   34 (  0.3%) (  0.4%)
Tests passed    : 8233 ( 68.5%) ( 99.6%)
---------------------------------------------------------------------
Time taken      :  261 seconds
=====================================================================

If PHP is already installed in your system, the CLI interpreter used for this test is not the system one, but ./sapi/cli/php.

Unfortunately, I had tests that failed, and this makes tracking regressions a difficult task. Here is the summary of the failures on my Debian testing OS.

Source --enable-maintainer-zts Failed tests
PHP 5.4.0 archive No 0
PHP 5.4.0 archive Yes 3
PHP-5.4 git branch No 5
PHP-5.4 git branch Yes 7

Working in a local git branch

I'll put my commits in a separate branch, so before changing anything to the source code:

% git checkout -b bug_40531
Switched to a new branch 'bug_40531'

Add custom tests

At first, I wrote my own test in the new file ext/mbstring/tests/bug40531.phpt:

--TEST--
Bug #40531 (mb_substr optional parameters)
--SKIPIF--
<?php extension_loaded('mbstring') or die('skip mbstring not available'); ?>
--FILE--
<?php
echo mb_substr('foobar', 3, null) . "\n";
?>
--EXPECT--
bar

Then I discovered I didn't need it: there was already a test for this bug, but it was disabled (skipped while waiting for a fix). So I just removed the two last lines of ext/mbstring/tests/mb_str_functions_opt-parameter.phpt:

--XFAIL--
mb functions fail to allow null instead of actual value

Now I ran only the tests that deal with the mbstring extension, and one of the fails... let's fix this!

% TESTS="ext/mbstring/tests" make test

But first, let's commit early:

% git commit ext/mbstring/tests/mb_str_functions_opt-parameter.phpt

Fixing the bug #40531

Why it was not easy (null and integers)

To find where the function mb_substr() is defined, I use the excellent ack program. The seaech is restricted to the code source of the extension "mbstring".

ack -Q 'PHP_FUNCTION(mb_substr)' ext/mbstring

Here is the interesting part of the code in ext/mbstring/mbstring.c:

/* {{{ proto string mb_substr(string str, int start [, int length [, string encoding]])
   Returns part of a string */
PHP_FUNCTION(mb_substr)
{
    size_t argc = ZEND_NUM_ARGS();
    char *str, *encoding;
    long from, len;
    int mblen, str_len, encoding_len;
    mbfl_string string, result, *ret;

    if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "sl|ls", &str, &str_len, &from, &len, &encoding, &encoding_len) == FAILURE) {
        return;
    }

    // ... skip 16 lines of code ...

    if (argc < 3) {
        len = str_len;
    }

Quite easy to understand. The C function zend_parse_parameters() reads the arguments according to a template of their types. Here "sl|ls" means: a string, an integer (long), and eventually another integer, and another string. The last if sets the default value of len if this optional parameter wasn't given. Now I just have to change these last lines to:

    if (argc < 3 || len == NULL) { // a one line patch
        len = str_len;
    }

Now, the test that was skipped passes... But a test that succeeded fails!

This patch changed the behaviour of the function mb_substr(). C has static typing where NULL is the name of the empty pointer, and it's value is 0. This is why gcc showed some warnings when compiling: (len == NULL) was comparing a "long" integer with a "void*" pointer. So gcc was typecasting NULL to a long integer, that means it compared len with 0.

In PHP, these three lines should behave differently:

echo mb_substr("foobar", 3);       // "bar" expected
echo mb_substr("foobar", 3, 0);    // ""    expected
echo mb_substr("foobar", 3, null); // "bar" expected

Fortunately, PHP does differentiate null from 0. But a long variable in C cannot. So we can't read the PHP int|null parameter as a simple C long integer.

Handling a zval

I knew that PHP does differentiate null from 0, so there had to be way to access this information from within C. And the way to do this is to read the internal structure of a PHP data, which is called zval. See the official documentation for an explanation of zvals.

Instead of reading the final long value of the parameter, we need to read the raw zval value, and convert it into a long if it isn't null. The patch isn't a one-liner anymore, but its still simple:

/* {{{ proto string mb_substr(string str, int start [, int length [, string encoding]])
   Returns part of a string */
PHP_FUNCTION(mb_substr)
{
    size_t argc = ZEND_NUM_ARGS();
    char *str, *encoding;
    long from, len;
// declare a zval variable
    zval *zlen = NULL;
    int mblen, str_len, encoding_len;
    mbfl_string string, result, *ret;

// replace len by zlen, and its type "l" by "z"
    if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "sl|zs", &str, &str_len, &from, &zlen, &encoding, &encoding_len) == FAILURE) {
        return;
    }

    // ... skip 16 lines of code ...

// add a check for a null PHP parameter (not an 0 integer)
    if (argc < 3 || ZVAL_IS_NULL(zlen)) {
        len = str_len;
    } else {
        len = Z_LVAL_P(zlen);
    }

That's it! Now all the tests pass!

zval with an integer prototype

Alas, there is a hidden flaw. What happens if $len isn't a proper integer?

% ./sapi/cli/php -r 'echo mb_substr("foobar", 3, "3 ") . "\n";'

When zend_parse_parameters() knew that we expected an integer, it could handle the type checking, but know that we just read a zval, it silently accepts any value. Of course, some manual checks could be added inside mb_substr(), but it would lead to copying the code of zend_parse_arg_impl() that is called by zend_parse_parameters, and that's not admissible. Calling this function would be a solution, but it's not that easy, and a bit dirty.

A clean solution is to change to prototype of mb_substr(). Each PHP function has a prototype that checks the validity of the calls. This check is done before executing the function called.

The current prototype doesn't check anything, everything is left to the future call of zend_parse_parameters.

ZEND_BEGIN_ARG_INFO_EX(arginfo_mb_substr, 0, 0, 2)
    ZEND_ARG_INFO(0, str)
    ZEND_ARG_INFO(0, start)
    ZEND_ARG_INFO(0, length)
    ZEND_ARG_INFO(0, encoding)
ZEND_END_ARG_INFO()

By reading the code where the macro ZEND_ARG_INFO is defined, we see that a macro ZEND_ARG_TYPE_INFO exists.

#define ZEND_ARG_TYPE_INFO(pass_by_ref, name, type_hint, allow_null) ...

With a nice allow_null parameter, just what we need! So the prototype becomes:

ZEND_BEGIN_ARG_INFO_EX(arginfo_mb_substr, 0, 0, 2)
    ZEND_ARG_INFO(0, str)
    ZEND_ARG_INFO(0, start)
    ZEND_ARG_TYPE_INFO(0, length, IS_LONG, 1)
    ZEND_ARG_INFO(0, encoding)
ZEND_END_ARG_INFO()

Now let's compile and test...cornegidouille, ça plante !

% ./sapi/cli/php -r 'echo mb_substr("foobar", 3, "3 ") . "\n";'

Fatal error: Unknown typehint in Command line code on line 1

Tracing back what this ZEND_ARG_TYPE_INFO does, I found this code, inside a function called zend_verify_arg_type:

switch(cur_arg_info->type_hint) {
    case IS_ARRAY:
        // ...
        break;
    case IS_CALLABLE:
        // ...
        break;
    default:
        zend_error(E_ERROR, "Unknown typehint");
}

So the only types that can be checked are "array" and "callback". I guess there wasn't a need for more, since checking the "integer" type was done outside the prototype. In fact, this macro ZEND_ARG_TYPE_INFO is never used anywhere in the code!

Let's add the IS_LONG type to this switch:

    case IS_LONG:
        if (!arg) {
            return zend_verify_arg_error(E_RECOVERABLE_ERROR, zf, arg_num, "be of the type long", "", "none", "" TSRMLS_CC);
        }

        if (Z_TYPE_P(arg) != IS_LONG && (Z_TYPE_P(arg) != IS_NULL || !cur_arg_info->allow_null)
                && (Z_TYPE_P(arg) != IS_STRING || is_numeric_string(Z_STRVAL_P(arg), Z_STRLEN_P(arg), NULL, NULL, 0)) == 0) {
            return zend_verify_arg_error(E_RECOVERABLE_ERROR, zf, arg_num, "be of the type long", "", zend_zval_type_name(arg), "" TSRMLS_CC);
        }
        break;

Compile, test... OK!

The side effect

The new function mb_substr() behaves differently:

% php -r 'echo mb_substr("foobar", 3, "3 ") . "\n";'
PHP Notice:  A non well formed numeric value encountered in Command line code on line 1
PHP Stack trace:
PHP   1. {main}() Command line code:0
PHP   2. mb_substr('foobar', 3, '3 ') Command line code:1
bar
% ./sapi/cli/php -r 'echo mb_substr("foobar", 3, "3 ") . "\n";'

Catchable fatal error: Argument 3 passed to mb_substr() must be of the type long, string given in Command line code on line 1

It is more strict than his ancestor, and throws a fatal error when the third parameter is wrong. Is that a bad thing ? I don't think so. Sure, it breaks backward compatibility, but I can't imagine a single case where using a non-numeric string as the length parameter wouldn't be a bug. IMHO, this patch only breaks code that is already broken.

Propagating the fix

Once the work is finished in the bug branch, it must be merged into the main branch.

% git checkout PHP-5.4
M   .gitignore
Switched to branch 'PHP-5.4'
% git merge bug_40531
Updating 260e777..027284f
Fast-forward
 Zend/zend_execute.c                                |   11 +++++++++++
 ext/mbstring/mbstring.c                            |   18 ++++++++++++------
 .../tests/mb_str_functions_opt-parameter.phpt      |    2 --
 3 files changed, 23 insertions(+), 8 deletions(-)
% git branch -d bug_40531
Deleted branch bug_40531 (was 027284f).
% git push

Now I use the web interface of Github to send a pull request upstream, so that this fix gets incorporated into the next PHP releases.

Conclusion

My first contact with the source code of PHP was a real promenade. Knowing the internals of the variable and functions feels good, and in the future I won't hesitate to read the code when the documentation page of a function isn't clear or exhaustive.

There are still a few points that could be enhanced for newcomers like me:

  • The documentation on PHP's internals (php.net's PHP at the Core is far from being complete.
  • The unit tests fail too often.

Yet this travel into PHP's source code was a pleasant experience.

Something went wrong with that request. Please try again.