fix unicode content parsing #415

neomoto · 2013-04-16T19:40:24Z

fixed exception in parsing symbols like em dash or cyrillic letters.
Tests are included

fixed exception in parsing symbols like em dash or cyrillic letters. Tests are included

leafo · 2013-04-16T20:53:04Z

Your test compiles fine without applying the patch.

neomoto · 2013-04-17T06:37:20Z

Sorry, I forgot explain when it happens.
If you have string overloading with this settings:
mbstring.internal_encoding = UTF-8
mbstring.func_overload = 2

then you have exception on "parse error: failed at `content: '—';".
Specifically, this row
mbstring.internal_encoding = UTF-8
causes parse error. With only
mbstring.func_overload = 2
parsing goes well.

robocoder · 2013-04-17T15:51:53Z

The mbstring.func_overload ini setting can't be changed at run-time; also, I've run into systems where the mbstring extension isn't enabled.

How about prefixing strlen() and substr() calls with '_', and then adding these helper functions?

// mb_orig_strlen() and mb_orig_substr() only exist when mbstring.func_overload & 2 is true and the mbstring extension is loaded
if (function_exists('mb_orig_strlen')) {
    function _strlen($s) {
        return mb_orig_strlen($s);
    }

    function _substr() {
        return call_user_func_array('mb_orig_substr', func_get_args());
    }
} else {
    function _strlen($s) {
        return strlen($s);
    }

    function _substr() {
        return call_user_func_array('substr', func_get_args());
    }
}

leafo · 2013-04-17T17:21:42Z

The language itself doesn't use unicode so the parser doesn't need to be aware of it (no unicode keywords or anything). The buffer can be treated as a string of 8 bit chars and still have unicode characters pass through it fine. I'm guessing the problem is that mbstring.func_overload causes the string functions to work differently than the $buffer[x] character accessor which causes a mismatch and the parser to fail?

Does changing the ini setting with ini_set just for the parse work? If that doesn't work then another option is to replace all instances of $this->buffer[] with substr($this->buffer, ...). There's probably a noticeable performance hit to this change though.

neomoto · 2013-04-17T18:28:15Z

Does changing the ini setting with ini_set just for the parse work?

Yes, you are right.. Changing compile method on this

    public function compile($string, $name = null) {
        $locale = setlocale(LC_NUMERIC, 0);
        setlocale(LC_NUMERIC, "C");

        $internal_encoding = NULL;
        // only if mbstring installed
        if (function_exists('mb_orig_strlen')) {
            $internal_encoding = ini_get('mbstring.internal_encoding');
            if(!is_null($internal_encoding)){
                ini_set('mbstring.internal_encoding', NULL);
            }
        }
        $this->parser = $this->makeParser($name);
        $root = $this->parser->parse($string);

        $this->env = null;
        $this->scope = null;

        $this->formatter = $this->newFormatter();

        if (!empty($this->registeredVars)) {
            $this->injectVariables($this->registeredVars);
        }

        $this->sourceParser = $this->parser; // used for error messages
        $this->compileBlock($root);

        ob_start();
        $this->formatter->block($this->scope);
        $out = ob_get_clean();
        setlocale(LC_NUMERIC, $locale);
        if(!is_null($internal_encoding)){
            ini_set('mbstring.internal_encoding', $internal_encoding);
        }
        return $out;
    }

helps and parsing is going well even if
mbstring.internal_encoding = UTF-8
is set in php.ini or .htaccess

Parse error appears only if you had set mbstring.internal_encoding UTF-8

unnamed777 · 2013-12-09T15:43:20Z

Thanks for fix.

neomoto added 2 commits April 16, 2013 23:36

Added correct parsing of unicode symbols

5984fef

fixed exception in parsing symbols like em dash or cyrillic letters. Tests are included

removed typo

0a9b3c9

more smart way for fixing utf-8 files parse error

b8c0fbb

Parse error appears only if you had set mbstring.internal_encoding UTF-8

Krinkle added the pr: bug-fix label Oct 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix unicode content parsing #415

fix unicode content parsing #415

neomoto commented Apr 16, 2013

leafo commented Apr 16, 2013

neomoto commented Apr 17, 2013

robocoder commented Apr 17, 2013

leafo commented Apr 17, 2013

neomoto commented Apr 17, 2013

unnamed777 commented Dec 9, 2013

fix unicode content parsing #415

Are you sure you want to change the base?

fix unicode content parsing #415

Conversation

neomoto commented Apr 16, 2013

leafo commented Apr 16, 2013

neomoto commented Apr 17, 2013

robocoder commented Apr 17, 2013

leafo commented Apr 17, 2013

neomoto commented Apr 17, 2013

unnamed777 commented Dec 9, 2013