unicode characters? #531

voku · 2018-09-04T09:12:42Z

Hi, in this issue spatie/7to5#40 I tried to figure out what goes wrong and I found that the "String_" class breaks the unicode chars, but why and how can I disable this behavior? Thanks!

lib/PhpParser/Node/Scalar/String_.php:

...
return chr(hexdec($str));
...

-->

  private static $BOM = [
      "\xef\xbb\xbf"     => 3, // UTF-8 BOM
      'ï»¿'              => 6, // UTF-8 BOM as "WINDOWS-1252" (one char has [maybe] more then one byte ...)
      "\x00\x00\xfe\xff" => 4, // UTF-32 (BE) BOM
      '  þÿ'             => 6, // UTF-32 (BE) BOM as "WINDOWS-1252"
      "\xff\xfe\x00\x00" => 4, // UTF-32 (LE) BOM
      'ÿþ  '             => 6, // UTF-32 (LE) BOM as "WINDOWS-1252"
      "\xfe\xff"         => 2, // UTF-16 (BE) BOM
      'þÿ'               => 4, // UTF-16 (BE) BOM as "WINDOWS-1252"
      "\xff\xfe"         => 2, // UTF-16 (LE) BOM
      'ÿþ'               => 4, // UTF-16 (LE) BOM as "WINDOWS-1252"
  ];

... becomes ...

    private static $BOM = [
        "" => 3,
        // UTF-8 BOM
        'ï»¿' => 6,
        // UTF-8 BOM as "WINDOWS-1252" (one char has [maybe] more then one byte ...)
        "\0\0��" => 4,
        // UTF-32 (BE) BOM
        '  þÿ' => 6,
        // UTF-32 (BE) BOM as "WINDOWS-1252"
        "��\0\0" => 4,
        // UTF-32 (LE) BOM
        'ÿþ  ' => 6,
        // UTF-32 (LE) BOM as "WINDOWS-1252"
        "��" => 2,
        // UTF-16 (BE) BOM
        'þÿ' => 4,
        // UTF-16 (BE) BOM as "WINDOWS-1252"
        "��" => 2,
        // UTF-16 (LE) BOM
        'ÿþ' => 4,
    ];

The text was updated successfully, but these errors were encountered:

nikic · 2018-09-06T01:24:18Z

The precise formatting of string literals is not preserved, only their content (with escape sequences resolved) is. To preserve original source code formatting (including string literals), https://github.com/nikic/PHP-Parser/blob/master/doc/component/Pretty_printing.markdown#formatting-preserving-pretty-printing can be used.

nikic · 2018-09-06T01:25:48Z

Just to be clear, the generated code still has exactly the same meaning -- none of the characters will be corrupted, they're just no longer written using hex escape sequences. Though it may be that the code is corrupted after being modified by a text editor which replaces broken UTF-8.

nikic · 2018-09-06T01:28:36Z

An alternative to the full formatting-preserving printer is https://github.com/nikic/PHP-Parser/blob/master/doc/component/Lexer.markdown#attribute-handling and extending the pretty printer to use the originalValue attribute. Especially if you know that none of your transformations modify string contents, this would be a quick fix.

@nikic

-> thanks @nikic -> nikic/PHP-Parser#531 -> wait for fix -> nikic/PHP-Parser#532

voku · 2018-09-07T02:36:55Z

Big thanks for the help and for your work, it works pretty good (voku/7to5@7a62d12) :) but now I have a different problem...

... after converting a class from PHP7 -> PHP5 there was a different behavior in "if"-conditions, so I tried to fix it, with this "hack":

-> #532

voku added a commit to voku/7to5 that referenced this issue Sep 7, 2018

[+]: keep unicode encoding

7a62d12

-> thanks @nikic -> nikic/PHP-Parser#531 -> wait for fix -> nikic/PHP-Parser#532

voku closed this as completed Sep 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unicode characters? #531

unicode characters? #531

voku commented Sep 4, 2018

nikic commented Sep 6, 2018

nikic commented Sep 6, 2018

nikic commented Sep 6, 2018

voku commented Sep 7, 2018

unicode characters? #531

unicode characters? #531

Comments

voku commented Sep 4, 2018

nikic commented Sep 6, 2018

nikic commented Sep 6, 2018

nikic commented Sep 6, 2018

voku commented Sep 7, 2018