Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicode characters? #531

Closed
voku opened this issue Sep 4, 2018 · 4 comments
Closed

unicode characters? #531

voku opened this issue Sep 4, 2018 · 4 comments

Comments

@voku
Copy link
Contributor

voku commented Sep 4, 2018

Hi, in this issue spatie/7to5#40 I tried to figure out what goes wrong and I found that the "String_" class breaks the unicode chars, but why and how can I disable this behavior? Thanks!

lib/PhpParser/Node/Scalar/String_.php:

...
return chr(hexdec($str));
...

-->

  private static $BOM = [
      "\xef\xbb\xbf"     => 3, // UTF-8 BOM
      ''              => 6, // UTF-8 BOM as "WINDOWS-1252" (one char has [maybe] more then one byte ...)
      "\x00\x00\xfe\xff" => 4, // UTF-32 (BE) BOM
      '  þÿ'             => 6, // UTF-32 (BE) BOM as "WINDOWS-1252"
      "\xff\xfe\x00\x00" => 4, // UTF-32 (LE) BOM
      'ÿþ  '             => 6, // UTF-32 (LE) BOM as "WINDOWS-1252"
      "\xfe\xff"         => 2, // UTF-16 (BE) BOM
      'þÿ'               => 4, // UTF-16 (BE) BOM as "WINDOWS-1252"
      "\xff\xfe"         => 2, // UTF-16 (LE) BOM
      'ÿþ'               => 4, // UTF-16 (LE) BOM as "WINDOWS-1252"
  ];

... becomes ...

    private static $BOM = [
        "" => 3,
        // UTF-8 BOM
        '' => 6,
        // UTF-8 BOM as "WINDOWS-1252" (one char has [maybe] more then one byte ...)
        "\0\0��" => 4,
        // UTF-32 (BE) BOM
        '  þÿ' => 6,
        // UTF-32 (BE) BOM as "WINDOWS-1252"
        "��\0\0" => 4,
        // UTF-32 (LE) BOM
        'ÿþ  ' => 6,
        // UTF-32 (LE) BOM as "WINDOWS-1252"
        "��" => 2,
        // UTF-16 (BE) BOM
        'þÿ' => 4,
        // UTF-16 (BE) BOM as "WINDOWS-1252"
        "��" => 2,
        // UTF-16 (LE) BOM
        'ÿþ' => 4,
    ];
@nikic
Copy link
Owner

nikic commented Sep 6, 2018

The precise formatting of string literals is not preserved, only their content (with escape sequences resolved) is. To preserve original source code formatting (including string literals), https://github.com/nikic/PHP-Parser/blob/master/doc/component/Pretty_printing.markdown#formatting-preserving-pretty-printing can be used.

@nikic
Copy link
Owner

nikic commented Sep 6, 2018

Just to be clear, the generated code still has exactly the same meaning -- none of the characters will be corrupted, they're just no longer written using hex escape sequences. Though it may be that the code is corrupted after being modified by a text editor which replaces broken UTF-8.

@nikic
Copy link
Owner

nikic commented Sep 6, 2018

An alternative to the full formatting-preserving printer is https://github.com/nikic/PHP-Parser/blob/master/doc/component/Lexer.markdown#attribute-handling and extending the pretty printer to use the originalValue attribute. Especially if you know that none of your transformations modify string contents, this would be a quick fix.

voku added a commit to voku/7to5 that referenced this issue Sep 7, 2018
@voku
Copy link
Contributor Author

voku commented Sep 7, 2018

Big thanks for the help and for your work, it works pretty good (voku/7to5@7a62d12) :) but now I have a different problem...

... after converting a class from PHP7 -> PHP5 there was a different behavior in "if"-conditions, so I tried to fix it, with this "hack":

-> #532

@voku voku closed this as completed Sep 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants