Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need help with JoliTypo and encoding errors #7

Closed
borisschapira opened this issue Jun 9, 2014 · 25 comments
Closed

Need help with JoliTypo and encoding errors #7

borisschapira opened this issue Jun 9, 2014 · 25 comments

Comments

@borisschapira
Copy link

Hi, I've been trying to use JoliTypo for personnal use on http://borisschapira.com/ but it provokes encoding errors for accented characters. Here is an example of what I give to JoliTypo fixer (with encoding determined via mb_detect_encoding) and what JoliTypo responds :

Mentions Légales (UTF-8)
Mentions Légales (ASCII) 

Here is my (pretty simple) code :

``` php`
function typofr($text)
{
static $fixer;
if (!isset($fixer)) {
$fixer = new Fixer(array(
'Trademark'));
$fixer->setLocale('fr_FR');
}
$fixed = $fixer->fix($text);
return $text."<script>console && console.log('-------');console && console.log('".$text." (".mb_detect_encoding($text).")'); console && console.log('".$fixed." (".mb_detect_encoding($fixed).")')</script>";
}


And you can temporarily see the result here, in the console : http://borisschapira.com/
@damienalexandre
Copy link
Member

Hey dude 👍

It looks like an encoding issue. Each time the detected encoding is UTF8, accents é became &Atilde;&copy; - this is a funny one!

This is those two chars:

This is the classic é combo, can you try to add something like this?

$fixed = $fixer->fix(utf8_encode(utf8_decode($text)));

I think your contents are not UTF-8, let me know asap 😉 I have time today to help.

@borisschapira
Copy link
Author

Fix of encode of decode (aka. "re-encode") did not worked, but fix of decoded string works :

Original : Mentions Légales (UTF-8)
Decoded : Mentions L�gales (UTF-8)
Re-encoded :  (ASCII)
Fix(Decoded) : Mentions L&eacute;gales (ASCII)
Fix(Re-encoded) : Mentions L&Atilde;&copy;gales (ASCII) 

Code :

$decoded = utf8_decode($text);
$reEncoded = utf8_encode($decoded);

$fixed = $fixer->fix($decoded);
$fixedWrong = $fixer->fix($reEncoded);

$logs = "<script> console && console.log('-------\\nOriginal : ".$text." (".mb_detect_encoding($text).")\\nDecoded : ".$decoded." (".mb_detect_encoding($decoded).")\\nRe-encoded : ".$reEncoded." (".mb_detect_encoding($reincoded).")\\nFix(Decoded) : ".$fixed." (".mb_detect_encoding($fixed).")\\nFix(Re-encoded) : ".$fixedWrong." (".mb_detect_encoding($fixedWrong).")')</script>";

Is it possible that my content is UTF8 but that Jolitypo works only with ISO-8859-1 ?

@damienalexandre
Copy link
Member

Nope, JoliTypo force UTF-8 at all stages.
The é char is typically a double encoded è, I was able to reproduce this case like this:

var_dump("Mentions Légales"); // string(17) "Mentions Légales"
var_dump(utf8_decode("Mentions Légales")); // string(16) "Mentions L�gales"
var_dump(utf8_encode("Mentions Légales")); // string(19) "Mentions Légales"

The first string is already UTF-8, then I decode her but then as I'm displaying it in a UTF-8 console, it fail to display. In the third example I UTF-8 encode the already UTF-8 string, and é appears.

I just added a test to JoliTypo to test this:

$isoString = mb_convert_encoding("Mentions Légales", "ISO-8859-1", "UTF-8");
$this->assertEquals("Mentions L&eacute;gales", $fixer->fix(utf8_encode($isoString)));
$this->assertEquals("Mentions L&Atilde;&copy;gales", $fixer->fix(utf8_encode(utf8_encode($isoString))));

@borisschapira
Copy link
Author

Ok so if I understand well, it means that my content is double-encoded in UTF8. Good to know...

@borisschapira
Copy link
Author

I really do not understand. For example :

// in an UTF8-encoded PHP file
$fixer->fix('λ');

responds

&Icirc;&raquo;

I must be missing something...

@damienalexandre
Copy link
Member

$fixer = new Fixer(array('Trademark'));
var_dump($fixer->fix('λ'));

This return &lambda; on my computer.

$fixer->fix(utf8_encode('λ'));

return what you got: &Icirc;&raquo;.

So lets try to understand:

  • which file editor do you use?
  • do you have mbstring? (php -m | grep mb)
  • which version of PHP?

Thx!

@borisschapira
Copy link
Author

  • PHP Storm, UTF8 for my file encoding
  • mbstring is activated
  • PHP 5.5.6 (built: Nov 20 2013 15:43:46)

@borisschapira
Copy link
Author

FYI, here is my fixer :

$fixers = [
    "Ellipsis",
    "Dimension",
    "Dash",
    "FrenchQuotes",
    "FrenchNoBreakSpace",
    "CurlyQuote",
    "Hyphen",
    "Trademark"];
$fixer = new Fixer($fixers);
$fixer->setLocale('fr_FR');

@damienalexandre
Copy link
Member

I'm testing this exact snippet (you can download the file here) and it does work as expected:

$ [JoliTypo] php test.php
string(8) "&lambda;"
&lambda;

I have PHP 5.5.3-1ubuntu2.3 with mbstring & Xdebug. Can you try from the command line?

@mdarse
Copy link
Contributor

mdarse commented Jun 16, 2014

FYI, it works correctly on my machine:

$ php test.php 
string(8) "&lambda;"
&lambda;

Tested on PHP 5.5.11 with Xdebug v2.2.3.
And on HHVM too ;)

$ hhvm test.php 
string(8) "&lambda;"
&lambda;

Tested on HipHop VM 3.0.1 (rel).

@borisschapira
Copy link
Author

In the AlwaysData web-based SSH

borisschapira@ssh:~/www/wp/wp-content/plugins/typofr$ php test.php
string(14) "&Icirc;&raquo;"
&Icirc;&raquo;
borisschapira@ssh:~/www/wp/wp-content/plugins/typofr$ php --version
PHP 5.5.6 (cli) (built: Nov 20 2013 15:43:46)
Copyright (c) 1997-2013 The PHP Group
Zend Engine v2.5.0, Copyright (c) 1998-2013 Zend Technologies

@borisschapira
Copy link
Author

Maybe it's something related to the mbstring configuration

mbstring

Multibyte Support => enabled
Multibyte string engine => libmbfl
HTTP input encoding translation => disabled
libmbfl version => 1.3.2

mbstring extension makes use of "streamable kanji code filter and converter", which is distributed under the GNU Lesser General Public License version 2.1.

Multibyte (japanese) regex support => enabled
Multibyte regex (oniguruma) backtrack check => On
Multibyte regex (oniguruma) version => 5.9.2

Directive => Local Value => Master Value
mbstring.detect_order => no value => no value
mbstring.encoding_translation => Off => Off
mbstring.func_overload => 0 => 0
mbstring.http_input => pass => pass
mbstring.http_output => pass => pass
mbstring.http_output_conv_mimetypes => ^(text/|application/xhtml\+xml) => ^(text/|application/xhtml\+xml)
mbstring.internal_encoding => no value => no value
mbstring.language => neutral => neutral
mbstring.strict_detection => Off => Off
mbstring.substitute_character => no value => no value

@damienalexandre
Copy link
Member

I have the exact same mbstring configuration.

Can you also dump libxml version? (heavily used in the lib via DomDocument).

libxml

libXML support => active
libXML Compiled Version => 2.9.1
libXML Loaded Version => 20901
libXML streams => enabled

Also your "dom" section?

dom

DOM/XML => enabled
DOM/XML API Version => 20031129
libxml Version => 2.9.1
HTML Support => enabled
XPath Support => enabled
XPointer Support => enabled
Schema Support => enabled
RelaxNG Support => enabled

Thx!

@borisschapira
Copy link
Author

libxml

libXML support => active
libXML Compiled Version => 2.6.32
libXML Loaded Version => 20632
libXML streams => enabled
dom

DOM/XML => enabled
DOM/XML API Version => 20031129
libxml Version => 2.6.32
HTML Support => enabled
XPath Support => enabled
XPointer Support => enabled
Schema Support => enabled
RelaxNG Support => enabled

@damienalexandre
Copy link
Member

Look like an old release of libxml2 (april 2008) - let's dig around this. Can you try to run this code on your server? (it's an extract of how JoliTypo use DomDocument):

http://3v4l.org/v79Ok

@borisschapira
Copy link
Author

Yep, that's it :

string(14) "&Icirc;&raquo;"

damienalexandre added a commit that referenced this issue Jun 16, 2014
Can't be sure if the bug is fixed in 2.7.0.
@borisschapira
Copy link
Author

I've found a workaround on http://php.net/manual/en/domdocument.loadhtml.php.
It seems to work with the simple test above, altered : http://3v4l.org/HnhdC

Does it work with your other tests cases ?

@damienalexandre
Copy link
Member

This fix is interesting and it does not break any of the tests - I will look more closely and push a new version of JoliTypo 👍

Thx!

@borisschapira
Copy link
Author

Let's hope it doesn't !

@damienalexandre
Copy link
Member

Can you try the new branch on your server?

composer require jolicode/jolitypo dev-contentEncoding

@borisschapira
Copy link
Author

Test is ok now, so I've tried to apply it to all my website.
The following error occured :

Warning: DOMDocument::loadHTML(): Empty string supplied as input in /home/borisschapira/www/wp/wp-content/plugins/typofr/vendor/jolicode/jolitypo/src/JoliTypo/Fixer.php on line 209

Seems that a test on the non-emptiness of the string is needed.

@damienalexandre
Copy link
Member

This error is now fixed (d5be8be),
you can try again \o/

composer update jolicode/jolitypo

@borisschapira
Copy link
Author

Everything's fine, λ is managed and no more need for an UTF8 decode before fix.
All of http://borisschapira.com/ is running with JoliTypo and my CasperJS content test does not detect any error. GG !

@damienalexandre
Copy link
Member

Great. I'm going to test on some website too and tag a new release soon.

Thank for your time ;-)

yeah

@borisschapira
Copy link
Author

Thanks for yours !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants