New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected behavior of 'add-xml-space' setting when used with 'wrap' => 0 and saveBuffer is called twice in tidy-html5 5.6.0 #673
Comments
@TysonAndre unfortunately I do not have PHP tidy wrapper (PHP 7.1.14) installed so I can not specifically test there... But testing with command line tidy I can get the same But when I use So it seems the PHP Wrapper code that handles To check the <p>this is a
string</p> I do not think this is anything to do with the option add-xml-space. This only adds an And that PR #645 is only concerning wrapping If I wanted to install and try the PHP 7.1.14 Tidy wrapper, can you give some pointers where I can download this from... especially in Windows... maybe I will get a chance to try it... and if there is a repo for the tidy wrapper code, maybe I could check the handling of that option... So at the moment this seems like a bug in the |
Oh. The PR I linked in tidy-html5's source wasn't even merged yet, as you said. http://windows.php.net/download contains zip files for windows downloads. Unfortunately, windows PHP builds bundle their own libtidy with the tidy DLLs, so that won't help you reproduce it.. You'd have to build tidy from source Building packages from source in windows is difficult. It may be significantly easier to perform the installation steps in a Linux VM (e.g. via VirtualBox) Resources:
Also, I looked at the git history of https://github.com/php/php-src/tree/PHP-7.1/ext/tidy for the PHP 7.1 releases. I don't see any code changes to the tidy extension since much earlier than Jan 1, 2017
|
I'm not able to reproduce it via the CLI, when checking out the 5.6.0 release tag, or So I'm guessing it's a bug caused when tidy is used as a library in a certain way: (The below snippet behaves as expected) ~/programming/tidy-html5 ±3a30f6a⚡ » ./tidy --add-xml-space yes --show-body-only yes --wrap 1
<p>this is a string</p>
Info: Document content looks like HTML5
No warnings or errors were found.
<p>
this
is
a
string</p>
» echo '<p>this is a string</p>' | ./tidy --add-xml-space yes --show-body-only yes --wrap 0 --char-encoding utf8
Info: Document content looks like HTML5
No warnings or errors were found.
<p>this is a string</p> |
Commit 86e62db seems like it may be related? It removed multiple calls to AdjustConfig. php-src/ext/tidy/tidy.c might have been relying on AdjustConfig being called automatically? The part that stands out is that the call to AdjustConfig converts wrap=0 to wrap=0x7fffffff, which is the same as how I manually worked around my issue: https://github.com/htacg/tidy-html5/blob/release/5.6/src/config.c#L1201-L1203 Also, in case you wanted to see what php-src is doing: It seems normal. There's no workarounds whatsoever based on option name (and tidy's declarations of option config data types haven't changed between 5.4 and 5.6 for the wrap config), and php-src's tidy extension isn't adding any options automatically https://github.com/php/php-src/blob/PHP-7.1.14/ext/tidy/tidy.c#L499-L559 (This is run in a loop over the provided $options) |
saveBuffer looks like it gets called twice. Because it's getting called twice, it's getting the unusual wrapping. This affects 5.6.0 but not 5.4.0 Details: The implementation of $tidy->parseString() afterwards calls saveToBuffer(), to set the (undocumented) object property $tidy->value
https://github.com/php/php-src/blob/PHP-7.1.14/ext/tidy/tidy.c#L764-L765 is what happens for » cat test.php
<?php
$x = new Tidy();
$x->parseString('<p>this is a string</p>', ['add-xml-space' => true, 'show-body-only' => true, 'wrap' => 0], 'utf8');
var_export($x->value);
echo "\n";
echo (string)$x;
» ~/php-7.1.14-tidy-nts-install/bin/php test.php
'<p>this is a string</p>
'
<p>
this
is
a
string</p> |
And if you plan to test this out in a Linux VM, see https://gist.github.com/TysonAndre/b7be05ab3e10f668b49afdde4a83764e for how I built ~/php-7.1.14-tidy-nts-install |
@TysonAndre I had written the following before you added more about the @TysonAndre as you point out, internally tidy changes wrap length zero to Reading say the Now when PHP So while it looks like Commit 86e62db, which removed some redundent calls to AdjustConfig in It seems I would need to build PHP from source in my native Windows 10 to be able to fully trace what is happening... I did start to setup php-7.1.14 from binaries, I can see even if I got this working, it would not show why if I get the same result as you... which would be frustrating, to say the least... will work on this... I am closing tonight, until tomorrow... Meantime, maybe you will spot something more... thanks... |
@TysonAndre ok, I setup a binary install of PHP 7.1.14 in C:\php... created a Set up a
Added >php test.php # and got output -
<p>this is a string</p> So no problem! I did try adding Of course if I add I forked the github PHP repo, and cloned my fork intending to build in Windows 10, x64, using MSVC 14 2015, but wow that source does not make it easy to build in native Windows... will leave that for another day... Of course now I have a PHP binary setup could now use ealier, different or later My binary version -
But unfortunately, can not duplicate the problem... |
@TysonAndre I have now connected my PHP 7.1.14 binary installation, from a Created a <?php
$x = new Tidy();
$x->parseString('<p>this is a string</p>', ['tidy-mark' => true, 'indent' => true, 'wrap' => 0], 'utf8');
echo (string)$x;
?> Now browsing that file, through <html>
<head>
<title></title>
</head>
<body>
<p>
this is a string
</p>
</body>
</html> Which is also beautiful, BUT where is the
Now in reading the repo So I think it is just a question of the order in which they are applied to the Any PHP chosen defaults should be first, so then that can be overridden by any later user config choices... Any ideas on this? Thanks... |
Also, what's your output of There was no tidy.dll in the zip file distributed by windows.php.net. So I don't expect placing one anywhere to have an effect. (But not that familiar with their setup)
|
@TysonAndre yes I got suspicious right after I sent the above... In Windows, I can now see, That means, as you point out, adding a If ever there was a case for using the shared library, this would be one. Due to where and how Windows searches for a DLL, using the shared library would only need copying the And now that I look further I can see the same sad news using
So asside from a full compile of PHP, and the Maybe I would consider it in Ubuntu linux, but given that the repo is nearly 500MB on disk, I presently do not have the space in Anyway, for the moment I am back in the dark |
@TysonAndre, well some good news and some bad I found the space in my Ubuntu linux to build PHP 7.3 from the latest repo... And successfully got the I installed it in
And running
So this is for sure using the latest shared library tidy... The bad news is that I can now see this problem -
So the problem is there in this latest PHP repo source... As stressed I do not think this is a problem in But something in how the library is used... maybe something about And to repeat I do not think it is anything to do with I did struggle to setup a Windows/MSVC14 build using HTH... |
@TysonAndre just because it is now easy, I checked out tidy Now And
And This puts the problem definitely back in And going back to <html>
<head>
<title>
</title>
</head>
<body>
<p>
this
is
a
string</p>
</body>
</html> So not only the multi-lined text output, but also several additional Looking at the commits it seems this not only involved 86e62db, but 350f7b4... I can NOT find special comments, other than the commit messages by @balthisar, why these were done... If he gets a chance maybe he could comment now... So I could just try I pulled, checked out branch Now
And now the output of <html>
<head>
<title></title>
</head>
<body>
<p>this is a string</p>
</body>
</html> And to be sure <?php
$x = new Tidy();
$x->parseString('<p>this is a string</p>', ['wrap' => 0], 'utf8');
echo (string)$x;
?> This does not explain why the seemingly innocuous changes in Unless a better fix is found will consider merging And maybe back porting it, perhaps together with a few other subsequent commits, into release Look forward to further testing and comments... thanks... |
@TysonAndre have now merged PR #705 and hope that closes this... and #704 If I missed something please feel free to re-open, or a new issue... thanks |
PHP's echo tidy_repair_string ('<p>this is a string</p>', ['tidy-mark' => true, 'indent' => true, 'wrap' => 0], 'utf8'); (It seems to me that PHP's Tidy API has quite some glitches, and the documentation needs improvement in this regard.) |
…repair A change released in tidy 5.6.0 breaks php-tidy when using tidy_parse_string+tidy_clean_repair and wrap=0, incorrectly wrapping every single word. Also it seems that $tidy->value should not be used to retrieve the repaired html as far as it is undocumented and for internal use. We replace the call with tidy_repair_string which directly returns the repaired string. Relates to htacg/tidy-html5#673 Relates to https://bugs.php.net/bug.php?id=75947 Tests pass. Signed-off-by: Kevin Decherf <kevin@kdecherf.com>
…repair A change released in tidy 5.6.0 breaks php-tidy when using tidy_parse_string+tidy_clean_repair and wrap=0, incorrectly wrapping every single word. Also it seems that $tidy->value should not be used to retrieve the repaired html as far as it is undocumented and for internal use. We replace the call with tidy_repair_string which directly returns the repaired string. Relates to htacg/tidy-html5#673 Relates to https://bugs.php.net/bug.php?id=75947 Tests pass. Signed-off-by: Kevin Decherf <kevin@kdecherf.com>
I'm using the PHP tidy wrapper(PHP 7.1.14) and the tidy HTML5 5.6.0 release
The behavior is different from 5.4.0. Tidy is inserting an excessive amount of newlines when those two options are used together.
Additionally, the behavior seems to contradict the documentation for
'tidy' => 0
It seems like https://github.com/htacg/tidy-html5/pull/645/files#diff-4a4e354609aef4a160784a5caa18e868R1710 might be a cause, but that's just an uninformed guess. (It looks like code that adds a newline).
A workaround is to set 'wrap' to an extremely large integer. This will continue to preserve newlines without breaking up long lines
The text was updated successfully, but these errors were encountered: