Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected behavior of 'add-xml-space' setting when used with 'wrap' => 0 and saveBuffer is called twice in tidy-html5 5.6.0 #673

Closed
TysonAndre opened this issue Feb 9, 2018 · 15 comments

Comments

@TysonAndre
Copy link

TysonAndre commented Feb 9, 2018

I'm using the PHP tidy wrapper(PHP 7.1.14) and the tidy HTML5 5.6.0 release

The behavior is different from 5.4.0. Tidy is inserting an excessive amount of newlines when those two options are used together.

Additionally, the behavior seems to contradict the documentation for 'tidy' => 0

This option specifies the right margin Tidy uses for line wrapping. Tidy tries to wrap lines so that they do not exceed this length. Set wrap to zero if you want to disable line wrapping.

It seems like https://github.com/htacg/tidy-html5/pull/645/files#diff-4a4e354609aef4a160784a5caa18e868R1710 might be a cause, but that's just an uninformed guess. (It looks like code that adds a newline).

php > $x = new Tidy();$x->parseString('<p>this is a string</p>', ['add-xml-space' => true, 'show-body-only' => true, 'wrap' => 50], 'utf8'); echo (string)$x;
<p>this is a string</p>
php > $x = new Tidy();$x->parseString('<p>this is a string</p>', ['add-xml-space' => true, 'show-body-only' => true, 'wrap' => 0], 'utf8'); echo (string)$x;
<p>
this
is
a
string</p>

A workaround is to set 'wrap' to an extremely large integer. This will continue to preserve newlines without breaking up long lines

$x = new Tidy();$x->parseString('<p>this is a string</p>', ['add-xml-space' => true, 'show-body-only' => true, 'wrap' => (1 << 30)], 'utf8'); echo (string)$x;
<p>this is a string</p>
@geoffmcl
Copy link
Contributor

@TysonAndre unfortunately I do not have PHP tidy wrapper (PHP 7.1.14) installed so I can not specifically test there...

But testing with command line tidy I can get the same wrapped-to-multiple-lines, as you show by using -w 1 or -w 2...

But when I use -w 0 I get no wrapping as expected, and this is true for any version of tidy I tried, 5.4, 5.6, or current 5.7.3...

So it seems the PHP Wrapper code that handles 'wrap' >= 0 is wrong. Rather than setting the wrap to zero it is setting it to 1... or something...

To check the wrap option is otherwise ok, you could try 'wrap' >= 14, that is -w 14 and you should get two lines, like -

<p>this is a
string</p>

I do not think this is anything to do with the option add-xml-space. This only adds an xml:space="preserve" attribute to certain elements, when generating XML...

And that PR #645 is only concerning wrapping <? ... ?> elements... nothing to do with wrapping text...

If I wanted to install and try the PHP 7.1.14 Tidy wrapper, can you give some pointers where I can download this from... especially in Windows... maybe I will get a chance to try it... and if there is a repo for the tidy wrapper code, maybe I could check the handling of that option...

So at the moment this seems like a bug in the PHP tidy wrapper code, and this should be filed there... thanks...

@geoffmcl geoffmcl added this to the 5.7 milestone Feb 10, 2018
@TysonAndre
Copy link
Author

TysonAndre commented Feb 10, 2018

Oh. The PR I linked in tidy-html5's source wasn't even merged yet, as you said.


http://windows.php.net/download contains zip files for windows downloads. Unfortunately, windows PHP builds bundle their own libtidy with the tidy DLLs, so that won't help you reproduce it.. You'd have to build tidy from source

Building packages from source in windows is difficult. It may be significantly easier to perform the installation steps in a Linux VM (e.g. via VirtualBox)

Resources:

Executing configure --help should now contain an option to enable tidy. That would be --with-tidy=/path/to/tidy-html5 (Or "C:\path\to\tidy-html5" on windows? Not that familiar with batch scripting)


Also, I looked at the git history of https://github.com/php/php-src/tree/PHP-7.1/ext/tidy for the PHP 7.1 releases. I don't see any code changes to the tidy extension since much earlier than Jan 1, 2017

~/path/to/php-src ±PHP-7.1 » git log -p ext/tidy

commit 83a77383b92d8b1b1a5c141c10e9342aa64b1d92
Author: Anatol Belski <ab@php.net>
Date:   Thu Jan 11 14:28:09 2018 +0100

    Fix test for libtidy 5.6.0
    
    libtidy 5.6.0 remove the language option from the library, it is only
    supported on cli. Prior to that, this option was not used in the
    library. Thus, exclude the option presence from test.

diff --git a/ext/tidy/tests/030.phpt b/ext/tidy/tests/030.phpt
index c351f9a..46809a9 100644
--- a/ext/tidy/tests/030.phpt
+++ b/ext/tidy/tests/030.phpt
@@ -12,18 +12,15 @@ $buffer = '<html></html>';
 $config = array(
   'indent' => true, // AutoBool
   'indent-attributes' => true, // Boolean
-  'indent-spaces' => 3, // Integer
-  'language' => 'de'); // String
+  'indent-spaces' => 3); // Integer
 $tidy = new tidy();
 $tidy->parseString($buffer, $config);
 $c = $tidy->getConfig();
 var_dump($c['indent']);
 var_dump($c['indent-attributes']);
 var_dump($c['indent-spaces']);
-var_dump($c['language']);
 ?>
 --EXPECTF--
 int(1)
 bool(true)
 int(3)
-%s(2) "de"

commit ccd4716ec77c60391d63a1a6af72f96a09b1c617
Author: Xinchen Hui <laruence@gmail.com>
Date:   Tue Jan 2 12:53:31 2018 +0800

    year++

diff --git a/ext/tidy/php_tidy.h b/ext/tidy/php_tidy.h
index 26a9d41..0b30a21 100644
--- a/ext/tidy/php_tidy.h
+++ b/ext/tidy/php_tidy.h
@@ -2,7 +2,7 @@
   +----------------------------------------------------------------------+
   | PHP Version 7                                                        |
   +----------------------------------------------------------------------+
-  | Copyright (c) 1997-2017 The PHP Group                                |
+  | Copyright (c) 1997-2018 The PHP Group                                |
   +----------------------------------------------------------------------+
   | This source file is subject to version 3.01 of the PHP license,      |
   | that is bundled with this package in the file LICENSE, and is        |
diff --git a/ext/tidy/tidy.c b/ext/tidy/tidy.c
index f764b98..c6d6b80 100644
--- a/ext/tidy/tidy.c
+++ b/ext/tidy/tidy.c
@@ -2,7 +2,7 @@
   +----------------------------------------------------------------------+
   | PHP Version 7                                                        |
   +----------------------------------------------------------------------+
-  | Copyright (c) 1997-2017 The PHP Group                                |
+  | Copyright (c) 1997-2018 The PHP Group                                |
   +----------------------------------------------------------------------+
   | This source file is subject to version 3.01 of the PHP license,      |
   | that is bundled with this package in the file LICENSE, and is        |

commit dac6c639bb54e93fb24b82d25f74daaa12763512
Author: Sammy Kaye Powers <sammyk@sammykmedia.com>
Date:   Wed Jan 4 11:23:42 2017 -0600

    Update copyright headers to 2017

@TysonAndre
Copy link
Author

TysonAndre commented Feb 10, 2018

I'm not able to reproduce it via the CLI, when checking out the 5.6.0 release tag, or next, or so on.

So I'm guessing it's a bug caused when tidy is used as a library in a certain way:

(The below snippet behaves as expected)

~/programming/tidy-html5 ±3a30f6a⚡ » ./tidy --add-xml-space yes --show-body-only yes --wrap 1
<p>this is a string</p>
Info: Document content looks like HTML5
No warnings or errors were found.

<p>
this
is
a
string</p>
» echo '<p>this is a string</p>' | ./tidy --add-xml-space yes --show-body-only yes --wrap 0 --char-encoding utf8
Info: Document content looks like HTML5
No warnings or errors were found.

<p>this is a string</p>

@TysonAndre
Copy link
Author

Commit 86e62db seems like it may be related? It removed multiple calls to AdjustConfig. php-src/ext/tidy/tidy.c might have been relying on AdjustConfig being called automatically?

The part that stands out is that the call to AdjustConfig converts wrap=0 to wrap=0x7fffffff, which is the same as how I manually worked around my issue:

https://github.com/htacg/tidy-html5/blob/release/5.6/src/config.c#L1201-L1203

Also, in case you wanted to see what php-src is doing: It seems normal. There's no workarounds whatsoever based on option name (and tidy's declarations of option config data types haven't changed between 5.4 and 5.6 for the wrap config), and php-src's tidy extension isn't adding any options automatically

https://github.com/php/php-src/blob/PHP-7.1.14/ext/tidy/tidy.c#L499-L559 (This is run in a loop over the provided $options)

@TysonAndre
Copy link
Author

TysonAndre commented Feb 10, 2018

saveBuffer looks like it gets called twice. Because it's getting called twice, it's getting the unusual wrapping. This affects 5.6.0 but not 5.4.0

Details: The implementation of $tidy->parseString() afterwards calls saveToBuffer(), to set the (undocumented) object property $tidy->value
https://github.com/php/php-src/blob/PHP-7.1.14/ext/tidy/tidy.c#L1035 calls https://github.com/php/php-src/blob/PHP-7.1.14/ext/tidy/tidy.c#L820

https://github.com/php/php-src/blob/PHP-7.1.14/ext/tidy/tidy.c#L764-L765 is what happens for (string)$tidy). That would be the second time PHP calls saveToBuffer on the same instance. The second time saveToBuffer is called, it has incorrect wrapping.

» cat test.php 
<?php
$x = new Tidy();
$x->parseString('<p>this is a string</p>', ['add-xml-space' => true, 'show-body-only' => true, 'wrap' => 0], 'utf8');
var_export($x->value);
echo "\n";
echo (string)$x;
» ~/php-7.1.14-tidy-nts-install/bin/php test.php 
'<p>this is a string</p>
'
<p>
this
is
a
string</p>

@TysonAndre
Copy link
Author

And if you plan to test this out in a Linux VM, see https://gist.github.com/TysonAndre/b7be05ab3e10f668b49afdde4a83764e for how I built ~/php-7.1.14-tidy-nts-install

@TysonAndre TysonAndre changed the title Unexpected behavior of 'add-xml-space' setting when used with 'wrap' => 0 in tidy-html5 5.6.0 Unexpected behavior of 'add-xml-space' setting when used with 'wrap' => 0 and saveBuffer is called twice in tidy-html5 5.6.0 Feb 10, 2018
@geoffmcl
Copy link
Contributor

@TysonAndre I had written the following before you added more about the tidySaveBuffer, but it still seems relevant...

@TysonAndre as you point out, internally tidy changes wrap length zero to 0x7fffffff... but that should not concern an external library user, like PHP is, through its tidy.c... it is done automatically...

Reading say the _php_tidy_set_tidy_opt it looks correct for the integer wrap value... it does the API call tidyOptSetInt(doc, tidyOptGetId(opt), Z_LVAL(conv))... so given that the macro Z_LVAL does what I expect and converts the string to a uint 0, for 'wrap' => 0 then all is well...

Now when PHP tidy.c calls libtidy with a stream, probably through php_tidy_quick_repair or the like, then the Tidy API tidyParseBuffer will be called, and it you trace that into libtidy that will get to the internal tidy TY_(DocParseStream)( doc, in );, which all tidying gets to, and there you will note one of the things it does is TY_(AdjustConfig)( doc ); /* ensure config consistency */...

So while it looks like Commit 86e62db, which removed some redundent calls to AdjustConfig in TakeConfigSnapshot and other places, at this moment, I do not see it as the cause... but we can always be wrong! Will try to work out some way to test that assumption...

It seems I would need to build PHP from source in my native Windows 10 to be able to fully trace what is happening... I did start to setup php-7.1.14 from binaries, I can see even if I got this working, it would not show why if I get the same result as you... which would be frustrating, to say the least... will work on this...

I am closing tonight, until tomorrow... Meantime, maybe you will spot something more... thanks...

@geoffmcl
Copy link
Contributor

@TysonAndre ok, I setup a binary install of PHP 7.1.14 in C:\php... created a php.ini enabling extension=php_tidy.dll... copied in a 5.6.0 tidy.dll, that are available in the releases... for no particular reason I chose to using the tidy-5.6.0-vc14-mt-64b.zip DLL...

Set up a test.php file -

<?php
 $x = new Tidy();
 $x->parseString('<p>this is a string</p>', ['show-body-only' => true, 'wrap' => 0], 'utf8');
 echo (string)$x;
?>

Added c:\php to my PATH and ran -

>php test.php # and got output -
<p>this is a string</p>

So no problem! I did try adding 'add-xml-space' => true but as suggested this has no effect on the clean results... Am I doing something wrong?

Of course if I add 'wrap' => 1 I will get the multi-lined results that you show...

I forked the github PHP repo, and cloned my fork intending to build in Windows 10, x64, using MSVC 14 2015, but wow that source does not make it easy to build in native Windows... will leave that for another day...

Of course now I have a PHP binary setup could now use ealier, different or later tidy DLL's for testing... if need be...

My binary version -

C:\Users\user>php --version
PHP 7.1.14 (cli) (built: Jan 31 2018 00:15:15) ( ZTS MSVC14 (Visual C++ 2015) x64 )
Copyright (c) 1997-2018 The PHP Group
Zend Engine v3.1.0, Copyright (c) 1998-2018 Zend Technologies

But unfortunately, can not duplicate the problem...

@geoffmcl
Copy link
Contributor

geoffmcl commented Feb 11, 2018

@TysonAndre I have now connected my PHP 7.1.14 binary installation, from a php-7.1.14-Win32-VC14-x64.zip, to Internet Information Services (IIS) 10, running on my Windows 10, through FastCGI, and have come up with a small problem, that maybe you or others can help me with... it seems a small thing, but...

Created a test2.php file in my local web pages, htdocs folder -

<?php
 $x = new Tidy();
 $x->parseString('<p>this is a string</p>', ['tidy-mark' => true, 'indent' => true, 'wrap' => 0], 'utf8');
 echo (string)$x;
?>

Now browsing that file, through localhost/test2.php correctly shows only this is a string - that's good - but when I view the source I only get -

<html>
  <head>
    <title></title>
  </head>
  <body>
    <p>
      this is a string
    </p>
  </body>
</html>

Which is also beautiful, BUT where is the 'tidy-mark' => true? I expect -

<meta name="generator" content="HTML Tidy for HTML5 for Windows version 5.6.0">

Now in reading the repo tidy.c PHP source, I can see in several places tidyOptSetBool(doc, TidyMark, no);, and have no problem with PHP Tidy using its own default config choices, but that should be done before establishing, adding, any user's config...

So I think it is just a question of the order in which they are applied to the tidy document...

Any PHP chosen defaults should be first, so then that can be overridden by any later user config choices...

Any ideas on this? Thanks...

@TysonAndre
Copy link
Author

Also, what's your output of echo (new Tidy())->getRelease();? I expect it to say 2017/11/25 if set up properly. (That's what mine says when built from source and pointing to tidy-html5-5.6.0

There was no tidy.dll in the zip file distributed by windows.php.net. So I don't expect placing one anywhere to have an effect. (But not that familiar with their setup)

» find . -iname '*tidy*'
./ext/php_tidy.dll

@geoffmcl
Copy link
Contributor

@TysonAndre yes I got suspicious right after I sent the above...

In Windows, I can now see, php_tidy.dll is linked with static libtidy - ARGH!!! ZUTE!!! - Thus the output of echo (new Tidy())->getRelease(); is 2017/03/01 - that is libtidy 5.4.0...

That means, as you point out, adding a tidy.dll anywhere does nothing!

If ever there was a case for using the shared library, this would be one. Due to where and how Windows searches for a DLL, using the shared library would only need copying the tidy.dll to the ./ext/ directory... which is what I had assumed, and done... now deleted... oh well...

And now that I look further I can see the same sad news using phpinfo();

Tidy support => enabled
libTidy Version => 5.4.0
libTidy Release => 2017/03/01
Extension Version => 7.1.14 ($Id: c6d6b80d55c5d351deaa745c3c540971e5583c24 $)

So asside from a full compile of PHP, and the tidy extension, using a later version of libtidy I have no way to test tidy 5.6, or later... and as stated this build does not seem an easy task in native Windows...

Maybe I would consider it in Ubuntu linux, but given that the repo is nearly 500MB on disk, I presently do not have the space in $HOME/projects... I would certainly hope php_tidy is built there using the shared lib tidy...

Anyway, for the moment I am back in the dark

@geoffmcl
Copy link
Contributor

@TysonAndre, well some good news and some bad

I found the space in my Ubuntu linux to build PHP 7.3 from the latest repo...

And successfully got the php app built. It successfully found and linked with my latest tidy shared library...

I installed it in /media/Disk2/projects/install/php/bin, so was able to see its dependencies...

$ readelf -d php
Dynamic section at offset 0xb04b80 contains 30 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libtidy.so.5]
...

And running <?php phpinfo(); ?> I get, in part -

Tidy support => enabled
libTidy Version => 5.7.3
libTidy Release => 2018/01/01
Extension Version => 7.3.0-dev ($Id: 2a0ed7b6e3816d9e5ef7efce5b6156132477f79e $)

So this is for sure using the latest shared library tidy...

The bad news is that I can now see this problem -

$ cat test.php
<?php
 $x = new Tidy();
 $x->parseString('<p>this is a string</p>', ['wrap' => 0], 'utf8');
 echo (string)$x;
?>
$ ./php test.php
*** skipping some output ***
<p>
this
is
a
string</p>

So the problem is there in this latest PHP repo source...

As stressed I do not think this is a problem in tidy library... but maybe...

But something in how the library is used... maybe something about tidySaveBuffer that you mentioned, or whatever...

And to repeat I do not think it is anything to do with --add-xml-space yes...

I did struggle to setup a Windows/MSVC14 build using php-sdk but this is a long haul... may keep working on that...

HTH...

geoffmcl added a commit that referenced this issue Feb 14, 2018
While these look like a code cleanup, they appear to have an adverse
consequence in the use of libtidy by PHP 7+, so have been reverted.

	modified:   src/config.c
	modified:   src/config.h
	modified:   src/tidylib.c
@geoffmcl
Copy link
Contributor

geoffmcl commented Feb 14, 2018

@TysonAndre just because it is now easy, I checked out tidy release/5.4 built and installed it...

Now echo (new Tidy())->getRelease(); yielded '2017/03/01'...

And phpinfo() agrees -

Tidy support => enabled
libTidy Version => 5.4.0
libTidy Release => 2017/03/01
Extension Version => 7.3.0-dev ($Id: 2a0ed7b6e3816d9e5ef7efce5b6156132477f79e $)

And $ ./php test.php correctly yields <p>this is a string</p>

This puts the problem definitely back in libtidy....

And going back to 5.7.3 I now note several other changes -

<html>

<head>

<title>
</title>
</head>
<body>

<p>
this
is
a
string</p>
</body>
</html>

So not only the multi-lined text output, but also several additional newlines that should not be there...

Looking at the commits it seems this not only involved 86e62db, but 350f7b4...

I can NOT find special comments, other than the commit messages by @balthisar, why these were done... If he gets a chance maybe he could comment now...

So I could just try reverting these two commits, but this already involves creating a new commit, so in a branch issue-673 I have removed these two to do some more testing...

I pulled, checked out branch issue-673, built and installed a new tidy, with a RC number, -DTIDY_RC_NUMBER=I673 -

Now phpinfo() shows -

Tidy support => enabled
libTidy Version => 5.7.3.I673
libTidy Release => 2018/01/01
Extension Version => 7.3.0-dev ($Id: 2a0ed7b6e3816d9e5ef7efce5b6156132477f79e $)

And now the output of $ ./php test.php appears correct -

<html>
<head>
<title></title>
</head>
<body>
<p>this is a string</p>
</body>
</html>

And to be sure test.php just contains -

<?php
 $x = new Tidy();
 $x->parseString('<p>this is a string</p>', ['wrap' => 0], 'utf8');
 echo (string)$x;
?>

This does not explain why the seemingly innocuous changes in AdjustConfig logic causes this change, but it is clear that it does...

Unless a better fix is found will consider merging issue-673 to next...

And maybe back porting it, perhaps together with a few other subsequent commits, into release 5.6, creating a 5.6.1 release... will see...

Look forward to further testing and comments... thanks...

@geoffmcl
Copy link
Contributor

@TysonAndre have now merged PR #705 and hope that closes this... and #704

If I missed something please feel free to re-open, or a new issue... thanks

@cmb69
Copy link
Contributor

cmb69 commented Oct 20, 2018

Which is also beautiful, BUT where is the 'tidy-mark' => true? I expect -

PHP's tidy::parseString() does not call tidyCleanAndRepair(). You have to call tidy::cleanRepair() after parsing to get the generator (and also the doctype). Alternatively, you can call tidy::repairString() which returns the parsed and repaired string (but does not modify the tidy object). So the simplest thing to do here is likely:

echo tidy_repair_string ('<p>this is a string</p>', ['tidy-mark' => true, 'indent' => true, 'wrap' => 0], 'utf8');

(It seems to me that PHP's Tidy API has quite some glitches, and the documentation needs improvement in this regard.)

Kdecherf added a commit to Kdecherf/php-readability that referenced this issue Feb 3, 2019
…repair

A change released in tidy 5.6.0 breaks php-tidy when using
tidy_parse_string+tidy_clean_repair and wrap=0, incorrectly wrapping
every single word. Also it seems that $tidy->value should not be used to
retrieve the repaired html as far as it is undocumented and for internal
use.

We replace the call with tidy_repair_string which directly returns the
repaired string.

Relates to htacg/tidy-html5#673
Relates to https://bugs.php.net/bug.php?id=75947

Tests pass.

Signed-off-by: Kevin Decherf <kevin@kdecherf.com>
j0k3r pushed a commit to Kdecherf/php-readability that referenced this issue Feb 4, 2019
…repair

A change released in tidy 5.6.0 breaks php-tidy when using
tidy_parse_string+tidy_clean_repair and wrap=0, incorrectly wrapping
every single word. Also it seems that $tidy->value should not be used to
retrieve the repaired html as far as it is undocumented and for internal
use.

We replace the call with tidy_repair_string which directly returns the
repaired string.

Relates to htacg/tidy-html5#673
Relates to https://bugs.php.net/bug.php?id=75947

Tests pass.

Signed-off-by: Kevin Decherf <kevin@kdecherf.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants