Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PHP 8.3 regression DomNode->value setting #11469

Closed
andypost opened this issue Jun 17, 2023 · 19 comments
Closed

PHP 8.3 regression DomNode->value setting #11469

andypost opened this issue Jun 17, 2023 · 19 comments

Comments

@andypost
Copy link
Contributor

Description

The following code:

<?php
$document = <<<EOD
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /></head>
<body>
<img data-caption="This is a &lt;a href=&quot;https://www.drupal.org&quot;&gt;quick&lt;/a&gt; test…" src="llama.jpg" />
</body>
</html>
EOD;
$dom = new \DOMDocument();
$dom->loadHTML($document, LIBXML_NOBLANKS);
$xpath = new \DOMXPath($dom);
foreach ($xpath->query('//@*[starts-with(name(.), "data-")]') as $node) {
  $value = $node->value;
  var_dump($value);
  $value = htmlspecialchars($value, ENT_QUOTES | ENT_SUBSTITUTE, 'UTF-8');
  var_dump($value);
  $node->value = $value;
  var_dump($node->value);
}

Resulted in this output:

string(60) "This is a <a href="https://www.drupal.org">quick</a> test…"
string(82) "This is a &lt;a href=&quot;https://www.drupal.org&quot;&gt;quick&lt;/a&gt; test…"
string(82) "This is a &lt;a href=&quot;https://www.drupal.org&quot;&gt;quick&lt;/a&gt; test…"

But I expected this output instead:

string(60) "This is a <a href="https://www.drupal.org">quick</a> test…"
string(82) "This is a &lt;a href=&quot;https://www.drupal.org&quot;&gt;quick&lt;/a&gt; test…"
string(60) "This is a <a href="https://www.drupal.org">quick</a> test…"

This is a cause of few tests failed in Drupal HEAD https://www.drupal.org/project/drupal/issues/3366843

PHP Version

PHP 8.3.0-alpha1

Operating System

Alpinelinux, Debian

@andypost
Copy link
Contributor Author

andypost commented Jun 17, 2023

@nielsdos as I see you're cleaning up dom extension atm, maybe you have idea?

@nielsdos
Copy link
Member

This is caused by the fix in 50fdad8 for #8388.
See also the test case: 50fdad8#diff-420139235ef3d0732aa168c5bccbe8aff89e72bd7e40ad8566af14cf1d17906f

Basically, PHP incorrectly unescaped character references when setting nodeValue on DOM attributes. This is not in accordance to the DOM spec. This can cause a whole bunch of subtle issues (including security pitfalls), such as not being able to round-trip data:

$attrNode = new DOMAttr('test');
$attrNode->value = '&amp;';
print $attrNode->value; // prints & instead of &amp;

This behaviour was fixed in 8.3 to comply with the spec.
I see that Drupal uses htmlspecialchars to work around this old bug, which in 8.3 causes incorrect test output.

If really necessary, the fix can be reverted for the time being.

@andypost
Copy link
Contributor Author

Thank you!

I will ask in community because it can affect more editor's integrations as this code will need more then one place to fix, the broken code is https://git.drupalcode.org/project/drupal/-/blob/11.x/core/modules/editor/src/EditorXssFilter/Standard.php#L97

@andypost
Copy link
Contributor Author

Moreover our CI system is broken with Alpha1 - unable to parse xml-artifacts, Very probably the same cause

@nielsdos
Copy link
Member

I will ask in community because it can affect more editor's integrations as this code will need more then one place to fix, the broken code is https://git.drupalcode.org/project/drupal/-/blob/11.x/core/modules/editor/src/EditorXssFilter/Standard.php#L97

Thanks, please keep me posted. If necessary, this can be reverted and discussed on the internals list to find a possible alternative solution. e.g. something that comes to mind is having a rawNodeValue property with the right behaviour, although that might be confusing as well and existing code can still be buggy.

Moreover our CI system is broken with Alpha1 - unable to parse xml-artifacts, Very probably the same cause

I don't know how Drupal's parser works. In any case, nothing changed to the DOM parsing at PHP's side.

@neclimdul
Copy link

My initial reaction would be a concern this inconsistent behavior could lead to unsafe escaping down the line by someone trying to support two versions of PHP.

@nielsdos
Copy link
Member

nielsdos commented Jun 19, 2023

Thanks for the input.
@andypost @neclimdul I understand the concerns and I think PHP shouldn't break this hard. Especially @neclimdul's concern resonates. I will push a revert of the behaviour change this evening, this will fix this issue and the existing code can go on without having to support two different behaviours (different behaviours could lead to lots of issues).
I will then reopen the original issue and try to find an alternative solution which is hopefully BC break-free, probably with opt-in behaviour somehow (but we can discuss that on the email list).

@nielsdos
Copy link
Member

cc @tstarling

@tstarling
Copy link
Contributor

There's no alternative which both fixes the bug and avoids a b/c break. If necessary, I will open an RFC proposing to reapply my patch. There's no need for rawNodeValue since forwards-compatible code is already possible with e.g.

function setNodeValue($attrNode, $value) {
    $attrNode->nodeValue = '';
    $attrNode->appendChild(new \DOMText($value));
}

Drupal should do that instead.

If an application relies on buggy behaviour, how are we meant to fix the bug without breaking the application? The only solution is to stop relying on the bug.

@andypost
Copy link
Contributor Author

@tstarling thank you, maybe unset($attrNode->nodeValue); will be faster as first operation?

@tstarling
Copy link
Contributor

unset is not implemented for that property. It does nothing.

$ php -a
Interactive shell

php > $doc = new DOMDocument;
php > $attr = $doc->createAttribute('test');
php > $attr->nodeValue = 'test';
php > var_dump($attr->nodeValue);
string(4) "test"
php > unset($attr->nodeValue);
php > var_dump($attr->nodeValue);
string(4) "test"

@nielsdos
Copy link
Member

I don't know what the best way to handle this situation is.

This isn't the only revert I did, I also reverted another patch after people complained. The main problem with this bug (and many others in ext/dom) is that they've been around for about 20 years. It's hard to asses the impact, but it seems that many people expect the wrong behaviour, and their code breaks hard when the behaviour is corrected.

For bugfixes that break BC we generally send them to master. However, given the age of the bug, even this seems risky. Non-bugfix BCs have to go through an RFC process to get a community consensus on what to do with them. Perhaps an RFC for fixing all spec compliance bugs in ext/dom is necessary given how old these issues are and the amount of code relying on wrong behaviour.

One solution I suggested in another thread is to introduce a "spec compliant" flag which is opt-in. In that way existing applications will remain working, and new code can opt-in to the correct behaviour. This would avoid BC breaks. I think this also needs an RFC though, and we're close to the feature freeze unfortunately.

Please let me know your thoughts.

Whatever is chosen, this will probably need a discussion on the internals list.

@andypost
Copy link
Contributor Author

The best way is deprecate old property (make it throw warning at least) and introduce new one

@tstarling
Copy link
Contributor

The best way is deprecate old property (make it throw warning at least) and introduce new one

The property is named in the DOM spec. It's non-compliant to deprecate it.

@neclimdul
Copy link

Sounds similar to how we're stuck with a non ISO DateTimeInterface::ISO8601 :-/

Maybe its just one of those changes that needs a lot of documentation and examples in prominent places to explain why, and how to properly use the property between versions.

@andypost
Copy link
Contributor Author

Yes, such big changes should be mentioned in UPGRADING at least

Also having a test that shows how old behavior has changed is likely to see in the same commit

@andypost
Copy link
Contributor Author

After upgrade to 8.3.0beta3 we got new test failure, sounds like related https://www.drupal.org/project/drupal/issues/3383577#comment-15212429

TL'DR when body contain only < character (not &lt;) the textContent property of document has different value or even throws error

<?php
$document = <<<EOD
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /></head>
<body><</body>
</html>
EOD;
$dom = new \DOMDocument();
$dom->loadHTML($document, LIBXML_NOBLANKS);
echo $dom->textContent . PHP_EOL;
$ php libxml.php 
<

$ php --ri dom

dom

DOM/XML => enabled
DOM/XML API Version => 20031129
libxml Version => 2.11.5
HTML Support => enabled
XPath Support => enabled
XPointer Support => enabled
Schema Support => enabled
RelaxNG Support => enabled

older version

# php libxml.php 
PHP Warning:  DOMDocument::loadHTML(): htmlParseStartTag: invalid element name in Entity, line: 4 in /mnt/libxml.php on line 10

root@74899afe88eb:/mnt# php --ri dom
dom

DOM/XML => enabled
DOM/XML API Version => 20031129
libxml Version => 2.9.10
HTML Support => enabled
XPath Support => enabled
XPointer Support => enabled
Schema Support => enabled
RelaxNG Support => enabled
root@74899afe88eb:/mnt# php -v
PHP 8.2.0 (cli) (built: Feb 16 2023 18:16:46) (NTS)
Copyright (c) The PHP Group
Zend Engine v4.2.0, Copyright (c) Zend Technologies
    with Zend OPcache v8.2.0, Copyright (c), by Zend Technologies

@nielsdos
Copy link
Member

@andypost That's unrelated to this issue report. This is not because of a change in PHP but because of a change in libxml2. I vaguely remember there was a change related to tags and error handling to prevent some kind of bad behaviour with untrusted data. I believe you should report this to the libxml2 developers instead.

@andypost
Copy link
Contributor Author

Thank you! Yes, it it depends on libxml version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants