Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOMDocument::savexml and friends ommit xmlns="" declaration for null namespace, creating incorrect xml representation of the DOM #11404

Closed
theseer opened this issue Jun 8, 2023 · 8 comments

Comments

@theseer
Copy link
Contributor

theseer commented Jun 8, 2023

Description

The following code:

<?php declare(strict_types = 1);

$dom1 = new DOMDocument;
$dom1->loadXML('<?xml version="1.0" ?><with xmlns="some:ns" />');

$dom2 = new DOMDocument;
$dom2->loadXML('<?xml version="1.0" ?><none />');

$dom1->documentElement->append(
    $dom1->importNode(
        $dom2->documentElement
    )
);

echo $dom1->saveXML();

Resulted in this output:

<?xml version="1.0"?>
<with xmlns="some:ns"><none/></with>

That is incorrect. The imported node none is in no namespace, the correct serialization to XML thus has to explicitly set the xmlns to "". Otherwise parsing the the generated XML back into a DOM places the node none into the same namespace as its parent with. Adding the following to the above example demonstrates that:

var_dump(
    $dom1->getElementsByTagName('with')->item(0)->namespaceURI,
    $dom1->getElementsByTagName('none')->item(0)->namespaceURI
);

$dom3 = new DOMDocument;
$dom3->loadXML(
    $dom1->saveXML()
);

var_dump(
    $dom3->getElementsByTagName('with')->item(0)->namespaceURI,
    $dom3->getElementsByTagName('none')->item(0)->namespaceURI
);

Output:

string(7) "some:ns"
NULL
string(7) "some:ns"
string(7) "some:ns"

The correct XML serialization of the DOM is:

<?xml version="1.0"?>
<with xmlns="some:ns"><none xmlns="" /></with>

A simple test in JavaScript shows the correct result at least in Firefox:

const parser = new DOMParser;

const dom1 = parser.parseFromString('<?xml version="1.0" ?><with xmlns="some:ns" />', 'text/xml');
const dom2 = parser.parseFromString('<?xml version="1.0" ?><none />', 'text/xml');

dom1.documentElement.appendChild(
        dom1.importNode(
                dom2.documentElement,
                true
        )
);

console.log(
         (new XMLSerializer).serializeToString(dom1)
);
 <?xml version="1.0" encoding="UTF-8"?>
<with xmlns="some:ns"><none xmlns=""/></with>

PHP Version

PHP 8.2.7 / PHP 8.1.20 / PHP 8.0.29 / PHP 7.4.33

Operating System

Fedora Linux 38 x86_64

@theseer
Copy link
Contributor Author

theseer commented Jun 8, 2023

For the record, it's equally wrong in Python 3:

from xml.dom.minidom import parseString

dom1 = parseString('<?xml version="1.0" ?><with xmlns="some:ns" />')
dom2 = parseString('<?xml version="1.0" ?><none />')

dom1.documentElement.appendChild(
    dom1.importNode(
        dom2.documentElement,True
    )
)

print dom1.toxml()

I'm not sure if minidom in Python is backed by LibXML 2. If so, it might (also?) be a LibXML 2 issue.

@damianwadley
Copy link
Member

Could be a libxml issue, however PHP does have a non-trivial amount of code in its implementation of importNode - including some work with node namespaces - so something there could be responsible as well...

@theseer
Copy link
Contributor Author

theseer commented Jun 8, 2023

Nice idea, but importNode is not to blame here. I just used it as an example derived from my original use case and because there is no workaround (see below). It's equally broken when the node is created inside the same DOM:

<?php declare(strict_types = 1);

$dom1 = new DOMDocument;
$dom1->loadXML('<?xml version="1.0" ?><with xmlns="some:ns" />');

$nodeA = $dom1->createElement('none');
$nodeB = $dom1->createElementNS(null, 'none');

$dom1->documentElement->appendChild($nodeA);
$dom1->documentElement->appendChild($nodeB);

echo    $dom1->saveXML();
<?xml version="1.0"?>
<with xmlns="some:ns"><none/><none/></with>

While it's of course questionable to use the DOM Level 1 API for creating $nodeA in a document containing namespaces, the 2nd $nodeB creation call is certainly valid. Both nodes have a NULL namespaceURI when queried:

var_dump($nodeA->namespaceURI, $nodeB->namespaceURI);
NULL
NULL

A potential workaround when in control of creating the nodes, instead of specifying NULL as namespace, one could use ''. I'm not sure if that qualifies as being equal to NULL from the DOM's standard perspective, but at least the serialization is correct then. Of course when the node gets imported, that's not an option as DOMNode::namespaceURI is read-only.

@Girgias
Copy link
Member

Girgias commented Jun 9, 2023

Going to assign @nielsdos to this, as they have the best understanding currently of the DOM extension.

@nielsdos
Copy link
Member

nielsdos commented Jun 10, 2023

Here's what I found.

Some more issues:

  • According to https://dom.spec.whatwg.org/#validate-and-extract the "empty string" namespace and the null namespace are equivalent due to step 1. This isn't handled like this in PHP's code.
  • The behaviour of namespaceUri is wrong. According to https://dom.spec.whatwg.org/#concept-attribute: namespaceUri is "null or a non-empty string". But null and the empty string can both be returned, likely as a consequence of the previous bullet point. For example by passing '' in createElementNS.

These two bullet points are bugs in PHP and can fortunately easily be resolved.

About libxml2's behavior.
libxml2 documentation says for xmlNewDocNode that the namespace argument is optional, but doesn't specify what its behaviour then is.
By reading libxml2 code and by trying some things out it seems that NULL just means it'll take on the default namespace.
However, we can see in the samples posted in this bug report that this behaviour is indeed incorrect wrt serialisation.

This is further confirmed by this C program:

#include <stdio.h>
#include <string.h>
#include <libxml/parser.h>
#include <libxml/tree.h>

static xmlNodePtr firstElementNode(xmlNodePtr node) {
    while (node->type != XML_ELEMENT_NODE) node = node->next;
    return node;
}

int main() {
    // Parse document
    const char* xmlString = "<?xml version=\"1.0\" ?><with xmlns=\"some:ns\" />";
    xmlDocPtr doc = xmlReadMemory(xmlString, strlen(xmlString), NULL, NULL, 0);
    
    // Add some children
    xmlNodePtr rootNode = xmlDocGetRootElement(doc);
    xmlNodePtr child1 = xmlNewDocNode(doc, NULL, (const xmlChar *)"child1", NULL);
    xmlAddChild(rootNode, child1);
    xmlNodePtr child2 = xmlNewDocNode(doc, NULL, (const xmlChar *)"child2", NULL);
    xmlAddChild(rootNode, child2);

    // Print out doc's string representation
    xmlChar* serializedXmlString;
    xmlDocDumpFormatMemory(doc, &serializedXmlString, NULL, 1);
    printf("%s\n", serializedXmlString);
    // Both ->ns pointers are NULL
    printf("child1=%p, child1->ns=%p\n", child1, child1->ns);
    printf("child2=%p, child2->ns=%p\n", child2, child2->ns);

    // Read back the serialized document into newDoc, and print out its string representation
    xmlDocPtr newDoc = xmlReadMemory((const char*)serializedXmlString, strlen((const char*)serializedXmlString), NULL, NULL, 0);
    xmlFree(serializedXmlString);
    xmlDocDumpFormatMemory(newDoc, &serializedXmlString, NULL, 1);
    printf("%s\n", serializedXmlString);
    xmlFree(serializedXmlString);
    // Both ->ns pointers are no longer NULL, they point to the default namespace
    child1 = firstElementNode(xmlDocGetRootElement(newDoc)->children);
    printf("child1=%p, child1->ns=%p, ns href=%s\n", child1, child1->ns, child1->ns->href);
    child2 = firstElementNode(child1->next);
    printf("child2=%p, child2->ns=%p, ns href=%s\n", child2, child2->ns, child2->ns->href);

    // Cleanup
    xmlFreeDoc(newDoc);
    xmlFreeDoc(doc);
    xmlCleanupParser();

    return 0;
}

I'll create an issue report on the libxml2 project with some questions. Either this is all expected behaviour and we'll have to tweak something at PHP's side, or this is something wrong in libxml2.
I already have a PoC patch for that, but I want to hear back from libxml2 first before I continue.

EDIT: the libxml2 maintainer replied and said this behaviour is by design and the library user should perform the necessary checks.

nielsdos added a commit to nielsdos/php-src that referenced this issue Jun 11, 2023
…aration for null namespace, creating incorrect xml representation of the DOM

The NULL namespace is only correct when there is no default namespace
override. When there is, we need to manually set it to the empty string
namespace.
nielsdos added a commit to nielsdos/php-src that referenced this issue Jun 15, 2023
…aration for null namespace, creating incorrect xml representation of the DOM

The NULL namespace is only correct when there is no default namespace
override. When there is, we need to manually set it to the empty string
namespace.
nielsdos added a commit to nielsdos/php-src that referenced this issue Jun 16, 2023
…aration for null namespace, creating incorrect xml representation of the DOM

The NULL namespace is only correct when there is no default namespace
override. When there is, we need to manually set it to the empty string
namespace.
nielsdos added a commit that referenced this issue Jun 17, 2023
* PHP-8.1:
  Fix GH-11404: DOMDocument::savexml and friends ommit xmlns="" declaration for null namespace, creating incorrect xml representation of the DOM
nielsdos added a commit that referenced this issue Jun 17, 2023
* PHP-8.2:
  Fix GH-11404: DOMDocument::savexml and friends ommit xmlns="" declaration for null namespace, creating incorrect xml representation of the DOM
nielsdos added a commit that referenced this issue Jun 19, 2023
… declaration for null namespace, creating incorrect xml representation of the DOM"

This reverts commit 7eb3e9c.

Although the fix follows the spec, it causes issues because a lot of old
code assumes the incorrect behaviour PHP had since a long time.
We cannot do this yet, especially not in a stable release.
We revert this for the time being.
See GH-11428.
nielsdos added a commit that referenced this issue Jun 19, 2023
* PHP-8.1:
  Revert "Fix GH-11404: DOMDocument::savexml and friends ommit xmlns="" declaration for null namespace, creating incorrect xml representation of the DOM"
nielsdos added a commit that referenced this issue Jun 19, 2023
* PHP-8.2:
  Fix GH-11455: Segmentation fault with custom object date properties
  Revert "Fix GH-11404: DOMDocument::savexml and friends ommit xmlns="" declaration for null namespace, creating incorrect xml representation of the DOM"
@nielsdos
Copy link
Member

Reopening because #11428 (comment) and following comments.
The BC breaks is too big, certainly for a patch release.
We'll have to find a strategy to fix all spec issues in ext/dom in a way that doesn't massively break existing applications.

@nielsdos nielsdos reopened this Jun 19, 2023
@nielsdos
Copy link
Member

For future reference, this works correctly in #13031

@nielsdos
Copy link
Member

nielsdos commented Mar 9, 2024

This is fixed in the new opt-in spec-compliance mode, which was merged in #13031.
For more information about the opt-in mode, please see https://wiki.php.net/rfc/opt_in_dom_spec_compliance.
In short: the behaviour of the old DOM classes are unaffected for backwards-compatibility reasons. There are new DOM classes where your code is correctly handled.

@nielsdos nielsdos closed this as completed Mar 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants