Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOMDocument / SimpleXMLElement incorrectly convert UTF-8 input when using ISO-8859-15 encoding #11663

Closed
filecage opened this issue Jul 10, 2023 · 4 comments

Comments

@filecage
Copy link

filecage commented Jul 10, 2023

Description

When I'm passing non-convertable UTF-8 characters into a DOMDocument node that has been initialized with an encoding like ISO-8859-15, I expect the library to correctly handle these characters. As in such a case where they are non-printable, I expect them to be encoded in HTML entitites.

This is what happens when my DOMDocument encoding is ISO-8859-1. However, it does not work correctly when I'm using ISO-8859-15.

Example

The following code:

<?php

// Decimal Unicode Code Points: 8224, 8225, 8482, 49, 50, 51
$input = '†‡™123';

// Correct output -> the UTF-8-only characters are encoded as HTML entitites
$domISO88591 = new DOMDocument('1.0', 'ISO-8859-1');
$domISO88591->appendChild($domISO88591->createElement('text', $input));
echo $domISO88591->saveXML();

// Incorrect output
$domISO885915 = new DOMDocument('1.0', 'ISO-8859-15');
$domISO885915->appendChild($domISO885915->createElement('text', $input));
echo $domISO885915->saveXML();

Result

<?xml version="1.0" encoding="ISO-8859-1"?>
<text>&#8224;&#8225;&#8482;123</text>

<?xml version="1.0" encoding="ISO-8859-15"?>
<text>&#8225;&#49;23</text>

Problem: The and are omitted and somehow the 1 has been incorrectly converted to &#49;.

Expected Output

<?xml version="1.0" encoding="ISO-8859-1"?>
<text>&#8224;&#8225;&#8482;123</text>

<?xml version="1.0" encoding="ISO-8859-15"?>
<text>&#8224;&#8225;&#8482;123</text>

Both outputs should be exactly the same because, given the input, both ISO encodings share the same set of printable characters. Also, all printable characters should not be encoded to HTML entitites (as with 1 and &#49;)

PHP Version

PHP 8.2.8
(All of my local PHP binaries are affected, with the newest version being 8.2.8 and the oldest being 7.4.15)

Operating System

macOS 13.4, libxml 2.9.4

@nielsdos
Copy link
Member

This is a libxml2 bug. I can reproduce this with 2.9.4 indeed, but not with version 2.11.4 which is used on my host.
Can you try upgrading libxml2? I believe a more recent version should fix your issue. Version 2.9.4 is also quite old.

@filecage
Copy link
Author

filecage commented Jul 10, 2023

@nielsdos thanks for the quick reply. I already suspected libxml2 to be at fault here, but so far was unable to update my local instance that ships with xcode or to compile PHP with a separate installation of libxml2 2.11.4, due to the removal of --with-libxml-dir in 29d1b7f.

If you're able to reproduce with libxml2 2.9.4 but not with 2.11.4 I think this issue can be closed again. In case an update of libxml2 doesn't fix it for me, I'll reopen.

@filecage
Copy link
Author

Just for future reference: Updating my XCode SDK (which is the default source for libxml2 on macOS) to the latest version did not help, as the newest version of libxml2 that Apple ships is 2.9.13, but that version is still affected by this bug.

Having looked through the changelogs of libxml2, I can't really tell which version fixed the faulty behaviour, but I can confirm that it works well using 2.11.4.

Additionally, the docs are outdated on how to configure the path to libxml2 to be used during the compile step. Instead of the no longer supported --with-libxml-dir option, I was successful customizing my version of libxml2 via pkg-config.

I've created php/doc-en#2574 for the outdated docs.

@nielsdos
Copy link
Member

Thanks! I can take a look at updating the docs sometime soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants