XML parser: stop using CDATA with invalid XML entities, fixes #10765 #4432

sbraz · 2020-09-02T16:33:53Z

I have been looking further into this double-escaped ampersand issue and I propose the following patch.

It removes an old workaround, added in 2e6a43a, which resulted in invalid XML files where CDATA sections contained HTML entities, some of which were invalid XML entities (such as ü which was translated to ü).

This was only a problem because the htmlentities function was used (which attempts to escape all characters that have matching HTML entities) instead of htmlspecialchars (which only converts the 5 characters that have valid XML entities).

By switching to htmlspecialchars, we ensure that the resulting XML will be valid no matter what special characters are used inside any tag, removing the need to maintain a list of special tags.

I initially meant to only remove the double escaping because text inside CDATA doesn't need to be escaped so I wrote sbraz@95d37e7. But after understanding why CDATA was used in the first place, I thought it best to completely get rid of it and simplify the code while making it future-proof (no need to maintain a list of tags containing special characters).

Redmine Issue: https://redmine.pfsense.org/issues/10765
Ready for review

Remove an old workaround, added in 2e6a43a, which resulted in invalid XML files where CDATA sections contained HTML entities, some of which were invalid XML entities (such as "ü" which was translated to "ü"). This was only a problem because the htmlentities function was used (which attempts to escape all characters that have matching HTML entities) instead of htmlspecialchars (which only converts the 5 characters that have valid XML entities). By switching to htmlspecialchars, we ensure that the resulting XML will be valid no matter what special characters are used inside any tag, removing the need to maintain a list of special tags.

sbraz · 2020-09-02T16:49:11Z

There is one problem with the current patch: if some XML contained HTML entities, such as <tag><![CDATA[ü]]></tag>, those would not be converted properly while loading the configuration. I think we need to increment the XML version and add a function in upgrade_config.inc but it basically means we need to revive the old XML parsing logic and put it in there.

sbraz · 2020-09-03T09:35:08Z

Well, I don't know how to solve this since the config upgrades are run after the config is loaded.

jim-p

We need CDATA protection for many fields for a variety of reasons, encoding in this way is not sufficient. This change would break every config in multiple ways and will not be accepted.

If you encode, then fields have to be aware of the encoding so they can decode and then encode again when needed, with CDATA it's transparently protected.

sbraz · 2020-09-21T12:20:22Z

Hi @jim-p,
I have tried to find a way to parse and write the config in Python without breaking it for pfSense and it seems really difficult. The main problem is that pfSense's parser considers that both of these syntaxes correspond to &, which is wrong:

<ldap_extended_query>&</ldap_extended_query>
<ldap_extended_query><![CDATA[&]]></ldap_extended_query>

I understand that this would break every configuration and I am sure that this can be avoided one way or another by relying on the version attribute.

However, I don't understand your second point regarding encoding: isn't pfSense using UTF-8 for all operations?

I think the current behaviour, i.e. relying on HTML entities inside an XML document and making sure they are not interpreted as invalid by protecting them with CDATA is overly complex and hackish. Getting rid of the list of specific tags that require CDATA encoding is probably going to be a good thing in the long run so I think it should be done as soon as possible.

jim-p · 2020-09-21T12:29:02Z

I do not mean "encoding" like UTF-8, I mean when changing to/from HTML entities and so on.

Trust me when I say we have tried avoiding CDATA in the past and it is not viable. CDATA is easy, transparent to the frontend, and works wonderfully. There is no advantage to removing it, and I don't see any drastic changes there being accepted.

rbgarga requested a review from jim-p September 8, 2020 22:26

jim-p requested changes Sep 9, 2020

View reviewed changes

jim-p closed this Sep 9, 2020

sbraz mentioned this pull request Sep 21, 2020

Calling the plugin breaks special XML fields which use CDATA and contain ampersands opoplawski/ansible-pfsense#61

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XML parser: stop using CDATA with invalid XML entities, fixes #10765 #4432

XML parser: stop using CDATA with invalid XML entities, fixes #10765 #4432

sbraz commented Sep 2, 2020

sbraz commented Sep 2, 2020

sbraz commented Sep 3, 2020

jim-p left a comment

sbraz commented Sep 21, 2020

jim-p commented Sep 21, 2020

XML parser: stop using CDATA with invalid XML entities, fixes #10765 #4432

XML parser: stop using CDATA with invalid XML entities, fixes #10765 #4432

Conversation

sbraz commented Sep 2, 2020

sbraz commented Sep 2, 2020

sbraz commented Sep 3, 2020

jim-p left a comment

Choose a reason for hiding this comment

sbraz commented Sep 21, 2020

jim-p commented Sep 21, 2020