Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XML parser: stop using CDATA with invalid XML entities, fixes #10765 #4432

Closed
wants to merge 1 commit into from

Conversation

sbraz
Copy link

@sbraz sbraz commented Sep 2, 2020

Hi @jim-p,

I have been looking further into this double-escaped ampersand issue and I propose the following patch.

It removes an old workaround, added in 2e6a43a, which resulted in invalid XML files where CDATA sections contained HTML entities, some of which were invalid XML entities (such as ü which was translated to ü).

This was only a problem because the htmlentities function was used (which attempts to escape all characters that have matching HTML entities) instead of htmlspecialchars (which only converts the 5 characters that have valid XML entities).

By switching to htmlspecialchars, we ensure that the resulting XML will be valid no matter what special characters are used inside any tag, removing the need to maintain a list of special tags.

I initially meant to only remove the double escaping because text inside CDATA doesn't need to be escaped so I wrote sbraz@95d37e7. But after understanding why CDATA was used in the first place, I thought it best to completely get rid of it and simplify the code while making it future-proof (no need to maintain a list of tags containing special characters).

Remove an old workaround, added in 2e6a43a,
which resulted in invalid XML files where CDATA sections contained HTML
entities, some of which were invalid XML entities (such as "ü" which was
translated to "ü").

This was only a problem because the htmlentities function was used
(which attempts to escape all characters that have matching HTML entities)
instead of htmlspecialchars (which only converts the 5 characters that have
valid XML entities).

By switching to htmlspecialchars, we ensure that the resulting XML will be
valid no matter what special characters are used inside any tag,
removing the need to maintain a list of special tags.
@sbraz
Copy link
Author

sbraz commented Sep 2, 2020

There is one problem with the current patch: if some XML contained HTML entities, such as <tag><![CDATA[&uuml;]]></tag>, those would not be converted properly while loading the configuration. I think we need to increment the XML version and add a function in upgrade_config.inc but it basically means we need to revive the old XML parsing logic and put it in there.

@sbraz
Copy link
Author

sbraz commented Sep 3, 2020

Well, I don't know how to solve this since the config upgrades are run after the config is loaded.

@rbgarga rbgarga requested a review from jim-p September 8, 2020 22:26
Copy link
Contributor

@jim-p jim-p left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need CDATA protection for many fields for a variety of reasons, encoding in this way is not sufficient. This change would break every config in multiple ways and will not be accepted.

If you encode, then fields have to be aware of the encoding so they can decode and then encode again when needed, with CDATA it's transparently protected.

@sbraz
Copy link
Author

sbraz commented Sep 21, 2020

Hi @jim-p,
I have tried to find a way to parse and write the config in Python without breaking it for pfSense and it seems really difficult. The main problem is that pfSense's parser considers that both of these syntaxes correspond to &, which is wrong:

  • <ldap_extended_query>&amp;</ldap_extended_query>
  • <ldap_extended_query><![CDATA[&]]></ldap_extended_query>

I understand that this would break every configuration and I am sure that this can be avoided one way or another by relying on the version attribute.

However, I don't understand your second point regarding encoding: isn't pfSense using UTF-8 for all operations?

I think the current behaviour, i.e. relying on HTML entities inside an XML document and making sure they are not interpreted as invalid by protecting them with CDATA is overly complex and hackish. Getting rid of the list of specific tags that require CDATA encoding is probably going to be a good thing in the long run so I think it should be done as soon as possible.

@jim-p
Copy link
Contributor

jim-p commented Sep 21, 2020

I do not mean "encoding" like UTF-8, I mean when changing to/from HTML entities and so on.

Trust me when I say we have tried avoiding CDATA in the past and it is not viable. CDATA is easy, transparent to the frontend, and works wonderfully. There is no advantage to removing it, and I don't see any drastic changes there being accepted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants