Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML unicode escaping sequences #401

Merged
merged 10 commits into from
Jan 12, 2023
Merged

HTML unicode escaping sequences #401

merged 10 commits into from
Jan 12, 2023

Conversation

Dakror
Copy link
Contributor

@Dakror Dakror commented Jan 6, 2023

Like discussed in #399 , implementing HTML unicode unescaping seems more sensible right now than CSS-based.

This PR implements both decimal and hex decoding of html entities.

@mikke89 mikke89 added the enhancement New feature or request label Jan 7, 2023
Copy link
Owner

@mikke89 mikke89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, this will be good to have!

Did you try to benchmark this? I don't expect any major trouble, but it would be nice to be sure, especially for documents with a lot of text. We could also avoid copies by taking strings by value and moving, but let's see what the profiler says first.

Source/Core/ElementText.cpp Outdated Show resolved Hide resolved
Tests/Source/UnitTests/XMLParser.cpp Outdated Show resolved Hide resolved
@Dakror
Copy link
Contributor Author

Dakror commented Jan 10, 2023

I performed a benchmark with a modified version of the element benchmark and verified no performance decrease

@mikke89
Copy link
Owner

mikke89 commented Jan 10, 2023

Thank you!
I did some profiling earlier, and I got around 4% performance impact for the decoder during SetInnerRml (6% if we include when it runs for attributes). I think this is acceptable as it shouldn't make a huge difference in practical cases, although it would be something to try to improve later.

I was about to merge this, but I just thought about a potential issue. The decoding happens before checking for RML tags. So if I am correct, submitting e.g. &lt;p&gt; will actually create a new <p> element. It is also recursive, so if we e.g. submit <p>&amp;lt;</p> it will turn into a <p> element with the text < instead of &lt;. And so on...

@Dakror
Copy link
Contributor Author

Dakror commented Jan 11, 2023

I think my latest commit fixes that issue. What do you think?

@mikke89
Copy link
Owner

mikke89 commented Jan 12, 2023

Yup, I think it does!

However, the new test actually demonstrates another issue:

CHECK(StringUtilities::StripWhitespace(document->GetInnerRML()) == "<p>&lt;span/&gt;</p>");

This would be correct if we returned the inner text (you can use ElementText::GetText for that). However, this is supposed to return the RML - but if you pass this back to SetInnerRml then this creates a <p> element containing the text. I think we need to RML-encode the value returned from a text element's GetInnerRml.

@Dakror
Copy link
Contributor Author

Dakror commented Jan 12, 2023

True, we need to reverse the decoding to stay correct

@mikke89 mikke89 merged commit 4c61fef into mikke89:master Jan 12, 2023
@mikke89
Copy link
Owner

mikke89 commented Jan 12, 2023

Here we go, thanks!

@Dakror Dakror deleted the html-escape branch January 13, 2023 09:16
@xland
Copy link

xland commented Jan 14, 2023

Should Document be updated?

@mikke89
Copy link
Owner

mikke89 commented Jan 16, 2023

You're right, the documentation should be updated. I just made some changes and additions to the RML documentation. In particular, I wrote a new page on the RML syntax.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants