HTML unicode escaping sequences #401

Dakror · 2023-01-06T11:17:57Z

Like discussed in #399 , implementing HTML unicode unescaping seems more sensible right now than CSS-based.

This PR implements both decimal and hex decoding of html entities.

mikke89

Nice work, this will be good to have!

Did you try to benchmark this? I don't expect any major trouble, but it would be nice to be sure, especially for documents with a lot of text. We could also avoid copies by taking strings by value and moving, but let's see what the profiler says first.

Source/Core/ElementText.cpp

Tests/Source/UnitTests/XMLParser.cpp

Co-authored-by: Michael R. P. Ragazzon <mikke89@users.noreply.github.com>

Dakror · 2023-01-10T17:14:30Z

I performed a benchmark with a modified version of the element benchmark and verified no performance decrease

mikke89 · 2023-01-10T18:44:17Z

Thank you!
I did some profiling earlier, and I got around 4% performance impact for the decoder during SetInnerRml (6% if we include when it runs for attributes). I think this is acceptable as it shouldn't make a huge difference in practical cases, although it would be something to try to improve later.

I was about to merge this, but I just thought about a potential issue. The decoding happens before checking for RML tags. So if I am correct, submitting e.g.  will actually create a new  element. It is also recursive, so if we e.g. submit &lt; it will turn into a  element with the text < instead of <. And so on...

Dakror · 2023-01-11T16:20:46Z

I think my latest commit fixes that issue. What do you think?

mikke89 · 2023-01-12T15:03:01Z

Yup, I think it does!

However, the new test actually demonstrates another issue:

CHECK(StringUtilities::StripWhitespace(document->GetInnerRML()) == "<p>&lt;span/&gt;</p>");

This would be correct if we returned the inner text (you can use ElementText::GetText for that). However, this is supposed to return the RML - but if you pass this back to SetInnerRml then this creates a  element containing the text. I think we need to RML-encode the value returned from a text element's GetInnerRml.

Dakror · 2023-01-12T16:50:41Z

True, we need to reverse the decoding to stay correct

mikke89 · 2023-01-12T21:00:34Z

Here we go, thanks!

xland · 2023-01-14T10:51:19Z

Should Document be updated?

mikke89 · 2023-01-16T00:02:42Z

You're right, the documentation should be updated. I just made some changes and additions to the RML documentation. In particular, I wrote a new page on the RML syntax.

Dakror added 2 commits January 6, 2023 12:10

HTML unicode escaping sequences

878ff4b

remove comment

7971003

mikke89 added the enhancement New feature or request label Jan 7, 2023

mikke89 reviewed Jan 8, 2023

View reviewed changes

Source/Core/ElementText.cpp Outdated Show resolved Hide resolved

Tests/Source/UnitTests/XMLParser.cpp Outdated Show resolved Hide resolved

Dakror and others added 5 commits January 9, 2023 11:33

Update Tests/Source/UnitTests/XMLParser.cpp

dd37e25

Co-authored-by: Michael R. P. Ragazzon <mikke89@users.noreply.github.com>

Fix equality statement, still broken though.

da9247b

Change location of text escaping

1a3a228

fix unit test

f2643ee

Added benchmark for very long text elements

a5f0584

html unescaping only in inner text, not in rml-parsing context

4e20816

Dakror added 2 commits January 12, 2023 17:56

fix decoding - encoding loophole with element text

546aaf9

fix line break

a4dcee9

mikke89 merged commit 4c61fef into mikke89:master Jan 12, 2023

Dakror deleted the html-escape branch January 13, 2023 09:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML unicode escaping sequences #401

HTML unicode escaping sequences #401

Dakror commented Jan 6, 2023

mikke89 left a comment

Dakror commented Jan 10, 2023

mikke89 commented Jan 10, 2023

Dakror commented Jan 11, 2023

mikke89 commented Jan 12, 2023

Dakror commented Jan 12, 2023

mikke89 commented Jan 12, 2023

xland commented Jan 14, 2023

mikke89 commented Jan 16, 2023

HTML unicode escaping sequences #401

HTML unicode escaping sequences #401

Conversation

Dakror commented Jan 6, 2023

mikke89 left a comment

Choose a reason for hiding this comment

Dakror commented Jan 10, 2023

mikke89 commented Jan 10, 2023

Dakror commented Jan 11, 2023

mikke89 commented Jan 12, 2023

Dakror commented Jan 12, 2023

mikke89 commented Jan 12, 2023

xland commented Jan 14, 2023

mikke89 commented Jan 16, 2023