-
Notifications
You must be signed in to change notification settings - Fork 312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTML unicode escaping sequences #401
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, this will be good to have!
Did you try to benchmark this? I don't expect any major trouble, but it would be nice to be sure, especially for documents with a lot of text. We could also avoid copies by taking strings by value and moving, but let's see what the profiler says first.
Co-authored-by: Michael R. P. Ragazzon <mikke89@users.noreply.github.com>
I performed a benchmark with a modified version of the element benchmark and verified no performance decrease |
Thank you! I was about to merge this, but I just thought about a potential issue. The decoding happens before checking for RML tags. So if I am correct, submitting e.g. |
I think my latest commit fixes that issue. What do you think? |
Yup, I think it does! However, the new test actually demonstrates another issue: CHECK(StringUtilities::StripWhitespace(document->GetInnerRML()) == "<p><span/></p>"); This would be correct if we returned the inner text (you can use |
True, we need to reverse the decoding to stay correct |
Here we go, thanks! |
Should Document be updated? |
You're right, the documentation should be updated. I just made some changes and additions to the RML documentation. In particular, I wrote a new page on the RML syntax. |
Like discussed in #399 , implementing HTML unicode unescaping seems more sensible right now than CSS-based.
This PR implements both decimal and hex decoding of html entities.