Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to skip HTML entity decoding. #8

Merged
merged 2 commits into from
Feb 7, 2024

Conversation

SebastianStehle
Copy link
Contributor

Hi,

We use your library for our mjml renderer. In some way this is just converter from mjml / html to html. Therefore entities need to be transformed to the output as is.

We could just convert the string again when writing to the output, but this is not super fast. Therefore it would be ideal if we can just skip the decoding altogether.

I added this an option to the html reader. To make future improvements easier I added a class for the options.

@osjoberg
Copy link
Owner

osjoberg commented Feb 6, 2024

Hi Sebastian,

Thank you for another quality contribution!

The main thing that I see that could be improved is the SkipDecodingCharacterReferences flag most likely should affect all states where the decoding may start:

  • DataState
  • RcDataState
  • AttributeValueSingleQuotedState
  • AttributeValueDoubleQuotedState
  • AttributeValueUnquotedState

A side affect of setting SkipDecodingCharacterReferences = true is that the tokenizer will no longer report errors when reading invalid character references. I am undecided on if it makes sense or not.

I get that setting SkipDecodingCharacterReferences for the entire life-time of the HtmlReader is good enough for your scenario. I am however considering a more flexible API where the unencoded (raw) values instead could be accessed when needed from:

  • HtmlReader.GetAttributeRaw(string name)
  • HtmlReader.GetAttributeRaw(int index)
  • HtmlReader.TextRaw

The downside is that this would be more complex to implement without having a performance regression.

In any case I am having a look at this in the end of the week but I may not have the time to finish the whole feature.

Kind regards,
Oskar

@SebastianStehle
Copy link
Contributor Author

I will have a look to the other places, but your recommendation would be super difficult to implement without double buffering. The only thing that might work is a new string representation that is a combination of string segments. a string segment could then be a plain text or a html entity.

@SebastianStehle
Copy link
Contributor Author

SebastianStehle commented Feb 7, 2024

I have found more cases.

EDIT: I have fixed the cases you mentioned.

@osjoberg osjoberg merged commit 7ce10cb into osjoberg:master Feb 7, 2024
@SebastianStehle
Copy link
Contributor Author

Awesome. Thanks for the merge :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants