Option to skip HTML entity decoding. #8

SebastianStehle · 2024-02-06T16:33:21Z

Hi,

We use your library for our mjml renderer. In some way this is just converter from mjml / html to html. Therefore entities need to be transformed to the output as is.

We could just convert the string again when writing to the output, but this is not super fast. Therefore it would be ideal if we can just skip the decoding altogether.

I added this an option to the html reader. To make future improvements easier I added a class for the options.

osjoberg · 2024-02-06T21:26:04Z

Hi Sebastian,

Thank you for another quality contribution!

The main thing that I see that could be improved is the SkipDecodingCharacterReferences flag most likely should affect all states where the decoding may start:

DataState
RcDataState
AttributeValueSingleQuotedState
AttributeValueDoubleQuotedState
AttributeValueUnquotedState

A side affect of setting SkipDecodingCharacterReferences = true is that the tokenizer will no longer report errors when reading invalid character references. I am undecided on if it makes sense or not.

I get that setting SkipDecodingCharacterReferences for the entire life-time of the HtmlReader is good enough for your scenario. I am however considering a more flexible API where the unencoded (raw) values instead could be accessed when needed from:

HtmlReader.GetAttributeRaw(string name)
HtmlReader.GetAttributeRaw(int index)
HtmlReader.TextRaw

The downside is that this would be more complex to implement without having a performance regression.

In any case I am having a look at this in the end of the week but I may not have the time to finish the whole feature.

Kind regards,
Oskar

SebastianStehle · 2024-02-07T08:52:11Z

I will have a look to the other places, but your recommendation would be super difficult to implement without double buffering. The only thing that might work is a new string representation that is a combination of string segments. a string segment could then be a plain text or a html entity.

SebastianStehle · 2024-02-07T09:11:08Z

I have found more cases.

EDIT: I have fixed the cases you mentioned.

SebastianStehle · 2024-02-07T20:19:41Z

Awesome. Thanks for the merge :)

Skip HTML entity decoding.

9a24a4d

SebastianStehle mentioned this pull request Feb 6, 2024

Undesired interpretation of HTML-encoded characters SebastianStehle/mjml-net#192

Closed

More cases

ce43824

osjoberg merged commit 7ce10cb into osjoberg:master Feb 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to skip HTML entity decoding. #8

Option to skip HTML entity decoding. #8

SebastianStehle commented Feb 6, 2024

osjoberg commented Feb 6, 2024

SebastianStehle commented Feb 7, 2024

SebastianStehle commented Feb 7, 2024 •

edited

Loading

SebastianStehle commented Feb 7, 2024

Option to skip HTML entity decoding. #8

Option to skip HTML entity decoding. #8

Conversation

SebastianStehle commented Feb 6, 2024

osjoberg commented Feb 6, 2024

SebastianStehle commented Feb 7, 2024

SebastianStehle commented Feb 7, 2024 • edited Loading

SebastianStehle commented Feb 7, 2024

SebastianStehle commented Feb 7, 2024 •

edited

Loading