Rewrite decoder #38

dan1wang · 2019-05-24T17:42:22Z

The decoder has multiple issues that must be addressed

the code is inconsistent
typo with Æ (PR fix AElig typo #37)
semi-colon shouldn't be always optional
- this makes this package unusable in many applications (HTML chars without ending ; still decoded #19)
- semi-colon is only optional for HTML 3.2 entities. Semi-colon is required for entities added afterward
- the regEx is also broken. The decoder would fail with "&ampsomething"
Html4Entities doesn't decode "&QUOT;><" but Html5Entities does. We should accept those special cases in quirk mode
doesn't accept numeric character reference starting with "&#X" even though "&#X" and "&#x" are both valid in HTML 4 and HTML 5 specification.

I am doing a complete rewrite of the decoder. The decoder would use incremental parser instead of regEx, and it would have a quirk mode and a strict mode.

dan1wang · 2019-05-25T13:03:25Z

I did a complete re-write of the named-entity decoder: https://github.com/dan1wang/node-html-entities/tree/rewrite

Here is bench mark result

    XmlEntities.decode: 41ms, 244op/msec
    Html4Entities.decode: 59ms, 169op/msec
    html5Entities.decode: 57ms, 175op/msec
    nodeHtmlEncoder(entities).htmlDecode: 841ms, 12op/msec
    nodeHtmlEncoder(numerical).htmlDecode: 830ms, 12op/msec
    entities.decodeXML: 30ms, 333op/msec
    entities.decodeHTML4: 52ms, 192op/msec
    entities.decodeHTML5: 43ms, 233op/msec
    newDecoder.decodeHTML4Entities: 47ms, 213op/msec
    newDecoder.decodeHTML5Entities: 110ms, 91op/msec
    newDecoder.decodeHTML4EntitiesStrict: 26ms, 385op/msec
    newDecoder.decodeHTML5EntitiesStrict: 32ms, 313op/msec

For non-strict decoding, the new HTML5 entity decoder is the slowest of all. This is because if you make semi-colon optional, you have to start trying to match the longest named entities first, and there are lots of entities to cover.

For strict decoding, the new decoder is lightning fast :-)

The code is incomplete (doesn't do numerical decoding yet).

The code should be changed to do String.split('&') first. That will speed things up considerably.

mdevils closed this as completed Dec 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite decoder #38

Rewrite decoder #38

dan1wang commented May 24, 2019 •

edited

Loading

dan1wang commented May 25, 2019

Rewrite decoder #38

Rewrite decoder #38

Comments

dan1wang commented May 24, 2019 • edited Loading

dan1wang commented May 25, 2019

dan1wang commented May 24, 2019 •

edited

Loading