Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite decoder #38

Closed
dan1wang opened this issue May 24, 2019 · 1 comment
Closed

Rewrite decoder #38

dan1wang opened this issue May 24, 2019 · 1 comment

Comments

@dan1wang
Copy link

dan1wang commented May 24, 2019

The decoder has multiple issues that must be addressed

  • the code is inconsistent
  • typo with Æ (PR fix AElig typo #37)
  • semi-colon shouldn't be always optional
    • this makes this package unusable in many applications (HTML chars without ending ; still decoded #19)
    • semi-colon is only optional for HTML 3.2 entities. Semi-colon is required for entities added afterward
    • the regEx is also broken. The decoder would fail with "&ampsomething"
  • Html4Entities doesn't decode "&QUOT;><" but Html5Entities does. We should accept those special cases in quirk mode
  • doesn't accept numeric character reference starting with "&#X" even though "&#X" and "&#x" are both valid in HTML 4 and HTML 5 specification.

I am doing a complete rewrite of the decoder. The decoder would use incremental parser instead of regEx, and it would have a quirk mode and a strict mode.

@dan1wang
Copy link
Author

I did a complete re-write of the named-entity decoder: https://github.com/dan1wang/node-html-entities/tree/rewrite

Here is bench mark result

    XmlEntities.decode: 41ms, 244op/msec
    Html4Entities.decode: 59ms, 169op/msec
    html5Entities.decode: 57ms, 175op/msec
    nodeHtmlEncoder(entities).htmlDecode: 841ms, 12op/msec
    nodeHtmlEncoder(numerical).htmlDecode: 830ms, 12op/msec
    entities.decodeXML: 30ms, 333op/msec
    entities.decodeHTML4: 52ms, 192op/msec
    entities.decodeHTML5: 43ms, 233op/msec
    newDecoder.decodeHTML4Entities: 47ms, 213op/msec
    newDecoder.decodeHTML5Entities: 110ms, 91op/msec
    newDecoder.decodeHTML4EntitiesStrict: 26ms, 385op/msec
    newDecoder.decodeHTML5EntitiesStrict: 32ms, 313op/msec

For non-strict decoding, the new HTML5 entity decoder is the slowest of all. This is because if you make semi-colon optional, you have to start trying to match the longest named entities first, and there are lots of entities to cover.

For strict decoding, the new decoder is lightning fast :-)

The code is incomplete (doesn't do numerical decoding yet).

The code should be changed to do String.split('&') first. That will speed things up considerably.

@mdevils mdevils closed this as completed Dec 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants