Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complete rewrite of character decoding (not active yet) (#38) #39

Closed
wants to merge 3 commits into from

Conversation

dan1wang
Copy link

  • Use a code generator to create fast decoder code
    • the code generator is written to be readable and easy-to-maintain
    • the generated code is constructed to be super-fast
  • Use substring instead of substr (which MDN advises against)
  • Optional error handler in case the application needs to throw an error
  • Generally compliant with HTML 4/5 standard

The new code is not attached to the main classes yet (so someone else can do benchmark test first).

The decoder has a strict mode (off by default). In strict mode, the decoder would reject all entities without trailing semicolon as well as ", <, >, and &

HTML4 decoder will accept all HTML4 named entities plus ' which is part of XHTML 1.0 and XML but not part of HTML 4.

Benchmark result on Windows 10 as followed:

    XmlEntities.decode: 29ms, 345op/msec
    Html4Entities.decode: 41ms, 244op/msec
    html5Entities.decode: 43ms, 233op/msec
    nodeHtmlEncoder(entities).htmlDecode: 730ms, 14op/msec
    nodeHtmlEncoder(numerical).htmlDecode: 825ms, 12op/msec
    entities.decodeXML: 27ms, 370op/msec
    entities.decodeHTML4: 38ms, 263op/msec
    entities.decodeHTML5: 32ms, 313op/msec
    newDecoder.decodeHTML4Entities: 22ms, 455op/msec
    newDecoder.decodeHTML5Entities: 37ms, 270op/msec
    newDecoder.decodeHTML4EntitiesStrict: 15ms, 667op/msec
    newDecoder.decodeHTML5EntitiesStrict: 26ms, 385op/msec

NitinSingh2020 and others added 3 commits October 27, 2018 01:20
- Use a code generator to create fast decoder code
- Use substring instead of substr (which MDN advises against)
- Optional error handler in case the application needs to throw an error
- Generally compliant with HTML 4/5 standard
@mdevils
Copy link
Owner

mdevils commented May 28, 2019

Hello @dan1wang. This is amazing work! I'm going to review this soon.

} else {` + /* https://infra.spec.whatwg.org/#c0-control */ `
if (((num > ${0xFDD0-1}) && (num < ${0xFDEF+1})) || ([${NON_CHARACTER.join(',')}].indexOf(num) >= 0)) {
parseError("${ERRORS.NON_CHARACTER.MSG}",${ERRORS.NON_CHARACTER.CODE});
} else if ((num == ${0x0D}) || (num < ${0x001F+1})) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mistake here. C0 control characters other than \x09 (tab), \x0A (lf), \x0C (ff), and \x20 (space) are invalid. Should be
} else if ( (num < ${0x20}) && (num != ${0x09}) && (num != ${0x0A}) && (num != ${0x0C}) ) {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants