Complete rewrite of character decoding (not active yet) (#38) #39

dan1wang · 2019-05-28T07:03:43Z

Use a code generator to create fast decoder code
- the code generator is written to be readable and easy-to-maintain
- the generated code is constructed to be super-fast
Use substring instead of substr (which MDN advises against)
Optional error handler in case the application needs to throw an error
Generally compliant with HTML 4/5 standard

The new code is not attached to the main classes yet (so someone else can do benchmark test first).

The decoder has a strict mode (off by default). In strict mode, the decoder would reject all entities without trailing semicolon as well as &QUOT;, &LT;, &GT;, and &AMP;

HTML4 decoder will accept all HTML4 named entities plus ' which is part of XHTML 1.0 and XML but not part of HTML 4.

Benchmark result on Windows 10 as followed:

    XmlEntities.decode: 29ms, 345op/msec
    Html4Entities.decode: 41ms, 244op/msec
    html5Entities.decode: 43ms, 233op/msec
    nodeHtmlEncoder(entities).htmlDecode: 730ms, 14op/msec
    nodeHtmlEncoder(numerical).htmlDecode: 825ms, 12op/msec
    entities.decodeXML: 27ms, 370op/msec
    entities.decodeHTML4: 38ms, 263op/msec
    entities.decodeHTML5: 32ms, 313op/msec
    newDecoder.decodeHTML4Entities: 22ms, 455op/msec
    newDecoder.decodeHTML5Entities: 37ms, 270op/msec
    newDecoder.decodeHTML4EntitiesStrict: 15ms, 667op/msec
    newDecoder.decodeHTML5EntitiesStrict: 26ms, 385op/msec

Fix typo

- Use a code generator to create fast decoder code - Use substring instead of substr (which MDN advises against) - Optional error handler in case the application needs to throw an error - Generally compliant with HTML 4/5 standard

mdevils · 2019-05-28T12:10:25Z

Hello @dan1wang. This is amazing work! I'm going to review this soon.

dan1wang · 2019-06-03T03:48:17Z

data/decoder-builder.js

+            } else {` + /* https://infra.spec.whatwg.org/#c0-control */ `
+                if (((num > ${0xFDD0-1}) && (num < ${0xFDEF+1})) || ([${NON_CHARACTER.join(',')}].indexOf(num) >= 0)) {
+                    parseError("${ERRORS.NON_CHARACTER.MSG}",${ERRORS.NON_CHARACTER.CODE});
+                } else if ((num == ${0x0D}) || (num < ${0x001F+1})) {


Mistake here. C0 control characters other than \x09 (tab), \x0A (lf), \x0C (ff), and \x20 (space) are invalid. Should be
} else if ( (num < ${0x20}) && (num != ${0x09}) && (num != ${0x0A}) && (num != ${0x0C}) ) {

NitinSingh2020 and others added 3 commits October 27, 2018 01:20

Fix typo

1611c40

Merge pull request mdevils#34 from NitinSingh2020/fix-typo

da2effe

Fix typo

Complete rewrite of character decoding (not active yet)

13a587f

- Use a code generator to create fast decoder code - Use substring instead of substr (which MDN advises against) - Optional error handler in case the application needs to throw an error - Generally compliant with HTML 4/5 standard

dan1wang commented Jun 3, 2019

View reviewed changes

mdevils force-pushed the master branch from fb9cce4 to 68a1a96 Compare April 11, 2020 14:46

mdevils closed this Dec 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Complete rewrite of character decoding (not active yet) (#38) #39

Complete rewrite of character decoding (not active yet) (#38) #39

dan1wang commented May 28, 2019

mdevils commented May 28, 2019

dan1wang Jun 3, 2019

Complete rewrite of character decoding (not active yet) (#38) #39

Complete rewrite of character decoding (not active yet) (#38) #39

Conversation

dan1wang commented May 28, 2019

mdevils commented May 28, 2019

dan1wang Jun 3, 2019

Choose a reason for hiding this comment