htmlparser2 adapter uses no html-encoding in text nodes #8

Sebmaster · 2014-07-05T19:36:00Z

Compare this test script:

var parse5 = require('parse5');

var rawHtml = '&amp;lt;b&amp;gt;World&amp;lt;/b&amp;gt;';

var parser = new parse5.Parser(parse5.TreeAdapters.htmlparser2);
var dom = parser.parseFragment(rawHtml);

console.log(dom.children[0].data);

var htmlparser2 = require('htmlparser2');
var handler = new htmlparser2.DefaultHandler();
var parserInstance = new htmlparser2.Parser(handler, {
  xmlMode: false,
  lowerCaseTags: true,
  lowerCaseAttributeNames: true
});

parserInstance.includeLocation = false;
parserInstance.parseComplete(rawHtml);

console.log(handler.dom[0].data);

which produces this ouput:

&lt;b&gt;World&lt;/b&gt;
&amp;lt;b&amp;gt;World&amp;lt;/b&amp;gt;

Seems like text nodes already contain decoded data in parse5 and it's used as-is in the tree adapter?

The text was updated successfully, but these errors were encountered:

fb55 · 2014-07-06T17:24:15Z

jsdom's entity decoding isn't entirely spec compliant & should be replaced with the parser-provided alternatives. htmlparser2 has a decodeEntities option, which results in the same behavior.

domenic · 2014-07-06T17:25:43Z

The more of that kind of thing we can move into the parser, the better. I am not exactly sure how to do that, but we'll look into it...

Sebmaster · 2014-07-06T17:36:02Z

While I do agree that we should move that kind of thing into the parser as far as possible, this problem doesn't stem from jsdom.

I'd expect the htmlparser2 tree-adapter to produce the same output format as htmlparser2 itself, however apparently this is not the case.

Sebmaster · 2014-07-07T10:43:56Z

Oh well, thanks @fb55, had more time to look into it just now. Seems like we can drop our custom HTMLDecode function and use htmlparser2/parse5 for that. Thanks!

Update DOCTYPE tokenization per spec

Sebmaster changed the title ~~htmlparser2 adapter uses invalid encoding in text nodes~~ htmlparser2 adapter uses no html-encoding in text nodes Jul 5, 2014

Sebmaster mentioned this issue Jul 5, 2014

XML compat / parse5 jsdom/jsdom#818

Closed

Sebmaster closed this as completed Jul 7, 2014

winhamwr mentioned this issue Dec 16, 2014

html entities fixes PolicyStat/gitlit#95

Merged

inikulin added a commit that referenced this issue Apr 16, 2018

Merge pull request #8 from HTMLParseErrorWG/doctype-update

be496e5

Update DOCTYPE tokenization per spec

milahu mentioned this issue Oct 18, 2022

htmlparser2-parse5-tree-adapter fb55/htmlparser2#1322

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

htmlparser2 adapter uses no html-encoding in text nodes #8

htmlparser2 adapter uses no html-encoding in text nodes #8

Sebmaster commented Jul 5, 2014

fb55 commented Jul 6, 2014

domenic commented Jul 6, 2014

Sebmaster commented Jul 6, 2014

Sebmaster commented Jul 7, 2014

htmlparser2 adapter uses no html-encoding in text nodes #8

htmlparser2 adapter uses no html-encoding in text nodes #8

Comments

Sebmaster commented Jul 5, 2014

fb55 commented Jul 6, 2014

domenic commented Jul 6, 2014

Sebmaster commented Jul 6, 2014

Sebmaster commented Jul 7, 2014