Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

htmlparser2 adapter uses no html-encoding in text nodes #8

Closed
Sebmaster opened this issue Jul 5, 2014 · 4 comments
Closed

htmlparser2 adapter uses no html-encoding in text nodes #8

Sebmaster opened this issue Jul 5, 2014 · 4 comments

Comments

@Sebmaster
Copy link
Contributor

Compare this test script:

var parse5 = require('parse5');

var rawHtml = '<b>World</b>';

var parser = new parse5.Parser(parse5.TreeAdapters.htmlparser2);
var dom = parser.parseFragment(rawHtml);

console.log(dom.children[0].data);

var htmlparser2 = require('htmlparser2');
var handler = new htmlparser2.DefaultHandler();
var parserInstance = new htmlparser2.Parser(handler, {
  xmlMode: false,
  lowerCaseTags: true,
  lowerCaseAttributeNames: true
});

parserInstance.includeLocation = false;
parserInstance.parseComplete(rawHtml);

console.log(handler.dom[0].data);

which produces this ouput:

<b>World</b>
<b>World</b>

Seems like text nodes already contain decoded data in parse5 and it's used as-is in the tree adapter?

@Sebmaster Sebmaster changed the title htmlparser2 adapter uses invalid encoding in text nodes htmlparser2 adapter uses no html-encoding in text nodes Jul 5, 2014
@fb55
Copy link
Collaborator

fb55 commented Jul 6, 2014

jsdom's entity decoding isn't entirely spec compliant & should be replaced with the parser-provided alternatives. htmlparser2 has a decodeEntities option, which results in the same behavior.

@domenic
Copy link

domenic commented Jul 6, 2014

The more of that kind of thing we can move into the parser, the better. I am not exactly sure how to do that, but we'll look into it...

@Sebmaster
Copy link
Contributor Author

While I do agree that we should move that kind of thing into the parser as far as possible, this problem doesn't stem from jsdom.

I'd expect the htmlparser2 tree-adapter to produce the same output format as htmlparser2 itself, however apparently this is not the case.

@Sebmaster
Copy link
Contributor Author

Oh well, thanks @fb55, had more time to look into it just now. Seems like we can drop our custom HTMLDecode function and use htmlparser2/parse5 for that. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants