Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc.outerHTML is not properly (entity) encoding attribute values #641

Closed
SunilAgrawal opened this issue Jun 10, 2013 · 11 comments
Closed

doc.outerHTML is not properly (entity) encoding attribute values #641

SunilAgrawal opened this issue Jun 10, 2013 · 11 comments

Comments

@SunilAgrawal
Copy link

Hi,
Was trying to use JSDOM on http://www.w3schools.com/js/tryit.asp?filename=tryjs_write, by building a DOM out of the HTML and making minor changes and then serializing the DOM.

However the serialized HTML doesn't have the attribute values properly HTML encoded. In particular there's attribute value called 'Submit code &raquo' which does get HTML decoded during DOM creation, but not encoded during serialization.

Is this a bug?

Thanks, Sunil

@domenic
Copy link
Member

domenic commented Jun 10, 2013

Is this a bug?

I'm not sure; do browsers behave differently?

@papandreou
Copy link
Contributor

Not a bug. Whether a character was originally represented by an entity is intentionally not stored anywhere -- just like every other HTML/XML parser I've worked with. Here's what Chrome does:

var div = document.createElement('div');
div.innerHTML = '<span foo="Submit code &raquo"></span>';
console.log(div.outerHTML); // "<div><span foo="Submit code »"></span></div>"

See also #323

Is this actually causing problems? Do you need the reserialized output to be pure ASCII?

@domenic
Copy link
Member

domenic commented Jun 10, 2013

Thanks @papandreou for testing this for us.

@domenic domenic closed this as completed Jun 10, 2013
@thehesiod
Copy link

the issue is that the page doesn't contain what encoding it is, so presumably ascii. So are we to assume that the serialized output will be utf8 and therefore overwrite the doc's encoding to be utf8? I would assume you'd want to have a way to preserve the original encoding, so you can re-serialize it out as ascii.

@domenic
Copy link
Member

domenic commented Jun 11, 2013

@thehesiod I believe the default encoding for documents without a <meta charset> is UTF-8. However, you're welcome to prove us wrong! Just produce a .html file that gives different results when loaded into jsdom vs. when loaded into browsers.

@domenic
Copy link
Member

domenic commented Jun 11, 2013

(Actually, I think the <meta charset> doesn't matter; I think that since JavaScript strings are always Unicode, innerHTML properties will never include escaped characters.)

@thehesiod
Copy link

I do actually have proof :) If you JSDOM http://www.w3schools.com/js/tryit.asp?filename=tryjs_write, then serialize it back to the browser, you'll note that the "Submit Code »" button does not render correctly. However if you do something like this: response.headers['content-type'] += '; charset=utf-8'; then it will render correctly. Thus, as I suspected the default charset is not utf8, but probably iso-latin1.

@thehesiod
Copy link

I think in any scenario there's an issue:

  1. Given that JSDOM currently outputs UTF8, what does JSDOM do if there's a meta charset: does it just ignore it (potentially creating mis-matched charsets), or does it remove/update the meta charset.
    2a) if it creates mis-matched charsets, then I think its a bug as its creating invalid HTML.
    2b) if its supposed to do the fix-ups, for content that doesn't have the meta-tag it should be adding the meta tag to be utf8 because it is changing the encoding from latin-1 to utf8 resulting in garbage getting rendered.

@thehesiod
Copy link

actually, here's the real proof. On that page, if you execute the following you'll get the charset the browser thinks it is:
document.charset
"ISO-8859-1"

@domenic
Copy link
Member

domenic commented Jun 11, 2013

Right, but serializing something back to the browser isn't jsdom's job. jsdom's job is to emulate a HTML DOM. Re-serializing is apparently something you're trying to use jsdom for, yes, but it's not a built-in feature. A simple example would be trying to deserialize and re-serialize a badly-formed document like <html>text; jsdom's parser will create a real parse tree out of that, so when you do document.documentElement.outerHTML, you don't get <html>text back.

If you try this in the console of your browser on that page:

document.getElementById("submitBTN").outerHTML

you'll get back exactly what jsdom gives, namely "<input id=\"submitBTN\" value=\"Submit Code »\" onclick=\"submitTryit()\" type=\"button\">".

I'm still not seeing anything that jsdom does differently from a browser. Can you give me a line of JavaScript I can run which produces different results in jsdom and in the browser?

@thehesiod
Copy link

Interesting, indeed outerhtml is decoded in the browser, but view source is encoded...I suppose it makes sense. We'll just do our fixups then, thanks for the patience and time!

Domenic Denicola notifications@github.com wrote:

Right, but serializing something back to the browser isn't jsdom's job. jsdom's job is to emulate a HTML DOM. If you try this in the console of your browser on that page:

document.getElementById("submitBTN").outerHTML

you'll get back exactly what jsdom gives, namely "<input id=\"submitBTN\" value=\"Submit Code »\" onclick=\"submitTryit()\" type=\"button\">".

I'm still not seeing anything that jsdom does differently from a browser. Can you give me a line of JavaScript I can run which produces different results in jsdom and in the browser?


Reply to this email directly or view it on GitHub:
#641 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants