Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getAttribute only works with lower-case values in XML #651

Closed
robdodson opened this issue Jul 4, 2013 · 13 comments
Closed

getAttribute only works with lower-case values in XML #651

robdodson opened this issue Jul 4, 2013 · 13 comments

Comments

@robdodson
Copy link
Contributor

When I use jsdom to parse an xml document it lower-cases all attributes. I believe this is the correct behavior for HTML according to MDN, but it is not the correct behavior for XML. In XML attributes are case sensitive so codeSystem and codesystem should be treated as two separate items.

Because DOMParser treats codeSystem and codesystem as separate entities my library works in the browser but not in node with jsdom. I can add a flag so it lower cases its arguments when it's using jsdom but that makes me a little iffy because it's not the expected behavior for XML.

@domenic
Copy link
Member

domenic commented Jul 4, 2013

Yeah, there's a few bugs about this... #620 and #393 come to mind. Remember that jsdom is for HTML DOMs, and is not meant as an XML parser; that said, if it behaves differently from a browser, then it's a bug.

@robdodson
Copy link
Contributor Author

Since we're using window.DOMParser to parse XML on the client side and that supports parsing XML perhaps we've fallen into a bit of a grey area. Swapping in jsdom makes it really easy to convert the lib to node—there's basically little to no code in our core that needs to change—but perhaps there's an alternative you would recommend?

Here's the project fork I'm working on if you're curious.

In #393 @tmpvar mentions that perhaps jsdom might consider supporting xml in the future. Is there anything I can do to help that along?

@domenic
Copy link
Member

domenic commented Jul 4, 2013

I mean, we're committed to supporting XML to exactly the extent that browsers do. It looks like window.DOMParser is a pretty full-featured XML parser, so we have to at least match that. It's not too gray of an area.

Swapping in jsdom makes it really easy to convert the lib to node

That's our goal :). So yeah, this definitely is in jsdom's purview.

Is there anything I can do to help that along?

For sure! To fix this particular issue, I'd suggest an approach where you figure out where we normalize the tag names, and if it's too early, make us stop doing that. We need to normalize them in HTML node serialization and when doing matches on HTML DOMs, but presumably there are places where we normalize them that we shouldn't be.

I can't imagine it'll be that straightforward, but we do have pretty nice test coverage, so you're unlikely to break anything.

@robdodson
Copy link
Contributor Author

ok cool. If I have some cycles this weekend I'll dive in. thanks!

@robdodson
Copy link
Contributor Author

I think I've tracked it down to the htmltodom.js file. There are two places where lowercasing happens.

line 60

parserInstance = new parser.Parser(handler, { lowerCaseTags: true, lowerCaseAttributeNames: true });

changing lowerCaseAttributeNames leaves the attributes untouched. The document is then iterated over by the setChild function. setChild forces all attributes to lowercase.

line 165

newNode.setAttribute(c.toLowerCase(), HTMLDecode(node.attribs[c]));

So changing lowerCaseAttributeNames to false and c.toLowerCase() to just c will leave XML attributes in their original case. The next question is how to signal that? Would a flag of some kind be too hacky?

@robdodson
Copy link
Contributor Author

@domenic I could put together a PR but I'm still not sure what's the best way to tell htmltodom.js to go into non-lowercase mode. Any thoughts?

@domenic
Copy link
Member

domenic commented Jul 21, 2013

@robdodson I don't think there should be a need for a flag. You should just be able match what browsers do, so when you use a window.DOMParser from jsdom it behaves exactly as a window.DOMParser in the browser would.

@robdodson
Copy link
Contributor Author

OK, I created the above PR to try to resolve this. window.DOMParser seems to be able to infer whether a document is XML or HTML and then it just does the right thing as far as attributes are concerned. I tried to replicate that by checking if the document has <?xml version="1.0"?> and if so I'm switching htmlparser2 to xmlMode: true

@ephaeton
Copy link

Your test fails when there's an XML document passed as raw string and it contains a byte order mark., as it expects the <?xml to be at the beginning of the string.

@robdodson
Copy link
Contributor Author

Oh hm.. Can you show me an example of what that looks like?

@ephaeton
Copy link

sample first line of a utf-8 encoded xml file with byte order mark:

$ head -1 path-to-xml | od -c
0000000 357 273 277   <   ?   x   m   l       v   e   r   s   i   o   n
0000020   =   "   1   .   0   "       e   n   c   o   d   i   n   g   =
0000040   "   U   T   F   -   8   "   ?   >  \n
0000052

I'm using this from nodejs, reading the data with readFileSync (with encoding given), i.e.,:

var data = fs.readFileSync('path-to-xml', 'UTF-8');
var jsdom = require("jsdom");
jsdom.jsdom(data.substring(1), null, {features: {QuerySelector: true}});

without the substring call, the match wouldn't work as it's anchored to the beginning of the string, and there the byte order mark is sitting.

potentially PEBKAC...

@ephaeton
Copy link

ephaeton commented May 6, 2014

Consider that the XML spec appendix F Autodetection of character encodings notes that this byte sequence (octal 0357 0273 0277 or hexadecimal 0xEF 0xBB 0xBF) denotes a UTF-8 encoding. It also denotes the valid byte order marks that may live at the beginning of the input before the xml declaration. IMHO you should consider supporting these byte order marks.

@robdodson
Copy link
Contributor Author

Would you be interested in putting together a PR which adds that support? I'm no longer working with jsdom so not sure when I would have to the time to do it.

@domenic domenic added the x(ht)ml label Jul 9, 2014
Sebmaster pushed a commit that referenced this issue Dec 2, 2015
Fixes #393. Fixes #651. Fixes #415 (wasn't quite applicable). Fixes #1276.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants