getAttribute only works with lower-case values in XML #651

robdodson · 2013-07-04T14:25:26Z

When I use jsdom to parse an xml document it lower-cases all attributes. I believe this is the correct behavior for HTML according to MDN, but it is not the correct behavior for XML. In XML attributes are case sensitive so codeSystem and codesystem should be treated as two separate items.

Because DOMParser treats codeSystem and codesystem as separate entities my library works in the browser but not in node with jsdom. I can add a flag so it lower cases its arguments when it's using jsdom but that makes me a little iffy because it's not the expected behavior for XML.

The text was updated successfully, but these errors were encountered:

domenic · 2013-07-04T18:34:56Z

Yeah, there's a few bugs about this... #620 and #393 come to mind. Remember that jsdom is for HTML DOMs, and is not meant as an XML parser; that said, if it behaves differently from a browser, then it's a bug.

robdodson · 2013-07-04T19:21:57Z

Since we're using window.DOMParser to parse XML on the client side and that supports parsing XML perhaps we've fallen into a bit of a grey area. Swapping in jsdom makes it really easy to convert the lib to node—there's basically little to no code in our core that needs to change—but perhaps there's an alternative you would recommend?

Here's the project fork I'm working on if you're curious.

In #393 @tmpvar mentions that perhaps jsdom might consider supporting xml in the future. Is there anything I can do to help that along?

domenic · 2013-07-04T19:28:22Z

I mean, we're committed to supporting XML to exactly the extent that browsers do. It looks like window.DOMParser is a pretty full-featured XML parser, so we have to at least match that. It's not too gray of an area.

Swapping in jsdom makes it really easy to convert the lib to node

That's our goal :). So yeah, this definitely is in jsdom's purview.

Is there anything I can do to help that along?

For sure! To fix this particular issue, I'd suggest an approach where you figure out where we normalize the tag names, and if it's too early, make us stop doing that. We need to normalize them in HTML node serialization and when doing matches on HTML DOMs, but presumably there are places where we normalize them that we shouldn't be.

I can't imagine it'll be that straightforward, but we do have pretty nice test coverage, so you're unlikely to break anything.

robdodson · 2013-07-05T17:24:19Z

ok cool. If I have some cycles this weekend I'll dive in. thanks!

robdodson · 2013-07-06T15:54:23Z

I think I've tracked it down to the htmltodom.js file. There are two places where lowercasing happens.

line 60

parserInstance = new parser.Parser(handler, { lowerCaseTags: true, lowerCaseAttributeNames: true });

changing lowerCaseAttributeNames leaves the attributes untouched. The document is then iterated over by the setChild function. setChild forces all attributes to lowercase.

line 165

newNode.setAttribute(c.toLowerCase(), HTMLDecode(node.attribs[c]));

So changing lowerCaseAttributeNames to false and c.toLowerCase() to just c will leave XML attributes in their original case. The next question is how to signal that? Would a flag of some kind be too hacky?

robdodson · 2013-07-20T01:19:32Z

@domenic I could put together a PR but I'm still not sure what's the best way to tell htmltodom.js to go into non-lowercase mode. Any thoughts?

domenic · 2013-07-21T09:01:56Z

@robdodson I don't think there should be a need for a flag. You should just be able match what browsers do, so when you use a window.DOMParser from jsdom it behaves exactly as a window.DOMParser in the browser would.

robdodson · 2013-07-21T17:09:01Z

OK, I created the above PR to try to resolve this. window.DOMParser seems to be able to infer whether a document is XML or HTML and then it just does the right thing as far as attributes are concerned. I tried to replicate that by checking if the document has <?xml version="1.0"?> and if so I'm switching htmlparser2 to xmlMode: true

ephaeton · 2014-04-28T08:36:51Z

Your test fails when there's an XML document passed as raw string and it contains a byte order mark., as it expects the <?xml to be at the beginning of the string.

robdodson · 2014-04-28T16:18:58Z

Oh hm.. Can you show me an example of what that looks like?

ephaeton · 2014-04-28T16:30:09Z

sample first line of a utf-8 encoded xml file with byte order mark:

$ head -1 path-to-xml | od -c
0000000 357 273 277   <   ?   x   m   l       v   e   r   s   i   o   n
0000020   =   "   1   .   0   "       e   n   c   o   d   i   n   g   =
0000040   "   U   T   F   -   8   "   ?   >  \n
0000052

I'm using this from nodejs, reading the data with readFileSync (with encoding given), i.e.,:

var data = fs.readFileSync('path-to-xml', 'UTF-8');
var jsdom = require("jsdom");
jsdom.jsdom(data.substring(1), null, {features: {QuerySelector: true}});

without the substring call, the match wouldn't work as it's anchored to the beginning of the string, and there the byte order mark is sitting.

potentially PEBKAC...

ephaeton · 2014-05-06T09:51:42Z

Consider that the XML spec appendix F Autodetection of character encodings notes that this byte sequence (octal 0357 0273 0277 or hexadecimal 0xEF 0xBB 0xBF) denotes a UTF-8 encoding. It also denotes the valid byte order marks that may live at the beginning of the input before the xml declaration. IMHO you should consider supporting these byte order marks.

robdodson · 2014-05-06T16:14:13Z

Would you be interested in putting together a PR which adds that support? I'm no longer working with jsdom so not sure when I would have to the time to do it.

Fixes #393. Fixes #651. Fixes #415 (wasn't quite applicable). Fixes #1276.

robdodson mentioned this issue Jul 7, 2013

Add support for Node.js blue-button/bluebutton.js#44

Merged

5 tasks

robdodson mentioned this issue Jul 21, 2013

Parser should toggle xmlMode depending on what kind of document is being parsed #661

Closed

robdodson mentioned this issue Jul 27, 2013

"TypeError: Arguments to path.resolve must be strings" on require blue-button/bluebutton.js#52

Closed

domenic added the x(ht)ml label Jul 9, 2014

Sebmaster pushed a commit that referenced this issue Dec 2, 2015

Add proper XML parsing

104406a

Fixes #393. Fixes #651. Fixes #415 (wasn't quite applicable). Fixes #1276.

Sebmaster closed this as completed in 52f774a Dec 6, 2015

amfio mentioned this issue Mar 12, 2019

Parsed XML doc's XPath queries fail with upper case attributes #2530

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

getAttribute only works with lower-case values in XML #651

getAttribute only works with lower-case values in XML #651

robdodson commented Jul 4, 2013

domenic commented Jul 4, 2013

robdodson commented Jul 4, 2013

domenic commented Jul 4, 2013

robdodson commented Jul 5, 2013

robdodson commented Jul 6, 2013

robdodson commented Jul 20, 2013

domenic commented Jul 21, 2013

robdodson commented Jul 21, 2013

ephaeton commented Apr 28, 2014

robdodson commented Apr 28, 2014

ephaeton commented Apr 28, 2014

ephaeton commented May 6, 2014

robdodson commented May 6, 2014

getAttribute only works with lower-case values in XML #651

getAttribute only works with lower-case values in XML #651

Comments

robdodson commented Jul 4, 2013

domenic commented Jul 4, 2013

robdodson commented Jul 4, 2013

domenic commented Jul 4, 2013

robdodson commented Jul 5, 2013

robdodson commented Jul 6, 2013

robdodson commented Jul 20, 2013

domenic commented Jul 21, 2013

robdodson commented Jul 21, 2013

ephaeton commented Apr 28, 2014

robdodson commented Apr 28, 2014

ephaeton commented Apr 28, 2014

ephaeton commented May 6, 2014

robdodson commented May 6, 2014