Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement DOMParser and XMLSerializer classes #1368

Closed
lehni opened this issue Jan 27, 2016 · 43 comments
Closed

Implement DOMParser and XMLSerializer classes #1368

lehni opened this issue Jan 27, 2016 · 43 comments
Labels

Comments

@lehni
Copy link
Contributor

lehni commented Jan 27, 2016

All modern browsers expose these two useful classes with both a very simple API:

https://developer.mozilla.org/en/docs/Web/API/DOMParser
https://developer.mozilla.org/en/docs/XMLSerializer

Without knowing the details, I believe it shouldn't be too hard to implement these as thin shim to what jsdom already has available.

I am currently using quickly hacked together shims in paper.js that look like the code below.

I am happy to work on a better version that can find its way into jsdom, if you think this would be useful and could give me some pointers as to where to start looking.

function XMLSerializer() {
}

XMLSerializer.prototype.serializeToString = function(node) {
    var text = jsdom.serializeDocument(node);
    // Fix a jsdom issue where all SVG tagNames are lowercased:
    // https://github.com/tmpvar/jsdom/issues/620
    var tagNames = ['linearGradient', 'radialGradient', 'clipPath', 'textPath'];
    for (var i = 0, l = tagNames.length; i < l; i++) {
        var tagName = tagNames[i];
        text = text.replace(
            new RegExp('(<|</)' + tagName.toLowerCase() + '\\b', 'g'),
            function(all, start) {
                return start + tagName;
            });
    }
    return text;
};

function DOMParser() {
}

DOMParser.prototype.parseFromString = function(string, contenType) {
    var div = document.createElement('div');
    div.innerHTML = string;
    return div.firstChild;
};
@domenic
Copy link
Member

domenic commented Jan 27, 2016

Sure, it would be great to have help implementing this!

Let's do this one at a time: one PR for DOMParser, one for XMLSerializer.

For DOMParser, the spec is at https://w3c.github.io/DOM-Parsing/#the-domparser-interface. You'd need to introduce new IDL (probably a new folder, domparsing, next to window/ and nodes/ and friends). Then write the impl class. You can peruse recent commits that introduced Location and History in the same fashion to see how that's done.

The actual implementation does indeed look not too complicated. We'd ideally want to do things a bit more in depth than your version though. I think the method will end up looking similar to jsdom.jsdom, actually! in lib/jsdom.js.

@lehni
Copy link
Contributor Author

lehni commented Jan 28, 2016

Thanks for the pointers! For XMLSerializer the specs are also there: https://w3c.github.io/DOM-Parsing/#the-xmlserializer-interface

For the DOMParser, I am not sure I understand your last comment. With jsdom.jsdom, did you mean exports.jsdom = ... inside jsdom.js?

@domenic
Copy link
Member

domenic commented Jan 28, 2016

Yep! The behavior is actually quite similar: take a string, and depending on another passed argument, create a document parsed as XML or a document parsed as HTML.

@lehni
Copy link
Contributor Author

lehni commented Jan 28, 2016

Alright, sounds good! BTW, I just realized that this is more according to specs:

DOMParser.prototype.parseFromString = function(string, contenType) {
    // Create a new document, since we're supposed to always return one.
    var doc = document.implementation.createHTMLDocument(''),
        body = doc.body,
        last;
    // Set the body's HTML, then change the DOM according the specs.
    body.innerHTML = string;
    // Remove all top-level children (<html><head/><body/></html>)
    while (last = doc.lastChild)
        doc.removeChild(last);
    // Insert the first child of the body at the top.
    doc.appendChild(body.firstChild);
    return doc;
};

@kahwee
Copy link

kahwee commented Feb 3, 2016

This may be helpful: https://developer.mozilla.org/en-US/docs/Web/API/DOMParser

It's under the section "DOMParser HTML extension for other browsers".

Tell me how I can help on this. I'm trying to write some test cases that involve DOMParser.

@domenic
Copy link
Member

domenic commented Feb 4, 2016

@kahwee I tried to outline how to do a pull request for DOMParser in #1368 (comment). You might want to coordinate with @lehni to make sure you two don't duplicate work.

@kahwee
Copy link

kahwee commented Feb 4, 2016

@domenic I haven't started work on this. I find the harder piece is XMLSerializer which xmldom sort of covers: https://github.com/jindw/xmldom/blob/master/dom.js

It is better to divide the task into 2 pull request.

@lehni How's progress on this?

Thank you both.

domenic added a commit that referenced this issue Jul 2, 2016
It's missing:

- <parsererror> elements when parsing XML strings fails
- copying the active document's URL

Also adds the empty-per-spec XMLDocument interface.

Closes #1341. Part of #1368.
@domenic
Copy link
Member

domenic commented Jul 2, 2016

So DOMParser is now implemented and will be in the next release.

I've looked at a bunch of XML serializers on npm and nobody seems to have put together one that matches the spec. In the meantime https://github.com/cburgmer/xmlserializer seems closest, so we could use that I guess. I opened cburgmer/xmlserializer#8 to get more spec compliance.

@domenic domenic added the feature label Jul 2, 2016
@lehni
Copy link
Contributor Author

lehni commented Jul 6, 2016

@domenic many thanks for implementing this! And apologies for dropping the ball on this one. Been swamped with work here...

@ianks
Copy link

ianks commented Aug 12, 2016

@domenic Does DOMParser in jsdom support parsing XML yet?

@domenic
Copy link
Member

domenic commented Aug 13, 2016

It supports it as well as jsdom does, which is pretty OKish

@ianks
Copy link

ianks commented Aug 13, 2016

I attemped to parse a SOAP response with it and it seemed to choke a bit, giving back undefined as a few of the properties on the Document object. I assume this probably has to do with namespaces?

@domenic
Copy link
Member

domenic commented Aug 13, 2016

Hmm, possibly. Could you give a small-ish testcase?

@ianks
Copy link

ianks commented Aug 15, 2016

@domenic I'm not sure how to write a full test case here, but this example highlights the undefined keys issue that exists.

var data =
  `<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
      <soap:Header>
          <ns2:ResponseHeader xmlns:ns2="https://adwords.google.com/api/adwords/mcm/v201509" xmlns="https://adwords.google.com/api/adwords/cm/v201509">
              <requestId>1234554321</requestId>
              <serviceName>ManagedCustomerService</serviceName>
              <methodName>get</methodName>
              <operations>1</operations>
              <responseTime>115</responseTime>
          </ns2:ResponseHeader>
      </soap:Header>
      <soap:Body>
          <ns2:getResponse xmlns="https://adwords.google.com/api/adwords/cm/v201509" xmlns:ns2="https://adwords.google.com/api/adwords/mcm/v201509">
              <ns2:rval>
                  <totalNumEntries>2</totalNumEntries>
                  <Page.Type>ManagedCustomerPage</Page.Type>
                  <ns2:entries>
                      <ns2:name>Test1</ns2:name>
                      <ns2:customerId>1234566789</ns2:customerId>
                  </ns2:entries>
                  <ns2:entries>
                      <ns2:name>Test2</ns2:name>
                      <ns2:customerId>987654321</ns2:customerId>
                  </ns2:entries>
              </ns2:rval>
          </ns2:getResponse>
      </soap:Body>
  </soap:Envelope>`;

var document = (new window.DOMParser()).parseFromString(data, "text/xml")

@domenic
Copy link
Member

domenic commented Aug 15, 2016

And what document object properties are undefined when you do that?

@ianks
Copy link

ianks commented Aug 15, 2016

The two types which are have undefined keys are: SymbolTreeNode and Document, which I believe correspond to the documentElement and... something else in chrome.

It should be noted that the way I required the files looks like this:

  // is there a better way?
  global.DOMParser = window.DOMParser = require('jsdom/lib/jsdom/living/domparsing/DOMParser-impl').implementation;

@domenic
Copy link
Member

domenic commented Aug 15, 2016

Oh, yeah, that won't work. You need to use an actual jsdom, e.g.

const DOMParser = jsdom.jsdom().defaultView.DOMParser;

Anyway, I still don't understand what properties you're talking about, exactly. document.SymbolTreeNode and document.Document are undefined in real browsers too, not just jsdom.

@ianks
Copy link

ianks commented Aug 15, 2016

Thank you for the info. Here is what I meant by SymbolTreeNode and Document types:

domparser

@domenic
Copy link
Member

domenic commented Aug 15, 2016

Weird. But I'd again really appreciate some code. Are you saying that document.undefined === SymbolTreeNode? Or what?

@ianks
Copy link

ianks commented Aug 15, 2016

Yes exactly!

@domenic
Copy link
Member

domenic commented Aug 15, 2016

Great. So then it can't be true that document.undefined === Document, so I guess that part of your comment was just a red herring.

@domenic
Copy link
Member

domenic commented Aug 15, 2016

Here's a test case showing that this is indeed happening: https://tonicdev.com/57b2082567f249120021c588/57b2082567f249120021c589/branches/master I'll open a separate issue.

@domenic
Copy link
Member

domenic commented Aug 15, 2016

Oh, no, that's not what's happening: https://tonicdev.com/57b2082567f249120021c588/57b2082567f249120021c589/branches/master

The issue is that both document.undefined and window.SymbolTreeNode are undefined. So I don't think there is any bug here. Maybe there is a bug in whatever software you are using to generate those tree views.

@ianks
Copy link

ianks commented Aug 15, 2016

Yes, I think you are right. Here is a list of all the properties I have when looping through document: https://gist.github.com/ianks/c935930573921025c28c460019ec85ab

I think my original diagnoses was just wrong 😢

Here is my super simple test case:

var data =
  `<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
      <soap:Header>
          <ns2:ResponseHeader xmlns:ns2="https://adwords.google.com/api/adwords/mcm/v201509" xmlns="https://adwords.google.com/api/adwords/cm/v201509">
              <requestId>1234554321</requestId>
              <serviceName>ManagedCustomerService</serviceName>
              <methodName>get</methodName>
              <operations>1</operations>
              <responseTime>115</responseTime>
          </ns2:ResponseHeader>
      </soap:Header>
      <soap:Body>
          <ns2:getResponse xmlns="https://adwords.google.com/api/adwords/cm/v201509" xmlns:ns2="https://adwords.google.com/api/adwords/mcm/v201509">
              <ns2:rval>
                  <totalNumEntries>2</totalNumEntries>
                  <Page.Type>ManagedCustomerPage</Page.Type>
                  <ns2:entries>
                      <ns2:name>Test1</ns2:name>
                      <ns2:customerId>1234566789</ns2:customerId>
                  </ns2:entries>
                  <ns2:entries>
                      <ns2:name>Test2</ns2:name>
                      <ns2:customerId>987654321</ns2:customerId>
                  </ns2:entries>
              </ns2:rval>
          </ns2:getResponse>
      </soap:Body>
  </soap:Envelope>`;

var doc = (new window.DOMParser()).parseFromString(data, "text/xml")
var entries = doc.querySelectorAll('rval > entries');

if (entries.length !== 2) {
  console.log('Invalid number of entries, should be 2, got ' + entries.length);
} else {
  console.log('Passed!');
}

@domenic
Copy link
Member

domenic commented Aug 15, 2016

That test case "fails" in Chrome and Firefox for me: http://jsbin.com/timizagego/edit?html,console

@ianks
Copy link

ianks commented Aug 15, 2016

Yeah, Monday got the best of me. Should be 2 entries, not one. I updated the example:

http://jsbin.com/duvavigasi/edit?html,console

@domenic
Copy link
Member

domenic commented Aug 15, 2016

Ah great! I'll file a separate issue.

@ianks
Copy link

ianks commented Aug 15, 2016

@domenic Thanks for holding my hand through that one. You are an excellent OSS contributor!

@Joris-van-der-Wel
Copy link
Member

Thank you for the info. Here is what I meant by SymbolTreeNode and Document types:

I am guessing that your tool is showing symbol keys as "undefined":

const mySymbol = Symbol();
// try inspecting this object in your tool:
const obj = {};
obj[mySymbol] = 123; // your tool would probably display this as {undefined: 123}
// you can also play around with:
Object.getOwnPropertySymbols(obj);

@DylanPiercey
Copy link

@domenic - great work getting DOMParser included. I was using it though and realized that it doesn't handle syntax errors the same way. You can check out what it should be like here.

Is this possible to add?

@domenic
Copy link
Member

domenic commented Oct 22, 2016

@DylanPiercey please file a separate issue with a test case. We already attempt to produce parsererror elements when possible, but I think there are some edge cases where it's not working.

@DylanPiercey
Copy link

DylanPiercey commented Oct 22, 2016

@domenic I was mostly just messing arround with stuff like

<div>
    <a></>
</div>

Which didn't throw a parseerror.

I can open a new issue, but if it's already implemented then thats good enough for me.
Do you think you could point me to an actual parse error that is handled by jsdom so that I can use it for some unit testing?

@jonnermut
Copy link

@domenic is there any way to avoid the lower casing of tag names when using the workaround for XMLSerializer.serializeToString proposed by the original poster?
Started having a look at parse5, and got lost in the api docs...

@domenic
Copy link
Member

domenic commented Nov 3, 2016

I'm not aware of any, no. We need to use an actual XML serializer, not a HTML one like parse5, to get XML-correct casing.

@jonnermut
Copy link

jonnermut commented Nov 5, 2016

I managed to get mostly want using jsdom's DOMParser and https://github.com/cburgmer/xmlserializer

One thorny one was that xmlserializer lower cases tags with the namespaceURI set to 'http://www.w3.org/1999/xhtml', which it seems that jsdom sets on all tags created via Document.createElement.

Here's some rough code illustrating this:

#!/usr/local/bin/node
var jsdom = require("jsdom");
var xmlserializer = require('xmlserializer');

function XMLSerializer() {
}

XMLSerializer.prototype.serializeToString = function(node) {
    return xmlserializer.serializeToString(node);
};

jsdom.env({
  html: `<html><body></body></html>`,
   virtualConsole: jsdom.createVirtualConsole().sendTo(console),
  done: function (err, window) {
     var DOMParser = window.DOMParser
     var xml = '<myTag myAttr="hello"></myTag>'
     var parser = new window.DOMParser();
      var doc = parser.parseFromString(xml, "application/xml");

      // swizzle the doc.createElement func, otherwise <otherElement> comes out as <otherelement>
      doc.createElement = function(localName) { return doc.createElementNS('', localName) }

      var el2 = doc.createElement('otherElement')
      doc.firstChild.appendChild(el2)

      var xmlSerializer = new XMLSerializer()
      console.log(xmlSerializer.serializeToString(doc))

  }
});

Outputs:

<myTag xmlns="null" myAttr="hello"><otherElement/></myTag>

Which is mostly want I want - case preserved - except for the xmlns="null" which I think is a recent bug in xmlserialiser

nhunzaker pushed a commit to nhunzaker/jsdom that referenced this issue Dec 22, 2016
It's missing:

- <parsererror> elements when parsing XML strings fails
- copying the active document's URL

Also adds the empty-per-spec XMLDocument interface.

Closes jsdom#1341. Part of jsdom#1368.
@tommedema
Copy link

tommedema commented Mar 15, 2018

I tried to polyfill XMLParser as suggested:

function XMLSerializer() {
}

XMLSerializer.prototype.serializeToString = function(node) {
    var text = jsdom.serializeDocument(node);
    return text;
};

Unfortunately serializeDocument is no longer part of the API. serialize() only works on the whole document; and it is not possible to create a new document from a HTML node (only from a string). It's not possible to turn a node into a string without a XML parser. Especially if you are trying to serialize a doctype.

My solution was to use xmldom:

import { XMLSerializer } from 'xmldom';
global.XMLSerializer = XMLSerializer;

@lehni
Copy link
Contributor Author

lehni commented Mar 16, 2018

My current version looks like this:

function XMLSerializer() {
};

XMLSerializer.prototype.serializeToString = function(node) {
  if (!node) {
    return '';
  }
  // Fix a jsdom issue where all SVG tagNames are lowercased:
  // https://github.com/tmpvar/jsdom/issues/620
  var text = node.outerHTML;
  var tagNames = ['linearGradient', 'radialGradient', 'clipPath', 'textPath'];
  for (var i = 0, l = tagNames.length; i < l; i++) {
    var tagName = tagNames[i];
    text = text.replace(
      new RegExp('(<|</)' + tagName.toLowerCase() + '\\b', 'g'),
      function(match, start) {
        return start + tagName;
      }
    );
  }
  return text;
};

@tommedema
Copy link

@lehni that doesn't return doctypes if you pass in document.documentElement as node, even though browsers do

@lehni
Copy link
Contributor Author

lehni commented Mar 17, 2018

I'm not saying it's perfect, but it works for my use-case

@teclone
Copy link

teclone commented Jul 12, 2018

@domenic, @tommedema, @ianks, @lehni. There is a complete javascript xml-serializer that i created and is avaliable as an npm package. It follows the w3c spec and even added some improvements. its serialization is neat and accurate with 99% test coverage.

@domenic
Copy link
Member

domenic commented Jul 12, 2018

Oh dear, I wish we'd known about that a month ago! @Sebmaster created our own package, w3c-xmlserializer, and #2282 has the code to integrate it all ready to go... I just need to find some time to work on jsdom -_-.

@teclone
Copy link

teclone commented Jul 12, 2018

Alright @domenic. sorry it came late.

@Sebmaster
Copy link
Member

I think this is it! Closing due to the release of v13.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants