-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shop for JS DOM implementations #34
Comments
Hello, Have you given more thought on this issue ? I have not yet understood how fathom's structures are implemented. I understand that it needs a selector-like API on a dom-like tree but it seems to also need a selector API on types (?). cheerio uses https://www.npmjs.com/package/css-select as its selector-API, which implements a subset of the https://github.com/jquery/sizzle selector-API. Could you explain what is the interface between fathom and the object model that the document needs to meet ? Is fathom what we could call a generic rule-based tree annotator or does it have hard bindings to the DOM as we know it in html ? |
Fathom is really quite generic. The only parts that interact with the DOM API are…
I have actually been thinking about switching Fathom to cheerio for speed. Do you happen to know whether cheerio can act on an arbitrary DOM tree, regardless of its implementation? It looks like it prefers instead to parse text. I wonder why, since css-select seems content to act on DOM-compliant nodes. Is cheerio doing some sort of indexing or caching? I'd like to maintain Fathom's ability to work with native DOM trees or at least wrap them transparently, since one of its major use cases is running within Firefox. To answer your question, I don't think it would be hard to decouple dom() from querySelectorAll(), but I'd have to do some thinking to decide how best to change the API. One nicety would be to support straight DOM access for certain rules but cheerio wrappings for others and be able to have callbacks for each sort coexist in one ruleset. I'd happily review a proposal, couched as a PR. Thanks for your interest! |
I think that cheerio was initially developped as a server-side jquery style html walker. It is mainly a wrapper around htmlparser2 and css-select and that would explain why the main use case was to start with strings. I understand your requirement for fathom to be able to work with a pre-parsed native DOM tree. I must admit that I don't know where the speed from cheerio comes from. Is it from the parsing phase, or as css-select boasts, from a different way of doing selector research ? I would certainly not hardcode cheerio inside fathom because cheerio has its own set of issues. so having a pluggable system is certainly the way to go towards a more generic tree annotator. I looked at the internals of cheerio. The "string" based approach seems to be deeply coded inside cheerio. For example, cloned doms are serialized + re-parsed. Modifications like the cheerio-dom structure corresponds to the htmlparser2 dom structure which is defined by https://github.com/fb55/domhandler the I am not sure if cheerio can wrap a native DOM tree.
It seems that there is a https://github.com/nrkn/css-select-browser-adapter that is a Nevertheless, I think I need to dig a little more into how fathom uses DOM (directly or via callbacks) to better understand what is used and check what it would mean for rulesets interoperability to have several non fully-compatible DOM-like engines. Is there a complex real-life example that you know about and that I could look at to see what we are talking about DOM-wise ? |
Note: after digging some more into cheerio,
|
Let me start with a correction: Fathom's clustering machinery uses the DOM API heavily, including css-select-browser-adapter looks short and sane. Thanks for test-driving it! Things like clustering would still need a way to walk around the tree, but I see cheerio has affordances for that. I'm not sure it has something like
I think this Readability workalike is your best bet for now. Drill into the functions it calls on DOM nodes, like domSort(), clusters(), and inlineTextLength(). Hmm, I suppose we had other DOM-dependent things after all, mostly in the utils module. They're technically optional, but they sure come in handy. |
I am now thinking hard about switching to cheerio, since jsdom is irredeemably leaky: jsdom/jsdom#1682. It crashed my attempts at auto-tuning my rule coefficients until I coded around it by preallocating all the DOMs rather than rebuilding them each time through the loop (which is fine and probably more efficient, at least until I have a zillion of them, but doesn't apply to other use cases). But I don't want to do that sort of trick for more general purposes; I'd have to keep a pool of jsdom docs around with all the attendant complexity. I tried switching to jsdoc.env(), rumored to be free of leaks, but it is intrinsically async, returning null and depending on a callback to do anything. This makes my decided non-event-loopy optimization loop hang. And in any case, jsdom.env() leaks faster than ever, even when given a callback that calls window.close(). So I am more motivated than ever to explore this. :-) |
htmlparser2, which cheerio uses, is a nice project. It parses quickly. It claims to provide a DOM Level 1 interface, though I'm not seeing it, playing locally with its DomHandler. Next I'm going to have a look at the other contenders on the htmlparser-benchmark list. Do you know anything about any of them? |
I don't think htmlparser2 is the silver bullet either. It seems to have its own share of memory leaks according to cheeriojs/cheerio#830. Last time I looked at these parsing tools (1 year ago maybe), parse5 was the new kid on the block (in a good sense). As I understand it, Would the API in https://github.com/nrkn/css-select-browser-adapter be sufficient for the traversal needs of fathom ? I believe that the selectors offered by https://github.com/fb55/css-select would be sufficient compared to the native querySelectorAll that you are using but I must admit that I don't fully understand how they are differing. In my perspective, I would think that querySelectorAll would always be faster because browsers adopted the native querySelectorAll after jQuery introduced this sort of API in js land. But I don't know what is their difference feature-wise. |
Regarding Node.contains, here is the implementation in cheerio - https://github.com/cheeriojs/cheerio/blob/765cdaaac56acdaf8779867c6205b7edde25e91f/lib/static.js#L167 . It should be easy enough to code through a css-select adapter. |
Yes.
I'd have to really dig through util.js to tell. Node.compareDocumentPosition() is one I immediately see is missing.
I don't really care if some silly corner of CSS3 isn't supported by one or the other. As long as we can select on tags and attrs and attr values, it's fine. |
No luck: var htmlparser = require("htmlparser2");
var rawHtml = "<html><body><p>Hello</p><p>Goodbye</p></body></html>";
var handler = new htmlparser.DomHandler(function (error, dom) {
if (error)
console.log('error')
else {
console.log(dom);
debugger;
}
},
{withDomLvl1: true}
);
var parser = new htmlparser.Parser(handler);
parser.write(rawHtml);
parser.done();
> dom.ELEMENT_NODE
undefined
> dom.childNodes
undefined Even if it's a Document, Document is a subinterface of Node and should have those properties. |
The dom level 1 compatibility indeed seems sparse - https://github.com/fb55/domhandler/blob/master/lib/node.js |
Yowzers. That's not nearly enough for our purposes.
|
Best domino port so far: #73 (comment) |
We use jsdom for our test cases (and, by accident, effectively recommend it for people who need to feed fathom a string rather than a predisgested DOM object). But jsdom is pretty slow, and the DOM API itself is not great. cheerio has been requested by one group inside Mozilla. There's also domjs and domino. Write pro/con lists and choose one. Here's a starting set of goals. We can edit it as needed.
The text was updated successfully, but these errors were encountered: