Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shop for JS DOM implementations #34

Open
erikrose opened this issue Jul 18, 2016 · 14 comments
Open

Shop for JS DOM implementations #34

erikrose opened this issue Jul 18, 2016 · 14 comments

Comments

@erikrose
Copy link
Contributor

erikrose commented Jul 18, 2016

We use jsdom for our test cases (and, by accident, effectively recommend it for people who need to feed fathom a string rather than a predisgested DOM object). But jsdom is pretty slow, and the DOM API itself is not great. cheerio has been requested by one group inside Mozilla. There's also domjs and domino. Write pro/con lists and choose one. Here's a starting set of goals. We can edit it as needed.

  • Be fast.
  • Be well-maintained.
  • Remain able to take in DOMs produced by JS running in a browser. IOW, if we express ranker functions in terms of cheerio's API, we need a way to get from a native DOM object to "a cheerio" so those rankers can still be used.
@jeromew
Copy link

jeromew commented Mar 21, 2017

Hello,
Thanks for this project ! It seems like a very nice way to "abstract" what readability is doing.
My initial reaction was to see how I could use cheerio with this since it is the de-factor fast parser / walker I use for readibility-like page analysis.

Have you given more thought on this issue ? I have not yet understood how fathom's structures are implemented. I understand that it needs a selector-like API on a dom-like tree but it seems to also need a selector API on types (?).

cheerio uses https://www.npmjs.com/package/css-select as its selector-API, which implements a subset of the https://github.com/jquery/sizzle selector-API.

Could you explain what is the interface between fathom and the object model that the document needs to meet ? Is fathom what we could call a generic rule-based tree annotator or does it have hard bindings to the DOM as we know it in html ?

@erikrose
Copy link
Contributor Author

Fathom is really quite generic. The only parts that interact with the DOM API are…

  1. The querySelectorAll call in the dom() LHS. But there's no innate need to use the dom() LHS if you have some other source of data. You'd need to write that source into a new kind of LHS, and Fathom isn't pluggable in that direction atm, but there's no brittle piece of design keeping it from happening. Ultimately, I would like to push Fathom as a more generic data structure annotator and queryer, so this would be a useful thing to make pluggable.
  2. And any callbacks (for scoring, etc.) that call DOM methods on the nodes passed in

I have actually been thinking about switching Fathom to cheerio for speed. Do you happen to know whether cheerio can act on an arbitrary DOM tree, regardless of its implementation? It looks like it prefers instead to parse text. I wonder why, since css-select seems content to act on DOM-compliant nodes. Is cheerio doing some sort of indexing or caching? I'd like to maintain Fathom's ability to work with native DOM trees or at least wrap them transparently, since one of its major use cases is running within Firefox.

To answer your question, I don't think it would be hard to decouple dom() from querySelectorAll(), but I'd have to do some thinking to decide how best to change the API. One nicety would be to support straight DOM access for certain rules but cheerio wrappings for others and be able to have callbacks for each sort coexist in one ruleset. I'd happily review a proposal, couched as a PR. Thanks for your interest!

@jeromew
Copy link

jeromew commented Mar 22, 2017

I think that cheerio was initially developped as a server-side jquery style html walker. It is mainly a wrapper around htmlparser2 and css-select and that would explain why the main use case was to start with strings.

I understand your requirement for fathom to be able to work with a pre-parsed native DOM tree.

I must admit that I don't know where the speed from cheerio comes from. Is it from the parsing phase, or as css-select boasts, from a different way of doing selector research ?

I would certainly not hardcode cheerio inside fathom because cheerio has its own set of issues. so having a pluggable system is certainly the way to go towards a more generic tree annotator.

I looked at the internals of cheerio. The "string" based approach seems to be deeply coded inside cheerio. For example, cloned doms are serialized + re-parsed. Modifications like appendTo(target) can append to strings that will be parsed internally. So there seem to be a lot of duck-typing between strings and the cheerio-dom impl.

the cheerio-dom structure corresponds to the htmlparser2 dom structure which is defined by https://github.com/fb55/domhandler

the find method, which is close to querySelectorAll, is imply implemented as css-select(selector, elems, options)

I am not sure if cheerio can wrap a native DOM tree. css-select advertises the fact that it can be used with different adapters

In its default configuration, css-select queries the DOM structure of the domhandler module (also known as htmlparser2 DOM). It uses domutils as its default adapter over the DOM structure. See Options below for details on querying alternative DOM structures.

It seems that there is a https://github.com/nrkn/css-select-browser-adapter that is a css-select adapter for a native DOM tree, so at least css-select can work both on native DOM and on cheerio's dom.

Nevertheless, I think I need to dig a little more into how fathom uses DOM (directly or via callbacks) to better understand what is used and check what it would mean for rulesets interoperability to have several non fully-compatible DOM-like engines.

Is there a complex real-life example that you know about and that I could look at to see what we are talking about DOM-wise ?

@jeromew
Copy link

jeromew commented Mar 22, 2017

Note: after digging some more into cheerio,

  1. I confirm that css-select 1.3.0-rc0 can traverse a "native DOM" tree when using the css-select-browser-adapter adapter (I successfully used css-select to traverse a jsdom instance)

  2. I can't find a way to instanciate cheerio with a jsdom instance and have it transparently hand it over to css-select. Cheerio wants to wrap the DOM context inside a cheerio object and there is currently no code path for a transparent wrapping of a jsdom node tree (that I consider equivalent to a native DOM tree). There may be a way to PR the cheerio code to accepts this for traversal but it seems a lot more complicated for all the cheerio dom tree modifiers.

@erikrose
Copy link
Contributor Author

erikrose commented Mar 23, 2017

Let me start with a correction: Fathom's clustering machinery uses the DOM API heavily, including compareDocumentPosition(), which I've seen go unimplemented in JS DOM implementations.

css-select-browser-adapter looks short and sane. Thanks for test-driving it! Things like clustering would still need a way to walk around the tree, but I see cheerio has affordances for that. I'm not sure it has something like Node.contains(anotherNode), but one could certainly write one.

Is there a complex real-life example that you know about and that I could look at to see what we are talking about DOM-wise?

I think this Readability workalike is your best bet for now. Drill into the functions it calls on DOM nodes, like domSort(), clusters(), and inlineTextLength(). Hmm, I suppose we had other DOM-dependent things after all, mostly in the utils module. They're technically optional, but they sure come in handy.

@erikrose
Copy link
Contributor Author

erikrose commented Mar 29, 2017

I am now thinking hard about switching to cheerio, since jsdom is irredeemably leaky: jsdom/jsdom#1682. It crashed my attempts at auto-tuning my rule coefficients until I coded around it by preallocating all the DOMs rather than rebuilding them each time through the loop (which is fine and probably more efficient, at least until I have a zillion of them, but doesn't apply to other use cases). But I don't want to do that sort of trick for more general purposes; I'd have to keep a pool of jsdom docs around with all the attendant complexity. I tried switching to jsdoc.env(), rumored to be free of leaks, but it is intrinsically async, returning null and depending on a callback to do anything. This makes my decided non-event-loopy optimization loop hang. And in any case, jsdom.env() leaks faster than ever, even when given a callback that calls window.close().

So I am more motivated than ever to explore this. :-)

@erikrose
Copy link
Contributor Author

htmlparser2, which cheerio uses, is a nice project. It parses quickly. It claims to provide a DOM Level 1 interface, though I'm not seeing it, playing locally with its DomHandler. Next I'm going to have a look at the other contenders on the htmlparser-benchmark list. Do you know anything about any of them?

@jeromew
Copy link

jeromew commented Mar 30, 2017

I don't think htmlparser2 is the silver bullet either. It seems to have its own share of memory leaks according to cheeriojs/cheerio#830.
I didn't know it had DOM Level 1 interface. I haven't tried it. According to https://github.com/fb55/domhandler you need to add a withDomLvl1 option to have DOM Level 1 properties.

Last time I looked at these parsing tools (1 year ago maybe), parse5 was the new kid on the block (in a good sense).

As I understand it, fathom only needs read-only access to the DOM is that correct ? (selecting elements and traversing the DOM in several ways for clustering) so maybe a simpler "read-only cheerio" based on the adapter pattern that css-select is advertising could be the way to go for fathom. It could work with any dom-friendly interface for which and adapter could be written.

Would the API in https://github.com/nrkn/css-select-browser-adapter be sufficient for the traversal needs of fathom ?

I believe that the selectors offered by https://github.com/fb55/css-select would be sufficient compared to the native querySelectorAll that you are using but I must admit that I don't fully understand how they are differing. In my perspective, I would think that querySelectorAll would always be faster because browsers adopted the native querySelectorAll after jQuery introduced this sort of API in js land. But I don't know what is their difference feature-wise.

@jeromew
Copy link

jeromew commented Mar 30, 2017

Regarding Node.contains, here is the implementation in cheerio - https://github.com/cheeriojs/cheerio/blob/765cdaaac56acdaf8779867c6205b7edde25e91f/lib/static.js#L167 . It should be easy enough to code through a css-select adapter.

@erikrose
Copy link
Contributor Author

erikrose commented Mar 30, 2017

As I understand it, fathom only needs read-only access to the DOM is that correct?

Yes.

Would the API in https://github.com/nrkn/css-select-browser-adapter be sufficient for the traversal needs of fathom?

I'd have to really dig through util.js to tell. Node.compareDocumentPosition() is one I immediately see is missing.

sufficient compared to the native querySelectorAll that you are using but I must admit that I don't fully understand how they are differing

I don't really care if some silly corner of CSS3 isn't supported by one or the other. As long as we can select on tags and attrs and attr values, it's fine.

@erikrose
Copy link
Contributor Author

you need to add a withDomLvl1 option to have DOM Level 1 properties

No luck:

var htmlparser = require("htmlparser2");

var rawHtml = "<html><body><p>Hello</p><p>Goodbye</p></body></html>";
var handler = new htmlparser.DomHandler(function (error, dom) {
            if (error)
                console.log('error')
            else {
                console.log(dom);
                debugger;
            }
        },
        {withDomLvl1: true}
    );
var parser = new htmlparser.Parser(handler);
parser.write(rawHtml);
parser.done();

> dom.ELEMENT_NODE
undefined
> dom.childNodes
undefined

Even if it's a Document, Document is a subinterface of Node and should have those properties.

@jeromew
Copy link

jeromew commented Mar 31, 2017

The dom level 1 compatibility indeed seems sparse - https://github.com/fb55/domhandler/blob/master/lib/node.js

@erikrose
Copy link
Contributor Author

erikrose commented Mar 31, 2017 via email

@erikrose
Copy link
Contributor Author

Best domino port so far: #73 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants