Fix readability to work with real dom (fixes #72)#80
Conversation
4e4ce80 to
925e222
Compare
|
I'd take the comment fix; it's really just for the benefit of the testcase. I'm still worried about perf here, though... |
|
Summarizing issues from #78:
|
There was a problem hiding this comment.
Nit: jsdom? (lowercase)
|
Regarding perf, I don't understand the problem. |
There was a problem hiding this comment.
Can you elaborate? :-) I'm sure they'll want to know.
I can reduce a test case if I'm told where to begin.
There was a problem hiding this comment.
Can you elaborate? :-)
For some reason, jsdom retrieves a bunch of undefined items in childNodes list for no obvious reason (please note I'm not fascinated with the idea of investigating this further atm, but maybe you are?).
I'm sure they'll want to know.
They don't. The 3.x branch of jsdom is claimed being not officially maintained anymore since they moved to the io.js platform for 4.x.
But you're invited to fork and maintain your own version if you will (hint: don't).
There was a problem hiding this comment.
(please note I'm not fascinated with the idea of investigating this further atm, but maybe you are?).
I am. And it was stupid of me asking to be told where to begin. I should just comment your line, notice the bug in the tests and start from there.
I am interested, because jsdom is the closest attempt I'm aware of of a conformant standalone DOM in Node.js.
But you're invited to fork and maintain your own version if you will (hint: don't).
:-p I'll try to repro the bug you found, find out whether it's still in 4.x and report to them if so.
There was a problem hiding this comment.
... wait... isn't it just a bug due to removing elements in a for-loop?
You cache the length, start the loop, remove elements. If one element got removed, at the end of the loop, node.childNodes[n] can only return undefined, because the node.childNodes isn't of the same length that it was at the beginning of the loop any longer. Possibly, you're also forgetting to remove some comments, by skipping one element after one was removed.
You can clone the list via Array.prototype.slice.call (or Array.from) and iterate over that (with forEach, Yay \o/)
There was a problem hiding this comment.
I'm wrong about forgetting some elements because of the n-- below, but I think the length explanation still stands.
There was a problem hiding this comment.
Fixed using a forEach loop.
f461263 to
6c810a9
Compare
I want at least some idea of how the JSDOMParser/Readability.js changes here impact performance, in relative terms instead of "on our really fast machines, when run in isolation on node.js, it takes 20ms so there's no problem". :-) |
|
In particular, I suspect the HTML encoding I added in JSDOMParser.js could be very slow. |
6c810a9 to
eb07c70
Compare
|
Now that we have benchmarks, it's pretty easy to see that this indeed slows down JSDOMParser, though it seems to speed up Readability a little, which is somewhat surprising. I'm going to look at using some more laziness in the textContent/innerHTML stuff, as well as ensuring entities like |
f53f3ce to
62e8b5d
Compare
There was a problem hiding this comment.
Now we have benchmarks, is performance really bad if we remove this and use the generic behavior? I still think we shouldn't distinguish between parsers within Readability…
There was a problem hiding this comment.
Hmm okay got it, JSDOMParser doesn't support what's required below.
There was a problem hiding this comment.
Which part does it not support? I think it should work in theory, but I'd be very surprised if it wasn't slower. That said, the benchmarks are super noisy on my machine. I get results anywhere between 170 and 250 ops/s for the JSDOMParser part on the reference benchmark, with an average of about 200, all on the same changeset.
There was a problem hiding this comment.
Trying to fix that by increasing the iterations massively increases the number of ops/s, which makes me think we're hitting some kind of JIT optimization, which isn't really realistic (we won't normally readerize the same document a bunch of times in a row).
There was a problem hiding this comment.
Which part does it not support?
JSDOMParser doesn't seem to support ownerDocument, though that's probably no big deal adding support for it.
There was a problem hiding this comment.
we won't normally readerize the same document a bunch of times in a row
True :)
|
Looks good to me. Some comments you might want to address before landing. r+ |
…ular: no firstElementChild implementation...)
For instance, jsdom's more spec-compliant parsing causes issues with auto-closing elements (lifehacker article) and with not having self-closing <img> and <br> tags. The former was fixed by removing offending markup, the latter by adjusting JSDOMParser to be more sane, and the expected outputs to cope with this. Finally, JSDOMParser automatically drops comments. The test code needed to manually do this in the jsdom case.
This patch replaces #78. It removes HTML comments from JSDOM generated document.
r=? @gijsk @leibovic