Fix readability to work with real dom (fixes #72) by n1k0 · Pull Request #80 · mozilla/readability

n1k0 · 2015-03-27T10:09:10Z

This patch replaces #78. It removes HTML comments from JSDOM generated document.

r=? @gijsk @leibovic

gijsk · 2015-03-30T18:51:20Z

I'd take the comment fix; it's really just for the benefit of the testcase.

I'm still worried about perf here, though...

gijsk · 2015-03-30T18:55:22Z

Summarizing issues from #78:

perf
HTML encoding issues
???

DavidBruant · 2015-03-30T19:43:09Z

Nit: jsdom? (lowercase)

DavidBruant · 2015-03-30T19:55:59Z

Regarding perf, I don't understand the problem.
From what I understand, at the branch in this PR, there are 22 test pages. Each page has its DOM built via both jsdom and JSDOMParser and the each of these 2 Documents goes through the Readability().parse(), so it's 44 DOM building + readability (+22 removing comments). On my machine, running all that takes under 700ms. Granted, it's in command line, not in the browser fighting against other threads and addons for CPU, but I find that extremely fast.
What is the performance problem? Is there a page or a use case in particular that has performance issues?

DavidBruant · 2015-03-30T19:59:08Z

Can you elaborate? :-) I'm sure they'll want to know.
I can reduce a test case if I'm told where to begin.

Can you elaborate? :-)

For some reason, jsdom retrieves a bunch of undefined items in childNodes list for no obvious reason (please note I'm not fascinated with the idea of investigating this further atm, but maybe you are?).

I'm sure they'll want to know.

They don't. The 3.x branch of jsdom is claimed being not officially maintained anymore since they moved to the io.js platform for 4.x.

But you're invited to fork and maintain your own version if you will (hint: don't).

(please note I'm not fascinated with the idea of investigating this further atm, but maybe you are?).

I am. And it was stupid of me asking to be told where to begin. I should just comment your line, notice the bug in the tests and start from there.
I am interested, because jsdom is the closest attempt I'm aware of of a conformant standalone DOM in Node.js.

But you're invited to fork and maintain your own version if you will (hint: don't).

:-p I'll try to repro the bug you found, find out whether it's still in 4.x and report to them if so.

... wait... isn't it just a bug due to removing elements in a for-loop?
You cache the length, start the loop, remove elements. If one element got removed, at the end of the loop, node.childNodes[n] can only return undefined, because the node.childNodes isn't of the same length that it was at the beginning of the loop any longer. Possibly, you're also forgetting to remove some comments, by skipping one element after one was removed.

You can clone the list via Array.prototype.slice.call (or Array.from) and iterate over that (with forEach, Yay \o/)

I'm wrong about forgetting some elements because of the n-- below, but I think the length explanation still stands.

Fixed using a forEach loop.

gijsk · 2015-03-31T15:11:41Z

What is the performance problem?

I want at least some idea of how the JSDOMParser/Readability.js changes here impact performance, in relative terms instead of "on our really fast machines, when run in isolation on node.js, it takes 20ms so there's no problem". :-)

gijsk · 2015-03-31T15:18:27Z

In particular, I suspect the HTML encoding I added in JSDOMParser.js could be very slow.

gijsk · 2015-04-02T13:55:20Z

Now that we have benchmarks, it's pretty easy to see that this indeed slows down JSDOMParser, though it seems to speed up Readability a little, which is somewhat surprising. I'm going to look at using some more laziness in the textContent/innerHTML stuff, as well as ensuring entities like ¨ and ৸ all work.

gijsk · 2015-04-02T20:38:43Z

So, this version makes JSDOMParser faster than it currently is, and Readability slightly slower. All in all, I'm pretty happy with the code here at this point. @n1k0, @leibovic can one of you review? :-)

n1k0 · 2015-04-03T13:34:59Z

Now we have benchmarks, is performance really bad if we remove this and use the generic behavior? I still think we shouldn't distinguish between parsers within Readability…

Hmm okay got it, JSDOMParser doesn't support what's required below.

Which part does it not support? I think it should work in theory, but I'd be very surprised if it wasn't slower. That said, the benchmarks are super noisy on my machine. I get results anywhere between 170 and 250 ops/s for the JSDOMParser part on the reference benchmark, with an average of about 200, all on the same changeset.

Trying to fix that by increasing the iterations massively increases the number of ops/s, which makes me think we're hitting some kind of JIT optimization, which isn't really realistic (we won't normally readerize the same document a bunch of times in a row).

Which part does it not support?

JSDOMParser doesn't seem to support ownerDocument, though that's probably no big deal adding support for it.

we won't normally readerize the same document a bunch of times in a row

True :)

n1k0 · 2015-04-03T14:46:51Z

Looks good to me. Some comments you might want to address before landing. r+

…+ add tests

…ular: no firstElementChild implementation...)

For instance, jsdom's more spec-compliant parsing causes issues with auto-closing elements (lifehacker article) and with not having self-closing <img> and <br> tags. The former was fixed by removing offending markup, the latter by adjusting JSDOMParser to be more sane, and the expected outputs to cope with this. Finally, JSDOMParser automatically drops comments. The test code needed to manually do this in the jsdom case.

n1k0 force-pushed the fix-readability-to-work-with-real-dom branch from 4e4ce80 to 925e222 Compare March 27, 2015 10:11

DavidBruant mentioned this pull request Mar 29, 2015

JSDOMParser errors on http://www.newyorker.com/magazine/2014/12/01/quiet-german #77

Closed

gijsk mentioned this pull request Mar 30, 2015

Fix readability to work with real dom (fixes #72) #78

Closed

DavidBruant reviewed Mar 30, 2015
View reviewed changes

Comment thread test/test-readability.js Outdated

Copy link
Copy Markdown

Contributor

DavidBruant Mar 30, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: jsdom? (lowercase)

DavidBruant reviewed Mar 30, 2015
View reviewed changes

n1k0 force-pushed the fix-readability-to-work-with-real-dom branch 3 times, most recently from f461263 to 6c810a9 Compare March 31, 2015 08:54

gijsk force-pushed the fix-readability-to-work-with-real-dom branch from 6c810a9 to eb07c70 Compare April 1, 2015 23:28

gijsk force-pushed the fix-readability-to-work-with-real-dom branch 2 times, most recently from f53f3ce to 62e8b5d Compare April 2, 2015 17:25

n1k0 reviewed Apr 3, 2015
View reviewed changes

gijsk added 5 commits April 3, 2015 22:32

Fix JSDOMParser to do some HTML escaping and identify as JSDOMParser …

62801fa

…+ add tests

Fix Readability.js to work with jsdom's DOM implementation (in partic…

7c60dba

…ular: no firstElementChild implementation...)

Turn off JS when using JSDOM in tests.

225adef

Fix embedded video testcase to work with both jsdom and JSDOMParser

a04f9d9

Conversation

n1k0 commented Mar 27, 2015

Uh oh!

gijsk commented Mar 30, 2015

Uh oh!

gijsk commented Mar 30, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DavidBruant commented Mar 30, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gijsk commented Mar 31, 2015

Uh oh!

gijsk commented Mar 31, 2015

Uh oh!

gijsk commented Apr 2, 2015

Uh oh!

gijsk commented Apr 2, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

n1k0 commented Apr 3, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants