Skip to content

Fix readability to work with real dom (fixes #72)#80

Merged
gijsk merged 5 commits into
masterfrom
fix-readability-to-work-with-real-dom
Apr 3, 2015
Merged

Fix readability to work with real dom (fixes #72)#80
gijsk merged 5 commits into
masterfrom
fix-readability-to-work-with-real-dom

Conversation

@n1k0
Copy link
Copy Markdown
Contributor

@n1k0 n1k0 commented Mar 27, 2015

This patch replaces #78. It removes HTML comments from JSDOM generated document.

r=? @gijsk @leibovic

@gijsk
Copy link
Copy Markdown
Contributor

gijsk commented Mar 30, 2015

I'd take the comment fix; it's really just for the benefit of the testcase.

I'm still worried about perf here, though...

@gijsk
Copy link
Copy Markdown
Contributor

gijsk commented Mar 30, 2015

Summarizing issues from #78:

  • perf
  • HTML encoding issues
  • ???

Comment thread test/test-readability.js Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: jsdom? (lowercase)

@DavidBruant
Copy link
Copy Markdown
Contributor

Regarding perf, I don't understand the problem.
From what I understand, at the branch in this PR, there are 22 test pages. Each page has its DOM built via both jsdom and JSDOMParser and the each of these 2 Documents goes through the Readability().parse(), so it's 44 DOM building + readability (+22 removing comments). On my machine, running all that takes under 700ms. Granted, it's in command line, not in the browser fighting against other threads and addons for CPU, but I find that extremely fast.
What is the performance problem? Is there a page or a use case in particular that has performance issues?

Comment thread test/test-readability.js Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate? :-) I'm sure they'll want to know.
I can reduce a test case if I'm told where to begin.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate? :-)

For some reason, jsdom retrieves a bunch of undefined items in childNodes list for no obvious reason (please note I'm not fascinated with the idea of investigating this further atm, but maybe you are?).

I'm sure they'll want to know.

They don't. The 3.x branch of jsdom is claimed being not officially maintained anymore since they moved to the io.js platform for 4.x.

But you're invited to fork and maintain your own version if you will (hint: don't).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(please note I'm not fascinated with the idea of investigating this further atm, but maybe you are?).

I am. And it was stupid of me asking to be told where to begin. I should just comment your line, notice the bug in the tests and start from there.
I am interested, because jsdom is the closest attempt I'm aware of of a conformant standalone DOM in Node.js.

But you're invited to fork and maintain your own version if you will (hint: don't).

:-p I'll try to repro the bug you found, find out whether it's still in 4.x and report to them if so.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... wait... isn't it just a bug due to removing elements in a for-loop?
You cache the length, start the loop, remove elements. If one element got removed, at the end of the loop, node.childNodes[n] can only return undefined, because the node.childNodes isn't of the same length that it was at the beginning of the loop any longer. Possibly, you're also forgetting to remove some comments, by skipping one element after one was removed.

You can clone the list via Array.prototype.slice.call (or Array.from) and iterate over that (with forEach, Yay \o/)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wrong about forgetting some elements because of the n-- below, but I think the length explanation still stands.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed using a forEach loop.

@n1k0 n1k0 force-pushed the fix-readability-to-work-with-real-dom branch 3 times, most recently from f461263 to 6c810a9 Compare March 31, 2015 08:54
@gijsk
Copy link
Copy Markdown
Contributor

gijsk commented Mar 31, 2015

What is the performance problem?

I want at least some idea of how the JSDOMParser/Readability.js changes here impact performance, in relative terms instead of "on our really fast machines, when run in isolation on node.js, it takes 20ms so there's no problem". :-)

@gijsk
Copy link
Copy Markdown
Contributor

gijsk commented Mar 31, 2015

In particular, I suspect the HTML encoding I added in JSDOMParser.js could be very slow.

@gijsk gijsk force-pushed the fix-readability-to-work-with-real-dom branch from 6c810a9 to eb07c70 Compare April 1, 2015 23:28
@gijsk
Copy link
Copy Markdown
Contributor

gijsk commented Apr 2, 2015

Now that we have benchmarks, it's pretty easy to see that this indeed slows down JSDOMParser, though it seems to speed up Readability a little, which is somewhat surprising. I'm going to look at using some more laziness in the textContent/innerHTML stuff, as well as ensuring entities like ¨ and ৸ all work.

@gijsk gijsk force-pushed the fix-readability-to-work-with-real-dom branch 2 times, most recently from f53f3ce to 62e8b5d Compare April 2, 2015 17:25
@gijsk
Copy link
Copy Markdown
Contributor

gijsk commented Apr 2, 2015

So, this version makes JSDOMParser faster than it currently is, and Readability slightly slower. All in all, I'm pretty happy with the code here at this point. @n1k0, @leibovic can one of you review? :-)

Comment thread Readability.js
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we have benchmarks, is performance really bad if we remove this and use the generic behavior? I still think we shouldn't distinguish between parsers within Readability…

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm okay got it, JSDOMParser doesn't support what's required below.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which part does it not support? I think it should work in theory, but I'd be very surprised if it wasn't slower. That said, the benchmarks are super noisy on my machine. I get results anywhere between 170 and 250 ops/s for the JSDOMParser part on the reference benchmark, with an average of about 200, all on the same changeset.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to fix that by increasing the iterations massively increases the number of ops/s, which makes me think we're hitting some kind of JIT optimization, which isn't really realistic (we won't normally readerize the same document a bunch of times in a row).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which part does it not support?

JSDOMParser doesn't seem to support ownerDocument, though that's probably no big deal adding support for it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we won't normally readerize the same document a bunch of times in a row

True :)

@n1k0
Copy link
Copy Markdown
Contributor Author

n1k0 commented Apr 3, 2015

Looks good to me. Some comments you might want to address before landing. r+

gijsk added 5 commits April 3, 2015 22:32
…ular: no firstElementChild implementation...)
For instance, jsdom's more spec-compliant parsing causes issues with auto-closing elements (lifehacker article) and with not having self-closing <img> and <br> tags. The former was fixed by removing offending markup, the latter by adjusting JSDOMParser to be more sane, and the expected outputs to cope with this.

Finally, JSDOMParser automatically drops comments. The test code needed to manually do this in the jsdom case.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants