New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

faster_lsi: Massively accelerate LSI performance. #664

Merged
merged 2 commits into from Jan 17, 2013

Conversation

Projects
None yet
7 participants
@jashank
Contributor

jashank commented Oct 31, 2012

Currently, Classifier::LSI rebuilds the index every time an entry is added. This runs into massive performance overheads on my website; theoretically, disabling automatic index rebuilds, and explicitly rebuilding the LSI index at the end of the LSI repopulation should speed things up nicely.

As a side note, here, I use pandoc-ruby to provide a more featureful Markdown transformer, so be mindful that the numbers I quote here have artifically imposed I/O overheads.

With just the 76 posts I wrote this year (abysmal, I know), I come up with the following figures:

Without faster_lsi:
  jekyll --lsi  16.91s user 0.88s system 97% cpu 18.302 total
With faster_lsi:
  jekyll --lsi  2.72s user 0.77s system 88% cpu 3.940 total

With 109 posts, we begin to see even better improvements:

Without faster_lsi:
  jekyll --lsi  51.00s user 1.47s system 98% cpu 53.060 total
With faster_lsi:
  jekyll --lsi  5.04s user 1.12s system 91% cpu 6.735 total

At this point, we begin to see I/O overheads being slower than LSI when faster_lsi is active. I call that fairly conclusive. But wait, there's more. I have 273 posts lying around... I wonder what happens if I feed them all in. With faster_lsi, it was nice and clippy. Without it, I simply gave up, and went and refilled my cup of tea. And it was still going.

Without faster_lsi:
  jekyll --lsi  1277.86s user 10.90s system 99% cpu 21:30.29 total
With faster_lsi:
  jekyll --lsi  34.62s user 4.43s system 96% cpu 40.430 total

That is, in anyone's books, a major improvement. Note, however, that I don't know just how well this will perform with jekyll --auto because I don't know how it does the LSI rebuilds. I think (but please, don't commit me on this) that the LSI is rebuilt every time Jekyll picks up a file change.

So, all up, the performance improvement is massive, and scales depending on how many files you have. At the last point, the improvement is just on 3200%.

A more optimal solution would be to cache the LSI index and/or content data somehow. I'll leave that to when faster_lsi takes over ten minutes to run.

faster_lsi: Massively accelerate LSI performance.
Currently, Classifier::LSI rebuilds the index every time an entry is
added.  This runs into massive performance overheads on my website;
theoretically, disabling automatic index rebuilds, and explicitly
rebuilding the LSI index at the end of the LSI repopulation should
speed things up nicely.

As a side note, here, I use pandoc-ruby to provide a more featureful
Markdown transformer, so be mindful that the numbers I quote here have
artifically imposed I/O overheads.

With just the 76 posts I wrote this year (abysmal, I know), I come up
with the following figures:

    Without faster_lsi:
      jekyll --lsi  16.91s user 0.88s system 97% cpu 18.302 total
    With faster_lsi:
      jekyll --lsi  2.72s user 0.77s system 88% cpu 3.940 total

With 109 posts, we begin to see even better improvements:

    Without faster_lsi:
      jekyll --lsi  51.00s user 1.47s system 98% cpu 53.060 total
    With faster_lsi:
      jekyll --lsi  5.04s user 1.12s system 91% cpu 6.735 total

At this point, we begin to see I/O overheads being slower than LSI
when faster_lsi is active.  I call that fairly conclusive.  But wait,
there's more.  I have 273 posts lying around... I wonder what happens
if I feed them all in.  With faster_lsi, it was nice and clippy.
Without it, I simply gave up, and went and refilled my cup of tea.
And it was still going.

    Without faster_lsi:
      jekyll --lsi  1277.86s user 10.90s system 99% cpu 21:30.29 total
    With faster_lsi:
      jekyll --lsi  34.62s user 4.43s system 96% cpu 40.430 total

That is, in anyone's books, a major improvement.  Note, however, that
I don't know just how well this will perform with `jekyll --auto`
because I don't know how it does the LSI rebuilds.  I _think_ (but
please, don't commit me on this) that the LSI is rebuilt every time
Jekyll picks up a file change.

So, all up, the performance improvement is massive, and scales
depending on how many files you have.  At the last point, the
improvement is just on 3200%.

A more optimal solution would be to cache the LSI index and/or content
data somehow.  I'll leave that to when faster_lsi takes over ten
minutes to run.
@parkr

This comment has been minimized.

Member

parkr commented Dec 18, 2012

This. Is. Awesome.

@jashank

This comment has been minimized.

Contributor

jashank commented Dec 19, 2012

I found it quite useful at the time I wrote it. Classifier::LSI is, even with GSL, agonisingly slow.

@agarie

This comment has been minimized.

agarie commented Dec 30, 2012

I made a rake task to generate posts and got similar results to yours, @jashank.

It's obvious that this pull request should be merged. The only problem I see is lack of tests - I doubt @mojombo will accept the patch without them.

@jashank

This comment has been minimized.

Contributor

jashank commented Dec 30, 2012

I'm trying to work out how to create a test for this. There are no existing tests for the LSI, and I'm personally not sure how to prove it works. Perhaps use tests/source to build a known-good LSI, and compare the test LSI with it? That may work, except how do I then know that the known-good LSI is, in fact, good...?

Theoretically, this change does not affect the LSI at all, only changing how entries are inserted into it, and how it is managed internally, so as long as it produces identical results as it did before (which it seems to), there's no problem.

@qrush

This comment has been minimized.

Contributor

qrush commented Jan 2, 2013

Can anyone prove this works? Sounds like it's great to merge but I don't have any sites that use LSI. Can anyone provide a repo that uses it, perhaps to use as benchmark or just paste one?

@jashank

This comment has been minimized.

Contributor

jashank commented Jan 2, 2013

@qrush, I've created a test repo, Jashank/jekyll-lsi-test, which contains 964 auto-generated posts. I'm currently running both 'normal' LSI and faster_lsi against it, to compare performance. I don't expect normal LSI to finish tonight. Benchmarks to come.

@jashank

This comment has been minimized.

Contributor

jashank commented Jan 8, 2013

My LSI benchmark did eventually finish with no difference. If someone else would like to verify that the results they get on the auto-generated corpus don't differ with faster_lsi, that would be greatly appreciated.

In any case, there should be absolutely no difference in the LSI given the same dataset. I haven't changed the way that the classifier runs, only (effectively) when it runs.

@parkr

This comment has been minimized.

Member

parkr commented Jan 11, 2013

Please add parentheses around the args to Classifier::LSI.new:

lsi = Classifier::LSI.new(:auto_rebuild => false)
@mojombo

This comment has been minimized.

Contributor

mojombo commented Jan 11, 2013

+1.

parkr added a commit that referenced this pull request Jan 17, 2013

Merge pull request #664 from Jashank/faster_lsi
faster_lsi: Massively accelerate LSI performance.

@parkr parkr merged commit faf5e44 into jekyll:master Jan 17, 2013

1 check passed

default The Travis build passed
Details

parkr added a commit that referenced this pull request Jan 17, 2013

@chrishough

This comment has been minimized.

chrishough commented Apr 15, 2014

+1

@jekyll jekyll locked and limited conversation to collaborators Feb 27, 2017

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.