Skip to content
This repository

faster_lsi: Massively accelerate LSI performance. #664

Merged
merged 2 commits into from about 1 year ago

6 participants

Jashank Jeremy Parker Moore Carlos Agarie Nick Quaranto Tom Preston-Werner Chris Hough
Jashank Jeremy

Currently, Classifier::LSI rebuilds the index every time an entry is added. This runs into massive performance overheads on my website; theoretically, disabling automatic index rebuilds, and explicitly rebuilding the LSI index at the end of the LSI repopulation should speed things up nicely.

As a side note, here, I use pandoc-ruby to provide a more featureful Markdown transformer, so be mindful that the numbers I quote here have artifically imposed I/O overheads.

With just the 76 posts I wrote this year (abysmal, I know), I come up with the following figures:

Without faster_lsi:
  jekyll --lsi  16.91s user 0.88s system 97% cpu 18.302 total
With faster_lsi:
  jekyll --lsi  2.72s user 0.77s system 88% cpu 3.940 total

With 109 posts, we begin to see even better improvements:

Without faster_lsi:
  jekyll --lsi  51.00s user 1.47s system 98% cpu 53.060 total
With faster_lsi:
  jekyll --lsi  5.04s user 1.12s system 91% cpu 6.735 total

At this point, we begin to see I/O overheads being slower than LSI when faster_lsi is active. I call that fairly conclusive. But wait, there's more. I have 273 posts lying around... I wonder what happens if I feed them all in. With faster_lsi, it was nice and clippy. Without it, I simply gave up, and went and refilled my cup of tea. And it was still going.

Without faster_lsi:
  jekyll --lsi  1277.86s user 10.90s system 99% cpu 21:30.29 total
With faster_lsi:
  jekyll --lsi  34.62s user 4.43s system 96% cpu 40.430 total

That is, in anyone's books, a major improvement. Note, however, that I don't know just how well this will perform with jekyll --auto because I don't know how it does the LSI rebuilds. I think (but please, don't commit me on this) that the LSI is rebuilt every time Jekyll picks up a file change.

So, all up, the performance improvement is massive, and scales depending on how many files you have. At the last point, the improvement is just on 3200%.

A more optimal solution would be to cache the LSI index and/or content data somehow. I'll leave that to when faster_lsi takes over ten minutes to run.

Jashank Jeremy faster_lsi: Massively accelerate LSI performance.
Currently, Classifier::LSI rebuilds the index every time an entry is
added.  This runs into massive performance overheads on my website;
theoretically, disabling automatic index rebuilds, and explicitly
rebuilding the LSI index at the end of the LSI repopulation should
speed things up nicely.

As a side note, here, I use pandoc-ruby to provide a more featureful
Markdown transformer, so be mindful that the numbers I quote here have
artifically imposed I/O overheads.

With just the 76 posts I wrote this year (abysmal, I know), I come up
with the following figures:

    Without faster_lsi:
      jekyll --lsi  16.91s user 0.88s system 97% cpu 18.302 total
    With faster_lsi:
      jekyll --lsi  2.72s user 0.77s system 88% cpu 3.940 total

With 109 posts, we begin to see even better improvements:

    Without faster_lsi:
      jekyll --lsi  51.00s user 1.47s system 98% cpu 53.060 total
    With faster_lsi:
      jekyll --lsi  5.04s user 1.12s system 91% cpu 6.735 total

At this point, we begin to see I/O overheads being slower than LSI
when faster_lsi is active.  I call that fairly conclusive.  But wait,
there's more.  I have 273 posts lying around... I wonder what happens
if I feed them all in.  With faster_lsi, it was nice and clippy.
Without it, I simply gave up, and went and refilled my cup of tea.
And it was still going.

    Without faster_lsi:
      jekyll --lsi  1277.86s user 10.90s system 99% cpu 21:30.29 total
    With faster_lsi:
      jekyll --lsi  34.62s user 4.43s system 96% cpu 40.430 total

That is, in anyone's books, a major improvement.  Note, however, that
I don't know just how well this will perform with `jekyll --auto`
because I don't know how it does the LSI rebuilds.  I _think_ (but
please, don't commit me on this) that the LSI is rebuilt every time
Jekyll picks up a file change.

So, all up, the performance improvement is massive, and scales
depending on how many files you have.  At the last point, the
improvement is just on 3200%.

A more optimal solution would be to cache the LSI index and/or content
data somehow.  I'll leave that to when faster_lsi takes over ten
minutes to run.
85f2dff
Parker Moore
Owner

This. Is. Awesome.

Jashank Jeremy

I found it quite useful at the time I wrote it. Classifier::LSI is, even with GSL, agonisingly slow.

Carlos Agarie

I made a rake task to generate posts and got similar results to yours, @Jashank.

It's obvious that this pull request should be merged. The only problem I see is lack of tests - I doubt @mojombo will accept the patch without them.

Jashank Jeremy

I'm trying to work out how to create a test for this. There are no existing tests for the LSI, and I'm personally not sure how to prove it works. Perhaps use tests/source to build a known-good LSI, and compare the test LSI with it? That may work, except how do I then know that the known-good LSI is, in fact, good...?

Theoretically, this change does not affect the LSI at all, only changing how entries are inserted into it, and how it is managed internally, so as long as it produces identical results as it did before (which it seems to), there's no problem.

Nick Quaranto

Can anyone prove this works? Sounds like it's great to merge but I don't have any sites that use LSI. Can anyone provide a repo that uses it, perhaps to use as benchmark or just paste one?

Jashank Jeremy

@qrush, I've created a test repo, Jashank/jekyll-lsi-test, which contains 964 auto-generated posts. I'm currently running both 'normal' LSI and faster_lsi against it, to compare performance. I don't expect normal LSI to finish tonight. Benchmarks to come.

Jashank Jeremy

My LSI benchmark did eventually finish with no difference. If someone else would like to verify that the results they get on the auto-generated corpus don't differ with faster_lsi, that would be greatly appreciated.

In any case, there should be absolutely no difference in the LSI given the same dataset. I haven't changed the way that the classifier runs, only (effectively) when it runs.

Parker Moore
Owner

Please add parentheses around the args to Classifier::LSI.new:

lsi = Classifier::LSI.new(:auto_rebuild => false)
Tom Preston-Werner
Owner

+1.

Parker Moore parkr merged commit faf5e44 into from January 16, 2013
Parker Moore parkr closed this January 16, 2013
Chris Hough

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Showing 2 unique commits by 1 author.

Oct 31, 2012
Jashank Jeremy faster_lsi: Massively accelerate LSI performance.
Currently, Classifier::LSI rebuilds the index every time an entry is
added.  This runs into massive performance overheads on my website;
theoretically, disabling automatic index rebuilds, and explicitly
rebuilding the LSI index at the end of the LSI repopulation should
speed things up nicely.

As a side note, here, I use pandoc-ruby to provide a more featureful
Markdown transformer, so be mindful that the numbers I quote here have
artifically imposed I/O overheads.

With just the 76 posts I wrote this year (abysmal, I know), I come up
with the following figures:

    Without faster_lsi:
      jekyll --lsi  16.91s user 0.88s system 97% cpu 18.302 total
    With faster_lsi:
      jekyll --lsi  2.72s user 0.77s system 88% cpu 3.940 total

With 109 posts, we begin to see even better improvements:

    Without faster_lsi:
      jekyll --lsi  51.00s user 1.47s system 98% cpu 53.060 total
    With faster_lsi:
      jekyll --lsi  5.04s user 1.12s system 91% cpu 6.735 total

At this point, we begin to see I/O overheads being slower than LSI
when faster_lsi is active.  I call that fairly conclusive.  But wait,
there's more.  I have 273 posts lying around... I wonder what happens
if I feed them all in.  With faster_lsi, it was nice and clippy.
Without it, I simply gave up, and went and refilled my cup of tea.
And it was still going.

    Without faster_lsi:
      jekyll --lsi  1277.86s user 10.90s system 99% cpu 21:30.29 total
    With faster_lsi:
      jekyll --lsi  34.62s user 4.43s system 96% cpu 40.430 total

That is, in anyone's books, a major improvement.  Note, however, that
I don't know just how well this will perform with `jekyll --auto`
because I don't know how it does the LSI rebuilds.  I _think_ (but
please, don't commit me on this) that the LSI is rebuilt every time
Jekyll picks up a file change.

So, all up, the performance improvement is massive, and scales
depending on how many files you have.  At the last point, the
improvement is just on 3200%.

A more optimal solution would be to cache the LSI index and/or content
data somehow.  I'll leave that to when faster_lsi takes over ten
minutes to run.
85f2dff
Jan 11, 2013
Jashank Jeremy Slight stylistic tweak to LSI initialisation.
Recommended-by: parkr
68333cd
This page is out of date. Refresh to see the latest.

Showing 1 changed file with 5 additions and 2 deletions. Show diff stats Hide diff stats

  1. 7  lib/jekyll/post.rb
7  lib/jekyll/post.rb
@@ -162,9 +162,12 @@ def related_posts(posts)
162 162
 
163 163
       if self.site.lsi
164 164
         self.class.lsi ||= begin
165  
-          puts "Running the classifier... this could take a while."
166  
-          lsi = Classifier::LSI.new
  165
+          puts "Starting the classifier..."
  166
+          lsi = Classifier::LSI.new(:auto_rebuild => false)
  167
+          $stdout.print("  Populating LSI... ");$stdout.flush
167 168
           posts.each { |x| $stdout.print(".");$stdout.flush;lsi.add_item(x) }
  169
+          $stdout.print("\n  Rebuilding LSI index... ")
  170
+          lsi.build_index
168 171
           puts ""
169 172
           lsi
170 173
         end
Commit_comment_tip

Tip: You can add notes to lines in a file. Hover to the left of a line to make a note

Something went wrong with that request. Please try again.