Browse files

faster_lsi: Massively accelerate LSI performance.

Currently, Classifier::LSI rebuilds the index every time an entry is
added.  This runs into massive performance overheads on my website;
theoretically, disabling automatic index rebuilds, and explicitly
rebuilding the LSI index at the end of the LSI repopulation should
speed things up nicely.

As a side note, here, I use pandoc-ruby to provide a more featureful
Markdown transformer, so be mindful that the numbers I quote here have
artifically imposed I/O overheads.

With just the 76 posts I wrote this year (abysmal, I know), I come up
with the following figures:

    Without faster_lsi:
      jekyll --lsi  16.91s user 0.88s system 97% cpu 18.302 total
    With faster_lsi:
      jekyll --lsi  2.72s user 0.77s system 88% cpu 3.940 total

With 109 posts, we begin to see even better improvements:

    Without faster_lsi:
      jekyll --lsi  51.00s user 1.47s system 98% cpu 53.060 total
    With faster_lsi:
      jekyll --lsi  5.04s user 1.12s system 91% cpu 6.735 total

At this point, we begin to see I/O overheads being slower than LSI
when faster_lsi is active.  I call that fairly conclusive.  But wait,
there's more.  I have 273 posts lying around... I wonder what happens
if I feed them all in.  With faster_lsi, it was nice and clippy.
Without it, I simply gave up, and went and refilled my cup of tea.
And it was still going.

    Without faster_lsi:
      jekyll --lsi  1277.86s user 10.90s system 99% cpu 21:30.29 total
    With faster_lsi:
      jekyll --lsi  34.62s user 4.43s system 96% cpu 40.430 total

That is, in anyone's books, a major improvement.  Note, however, that
I don't know just how well this will perform with `jekyll --auto`
because I don't know how it does the LSI rebuilds.  I _think_ (but
please, don't commit me on this) that the LSI is rebuilt every time
Jekyll picks up a file change.

So, all up, the performance improvement is massive, and scales
depending on how many files you have.  At the last point, the
improvement is just on 3200%.

A more optimal solution would be to cache the LSI index and/or content
data somehow.  I'll leave that to when faster_lsi takes over ten
minutes to run.
  • Loading branch information...
1 parent fa8400a commit 85f2dfffa69f33058e643ffd7edd01e6137a5059 @jashank jashank committed Oct 31, 2012
Showing with 5 additions and 2 deletions.
  1. +5 −2 lib/jekyll/post.rb
@@ -162,9 +162,12 @@ def related_posts(posts)
self.class.lsi ||= begin
- puts "Running the classifier... this could take a while."
- lsi =
+ puts "Starting the classifier..."
+ lsi = :auto_rebuild => false
+ $stdout.print(" Populating LSI... ");$stdout.flush
posts.each { |x| $stdout.print(".");$stdout.flush;lsi.add_item(x) }
+ $stdout.print("\n Rebuilding LSI index... ")
+ lsi.build_index
puts ""

0 comments on commit 85f2dff

Please sign in to comment.