Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
faster_lsi: Massively accelerate LSI performance. #664
Currently, Classifier::LSI rebuilds the index every time an entry is added. This runs into massive performance overheads on my website; theoretically, disabling automatic index rebuilds, and explicitly rebuilding the LSI index at the end of the LSI repopulation should speed things up nicely.
As a side note, here, I use pandoc-ruby to provide a more featureful Markdown transformer, so be mindful that the numbers I quote here have artifically imposed I/O overheads.
With just the 76 posts I wrote this year (abysmal, I know), I come up with the following figures:
With 109 posts, we begin to see even better improvements:
At this point, we begin to see I/O overheads being slower than LSI when faster_lsi is active. I call that fairly conclusive. But wait, there's more. I have 273 posts lying around... I wonder what happens if I feed them all in. With faster_lsi, it was nice and clippy. Without it, I simply gave up, and went and refilled my cup of tea. And it was still going.
That is, in anyone's books, a major improvement. Note, however, that I don't know just how well this will perform with
So, all up, the performance improvement is massive, and scales depending on how many files you have. At the last point, the improvement is just on 3200%.
A more optimal solution would be to cache the LSI index and/or content data somehow. I'll leave that to when faster_lsi takes over ten minutes to run.
I'm trying to work out how to create a test for this. There are no existing tests for the LSI, and I'm personally not sure how to prove it works. Perhaps use
Theoretically, this change does not affect the LSI at all, only changing how entries are inserted into it, and how it is managed internally, so as long as it produces identical results as it did before (which it seems to), there's no problem.
referenced this pull request
Jan 2, 2013
My LSI benchmark did eventually finish with no difference. If someone else would like to verify that the results they get on the auto-generated corpus don't differ with
In any case, there should be absolutely no difference in the LSI given the same dataset. I haven't changed the way that the classifier runs, only (effectively) when it runs.