Sped up Elasticsearch indexing. #460

ShawnMilo · 2015-08-03T14:49:37Z

Cache course information so it doesn't do extra database
queries per LearningResource during import.

noisecapella · 2015-08-03T17:49:33Z

learningresources/models.py

@@ -136,12 +136,25 @@ class LearningResource(BaseModel):
    xa_histogram_grade = models.FloatField(default=0)
    url_name = models.TextField(null=True)

-    def get_preview_url(self):
-        """Create a preview URL."""
+    def get_preview_url(self, org=None, course_number=None, run=None):


I wonder if this would be cleaner as a standalone function since we're now taking in all the arguments we need explicitly.

Right now all the kwargs are optional. This is because I didn't want to force any other code which calls this to change.

We could make it a separate function, but I think the bigger win would be to not have it implicitly do database lookups when it runs. I don't know if that's practical, though.

noisecapella · 2015-08-03T18:20:17Z

It would be nice if the cache were in a separate object so that it could be easily unit tested. I know there's functional testing done already of search indexing but unit tests highlight problems more specifically which allows for quicker diagnosis of problems.

noisecapella · 2015-08-03T18:21:55Z

Looks good overall, some minor comments and tests are failing which you're probably already aware of.

ShawnMilo · 2015-08-03T18:23:01Z

Yep, I'm working on fixing those now. I did it really quick one morning before stand-up, and didn't clean it up to push. This morning's stand-up prompted me to throw it up for people to look at.

ShawnMilo · 2015-08-03T20:37:50Z

Note: When this passes code review, I still need to flatten it.

noisecapella · 2015-08-04T13:10:50Z

Looks good except for the typo, feel free to merge

Disabled by default; an environment variable should be added to production to enable this.

ShawnMilo · 2015-08-04T14:25:53Z

Thanks, typo fixed and flattened. Will merge after tests pass.

@carsongee: This is disabled by default, so if we want to use it in Heroku, an environment variable has to be set. ALLOW_CACHING

Sped up Elasticsearch indexing.

carsongee · 2015-08-04T15:34:27Z

search/search_indexes.py

+        }
+
+
+def get_course_metadata(course_id):


I think this would all be cleaner using a memoize decorator: https://wiki.python.org/moin/PythonDecoratorLibrary#Memoize

We could also use a django local mem cache too

The problem with memoize is that there's no concept of expiration. I do think Django's caching would be an upgrade, since they added the ability to set an expiration timeout in Django 1.7. I'll open an issue for it.

It would be pretty trivial to add expiration to memoize, just make a parameter of the decorator the expiration time or add your datetime.now() - INDEX_CACHE["born"] > MAX_INDEX_AGE) inside the memoize wrapper if it is constant.

carsongee · 2015-08-04T15:41:11Z

It would be nice if there was a performance test we could add that would assert we wouldn't go slower, like assert number of DB queries for import of a course and make sure that doesn't go up with the flag enabled. Also, I don't know if we want the feature flag, if it makes things faster and doesn't break things, it should just always be on right? Or do it with a function parameter so the caller can determine if it is on or not, and not the sys admin?

ShawnMilo · 2015-08-04T15:48:36Z

The main practical reason the flag is there is that it breaks all kinds of tests if it's on by default. Since it's called implicitly when various objects are saved, an optional kwarg won't help; it has to be something set in a larger scope.

The other reason it should be off by default is that otherwise it's going to cause a lot of confusion during development, as we save something and refresh and don't see the change -- unless we always keep caching in the front of our minds or remember to disable it locally. Since cache invalidation is one of the hard problems in programming, I think it's best to leave it off by default to prevent surprises. I also know this from bitter experience.

carsongee · 2015-08-04T15:54:01Z

We shouldn't write it in such a way that it will break the tests if the feature is on, that would invalidate much of our testing (as it relates to verifying we are production ready), so part of adding this would be to make the code those tests use smarter (and probably some of the test smarter as well) to deal with this, otherwise we end with all those problems you have likely run into.
For this specifically, you can bypass that harder problem because cache invalidation is actually very easy to handle since we aren't using threads, we just need to turn cache on at start of import and turn it off afterwards, that way we are limiting the scope of the change and thus wouldn't need a flag to turn it on and off. We can even limit the scope of the caching to inside the import pipeline

carsongee · 2015-08-04T15:56:22Z

Have you looked at the speed up this provides for large courses? i.e. 8.01 or the MechCX course? i.e. from 5mins to 4mins or similar?

ShawnMilo · 2015-08-04T15:58:13Z

I ran it against all the courses I had locally, and overall it cut the time in half. I didn't test any courses in isolation.

Your comments above (regarding cache invalidation) will take some more thinking to respond to, so a response will be later. Just so you didn't think I missed or ignored that part.

Closes #467. Improves upon PR #460 by getting rid of a "global" variable and a custom settings.py value. Also adds tests which confirm that fewer queries are done when the caching is used.

ShawnMilo added the Needs Review label Aug 3, 2015

noisecapella reviewed Aug 3, 2015
View reviewed changes

noisecapella added Waiting on Author and removed Needs Review labels Aug 3, 2015

ShawnMilo added Needs Review and removed Waiting on Author labels Aug 3, 2015

Sped up indexing using caching.

710d6bc

Disabled by default; an environment variable should be added to production to enable this.

ShawnMilo force-pushed the speedup/skm/search_indexing branch from bee2e06 to 710d6bc Compare August 4, 2015 14:24

ShawnMilo added a commit that referenced this pull request Aug 4, 2015

Merge pull request #460 from mitodl/speedup/skm/search_indexing

9aeec10

Sped up Elasticsearch indexing.

ShawnMilo merged commit 9aeec10 into master Aug 4, 2015

ShawnMilo deleted the speedup/skm/search_indexing branch August 4, 2015 15:32

carsongee reviewed Aug 4, 2015
View reviewed changes

ShawnMilo mentioned this pull request Aug 4, 2015

Use Django's caching for indexing data caching. #467

Closed

ShawnMilo mentioned this pull request Aug 7, 2015

Updated indexing caching from dict to Django's cache. #485

Merged

pwilkins added Needs Acceptance test and removed Needs Acceptance test labels Sep 16, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sped up Elasticsearch indexing. #460

Sped up Elasticsearch indexing. #460

ShawnMilo commented Aug 3, 2015

noisecapella Aug 3, 2015

ShawnMilo Aug 3, 2015

noisecapella commented Aug 3, 2015

noisecapella commented Aug 3, 2015

ShawnMilo commented Aug 3, 2015

ShawnMilo commented Aug 3, 2015

noisecapella commented Aug 4, 2015

ShawnMilo commented Aug 4, 2015

carsongee Aug 4, 2015

carsongee Aug 4, 2015

ShawnMilo Aug 4, 2015

carsongee Aug 4, 2015

carsongee commented Aug 4, 2015

ShawnMilo commented Aug 4, 2015

carsongee commented Aug 4, 2015

carsongee commented Aug 4, 2015

ShawnMilo commented Aug 4, 2015

Sped up Elasticsearch indexing. #460

Sped up Elasticsearch indexing. #460

Conversation

ShawnMilo commented Aug 3, 2015

noisecapella Aug 3, 2015

Choose a reason for hiding this comment

ShawnMilo Aug 3, 2015

Choose a reason for hiding this comment

noisecapella commented Aug 3, 2015

noisecapella commented Aug 3, 2015

ShawnMilo commented Aug 3, 2015

ShawnMilo commented Aug 3, 2015

noisecapella commented Aug 4, 2015

ShawnMilo commented Aug 4, 2015

carsongee Aug 4, 2015

Choose a reason for hiding this comment

carsongee Aug 4, 2015

Choose a reason for hiding this comment

ShawnMilo Aug 4, 2015

Choose a reason for hiding this comment

carsongee Aug 4, 2015

Choose a reason for hiding this comment

carsongee commented Aug 4, 2015

ShawnMilo commented Aug 4, 2015

carsongee commented Aug 4, 2015

carsongee commented Aug 4, 2015

ShawnMilo commented Aug 4, 2015