Skip to content
This repository has been archived by the owner on Jan 28, 2020. It is now read-only.

Sped up Elasticsearch indexing. #460

Merged
merged 1 commit into from
Aug 4, 2015
Merged

Conversation

ShawnMilo
Copy link
Contributor

Cache course information so it doesn't do extra database
queries per LearningResource during import.

@@ -136,12 +136,25 @@ class LearningResource(BaseModel):
xa_histogram_grade = models.FloatField(default=0)
url_name = models.TextField(null=True)

def get_preview_url(self):
"""Create a preview URL."""
def get_preview_url(self, org=None, course_number=None, run=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this would be cleaner as a standalone function since we're now taking in all the arguments we need explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now all the kwargs are optional. This is because I didn't want to force any other code which calls this to change.

We could make it a separate function, but I think the bigger win would be to not have it implicitly do database lookups when it runs. I don't know if that's practical, though.

@noisecapella
Copy link
Contributor

It would be nice if the cache were in a separate object so that it could be easily unit tested. I know there's functional testing done already of search indexing but unit tests highlight problems more specifically which allows for quicker diagnosis of problems.

@noisecapella
Copy link
Contributor

Looks good overall, some minor comments and tests are failing which you're probably already aware of.

@ShawnMilo
Copy link
Contributor Author

Yep, I'm working on fixing those now. I did it really quick one morning before stand-up, and didn't clean it up to push. This morning's stand-up prompted me to throw it up for people to look at.

@ShawnMilo
Copy link
Contributor Author

Note: When this passes code review, I still need to flatten it.

@noisecapella
Copy link
Contributor

Looks good except for the typo, feel free to merge

Disabled by default; an environment variable should be
added to production to enable this.
@ShawnMilo
Copy link
Contributor Author

Thanks, typo fixed and flattened. Will merge after tests pass.

@carsongee: This is disabled by default, so if we want to use it in Heroku, an environment variable has to be set. ALLOW_CACHING

ShawnMilo added a commit that referenced this pull request Aug 4, 2015
@ShawnMilo ShawnMilo merged commit 9aeec10 into master Aug 4, 2015
@ShawnMilo ShawnMilo deleted the speedup/skm/search_indexing branch August 4, 2015 15:32
}


def get_course_metadata(course_id):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would all be cleaner using a memoize decorator: https://wiki.python.org/moin/PythonDecoratorLibrary#Memoize

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also use a django local mem cache too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with memoize is that there's no concept of expiration. I do think Django's caching would be an upgrade, since they added the ability to set an expiration timeout in Django 1.7. I'll open an issue for it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be pretty trivial to add expiration to memoize, just make a parameter of the decorator the expiration time or add your datetime.now() - INDEX_CACHE["born"] > MAX_INDEX_AGE) inside the memoize wrapper if it is constant.

@carsongee
Copy link
Contributor

It would be nice if there was a performance test we could add that would assert we wouldn't go slower, like assert number of DB queries for import of a course and make sure that doesn't go up with the flag enabled. Also, I don't know if we want the feature flag, if it makes things faster and doesn't break things, it should just always be on right? Or do it with a function parameter so the caller can determine if it is on or not, and not the sys admin?

@ShawnMilo
Copy link
Contributor Author

The main practical reason the flag is there is that it breaks all kinds of tests if it's on by default. Since it's called implicitly when various objects are saved, an optional kwarg won't help; it has to be something set in a larger scope.

The other reason it should be off by default is that otherwise it's going to cause a lot of confusion during development, as we save something and refresh and don't see the change -- unless we always keep caching in the front of our minds or remember to disable it locally. Since cache invalidation is one of the hard problems in programming, I think it's best to leave it off by default to prevent surprises. I also know this from bitter experience.

@carsongee
Copy link
Contributor

We shouldn't write it in such a way that it will break the tests if the feature is on, that would invalidate much of our testing (as it relates to verifying we are production ready), so part of adding this would be to make the code those tests use smarter (and probably some of the test smarter as well) to deal with this, otherwise we end with all those problems you have likely run into.
For this specifically, you can bypass that harder problem because cache invalidation is actually very easy to handle since we aren't using threads, we just need to turn cache on at start of import and turn it off afterwards, that way we are limiting the scope of the change and thus wouldn't need a flag to turn it on and off. We can even limit the scope of the caching to inside the import pipeline

@carsongee
Copy link
Contributor

Have you looked at the speed up this provides for large courses? i.e. 8.01 or the MechCX course? i.e. from 5mins to 4mins or similar?

@ShawnMilo
Copy link
Contributor Author

I ran it against all the courses I had locally, and overall it cut the time in half. I didn't test any courses in isolation.

Your comments above (regarding cache invalidation) will take some more thinking to respond to, so a response will be later. Just so you didn't think I missed or ignored that part.

ShawnMilo added a commit that referenced this pull request Aug 7, 2015
Closes #467.

Improves upon PR #460 by getting rid of a "global" variable and
a custom settings.py value. Also adds tests which confirm that
fewer queries are done when the caching is used.
ShawnMilo added a commit that referenced this pull request Aug 11, 2015
Closes #467.

Improves upon PR #460 by getting rid of a "global" variable and
a custom settings.py value. Also adds tests which confirm that
fewer queries are done when the caching is used.
ShawnMilo added a commit that referenced this pull request Aug 11, 2015
Closes #467.

Improves upon PR #460 by getting rid of a "global" variable and
a custom settings.py value. Also adds tests which confirm that
fewer queries are done when the caching is used.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants