Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: infinite growth of cache when auto eviction is disabled #34210

Merged

Conversation

regisb
Copy link
Contributor

@regisb regisb commented Feb 8, 2024

Description

The course structure cache keys were previously set with an explicit timeout of None which was overriding the cache default timeout of 2h. This was OK in environments where the cache is configured with a maximum memory limit and an automatic key eviction strategy. But in others (such as Tutor), the course structure cache could grow infinitely.

It was agreed that course structure cache keys should be long-lived but should respect the default cache structure timeout. Thus, we set here the TTL to 1 week.

We can also configure Tutor to use a cache eviction policy. But that means we need to set a maxmemory value in Redis. It's not possible to set a value that will be appropriate for everyone:

  • if it's higher than the total memory (e.g: in small instances), server will crash before the cache is filled.
  • if it's too low (e.g: in large instances), the whole platform will abruptly slow down as many cache entries are suddenly evicted.

That question of whether Tutor should define a cache eviction policy is still under discussion, but it can be resolved independently of this change.

Supporting information

See discussion here: overhangio/tutor#984

Testing instructions

Please provide detailed step-by-step instructions for testing this change.

Deadline

ASAP. We would like to backport this change in Tutor.

@openedx-webhooks
Copy link

Thanks for the pull request, @regisb! Please note that it may take us up to several weeks or months to complete a review and merge your PR.

Feel free to add as much of the following information to the ticket as you can:

  • supporting documentation
  • Open edX discussion forum threads
  • timeline information ("this must be merged by XX date", and why that is)
  • partner information ("this is a course on edx.org")
  • any other information that can help Product understand the context for the PR

All technical communication about the code itself will be done via the GitHub pull request interface. As a reminder, our process documentation is here.

Please let us know once your PR is ready for our review and all tests are green.

@openedx-webhooks openedx-webhooks added the open-source-contribution PR author is not from Axim or 2U label Feb 8, 2024
regisb added a commit to overhangio/tutor that referenced this pull request Feb 8, 2024
See the discussion here: #984
And the upstream PR here: openedx/edx-platform#34210

The tl;dr is that the Redis course structure cache was growing without
bounds. While the upstream fix should resolve that issue, we decided
that Tutor should have a maxmemory limit and an eviction policy set for
operational safety.
regisb added a commit to overhangio/tutor that referenced this pull request Feb 8, 2024
See the discussion here: #984
And the upstream PR here: openedx/edx-platform#34210

The tl;dr is that the Redis course structure cache was growing without
bounds. While the upstream fix should resolve that issue, we decided
that Tutor should have a maxmemory limit and an eviction policy set for
operational safety.

To manually expire existing keys, run:

    tutor local run cms ./manage.py cms shell -c "from django.core.cache import caches; c = caches['course_structure_cache']; [c.expire(key, 604800) for key in c.keys('*')]"
@ormsbee
Copy link
Contributor

ormsbee commented Feb 9, 2024

@robrap: I wanted to bring this to your attention. I don't think it will have a negative impact on you folks, but I'd like to go over your config on Monday if you have time.

@robrap
Copy link
Contributor

robrap commented Feb 12, 2024

@ormsbee: I got your message, but haven’t gotten to this. Let’s coordinate in Slack.

@ormsbee
Copy link
Contributor

ormsbee commented Feb 12, 2024

@robrap: Scheduled the meeting for tomorrow. Thank you!

regisb added a commit to overhangio/tutor that referenced this pull request Feb 13, 2024
See the discussion here: #984
And the upstream PR here: openedx/edx-platform#34210

The tl;dr is that the Redis course structure cache was growing without
bounds. While the upstream fix should resolve that issue, we decided
that Tutor should have a maxmemory limit and an eviction policy set for
operational safety.

Thus, Redis now has a 4gb maxmemory. If you need more memory on your
instance, you should implement the "redis-conf" patch.

To manually expire existing keys, run:

    tutor local run cms ./manage.py cms shell -c "from django.core.cache import caches; c = caches['course_structure_cache']; [c.expire(key, 604800) for key in c.keys('*')]"
@regisb
Copy link
Contributor Author

regisb commented Feb 13, 2024

Thank you for being on top of this Dave & Robert :)

@robrap
Copy link
Contributor

robrap commented Feb 13, 2024

  • [inform] I'm in the midst of updating on our end, and will let you know when this is complete.
  • [request] It seems like this might warrant an ! in the commit message and a backward incompatible footnote that let's someone know that if they wish to maintain the old behavior, they will need an explicit cache setting for course_structure_cache and that its TIMEOUT setting needs to be set to None.

@robrap
Copy link
Contributor

robrap commented Feb 13, 2024

@regisb @ormsbee: My setting changes have been deployed. I'm not sure if you want to delay this by a day, but I guess it can go whenever you wish.

@regisb regisb force-pushed the overhangio/course-structure-cache-timeout branch from 78185f6 to dfa16b3 Compare February 14, 2024 08:38
@regisb
Copy link
Contributor Author

regisb commented Feb 14, 2024

I have updated the commit message to include backward-compatible instructions, as well as the ! breaking change prefix.

See discussion here: overhangio/tutor#984

This is a breaking change that will explicitely set the timeout of
course structure cache entries to 1 week, instead of being unlimited. If
you wish to revert to the former behaviour, you should set
`course_structure_cache["TIMEOUT"] = None`.

The course structure cache keys were previously set with an explicit
timeout of `None` which was overriding the cache default timeout of 2h.
This was OK in environments where the cache is configured with a maximum
memory limit and an automatic key eviction strategy. But in others (such
as Tutor), the course structure cache could grow infinitely.

It was agreed that course structure cache keys should be long-lived but
should respect the default cache structure timeout. Thus, we set here
the TTL to 1 week.

We can also configure Tutor to use a cache eviction policy. But that
means we need to set a `maxmemory` value in Redis. It's not possible to
set a value that will be appropriate for everyone:
- if it's higher than the total memory (e.g: in small instances), server
  will crash before the cache is filled.
- if it's too low (e.g: in large instances), the whole platform will abruptly
  slow down as many cache entries are suddenly evicted.

That question of whether Tutor should define a cache eviction policy is
still under discussion, but it can be resolved independently of this
change.
@regisb regisb force-pushed the overhangio/course-structure-cache-timeout branch from dfa16b3 to ad201cd Compare February 14, 2024 08:39
@ormsbee ormsbee merged commit 4daf452 into openedx:master Feb 14, 2024
46 checks passed
@openedx-webhooks
Copy link

@regisb 🎉 Your pull request was merged! Please take a moment to answer a two question survey so we can improve your experience in the future.

@edx-pipeline-bot
Copy link
Contributor

2U Release Notice: This PR has been deployed to the edX staging environment in preparation for a release to production.

@edx-pipeline-bot
Copy link
Contributor

2U Release Notice: This PR has been deployed to the edX production environment.

1 similar comment
@edx-pipeline-bot
Copy link
Contributor

2U Release Notice: This PR has been deployed to the edX production environment.

@regisb regisb deleted the overhangio/course-structure-cache-timeout branch February 19, 2024 08:09
regisb added a commit to overhangio/tutor that referenced this pull request Feb 19, 2024
See the discussion here: #984
And the upstream PR here: openedx/edx-platform#34210

The tl;dr is that the Redis course structure cache was growing without
bounds. While the upstream fix should resolve that issue, we decided
that Tutor should have a maxmemory limit and an eviction policy set for
operational safety.

Thus, Redis now has a 4gb maxmemory. If you need more memory on your
instance, you should implement the "redis-conf" patch.

To manually expire existing keys, run:

    tutor local run cms ./manage.py cms shell -c "from django.core.cache import caches; c = caches['course_structure_cache']; [c.expire(key, 604800) for key in c.keys('*')]"
regisb added a commit to overhangio/tutor that referenced this pull request Feb 19, 2024
See the discussion here: #984
And the upstream PR here: openedx/edx-platform#34210

The tl;dr is that the Redis course structure cache was growing without
bounds. While the upstream fix should resolve that issue, we decided
that Tutor should have a maxmemory limit and an eviction policy set for
operational safety.

Thus, Redis now has a 4gb maxmemory. If you need more memory on your
instance, you should implement the "redis-conf" patch.

To manually expire existing keys, run:

    tutor local run cms ./manage.py cms shell -c "from django.core.cache import caches; c = caches['course_structure_cache']; [c.expire(key, 604800) for key in c.keys('*')]"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
open-source-contribution PR author is not from Axim or 2U
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

5 participants