Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: allow to setup redis maxmemory #984

Closed
wants to merge 1 commit into from

Conversation

Ian2012
Copy link

@Ian2012 Ian2012 commented Jan 16, 2024

Description

In some of our production environments, we have experienced an incremental use of Redis capacity. After some research we found that this code (with no major changes since the nutmeg release) was storing course structure data in Redis without TTL as it the expected cache layer was memcached and it's configured by default to use the LRU eviction policy.

The above code stored the data in the following key format: course_structure.{{db}}{{version_id}} so there was a lot of outdated/unused data (due to changes in the courses) stored in Redis indefinitely.

Note: Although the majority of the keys stored in Redis were associated with sessions, those already had a TTL configured and didn't represent the majority of the data. It was estimated (using a sample) that every key in the course_structure namespace has 0.3 MB of data (for small courses) and there was around 54k keys there.

To configure an eviction policy in Redis we need to configure two params:

  • maxmemory: Setups the Redis memory usage limit.
  • maxmemory-policy: Setups the eviction policy used by Redis when the maxmemory limit is reached.

See Key eviction for more information.

The problem was discussed here in the Juniper release, but the problem is the same.

This PR solves that issue by configuring the above parameters.

@regisb
Copy link
Contributor

regisb commented Jan 22, 2024

This PR points to a serious issue: that the course structure cache can grow without bounds when many changes are made to the courses. The TIMEOUT=7200 parameter of the course_structure_cache is bypassed by the timeout=None argument in edx-platform.

Am I getting this right?

While I recognize that this is a serious issue, and I thank you for suggesting a fix, your approach has some important drawbacks:

  1. The max memory threshold will work for some users, but it will inevitably have to be tweaked for others, with a value that is very hard to guess.
  2. Enabling eviction for all keys may have some unintended effects which may be critical.

I think that the "right" fix would be to not override the default timeout, as was suggested by @ormsbee here. But I'm open to other approaches.

@ormsbee
Copy link
Contributor

ormsbee commented Jan 22, 2024

I didn't realize that Redis wouldn't evict something unless it had an explicit TTL set on it. I think it's fine to put a timeout on the structure cache entries, but it should be a long timeout (a week maybe?). I definitely wouldn't want to go to the default 7200 seconds.

@regisb
Copy link
Contributor

regisb commented Jan 23, 2024

Thanks for commenting Dave. We can definitely bump the current timeout from 7200s to 1 week. Would you be open to a PR that removed the None timeout?

I can confirm that course_structure* keys have no expiration time, as opposed to others:

$ tutor local exec redis bash
$ redis-cli -n 1
127.0.0.1:6379[1]> KEYS *
...
85) "default:1:celery-task-meta-f7cbb83b-f03f-4391-8ba1-cca0e5d314bd"
86) "course_structure:1:650d788f8e457ec4d9e18d64"
87) "course_structure:1:649d8994d3f9cafd41d4090f"
88) "course_structure:1:650d788f5998d7ab16e18d67"
89) "course_structure:1:64ff3dbf2918040af8062d4b"
90) "course_structure:1:649d89a5c6af476799d40911"
127.0.0.1:6379[1]> EXPIRETIME course_structure:1:649d89a5c6af476799d40911
(integer) -1
127.0.0.1:6379[1]> EXPIRETIME default:1:common.djangoapps.xblock_django.models.XBlockConfiguration.common.djangoapps.xblock_django.api.disabled_xblocks
(integer) 1705998245

@ormsbee
Copy link
Contributor

ormsbee commented Jan 23, 2024

@regisb: I was thinking this over last night. If having any timeout is sufficient to trigger Redis's LRU eviction policy, let's just set it for something that's absurdly high, like a year. I have on one occasion been saved by having structure cache entries > 1 month old. Tag me on that for a PR, and I'll merge it.

@regisb
Copy link
Contributor

regisb commented Jan 23, 2024

If having any timeout is sufficient to trigger Redis's LRU eviction policy

I might be wrong, but I don't think this is the case. In Redis, LRU eviction will only work if there is a defined maximum memory value. I would like to avoid defining such a value. If we do define such a value, then LRU eviction will kick in even for keys that do not have an expiration time.

Thus, defining a very large timeout will not resolve our issue.

@Ian2012
Copy link
Author

Ian2012 commented Jan 23, 2024

@regisb You are right, we would need to define the maxmemory parameter in Redis for the LRU eviction to work.

@ormsbee
Copy link
Contributor

ormsbee commented Jan 23, 2024

So if someone makes a PR that sets the timeout for structures to a week, I will merge it. But I really think that we do need to put some value for maxmemory and make Redis do LRU. We shouldn't be deploying our cache in a mode where it just explodes when it fills up. I don't know what has to happen in order to make that value scale with the size of the server though.

@Ian2012
Copy link
Author

Ian2012 commented Jan 24, 2024

@regisb Would you agree on creating a patch instead for it? So people can define whatever parameter they need for their redis installation. Also, this fix can be more easily backported to other tutor versions instead of edx-platform versions.

What do you think?

@regisb regisb self-assigned this Jan 30, 2024
@regisb
Copy link
Contributor

regisb commented Feb 6, 2024

Sorry for my slow response...

But I really think that we do need to put some value for maxmemory and make Redis do LRU. We shouldn't be deploying our cache in a mode where it just explodes when it fills up.

The flip side of that is that for smaller platforms, unused and useless cached entries would just accumulate until reaching the maxmemory, which would be a waste of resources.

Would you agree on creating a patch instead for it?

Yes, absolutely. We need to make it possible to customize Redis settings.

Here's what I propose:

  1. In edx-platform, remove the None timeout such that the cache key uses the default timeout.
  2. In edx-platform, bump the default timeout of course_structure_cache from 7200 to 1 week.
  3. In Tutor, also bump the default timeout of course_structure_cache from 7200 to 1 week.
  4. In Tutor, make it possible to customize Redis settings by introducing a dedicated patch.

That way, the default would be to not have a maxmemory for Redis, but people would be able to set it based on their requirements using a simple patch.

@DawoudSheraz DawoudSheraz self-assigned this Feb 6, 2024
@ormsbee
Copy link
Contributor

ormsbee commented Feb 7, 2024

That plan works for me, and I think it puts us in a better place than we are now.

But I really think that we do need to put some value for maxmemory and make Redis do LRU. We shouldn't be deploying our cache in a mode where it just explodes when it fills up.

The flip side of that is that for smaller platforms, unused and useless cached entries would just accumulate until reaching the maxmemory, which would be a waste of resources.

Just to be clear, I mean to say that Redis should have a maxmemory limit set for operational safety reasons, regardless of what we decide the TTL for any particular cache entries are. We could set it to a conservative number for smaller installs, and let larger ones override it with higher values.

@regisb
Copy link
Contributor

regisb commented Feb 8, 2024

Just to be clear, I mean to say that Redis should have a maxmemory limit set for operational safety reasons, regardless of what we decide the TTL for any particular cache entries are. We could set it to a consertive number for smaller installs, and let larger ones override it with higher values.

Yes, I agree, but I'm really wondering what this conservative number should be. Maybe 8G?

According to my experiments, we can hardcode maxmemory 8gb in redis.conf, and operators will be able to override it using just a patch. This would allow us to avoid introducing a new configuration setting. @Ian2012 would that work for you?

I opened an upstream PR here: openedx/edx-platform#34210
Here's the corresponding Tutor PR: #1001

regisb added a commit that referenced this pull request Feb 8, 2024
See the discussion here: #984
And the upstream PR here: openedx/edx-platform#34210

The tl;dr is that the Redis course structure cache was growing without
bounds. While the upstream fix should resolve that issue, we decided
that Tutor should have a maxmemory limit and an eviction policy set for
operational safety.
regisb added a commit that referenced this pull request Feb 8, 2024
See the discussion here: #984
And the upstream PR here: openedx/edx-platform#34210

The tl;dr is that the Redis course structure cache was growing without
bounds. While the upstream fix should resolve that issue, we decided
that Tutor should have a maxmemory limit and an eviction policy set for
operational safety.

To manually expire existing keys, run:

    tutor local run cms ./manage.py cms shell -c "from django.core.cache import caches; c = caches['course_structure_cache']; [c.expire(key, 604800) for key in c.keys('*')]"
@Ian2012
Copy link
Author

Ian2012 commented Feb 8, 2024

I agree on the patch but the default maxmemory seems too high. 4gb seems like a more reasonable default for most installations.

@regisb
Copy link
Contributor

regisb commented Feb 13, 2024

OK, let's go with 4GB.

regisb added a commit that referenced this pull request Feb 13, 2024
See the discussion here: #984
And the upstream PR here: openedx/edx-platform#34210

The tl;dr is that the Redis course structure cache was growing without
bounds. While the upstream fix should resolve that issue, we decided
that Tutor should have a maxmemory limit and an eviction policy set for
operational safety.

Thus, Redis now has a 4gb maxmemory. If you need more memory on your
instance, you should implement the "redis-conf" patch.

To manually expire existing keys, run:

    tutor local run cms ./manage.py cms shell -c "from django.core.cache import caches; c = caches['course_structure_cache']; [c.expire(key, 604800) for key in c.keys('*')]"
regisb added a commit to overhangio/edx-platform that referenced this pull request Feb 14, 2024
See discussion here: overhangio/tutor#984

This is a breaking change that will explicitely set the timeout of
course structure cache entries to 1 week, instead of being unlimited. If
you wish to revert to the former behaviour, you should set
`course_structure_cache["TIMEOUT"] = None`.

The course structure cache keys were previously set with an explicit
timeout of `None` which was overriding the cache default timeout of 2h.
This was OK in environments where the cache is configured with a maximum
memory limit and an automatic key eviction strategy. But in others (such
as Tutor), the course structure cache could grow infinitely.

It was agreed that course structure cache keys should be long-lived but
should respect the default cache structure timeout. Thus, we set here
the TTL to 1 week.

We can also configure Tutor to use a cache eviction policy. But that
means we need to set a `maxmemory` value in Redis. It's not possible to
set a value that will be appropriate for everyone:
- if it's higher than the total memory (e.g: in small instances), server
  will crash before the cache is filled.
- if it's too low (e.g: in large instances), the whole platform will abruptly
  slow down as many cache entries are suddenly evicted.

That question of whether Tutor should define a cache eviction policy is
still under discussion, but it can be resolved independently of this
change.
ormsbee pushed a commit to openedx/edx-platform that referenced this pull request Feb 14, 2024
See discussion here: overhangio/tutor#984

This is a breaking change that will explicitely set the timeout of
course structure cache entries to 1 week, instead of being unlimited. If
you wish to revert to the former behaviour, you should set
`course_structure_cache["TIMEOUT"] = None`.

The course structure cache keys were previously set with an explicit
timeout of `None` which was overriding the cache default timeout of 2h.
This was OK in environments where the cache is configured with a maximum
memory limit and an automatic key eviction strategy. But in others (such
as Tutor), the course structure cache could grow infinitely.

It was agreed that course structure cache keys should be long-lived but
should respect the default cache structure timeout. Thus, we set here
the TTL to 1 week.

We can also configure Tutor to use a cache eviction policy. But that
means we need to set a `maxmemory` value in Redis. It's not possible to
set a value that will be appropriate for everyone:
- if it's higher than the total memory (e.g: in small instances), server
  will crash before the cache is filled.
- if it's too low (e.g: in large instances), the whole platform will abruptly
  slow down as many cache entries are suddenly evicted.

That question of whether Tutor should define a cache eviction policy is
still under discussion, but it can be resolved independently of this
change.
regisb added a commit that referenced this pull request Feb 19, 2024
See the discussion here: #984
And the upstream PR here: openedx/edx-platform#34210

The tl;dr is that the Redis course structure cache was growing without
bounds. While the upstream fix should resolve that issue, we decided
that Tutor should have a maxmemory limit and an eviction policy set for
operational safety.

Thus, Redis now has a 4gb maxmemory. If you need more memory on your
instance, you should implement the "redis-conf" patch.

To manually expire existing keys, run:

    tutor local run cms ./manage.py cms shell -c "from django.core.cache import caches; c = caches['course_structure_cache']; [c.expire(key, 604800) for key in c.keys('*')]"
regisb added a commit that referenced this pull request Feb 19, 2024
See the discussion here: #984
And the upstream PR here: openedx/edx-platform#34210

The tl;dr is that the Redis course structure cache was growing without
bounds. While the upstream fix should resolve that issue, we decided
that Tutor should have a maxmemory limit and an eviction policy set for
operational safety.

Thus, Redis now has a 4gb maxmemory. If you need more memory on your
instance, you should implement the "redis-conf" patch.

To manually expire existing keys, run:

    tutor local run cms ./manage.py cms shell -c "from django.core.cache import caches; c = caches['course_structure_cache']; [c.expire(key, 604800) for key in c.keys('*')]"
@regisb
Copy link
Contributor

regisb commented Feb 20, 2024

Closed by #1001.

@regisb regisb closed this Feb 20, 2024
MichaelLimaOliveira pushed a commit to projeto-desenvolve-open/edx-platform that referenced this pull request May 24, 2024
See discussion here: overhangio/tutor#984

This is a breaking change that will explicitely set the timeout of
course structure cache entries to 1 week, instead of being unlimited. If
you wish to revert to the former behaviour, you should set
`course_structure_cache["TIMEOUT"] = None`.

The course structure cache keys were previously set with an explicit
timeout of `None` which was overriding the cache default timeout of 2h.
This was OK in environments where the cache is configured with a maximum
memory limit and an automatic key eviction strategy. But in others (such
as Tutor), the course structure cache could grow infinitely.

It was agreed that course structure cache keys should be long-lived but
should respect the default cache structure timeout. Thus, we set here
the TTL to 1 week.

We can also configure Tutor to use a cache eviction policy. But that
means we need to set a `maxmemory` value in Redis. It's not possible to
set a value that will be appropriate for everyone:
- if it's higher than the total memory (e.g: in small instances), server
  will crash before the cache is filled.
- if it's too low (e.g: in large instances), the whole platform will abruptly
  slow down as many cache entries are suddenly evicted.

That question of whether Tutor should define a cache eviction policy is
still under discussion, but it can be resolved independently of this
change.
ashajyothi-r pushed a commit to ts-tech-repo/edx-platform-quince that referenced this pull request Jun 5, 2024
See discussion here: overhangio/tutor#984

This is a breaking change that will explicitely set the timeout of
course structure cache entries to 1 week, instead of being unlimited. If
you wish to revert to the former behaviour, you should set
`course_structure_cache["TIMEOUT"] = None`.

The course structure cache keys were previously set with an explicit
timeout of `None` which was overriding the cache default timeout of 2h.
This was OK in environments where the cache is configured with a maximum
memory limit and an automatic key eviction strategy. But in others (such
as Tutor), the course structure cache could grow infinitely.

It was agreed that course structure cache keys should be long-lived but
should respect the default cache structure timeout. Thus, we set here
the TTL to 1 week.

We can also configure Tutor to use a cache eviction policy. But that
means we need to set a `maxmemory` value in Redis. It's not possible to
set a value that will be appropriate for everyone:
- if it's higher than the total memory (e.g: in small instances), server
  will crash before the cache is filled.
- if it's too low (e.g: in large instances), the whole platform will abruptly
  slow down as many cache entries are suddenly evicted.

That question of whether Tutor should define a cache eviction policy is
still under discussion, but it can be resolved independently of this
change.
krishna-ts pushed a commit to ts-tech-repo/edx-platform-quince that referenced this pull request Jun 11, 2024
See discussion here: overhangio/tutor#984

This is a breaking change that will explicitely set the timeout of
course structure cache entries to 1 week, instead of being unlimited. If
you wish to revert to the former behaviour, you should set
`course_structure_cache["TIMEOUT"] = None`.

The course structure cache keys were previously set with an explicit
timeout of `None` which was overriding the cache default timeout of 2h.
This was OK in environments where the cache is configured with a maximum
memory limit and an automatic key eviction strategy. But in others (such
as Tutor), the course structure cache could grow infinitely.

It was agreed that course structure cache keys should be long-lived but
should respect the default cache structure timeout. Thus, we set here
the TTL to 1 week.

We can also configure Tutor to use a cache eviction policy. But that
means we need to set a `maxmemory` value in Redis. It's not possible to
set a value that will be appropriate for everyone:
- if it's higher than the total memory (e.g: in small instances), server
  will crash before the cache is filled.
- if it's too low (e.g: in large instances), the whole platform will abruptly
  slow down as many cache entries are suddenly evicted.

That question of whether Tutor should define a cache eviction policy is
still under discussion, but it can be resolved independently of this
change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

None yet

4 participants