Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can appending a date stamp work, and what is it for? #189

Open
dabrahams opened this issue Feb 29, 2024 · 4 comments
Open

How can appending a date stamp work, and what is it for? #189

dabrahams opened this issue Feb 29, 2024 · 4 comments

Comments

@dabrahams
Copy link

From reading Github's docs on the underlying cache mechanism, it seems like the appending could only work if you were treating what we put in the key: option as one of the restore-keys:, because keys are matched exactly. But then, you offer a restore-keys: option too. So are you concatenating our specified key with the restored-keys?

Also, it's not clear what kind of scenario benefits from appending a date stamp. It seems like appending a date stamp can be counterproductive for cache invalidation, because when an old cache is matched it is marked "used" at the end of the run, so it is roughly as new as the copy with the new date stamp. Because there is limited space for caches and GitHub throws away the oldest ones, the matched old cache can become newer than something that is still needed, even though in almost every scenario the copy with the new date stamp should take precedence. Can you explain why anyone would want this option enabled (and eventually put the answer in the README?)

@dabrahams
Copy link
Author

Now I'm starting to understand one possible reason for the date stamp: github won't replace an existing cache with the same key, which they don't warn us about. But then if you allow people to turn off the date stamp, shouldn't you also automatically clear the existing entry as shown here?

@hendrikmuhs
Copy link
Owner

Nice finding, highly appreciated! There has been so much fruitless debates about the timestamp, see the long discussion in #138 about this. If you dig into issues and PR's you find more. Your work collects the facts, I agree, best if we put that into the README.

After long debate I agreed on accepting a PR to make the timestamp optional. I have never been convinced about its usefulness and therefore I am also not using it as most user of this action. However in the interest of a few that have a strong opinion, I merged the option to disable the timestamp. It is optional and use on own risk(well that is true anyway ;-) ). Later in #138 there are reports about problems especially when actions run in parallel. That would actually also be my concern regarding the workaround you mention. Again, I am not convinced about disabling the timestamp as I don't see what problem it solves, IMO it does not solve the problem of running out of space, which is definitely a problem for some large projects. However, I think this must be solved in a different way. But, if anyone from the non-timestamp users is willing to implement the clear functionality I happily merge it.

As you have spend a lot of time on it: Feel free to put your findings into a PR for the README. I am super busy at the moment and sorry for the lagging response. But if you or anyone wants to pick this up and improve docs, I do my best reviewing and merging asap.

@dabrahams
Copy link
Author

there are reports about problems especially when actions run in parallel.

Sure, if two jobs running in parallel try to write the same cache, there's a race condition. It seems like doing that is a “buyer beware” situation that any programmer should be aware of. If you wanted to protect the naïve user from this problem you could always inject a job identifier into the cache key. I guess if GitHub's implementation was very poor it might be possible to corrupt caches, but “surely” they would make cache writes functionally atomic, right?!

I am not convinced about disabling the timestamp as I don't see what problem it solves, IMO it does not solve the problem of running out of space, which is definitely a problem for some large projects

I don't understand why you would think it doesn't help prevent running out of space; the default behavior is to accumulate new caches and let GitHub clear up the older ones when it notices you're over the limit—which is not immediate AFAICT; I've often seen the message that “Approaching total cache storage limit (27.3 GB of 10 GB Used)”. So by default projects are constantly going over the limit. My project was, and disabling the timestamp and using clear is preventing that. I seldom go over the limit at all anymore, because I'm not leaving old caches around; they get replaced.

But exceeding the cache limit wouldn't be a problem at all (except for GitHub's own resource usage concerns) if it weren't for the fact that old caches get marked as used when the updated cache is written. I have a broad matrix build, with each element of the matrix using and contributing cache. When one element finishes running, its new cache will be older than the old cache from another element of the matrix.

Maybe the problem you're referring to is that one single cache can exceed the limit on its own. But you could cap your max cache size at 10G and issue an error or warning if the user tries to select a larger cache size, so that's easily addressed, no?

I'm happy to submit a PR for the README once you and I come to some consensus here. I've never developed a GH action and don't have a clue how they're tested, so actually implementing the clear functionality is a bigger lift for me.

@hendrikmuhs
Copy link
Owner

If you wanted to protect the naïve user from this problem you could always inject a job identifier into the cache key.

That's what the key option is for, users should use a different identifiers for every case

I don't understand why you would think it doesn't help prevent running out of space;

That's not what I wrote. I wrote: "IMO it does not solve the problem of running out of space". With other words: You can prevent running of disk space, the same way as you can prevent running out of space by not using the cache at all. I think disabling the timestamp is not a good solution for solving the problem. It moves cache eviction into the client, but I think cache eviction should happen from the cache implementation itself(server-side). If you run over the limit github cleans up for you, github does not reject new entries. That's how it should work. Note that github has a much better view on the cache than a runner in a workflow.

Maybe the problem you're referring to is that one single cache can exceed the limit on its own. But you could cap your max cache size at 10G and issue an error or warning if the user tries to select a larger cache size, so that's easily addressed, no?

max-size has a default of 500M - are you setting this to a higher value? If not changed it fits.

When one element finishes running, its new cache will be older than the old cache from another element of the matrix.

Sounds like a configuration problem to me. If you already have problems with short-lived cache entries, I don't understand how disabling append-timestamp helps.

Are you sharing caches between different matrix elements? I think that might be the problem. I am using a different key for every matrix element to prevent sharing the cache. It does not make sense for me to share the cache between different build environments, because ccache would not hit anyway, but the entries would live side by side. I rather use smaller individual caches per build env.

Again, I am not against disabling the timestamp, use cases are different, if you are happy with it - together with your manual clear - that's fine for me.

In my projects I don't run into the problems you describe, my approach would be:

  • 1 cache per workflow and matrix element
  • limit max-size to fit more elements into the repository limit
  • experiment with the --evict-older-than AGE option in ccache (would be a nice option for the action)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants