Deduplicate Check and Prunes for the same backup repository #214

ccremer · 2020-12-18T17:07:56Z

Summary

As K8up user
I want to deduplicate Jobs that target the same repository
So that exclusive Jobs are not run excessively

Context

Check and Prunes are Restic Jobs that need exclusive access to the backend repository: Only one job can effectively run at the same time. However, multiple backups can target the same Restic repository.

The operator should deduplicate prune jobs that are managed by a smart schedule. So for example if there are multiple schedules with @daily-random prunes to the same S3 endpoint the scheduler should only register one of these.

But if the prunes have explicit cron patterns like 5 4 * * * and 5 5 * * * they should NOT be deduplicated. This will ensure maximum flexibility if for some reason a user explicitly wants multiple prune runs.

Out of Scope

Further links

Support "auto" schedules #116

Acceptance criteria

Given 2 Schedules with either Check or Prune job, when the same randomized predefined cron syntax is specified that targets the same backup repository, then ignore the duplicated schedule of the same job type that has also the same schedule and backend.
Given 2 Schedules with jobs that are already deduplicated, when changing the cron schedule of one of the jobs, then remove the deduplication and schedule both jobs separately.
Given 2 Schedules with jobs that are already deduplicated, when changing the backend of a Schedule, then remove the deduplication and schedule jobs separately.

Implementation Ideas

Schedules are added to the cron library at the time when Schedules get reconciled. At this time we could do the deduplication logic so that the duplication job does not get added

The text was updated successfully, but these errors were encountered:

cimnine · 2021-01-21T13:30:39Z

Some ideas to spin:

Deduplicate when scheduling, i.e. before adding a new entry to the internal cron
Deduplicate when creating job objects,
- by checking an internal reference
- by querying k8s controller if a similar object was already scheduled
- combination of them.

ccremer · 2021-01-22T15:02:14Z

This ain't exactly easy. I started implementing, but soon discovered that

we need to check for deduplication first in all other Schedule CRs, as only those contain the @... schedule definitions.
also with other schedules, we check if the backend is "same" -> equal comparison implementations necessary
then, we don't know whether the checks or prunes of the other schedules have already been deduplicated or not, thus we also need to verify if they're not already in the cron scheduler.

So, back to drawing board:

The graphic is in the PR for editing, though it's not intended to stay there in that form.

Kidswiss · 2021-01-22T15:29:21Z

One thought I just had:

If we do this, should we provide a stable deduplication for this? Example:

A newscheduleB is not registered because scheduleA is already registered. Both contain a prune with @weekly-random. SchedulA's random prune was just triggered for this week. We restart the operator and now suddenly scheduleB is registered but scheduleA not, because it was handled first. ScheduleB's @weekly-random would already trigger the next day. Now we have a schedule that didn't run in the specified interval.

To guarantee the same interval regardless of operator restarts it would need to a way to know which schedule it should prefer.

cimnine · 2021-01-22T15:57:56Z

To guarantee the same interval regardless of operator restarts it would need to a way to know which schedule it should prefer.

Could we sort them by date of creation?

It could work something like this:

Read 1st Schedule CR that arrives (i.e. that is reconciled)
- Just schedule it in cron.
- Add it to a new internal data structure; remember the fields on which we like to deduplicate, plus creation date for sorting.
Read 2nd Schedule CR that arrives
- Check the internal data structure whether we've seen an equal schedule before
  - If yes: Check if it's older than the other(s)
    - If yes: Remove the previously oldest schedule from cron, then add this one.
    - If no: Ignore.
  - If no: Schedule it.
- In any case: Add it to the internal data structure as well.
Repeat for every next Schedule CR.

Now, if a Schedule is deleted:

Check the internal data structure if it is the oldest of the available Schedules.
- If yes:
  - Remove from cron.
  - Remove it from the internal data structure.
  - Schedule the now oldest Schedule CR.
- If no:
  - Just remove from internal data structure.

Kidswiss · 2021-01-25T07:24:08Z

I had another idea over the weekend:

We could hash the repository string and the type and use that as the randomness seed (https://golang.org/pkg/math/rand/#Seed). So each type and repo combination will generate the same "random" time. This way we only have to track if at least one of the jobs is registered for a given type/repo combo.

By hashing the values before using as the seed it should generate enough spread for the schedules.

cimnine · 2021-01-25T08:19:15Z

We could hash the repository string and the type and use that as the randomness seed (https://golang.org/pkg/math/rand/#Seed). So each type and repo combination will generate the same "random" time. This way we only have to track if at least one of the jobs is registered for a given type/repo combo.

It sounds like rand should not be used then, but rather a number should be deduced directly from the hash. For one, because there's the underlying assumption that the implementation of rand does not change and stays stable. It's also not obvious to rely on rand to produce a predictable number ;)

My main concern with this solution: It's anything but obvious to understand. I.e., it's a very implicit solution and in my experience with implicit solutions is that they are hard to understand right away for the next developer.

Kidswiss · 2021-01-25T08:41:37Z

Sure we can use something else to generate the times, rand was just a suggestion.

But I feel like we'd have to get the randomness for the same types and repos down. Your suggestion could still lead to garbled execution times if there are a lot of namespace changes on a cluster.

ccremer · 2021-01-25T10:41:22Z

Unpopular opinion: The more we try to solve these "stable across restarts" problem, the more I'm convinced we should get rid of any internal states altogether. e.g. replace the cron library with K8s CronJobs etc.
This is already the 2nd or 3rd time we try to solve this problem with special mechanisms 🙈
Such an attempt ofc would not much simplify the deduplication logic, but if we can find a "stateless" algorithm, we can leave state information to Kubernetes API/etcd and not need to worry about restarts.

In a private project/operator I'm exactly at the same problem: handling scheduling and restarts. I have found a working solution, we can discuss it if you're further interested.

At the moment I'm a bit hesitant to come up with complicated "solutions" that solve deduplication across restarts when using internal state. Maybe we should limit the deduplication feature to @daily-random only, so that a missed schedule due to K8up restart isn't the end of the world.

Kidswiss · 2021-01-25T11:01:21Z

If we implement the deduplication logic fo @daily-random it can also be used for the others, or not? I mean, the effort to implement it for one would probably be the same as implementing it for all.

I agree that switching to k8s native cron-jobs could help with things, but they may make other things more complicated.

I also agree that off-loading as much state as possible to k8s should be desired, but there are cases where I think having a small in-memory state could make sense. For example to reduce the amount of API queries.

I'm interested in hearing your solution for that issue

ccremer · 2021-01-25T11:19:55Z

For example to reduce the amount of API queries.

With the switch to Operator SDK resp. controller-runtime, the client has built-in read-cache by default. Each GETted object lands in the cache and is automatically watched for changes. repeated GETs for already retrieved objects don't even land at the API server anymore. It's actually harder to ignore the cache for certain object Kinds for whatever reason.

So, as far as performance goes, I think it's worse when we try to maintain our own barely tested cache ;)

If we implement the deduplication logic fo @daily-random it can also be used for the others, or not? I mean, the effort to implement it for one would probably be the same as implementing it for all.

It depends whether we also implement for stable deduplication across restarts. If we decide to do it stable, we are accepting added complexity and reduced maintainability, whereas with ephemeral deduplication we can simplify deduplication at the cost of missing schedules as you described.

Kidswiss · 2021-01-25T13:23:09Z

My personal opinion is that missing schedules should be something that the k8up operator should avoid as much as possible. Nobody wants a backup solution that may or may not trigger a job.

ccremer · 2021-01-25T16:52:47Z

Thanks for the good internal discussions 👍

Here is the new proposal how it could work:

We add a new CRD, EffectiveSchedule (better name welcome) that removes the effectiveSchedule status fields from the Schedule CR. This new CR, stored in the same namespace as the operator is running, will be a persistent link that holds the information in order to deduplicate Check and Prune.
This CR will be created when a Schedule is reconciled with a @-random spec. If it finds an EffectiveSchedule object that already has a back-reference to itself via OwnerReference, then it does nothing. Otherwise, it will create a new EffectiveSchedule with a randomized schedule and added to internal Cron scheduler.
When another Schedule gets reconciled that has the same schedule and same backend(s), then the EffectiveSchedule will get another OwnerReference to the new Schedule, but it won't be added to the internal Cron scheduler. That way, the duplicate schedule is deduplicated. If the schedule that has go the prune and check jobs assigned, is deleted, then the next Schedule is elected to "master" the Check and Prune. If no Schedule is in the list anymore, then Kubernetes automatically GCs the EffectiveSchedule.
The Idea of this new CRD is to have a intermediary step before we can go plain Kubernetes CronJobs in K8up v2. It is not meant to be a resource maintained from K8up end users, but purely from the Operator. Thus it's regarded as an implementation change. It may be removed in K8up v2 or later if the relationships can be computed at runtime.
The new CRD should go into K8up v1.0, but the deduplication feature goes into v1.1

smlx · 2024-08-12T06:43:25Z

Hi, what's the current status of this issue in the latest version of k8up? Does k8up ensure that a check job scheduled with @weekly-random will not run at the same time as a prune or backup job?

Kidswiss · 2024-08-12T08:11:08Z

Hi @smlx

K8up doesn't yet deduplicate jobs to the same repository.

However, there are already mechanisms in place that will prevent two exclusive jobs (like prune and check) from running at the same time.

ccremer added the enhancement New feature or request label Dec 18, 2020

ccremer mentioned this issue Dec 18, 2020

Support "auto" schedules #116

Closed

ccremer added this to the Schnitzel Features milestone Dec 18, 2020

tobru modified the milestones: Schnitzel Features, v1.0.0-rc2, v1.1.0 Dec 22, 2020

tobru changed the title ~~[Feature] Deduplicate Check and Prunes for the same backup repository~~ Deduplicate Check and Prunes for the same backup repository Jan 5, 2021

tobru mentioned this issue Jan 5, 2021

Deduplicate prune jobs #118

Closed

tobru modified the milestones: v1.1.0, v1.0.0 Jan 13, 2021

ccremer linked a pull request Jan 22, 2021 that will close this issue

Some deduplication PoC #326

Closed

5 tasks

This was referenced Jan 26, 2021

new CRD: EffectiveSchedule #333

Merged

Deduplication pt 1: Generate effective schedules in its own custom resource #338

Closed

ccremer removed a link to a pull request Feb 1, 2021

Some deduplication PoC #326

Closed

5 tasks

ccremer mentioned this issue Feb 5, 2021

Deduplication for Prune and Check jobs #344

Closed

5 tasks

tobru removed this from the v1.1 milestone Sep 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deduplicate Check and Prunes for the same backup repository #214

Deduplicate Check and Prunes for the same backup repository #214

ccremer commented Dec 18, 2020 •

edited

Loading

cimnine commented Jan 21, 2021

ccremer commented Jan 22, 2021

Kidswiss commented Jan 22, 2021 •

edited

Loading

cimnine commented Jan 22, 2021 •

edited

Loading

Kidswiss commented Jan 25, 2021

cimnine commented Jan 25, 2021 •

edited

Loading

Kidswiss commented Jan 25, 2021

ccremer commented Jan 25, 2021 •

edited

Loading

Kidswiss commented Jan 25, 2021

ccremer commented Jan 25, 2021

Kidswiss commented Jan 25, 2021

ccremer commented Jan 25, 2021

smlx commented Aug 12, 2024

Kidswiss commented Aug 12, 2024

Deduplicate Check and Prunes for the same backup repository #214

Deduplicate Check and Prunes for the same backup repository #214

Comments

ccremer commented Dec 18, 2020 • edited Loading

Summary

Context

Out of Scope

Further links

Acceptance criteria

Implementation Ideas

cimnine commented Jan 21, 2021

ccremer commented Jan 22, 2021

Kidswiss commented Jan 22, 2021 • edited Loading

cimnine commented Jan 22, 2021 • edited Loading

Kidswiss commented Jan 25, 2021

cimnine commented Jan 25, 2021 • edited Loading

Kidswiss commented Jan 25, 2021

ccremer commented Jan 25, 2021 • edited Loading

Kidswiss commented Jan 25, 2021

ccremer commented Jan 25, 2021

Kidswiss commented Jan 25, 2021

ccremer commented Jan 25, 2021

smlx commented Aug 12, 2024

Kidswiss commented Aug 12, 2024

ccremer commented Dec 18, 2020 •

edited

Loading

Kidswiss commented Jan 22, 2021 •

edited

Loading

cimnine commented Jan 22, 2021 •

edited

Loading

cimnine commented Jan 25, 2021 •

edited

Loading

ccremer commented Jan 25, 2021 •

edited

Loading