Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

worker pods design doc #1653

Open
wants to merge 1 commit into
base: master
from

Conversation

@skriss
Copy link
Contributor

commented Jul 11, 2019

Signed-off-by: Steve Kriss krisss@vmware.com

This is a draft design doc for #487. Please provide feedback!

@nrb @carlisia @prydonius

@nrb
Copy link
Member

left a comment

This also has implications for developer workflow - in order to do basic backup/restore operations, it will no longer be sufficient to run only a local server. An image with the relevant changes must also be uploaded before they can be tested.

@skriss

This comment has been minimized.

Copy link
Contributor Author

commented Jul 11, 2019

This also has implications for developer workflow - in order to do basic backup/restore operations, it will no longer be sufficient to run only a local server. An image with the relevant changes must also be uploaded before they can be tested.

Yes, good point.

I definitely want to debate if this is the right model or not - it has several benefits but also potentially some drawbacks (this being one). I'm not yet at the point where I'm 100% on it.

@prydonius
Copy link
Collaborator

left a comment

This also has implications for developer workflow - in order to do basic backup/restore operations, it will no longer be sufficient to run only a local server. An image with the relevant changes must also be uploaded before they can be tested.

There is also perhaps a benefit here in that you don't need the Velero server running in order to perform a backup with this approach. I could in fact run the velero backup run command locally and test any changes I've made. Testing the e2e flow though is indeed more difficult.


### Updates to `velero backup logs`

The `velero backup logs` command will be modified so that if a backup is in progress, the logs are gotten from the worker pod's stdout (essentially proxying to `kubectl -n velero logs POD_NAME`). A `-f/--follow` flag will be added to allow streaming of logs for an in-progress backup.

This comment has been minimized.

Copy link
@prydonius

prydonius Jul 11, 2019

Collaborator

Even after completion, the Pod will be kept around until it's deleted, so we can choose to always pull the logs from the Pod as long as it is still around.

This comment has been minimized.

Copy link
@skriss

skriss Jul 15, 2019

Author Contributor

Yeah, it may be most efficient to always do that if the pod is still around, otherwise fall back to the archived logs in object storage.

This comment has been minimized.

Copy link
@nrb

nrb Jul 15, 2019

Member

Would it be possible that the logs grow too large for the cluster to keep around?


### Changes to backup deletion controller

The Velero server keeps a record of current in-progress backups and disallows them from being deleted. This will need to be updated to account for backups running in worker pods. The most sensible approach is for the backup deletion controller to first check the backup's status via the API, and to allow deletion to proceed if it's not new/in-progress. If the backup is new or in-progress, the controller can check to see if a worker pod is running for the backup. If so, the deletion is disallowed because the backup is actively being processed. If there is no worker pod for the backup, the backup can be deleted as it's not actively being processed.

This comment has been minimized.

Copy link
@prydonius

prydonius Jul 11, 2019

Collaborator

Does this mean that the deletion request is marked as disallowed and never reconsidered, or will the deletion request just be re-enqueued at a later time and eventually cause the backup to be deleted after it has been completed?

This comment has been minimized.

Copy link
@skriss

skriss Jul 15, 2019

Author Contributor

I believe that currently they're disallowed, marked as "processed", and never re-processed. We can certainly look at changing that.

This comment has been minimized.

Copy link
@carlisia

carlisia Jul 31, 2019

Member

Could this be an appropriate place to delete the associated pod?


### Open items

- It probably makes sense to use Kubernetes Jobs to control worker pods, rather than directly creating "bare pods". The Job will ensure that a worker pod successfully runs to completion.

This comment has been minimized.

Copy link
@prydonius

prydonius Jul 11, 2019

Collaborator

One of the benefits of Jobs over Pods is being able to configure the maximum number of retries, but if we don't need to retry failed backups then a Pod should suffice. Do we retry backups today?

This comment has been minimized.

Copy link
@nrb

nrb Jul 15, 2019

Member

As of today, we just fail backups and restores. Retries may be helpful; I'd appreciate community feedback on that.

This comment has been minimized.

Copy link
@skriss

skriss Jul 15, 2019

Author Contributor

In my mind there are two different scenarios to consider here:

  1. The backup completes, with a final phase of "Failed" or "PartiallyFailed". I would not currently consider this grounds for a retry.
  2. The worker pod gets evicted or otherwise forcibly terminated before the backup has had a chance to complete. In this case I would want to retry.
- It probably makes sense to use Kubernetes Jobs to control worker pods, rather than directly creating "bare pods". The Job will ensure that a worker pod successfully runs to completion.
- Currently, restic repository lock management is handled by an in-process lock manager in the Velero server. In order for backups/restores to safely run concurrently, the design for restic lock management needs to change. There is an [open issue](https://github.com/heptio/velero/issues/1540) for this which is currently scoped for the v1.1 release.
- There are several prometheus metrics that are emitted as part of the backup process. Since backups will no longer be running in the Velero server, we need to find a way to expose those values. One option is to store any value that would feed a metric as a field on the backup's `status`, and to have the Velero server scrape values from there for completed backups.
- Over time, many completed worker pods will exist in the `velero` namespace. We need to consider whether this poses any issue and whether we should garbage-collect them more aggressively.

This comment has been minimized.

Copy link
@prydonius

prydonius Jul 11, 2019

Collaborator

Just FYI there is an alpha feature to clean up Jobs/Pods after some amount of time from completion (https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#ttl-mechanism-for-finished-jobs). There's some work to be done to take this to beta, but no one's picked it up right now. So not clear when it will get promoted, but worth keeping in mind.

This comment has been minimized.

Copy link
@nrb

nrb Jul 15, 2019

Member

Good to know about that alpha feature. I think, absent that feature, we'd be ok cleaning up pods when a Backup's TTL is met.

This comment has been minimized.

Copy link
@prydonius

prydonius Jul 15, 2019

Collaborator

If the Backup is deleted when the TTL is met, an easy way to do this would be to just have the Pod/Job's ownerReference set to the Backup.

This comment has been minimized.

Copy link
@skriss

skriss Jul 15, 2019

Author Contributor

yeah, I was thinking we'd use owner refs here to get cascading deletion


## Security Considerations

Each worker pod will be created in the `velero` namespace, will use the `velero` service account for identity, and will have the `cloud-credentials` secret mounted as a volume. As a result the worker pods will have the same level of privilege as the Velero server pod.

This comment has been minimized.

Copy link
@prydonius

prydonius Jul 11, 2019

Collaborator

Presumably we could lower the privileges of the Velero server if it now only needs to create these Jobs/Pods and no longer needs cluster-admin access for backing up all the resources.

This comment has been minimized.

Copy link
@nrb

nrb Jul 15, 2019

Member

The Velero server would be reduced, but Velero overall would still need the cluster-admin access.

So it does change, but I think in terms of the overall set of permissions, it doesn't change things.

That said, this design may enable the server to instantiate a worker pod with the requesting user's same permissions, and the backup flow within that worker pod could differentiate between not found and permission denied errors to handle logging. This requires more detailed thought, though, as even then access to cluster resources is needed if we continue to do things like follow a Pod->PV->PVC.

This comment has been minimized.

Copy link
@prydonius

prydonius Jul 15, 2019

Collaborator

Yeah you're right that the workers will still need cluster-admin, though this still feels like an improvement as the long-running server process no longer needs it.

That said, this design may enable the server to instantiate a worker pod with the requesting user's same permissions

Interesting, how do you see this working? Would the user need to specify the ServiceAccount to use for the worker?

This comment has been minimized.

Copy link
@nrb

nrb Jul 15, 2019

Member

That's a good question. It could be exposed for them to set, or maybe it's created on the fly to match the current user? That may be tricky to get right, so likely less desirable.

@nrb
Copy link
Member

left a comment

Overall, I think this proposal is fairly simple, and accomplishes a few things at once (backup/restore operations, including scheduled ones, not blocking each other, and immediate log streaming).

My read on this is that it's likely less of an engineering time investment that the alternatives, too, but I'm not 100% positive there.


There is also perhaps a benefit here in that you don't need the Velero server running in order to perform a backup with this approach. I could in fact run the velero backup run command locally and test any changes I've made. Testing the e2e flow though is indeed more difficult.

That's true! And may make some testing scenarios easier, in fact. So I suppose it may be an acceptable change. It'll definitely have to come with docs for contributors.

That said, I would also want to make sure we're still "dogfooding" the actual workflow end users are asked to use, just to be sure we're seeing the same issues.

- It probably makes sense to use Kubernetes Jobs to control worker pods, rather than directly creating "bare pods". The Job will ensure that a worker pod successfully runs to completion.
- Currently, restic repository lock management is handled by an in-process lock manager in the Velero server. In order for backups/restores to safely run concurrently, the design for restic lock management needs to change. There is an [open issue](https://github.com/heptio/velero/issues/1540) for this which is currently scoped for the v1.1 release.
- There are several prometheus metrics that are emitted as part of the backup process. Since backups will no longer be running in the Velero server, we need to find a way to expose those values. One option is to store any value that would feed a metric as a field on the backup's `status`, and to have the Velero server scrape values from there for completed backups.
- Over time, many completed worker pods will exist in the `velero` namespace. We need to consider whether this poses any issue and whether we should garbage-collect them more aggressively.

This comment has been minimized.

Copy link
@nrb

nrb Jul 15, 2019

Member

Good to know about that alpha feature. I think, absent that feature, we'd be ok cleaning up pods when a Backup's TTL is met.


## Security Considerations

Each worker pod will be created in the `velero` namespace, will use the `velero` service account for identity, and will have the `cloud-credentials` secret mounted as a volume. As a result the worker pods will have the same level of privilege as the Velero server pod.

This comment has been minimized.

Copy link
@nrb

nrb Jul 15, 2019

Member

The Velero server would be reduced, but Velero overall would still need the cluster-admin access.

So it does change, but I think in terms of the overall set of permissions, it doesn't change things.

That said, this design may enable the server to instantiate a worker pod with the requesting user's same permissions, and the backup flow within that worker pod could differentiate between not found and permission denied errors to handle logging. This requires more detailed thought, though, as even then access to cluster resources is needed if we continue to do things like follow a Pod->PV->PVC.


One alternative option is to simply increase the number of workers being used per controller. This would, in theory, allow multiple backups/restores to be processed concurrently by the Velero server. The downsides to this option are that (1) this doesn't do anything to improve Velero's scalability, since all of the work is still being done within a single pod/process, and (2) it doesn't easily enable log streaming for in-progress backups.

### Run multiple replicas of the Velero server

This comment has been minimized.

Copy link
@nrb

nrb Jul 15, 2019

Member

Have you given thought to leader election at server startup? That was my initial thought, not per-backup.

The problem with per-backup log streaming is still there, though.

This comment has been minimized.

Copy link
@skriss

skriss Jul 15, 2019

Author Contributor

What would the non-leaders do in that scenario?

This comment has been minimized.

Copy link
@nrb

nrb Jul 15, 2019

Member

They would become workers, similar in function to what's proposed here. However, that means we can still only process replica number of backups at once.

This comment has been minimized.

Copy link
@skriss

skriss Jul 15, 2019

Author Contributor

Ah, so leader assigns a worker to a backup, and worker processes backups?

This comment has been minimized.

Copy link
@nrb

nrb Jul 15, 2019

Member

Yeah, that's what I was thinking.


### Open items

- It probably makes sense to use Kubernetes Jobs to control worker pods, rather than directly creating "bare pods". The Job will ensure that a worker pod successfully runs to completion.

This comment has been minimized.

Copy link
@nrb

nrb Jul 15, 2019

Member

As of today, we just fail backups and restores. Retries may be helpful; I'd appreciate community feedback on that.

- There are several prometheus metrics that are emitted as part of the backup process. Since backups will no longer be running in the Velero server, we need to find a way to expose those values. One option is to store any value that would feed a metric as a field on the backup's `status`, and to have the Velero server scrape values from there for completed backups.
- Over time, many completed worker pods will exist in the `velero` namespace. We need to consider whether this poses any issue and whether we should garbage-collect them more aggressively.

## Alternatives Considered

This comment has been minimized.

Copy link
@nrb

nrb Jul 15, 2019

Member

Another alternative may be a static set of workers that process some queue of items. However, at the moment I would prefer pod-per-item due to the simplicity in implementing the streaming logs.

@nrb nrb added the Design label Jul 16, 2019


- It probably makes sense to use Kubernetes Jobs to control worker pods, rather than directly creating "bare pods". The Job will ensure that a worker pod successfully runs to completion.
- Currently, restic repository lock management is handled by an in-process lock manager in the Velero server. In order for backups/restores to safely run concurrently, the design for restic lock management needs to change. There is an [open issue](https://github.com/heptio/velero/issues/1540) for this which is currently scoped for the v1.1 release.
- There are several prometheus metrics that are emitted as part of the backup process. Since backups will no longer be running in the Velero server, we need to find a way to expose those values. One option is to store any value that would feed a metric as a field on the backup's `status`, and to have the Velero server scrape values from there for completed backups.

This comment has been minimized.

Copy link
@nrb

nrb Jul 18, 2019

Member

I think we should do some more research on how distributed metrics are tracked with prometheus. It may be wishful thinking, but we can't be the only ones to want to do this.

I also think the public API on our metrics package can be sufficient for masking most of this from the backup code, though the logic inside may change to route the metric correctly.

@carlisia
Copy link
Member

left a comment

This is looking pretty solid.

### Open items

- It probably makes sense to use Kubernetes Jobs to control worker pods, rather than directly creating "bare pods". The Job will ensure that a worker pod successfully runs to completion.
- Currently, restic repository lock management is handled by an in-process lock manager in the Velero server. In order for backups/restores to safely run concurrently, the design for restic lock management needs to change. There is an [open issue](https://github.com/heptio/velero/issues/1540) for this which is currently scoped for the v1.1 release.

This comment has been minimized.

Copy link
@carlisia

carlisia Jul 31, 2019

Member

v1.x release now.


### Changes to backup deletion controller

The Velero server keeps a record of current in-progress backups and disallows them from being deleted. This will need to be updated to account for backups running in worker pods. The most sensible approach is for the backup deletion controller to first check the backup's status via the API, and to allow deletion to proceed if it's not new/in-progress. If the backup is new or in-progress, the controller can check to see if a worker pod is running for the backup. If so, the deletion is disallowed because the backup is actively being processed. If there is no worker pod for the backup, the backup can be deleted as it's not actively being processed.

This comment has been minimized.

Copy link
@carlisia

carlisia Jul 31, 2019

Member

Could this be an appropriate place to delete the associated pod?


Velero controllers will no longer directly execute backup/restore logic themselves (note: the rest of this document will refer only to backups, for brevity, but it applies equally to restores). Instead, when the controller is informed of a new backup custom resource, it will immediately create a new worker pod which is responsible for end-to-end processing of the backup, including validating the spec, scraping the Kubernetes API server for resources, triggering persistent volume snapshots, writing data to object storage, and updating the custom resource's status as appropriate.

A worker pod will be given a deterministic name based on the name of the backup it's executing. This will prevent Velero from inadvertently creating multiple worker pods for the same backup, since any subsequent attempts to create a pod with the specified name will fail.

This comment has been minimized.

Copy link
@carlisia

carlisia Jul 31, 2019

Member

If the pod name will match the given backup name, I suppose before creating the pod we will first check if that backup already exists in storage? If before, then we would be able to make use of the worker pod itself for all validations except this one.

This comment has been minimized.

Copy link
@nrb

nrb Aug 1, 2019

Member

Doing the check before making the pod makes sense to me. Fairly sure we already check to see if the backup exists before starting one now, so this would go after that check.


A worker pod will be given a deterministic name based on the name of the backup it's executing. This will prevent Velero from inadvertently creating multiple worker pods for the same backup, since any subsequent attempts to create a pod with the specified name will fail.

This design trivially enables running multiple backups concurrently, as each one runs in its own isolated pod and the Velero server's backup controller does not need to wait for the backup to complete before spawning a worker pod for the next one. Additionally, Velero becomes much more scalable, as the resource requirements for the Velero server itself are largely unaffacted by the number of backups. Instead, Velero's scalability becomes limited only by the total amount of resources available in the cluster to process worker pods.

This comment has been minimized.

Copy link
@carlisia

carlisia Jul 31, 2019

Member

Is remaining resource available something we would need to check for before initiating a backup? What happens if there's not enough resources to complete a backup? (I suppose the same questions would apply with todays design but the scenario is much less likely to happen).

DRAFT worker pods design doc
Signed-off-by: Steve Kriss <krisss@vmware.com>

@skriss skriss force-pushed the skriss:worker-pods-design branch from a6dfef2 to a12eba5 Sep 4, 2019

@skriss

This comment has been minimized.

Copy link
Contributor Author

commented Sep 13, 2019

crazy idea: we could do away with the restic daemonset, move the PVB/PVR controllers into the core velero server process, and have them dispatch worker pods to run restic backups/restores on the appropriate nodes.

That would even enable us to back up hostPath volumes, since we would know at the time of the worker pod creation which volumes to back up.

@nrb

This comment has been minimized.

Copy link
Member

commented Sep 13, 2019

Doesn't sound that crazy to me. If I'm following, would this mean the --use-restic flag in velero install is essentially a no-op as the restic controllers would just always run?

@skriss

This comment has been minimized.

Copy link
Contributor Author

commented Sep 13, 2019

Yeah, probably. We could disable the PVB/PVR controllers by default if we wanted to, and only enable them if --use-restic is specified. Not sure yet.

@nrb

This comment has been minimized.

Copy link
Member

commented Sep 13, 2019

In the abstract, I like the idea of moving the restic work from the DaemonSet to, well, workers. It logically unifies the "work" vs "scheduling" concepts.

@prydonius

This comment has been minimized.

Copy link
Collaborator

commented Sep 13, 2019

+1 to running restic operations in worker pods, that's a great idea!

Yeah, probably. We could disable the PVB/PVR controllers by default if we wanted to, and only enable them if --use-restic is specified. Not sure yet.

IMO we should definitely disable them by default until we consider restic GA, and even once restic is GA it might make sense to disable them especially if we don't see restic being used in conjunction with volume snapshots. (good usecase for feature flags)

@skriss

This comment has been minimized.

Copy link
Contributor Author

commented Sep 16, 2019

Another thing to think about: we probably need to synchronize execution of hooks. It's probably not ok to run fsfreeze twice for the same pod because two backups are occurring simultaneously.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.