Support for Deploying Terminating Pods as Brokers #437

kate-goldenring · 2021-12-10T01:38:46Z

What this PR does / why we need it:
This adds support for deploying Jobs to devices discovered by the Akri Controller. Previously, Akri only supported deploying Pods that never terminated. Adding Jobs will enable more scenarios with Akri such as device management scenarios. It follows the design in this proposal project-akri/akri-docs#17. Some specific changes to call out:

The Configuration CRD: (Breaking change) The brokerPodSpec section of the Configuration changed to a brokerType enum with a brokerPodSpec or brokerJobSpec options.
The Controller: Logic to deploy a Pod or Job depending on the brokerType of the Configuration. ~~Logic to handle the brokerType of a Configuration being updated -- will bring down and redeploy workloads and services.~~
The Agent: Logic to not bring down a Configuration if the brokerType changed.
Helm Charts: In rbac.yaml, extends Controller's ClusterRole to enable it to create, delete, modify Jobs. Enables providing a JobSpec in Configurations, where completions, parallelism, and backoffLimit are configurable. All other types of JobSpec customizations will need to be done by creation a Configuration yaml and modifying it. Instructions for this can be added to our docs, just as they are for PodSpecs. Removed default image for debugEcho -- docs will need to be updated for this.

Docs
I put in a PR with docs updating Pod deployment in tandem with this one: project-akri/akri-docs#21
It also contains documentation on Jobs.

Tests
This PR includes unit tests. Should we extend out e2e workflows to test Jobs? Maybe we just test a subset of our platform/k8s distro matrix for jobs to prevent too many runs. If so, can that be tabled to a later pr so as to not overload this one?

Special notes for your reviewer:
I went ahead and built and published some containers for folks to try out this implementation. You can run this on this Kubernetes playground if you do not have a cluster:

Clone and checkout this branch. Then package the helm chart locally:

git clone git@github.com:kate-goldenring/akri.git --branch strategy-jobs akri-jobs
helm package akri-jobs/deployment/helm

Install Akri with a debugEcho Configuration that will deploy a Job to each (foo0 and foo1) discovered device. The Job will echo "Hello World" and then sleep for 5 seconds before terminating.

helm install akri akri-0.8.0.tgz \
 --set agent.allowDebugEcho=true \
 --set debugEcho.discovery.enabled=true \
 --set debugEcho.discovery.image.tag=latest \
 --set controller.image.repository=ghcr.io/kate-goldenring/job-brokers/controller \
 --set controller.image.tag=v0.8.0-20220112_182137-amd64 \
 --set agent.image.repository=ghcr.io/kate-goldenring/job-brokers/agent \
 --set agent.image.tag=v0.8.0-20220107_184416-amd64 \
--set debugEcho.configuration.brokerJob.image.repository=busybox \
--set debugEcho.configuration.brokerJob.command[0]="sh" \
--set debugEcho.configuration.brokerJob.command[1]="-c" \
--set debugEcho.configuration.brokerJob.command[2]="echo 'Hello World'" \
--set debugEcho.configuration.brokerJob.command[3]="sleep 5" \
--set debugEcho.configuration.enabled=true

Say you are feeling more exuberant and want the Job to echo "Hello Amazing World", you can change the brokerSpec. Upgrade the installation to do so and watch Akri delete all the resources (instances, jobs) and recreate them, deploying the new job (Do kubectl logs on the Pods to verify).

helm upgrade akri akri-0.8.0.tgz \
--set agent.allowDebugEcho=true \
--set debugEcho.discovery.enabled=true \
--set debugEcho.discovery.image.tag=latest \
--set controller.image.repository=ghcr.io/kate-goldenring/job-brokers/controller \
--set controller.image.tag=v0.8.0-20220112_182137-amd64 \
--set agent.image.repository=ghcr.io/kate-goldenring/job-brokers/agent \
--set agent.image.tag=v0.8.0-20220107_184416-amd64 \
--set debugEcho.configuration.brokerJob.image.repository=busybox \
--set debugEcho.configuration.brokerJob.command[0]="sh" \
--set debugEcho.configuration.brokerJob.command[1]="-c" \
--set debugEcho.configuration.brokerJob.command[2]="echo 'Hello Amazing World'" \
--set debugEcho.configuration.brokerJob.command[3]="sleep 5" \
--set debugEcho.configuration.enabled=true

Test out some of the Job settings and modify the Job to run the Job twice per device (completions=2) in parallel (parallelism=2). These Helm settings simply modify equivalently named parts of the Kubernetes JobSpec

helm upgrade akri akri-0.8.0.tgz \
 --set agent.allowDebugEcho=true \
 --set debugEcho.discovery.enabled=true \
 --set debugEcho.discovery.image.tag=latest \
 --set controller.image.repository=ghcr.io/kate-goldenring/job-brokers/controller \
 --set controller.image.tag=v0.8.0-20220112_182137-amd64 \
 --set agent.image.repository=ghcr.io/kate-goldenring/job-brokers/agent \
 --set agent.image.tag=v0.8.0-20220107_184416-amd64 \
--set debugEcho.configuration.brokerJob.image.repository=busybox \
--set debugEcho.configuration.brokerJob.command[0]="sh" \
--set debugEcho.configuration.brokerJob.command[1]="-c" \
--set debugEcho.configuration.brokerJob.command[2]="echo 'Hello Amazing World'" \
--set debugEcho.configuration.brokerJob.command[3]="sleep 5" \
--set debugEcho.configuration.brokerJob.parallelism=2 \
--set debugEcho.configuration.brokerJob.completions=2 \
--set debugEcho.configuration.enabled=true

Pods are still deployed the same per usual and they can also be updated.

helm install akri akri-0.8.0.tgz \
 --set agent.allowDebugEcho=true \
 --set debugEcho.discovery.enabled=true \
 --set debugEcho.discovery.image.tag=latest \
 --set controller.image.repository=ghcr.io/kate-goldenring/job-brokers/controller \
 --set controller.image.tag=v0.8.0-20220112_182137-amd64 \
 --set agent.image.repository=ghcr.io/kate-goldenring/job-brokers/agent \
 --set agent.image.tag=v0.8.0-20220107_184416-amd64 \
--set debugEcho.configuration.brokerPod.image.repository=nginx \
--set debugEcho.configuration.enabled=true

Note on future extension
Many deployment scenarios can be fulfilled with Jobs just by adjusting the parallelism (number of the same Pod that should run concurrently) and completions (number of times the Pods of the Job need to successfully complete). However, an additional scenario could be enabled by adding support in the controller for adding podAntiAffinity (by instance name).
For example, you could specify that you want 3 Jobs running concurrently (parallelism=3 and completions=3) but that you want at most 1 on each node. Limiting one to run on each node can be done by setting podAntiAffinity in the JobSpec like so:

  spec:
    template:
      metadata:
        labels:
          akri.sh/instance: akri-instance-ha5h
      spec:
        containers:
        - name: pi
          image: perl
          command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
        restartPolicy: Never
        affinity:
          podAntiAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                - key: akri.sh/instance
                  operator: In
                  values:
                  - akri-instance-ha5h
              topologyKey: "kubernetes.io/hostname"

This would need to be done by the controller rather than in the Helm charts as it is likely desirable to set podAntiAffinity per device/instance rather than per Configuration.

If applicable:

this PR has an associated PR with documentation in akri-docs
this PR contains unit tests
added code adheres to standard Rust formatting (cargo fmt)
code builds properly (cargo build)
code is free of common mistakes (cargo clippy)
all Akri tests succeed (cargo test)
inline documentation builds (cargo doc)
version has been updated appropriately (./version.sh)
all commits pass the DCO bot check by being signed off -- see the failing DCO check for instructions on how to retroactively sign commits

Signed-off-by: Kate Goldenring <kate.goldenring@microsoft.com>

kate-goldenring · 2022-01-06T17:13:20Z

Would be good to figure out how to break these big prs into several smaller ones. No action needed for this one.

One way would be to make the Configuration CRD change in a first one. And then add the implementation in another.

kate-goldenring · 2022-01-06T17:17:46Z

When a job spec is changed, do you kill old jobs and schedule new jobs?

edit: I think I found the answer to this is yes. Do we need to rethink this? What is a firmware update of a device is in progress and you kill the job? You might brick the device that is being flashed.

Good point. One thing we could do is never delete a job -- let them stick around as "Completed" (which may be useful anyways for logs purposes) -- and jsut deploy the new job and add an anti-affinity that will not run it so long as another Job is running.

controller/src/util/config_action.rs

shared/src/akri/configuration_state.rs

webhooks/validating/configuration/src/main.rs

controller/src/util/pod_watcher.rs

shared/src/akri/configuration.rs

bfjelds · 2022-01-06T21:16:24Z

controller/src/util/config_action.rs

+                        );
+                    }
+                    Some(broker) => {
+                        if !should_recreate_config(&config, &prev_config_state) {


sorry, it has been a while since i've looked at this code ... should_recreate_config is confusing to me.

here we are only acting if it returns false. it seems like we used to do nothing if it returned false.

the name is also confusing to me. are we actually recreating a Configuration? or are we just executing code as though the Configuration is deleted and recreated?

i agree the name is unhelpful. It works with the agent's context but not the controller. It may be best to rename it something like broker_changed or more specifically only_broker_changed. This check is done by the agent upon a config change. If true, it does not recreate the configuration rather lets the change stand (no further action). The controller does this check, and if true, it redeploys the brokers to reflect the new config. This is our first step at more explicit configuration modification handling. For this PR, we could leave this out and stick to the old way of deleting and reapplying a configuration, but it means instances will be brought down and redeployed. We need to figure out how to handle this gracefully and this was my first stab at it. it does mean though that if the controller goes down for some reason, a configuration may be applied that does not reflect the system (since the controller was not there to redeploy the brokers)

@bfjelds @jiria @romoh on this note, i am thinking that at least for this PR, we should remove support for handling broker changes and stick to our old approach of the agent delete and reapplying modified configurations. This would make this PR smaller, as it would remove the configuration watcher from the controller, removing the "Configuration Event" section of the flow diagram here for the time being.

All in favor of splitting into smaller prs.

Another thought on how to handle controller restarts, we could cache the config that was used to create a broker as part of broker annotation. So then controller could check if the broker is based on the latest configuration or not.

Good point. That can also done simply via the Job name. I am using the Configuration generation in the Job name. However, someone could feasibly delete a configuration and reapply a new one while the Controller is down and then the Controller would wrongly think it has the right state.

kate-goldenring · 2022-01-06T23:50:26Z

When a job spec is changed, do you kill old jobs and schedule new jobs?
edit: I think I found the answer to this is yes. Do we need to rethink this? What is a firmware update of a device is in progress and you kill the job? You might brick the device that is being flashed.

Good point. One thing we could do is never delete a job -- let them stick around as "Completed" (which may be useful anyways for logs purposes) -- and jsut deploy the new job and add an anti-affinity that will not run it so long as another Job is running.

@jiria anti-affinity cannot be set in a JobSpec; however, we could set it in the podspecs as illustrated in the "Note on future extension" section of the PR description above. However, that would restrict any two pods of a job from running on a node, which seems okay, but there may be a scenario (doesnt seem likely though) where someone may want two identical pods
of a job (parallelism=2+) to run on the same node at once

deployment/helm/templates/custom-configuration.yaml

controller/src/util/instance_action.rs

shared/src/k8s/job.rs

Signed-off-by: Kate Goldenring <kate.goldenring@microsoft.com>

kate-goldenring · 2022-01-07T20:30:47Z

@romoh @jiria @bfjelds to reduce the size of this PR, I have removed the Configuration watcher (config_action.rs, configuration_state.rs) from the Controller and ignoring brokerSpec changes in the agent (config_action.rs) -- motivated by this note. This not only was this the section that was the most confusing (Controller and Agent coordinating configuraiton recreation, handling letting jobs finish before deleting them, managing configuration state) but it is also hard to justify. It was added only to enable modifying a brokerSpec without bringing down and recreating instances. Since instances are still directly tied to a Configuration, this seems unneccesary. I have updated the PR description -- the examples point to newly built containers. There should now be much less to review. I updated the proposal and added a section on concerns over gracefully handling Configuration changes as well. Gracefully handling Configuration changes is a discussion that we can leave for another proposal and PR.

Signed-off-by: Kate Goldenring <kate.goldenring@microsoft.com>

jiria · 2022-01-11T07:13:17Z

When a job spec is changed, do you kill old jobs and schedule new jobs?
edit: I think I found the answer to this is yes. Do we need to rethink this? What is a firmware update of a device is in progress and you kill the job? You might brick the device that is being flashed.

Good point. One thing we could do is never delete a job -- let them stick around as "Completed" (which may be useful anyways for logs purposes) -- and jsut deploy the new job and add an anti-affinity that will not run it so long as another Job is running.

@jiria anti-affinity cannot be set in a JobSpec; however, we could set it in the podspecs as illustrated in the "Note on future extension" section of the PR description above. However, that would restrict any two pods of a job from running on a node, which seems okay, but there may be a scenario (doesnt seem likely though) where someone may want two identical pods of a job (parallelism=2+) to run on the same node at once

Why do we need the anti-affinity in this case?

Re cleanup, imho some cleanup might be useful, but perhaps not killing the running jobs.

Also, from another pov, we might not want to have two firmware updates running at the same time. Is there a good way for the two jobs to signal each other?

Signed-off-by: Kate Goldenring <kate.goldenring@microsoft.com>

controller/src/util/instance_action.rs

kate-goldenring · 2022-01-11T17:53:44Z

When a job spec is changed, do you kill old jobs and schedule new jobs?
edit: I think I found the answer to this is yes. Do we need to rethink this? What is a firmware update of a device is in progress and you kill the job? You might brick the device that is being flashed.

Good point. One thing we could do is never delete a job -- let them stick around as "Completed" (which may be useful anyways for logs purposes) -- and jsut deploy the new job and add an anti-affinity that will not run it so long as another Job is running.

@jiria anti-affinity cannot be set in a JobSpec; however, we could set it in the podspecs as illustrated in the "Note on future extension" section of the PR description above. However, that would restrict any two pods of a job from running on a node, which seems okay, but there may be a scenario (doesnt seem likely though) where someone may want two identical pods of a job (parallelism=2+) to run on the same node at once

Why do we need the anti-affinity in this case?

Re cleanup, imho some cleanup might be useful, but perhaps not killing the running jobs.

Also, from another pov, we might not want to have two firmware updates running at the same time. Is there a good way for the two jobs to signal each other?

@jiria per this comment and the changes associated with it, I removed all gracefully handling of Configuration changes from this PR. As this thread shows, we have plenty to discuss here. For now, they will be deleted immediately upon Configuration changes just like pods are. We dont do anything graceful from them currently either. See this diagram from the proposal PR for a visual. As to your questions, "anti-affinity" was the signalling i was hoping for. aka i was looking for a way for a job to know another job is running and therefore not run. Anti-affinity cannot do that for jobs. I agree that we should think this through for firmware update settings and make a plan.

bfjelds · 2022-01-12T17:28:18Z

webhooks/validating/configuration/src/main.rs

+    // Misplaced `resources`
+    //   Valid: .request.object.spec.brokerSpec.brokerPodSpec.containers[*].resources
+    // Invalid: .request.object.spec.brokerSpec.brokerPodSpec.resources
+    const INVALID_BROKER_POD_SPEC: &str = r#"


i think it woud be good to have an invalid test where both job and pod specs are defined.

My understanding is that type of request would never even get to the admission controller because it would never be correctly applied to the cluster. I believe a Configuration must successfully be applied (match expected CRD format) and then is passed to an admission controller as an AdmissionReview. So this test is better suited as a unit test in configuration.rs, which affirms that that will not pass the expectation that it is of type Configuration CRD.

I had the wrong thinking around this. Currently, our CRD does not specify oneOf with regards to brokerSpec. It seems that we cannot use the oneOf structural schema specifier since we need to set x-kubernetes-preserve-unknown-fields: true. We should do more investigation here to see if we can limit in our Configuration CRD specifying both types of brokers -- that way the error isnt caught by our application and they error out. For now, here is the test that asserts that serde will catch this and error out the Agent: 20422dc

Signed-off-by: Kate Goldenring <kate.goldenring@microsoft.com>

…ec are specified Signed-off-by: Kate Goldenring <kate.goldenring@microsoft.com>

kate-goldenring · 2022-01-12T20:17:30Z

@romoh @jiria this should be ready for a final look

kate-goldenring added 23 commits December 8, 2021 21:13

Update Configuration CRD

244ee77

Signed-off-by: Kate Goldenring <kate.goldenring@microsoft.com>

Updated Instance CRD

cbe3b86

Signed-off-by: Kate Goldenring <kate.goldenring@microsoft.com>

get rid of configurationDeploymentStatuses

b0a563e

Signed-off-by: Kate Goldenring <kate.goldenring@microsoft.com>

Add jobs module

376a667

Signed-off-by: Kate Goldenring <kate.goldenring@microsoft.com>

jobs building

ddfecc6

Signed-off-by: Kate Goldenring <kate.goldenring@microsoft.com>

Clean up

f8f75d4

Signed-off-by: Kate Goldenring <kate.goldenring@microsoft.com>

enable configs to be modified

7cf7637

Signed-off-by: Kate Goldenring <kate.goldenring@microsoft.com>

jobs

f3788c6

Signed-off-by: Kate Goldenring <kate.goldenring@microsoft.com>

reapplying jobs

161bf01

Signed-off-by: Kate Goldenring <kate.goldenring@microsoft.com>

Support for creation of services for jobs

4523102

Signed-off-by: Kate Goldenring <kate.goldenring@microsoft.com>

remove deploymentStrategy and brokerGeneration from Configuration

9933817

Signed-off-by: Kate Goldenring <kate.goldenring@microsoft.com>

Box PodSpec and JobSpec

14d5bd4

Signed-off-by: Kate Goldenring <kate.goldenring@microsoft.com>

using cmp::Ordering when checking generation

9942f57

Signed-off-by: Kate Goldenring <kate.goldenring@microsoft.com>

Make sure Job Pods are deleted rather than orphaned

2018d21

Signed-off-by: Kate Goldenring <kate.goldenring@microsoft.com>

Get all tests to pass

ecfd03a

Signed-off-by: Kate Goldenring <kate.goldenring@microsoft.com>

rename brokerType to brokerSpec

6c79691

Signed-off-by: Kate Goldenring <kate.goldenring@microsoft.com>

Helm charts

f7a442b

Signed-off-by: Kate Goldenring <kate.goldenring@microsoft.com>

Cleanup changes, add comments, add tests

a2f6b6c

Signed-off-by: Kate Goldenring <kate.goldenring@microsoft.com>

Fix bug - not storing config updates in map

1abde88

Signed-off-by: Kate Goldenring <kate.goldenring@microsoft.com>

add now required debug echo image to tests

03c44e8

Signed-off-by: Kate Goldenring <kate.goldenring@microsoft.com>

remove enabling podAntiAffinity in Configuration templates

cbc07ee

Signed-off-by: Kate Goldenring <kate.goldenring@microsoft.com>

add todo

b7b4fa6

Signed-off-by: Kate Goldenring <kate.goldenring@microsoft.com>

Add additional ClusterRole rules to Controller for Jobs

7ca6bf1

Signed-off-by: Kate Goldenring <kate.goldenring@microsoft.com>

kate-goldenring requested review from bfjelds, Britel, jiria and romoh as code owners December 10, 2021 01:38

kate-goldenring added 2 commits December 10, 2021 01:40

increase version

275a06c

Signed-off-by: Kate Goldenring <kate.goldenring@microsoft.com>

Use brokerPod instead of brokerJob in e2e tests

38598ba

Signed-off-by: Kate Goldenring <kate.goldenring@microsoft.com>

kate-goldenring mentioned this pull request Dec 10, 2021

Documentation on Deploying Jobs with Akri project-akri/akri-docs#21

Merged