-
Notifications
You must be signed in to change notification settings - Fork 659
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better way to do "scheduled chaos" #1223
Comments
Here are some thoughts about how the scheduled chaos can be implemented. The idea is to continue using ChaosEngine to describe chaos intent, with a separate CR to purely define scheduling policies. The scheduler controller can create/delete the engine based on the policy selected with enhancements over time to smart-generate the engine based on certain default templates (phase-2) While kubernetes cron is pretty good in itself the need for custom logic over it is to provide other capabilities such as instance count definitions, included/excluded days, randomness quotient etc., which is needed in the chaos context. # <----------! PHASE-1 !------------>
## fields may be set/empty depending on type=immediate(now)/onceAtTime(once)/
## repeat(b/w intervals).
## ChaosScheduler controller creates the engine resource depending upon schedule policy
## Below schema allows you to reuse chaosengine as is, continue to be gitOps-controlled
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosSchedule
metadata:
name: schedule-nginx-chaos
spec:
schedule:
type: "repeat"
startTime: "2020-02-04T13:09:00Z"
endTime: "2020-02-04T13:14:00Z"
minChaosInterval: ""
instanceCount: "2"
includedDays: 0-6
random: false
## suspend: true to freeze/halt chaos
suspend: false
## embed the entire engine
engineTemplate: |
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: nginx-chaos
spec:
appinfo:
appns: default
applabel: "app=nginx"
appkind: deployment
engineState: 'active'
chaosServiceAccount: pod-delete-sa
monitoring: false
jobCleanUpPolicy: 'delete'
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: '30' # <----------! PHASE-2 !------------>
## the engine template is much lesser in length, i.e., only salient information will
## be passed such as experiment name & serviceaccount. The chaos-scheduler
## operator (actually, a co-controller inside the scheduler operator, lets call it
## engine-constructor) will be able to derive the rest from a pre-defined config
## template (configmap or any other?) which provides a standard set of
## run-properties (jobCleanupPolicy, monitoring). The app info will be derived by
## the constructor controller (which, runs as a sidecar, say) based on an annotation:
## litmuschaos.io/chaos-schedule: schedule-nginx-chaos
## This in principle is similar to requirement discussed here:
## https://github.com/litmuschaos/litmus/issues/1227
kind: ChaosSchedule
metadata:
name: schedule-nginx-chaos
spec:
schedule:
type: "repeat"
startTime: "2020-02-04T13:09:00Z"
endTime: "2020-02-04T13:14:00Z"
minChaosInterval: ""
instanceCount: "2"
includedDays: "monday,tuesday,wednesday"
random: false
## suspend: true to freeze/halt chaos
suspend: false
## embed the entire engine
engineTemplate:
spec:
chaosServiceAccount: pod-delete-sa
experiments:
- pod-delete |
Here is something I have been working on, creating a controller with all the functionality of a cronJob, similar to a CR yaml as such
|
So, my thoughts around these are: Do you make sense? @ksatchit |
IMO it might be beneficial to:
|
@ksatchit @rahulchheda can we just have some comments dedicated to write down the requirements in plain English. I guess we have got into too much into K8s controllers & so on. We will get back to implementation aspects as well. However, wanted to get the requirements understood first. After reading through the comments, I guess we want to create experiments via engine. These experiments perhaps run only once. We would like to create experiments in a scheduled manner as per the problem statement. |
@AmitKumarDas @cpitstick-argo 's requirement is succinct in this regard. Have tried to keep the schema of the ChaosSchedule in an english-way |
Here is how I currently do it. One concern I would have is: Do you need to delete the old chaosengine before you create a new one? Maybe this is an internal implementation detail, but I think it's important for keeping the Kubernetes cluster clean. Note that this is currently a helm template, initialized with values that look like this:
|
Yes. The scheduler is expected to remove the ChaosEngine (and thereby other chaos resources) at the end of each instance. |
An alpha version of the scheduler is now available. Refer:
Keeping this issue open for further refinement and discussion. But will remove this issue from the milestone tracker. Enhancements to the alpha scheduler in 1.4 will be via new issues. |
1.5 consists of enhanced schema for the scheduler, OpenAPI v3 validation and support for halt/resume chaos. Support for randomized scheduling is tracked here: #1562 Closing this issue now. |
Use-case: To have a way to run experiments on a schedule, as in a cron.
There are two ways I know of to do this:
(Hacky): Set the experiment duration to MAX_INT (or equivalent) and then set the interval to whatever period is desired. This would, in theory, work for over 1000 years for 32-bit MAX_INT (1035.625 to be exact). However, it is clearly not the best architecture as the engine is running the entire time consuming more resources than may be necessary, especially on larger Kubernetes clusters. It is also extremely non-standard and does not take advantage of the standardized language for cron jobs.
Delete the ChaosEngine pod periodically utilizing the Kubernete Cron framework, as such:
I view this as superior to #1 as it relies on more standardized infrastructure, but it is still a hack as it relies on the undefined behavior that experiments restart when the ChaosEngine pod is deleted.
Ergo: Litmus needs a more formal way to support this.
The text was updated successfully, but these errors were encountered: