Better way to do "scheduled chaos" #1223

cpitstick-argo · 2020-02-24T19:40:33Z

Use-case: To have a way to run experiments on a schedule, as in a cron.

There are two ways I know of to do this:

(Hacky): Set the experiment duration to MAX_INT (or equivalent) and then set the interval to whatever period is desired. This would, in theory, work for over 1000 years for 32-bit MAX_INT (1035.625 to be exact). However, it is clearly not the best architecture as the engine is running the entire time consuming more resources than may be necessary, especially on larger Kubernetes clusters. It is also extremely non-standard and does not take advantage of the standardized language for cron jobs.
Delete the ChaosEngine pod periodically utilizing the Kubernete Cron framework, as such:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: cpu-hog-litmus-chaos-cron
  namespace: {{ .chaos-namespace }}
spec:
  schedule: "*/5 * * * *" #Every five minutes
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: cpu-hog-litmus-chaos-cron
            image: alpine
            args:
            - /bin/sh
            - -c
            - 'apk add curl &&
              curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.14.9/bin/linux/amd64/kubectl &&
              chmod +x ./kubectl &&
              ./kubectl delete pod engine-hello-world-node-cpu-hog-runner -n hello-world-example'
          restartPolicy: OnFailure

I view this as superior to #1 as it relies on more standardized infrastructure, but it is still a hack as it relies on the undefined behavior that experiments restart when the ChaosEngine pod is deleted.

Ergo: Litmus needs a more formal way to support this.

The text was updated successfully, but these errors were encountered:

ksatchit · 2020-03-16T06:59:42Z

cc: @chandankumar4 @Sanjay1611

ksatchit · 2020-04-17T09:32:17Z

Here are some thoughts about how the scheduled chaos can be implemented. The idea is to continue using ChaosEngine to describe chaos intent, with a separate CR to purely define scheduling policies. The scheduler controller can create/delete the engine based on the policy selected with enhancements over time to smart-generate the engine based on certain default templates (phase-2)

While kubernetes cron is pretty good in itself the need for custom logic over it is to provide other capabilities such as instance count definitions, included/excluded days, randomness quotient etc., which is needed in the chaos context.

# <----------! PHASE-1 !------------>

## fields may be set/empty depending on type=immediate(now)/onceAtTime(once)/
## repeat(b/w intervals).
## ChaosScheduler controller creates the engine resource depending upon schedule policy
## Below schema allows you to reuse chaosengine as is, continue to be gitOps-controlled

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosSchedule
metadata:
  name: schedule-nginx-chaos
spec:
  schedule:
    type: "repeat"
    startTime: "2020-02-04T13:09:00Z"
    endTime: "2020-02-04T13:14:00Z"
    minChaosInterval: ""
    instanceCount: "2"
    includedDays: 0-6
    random: false
  ## suspend: true to freeze/halt chaos
  suspend: false
  ## embed the entire engine
  engineTemplate: |
    apiVersion: litmuschaos.io/v1alpha1
    kind: ChaosEngine
    metadata:
      name: nginx-chaos
    spec:
      appinfo:
        appns: default
        applabel: "app=nginx"
        appkind: deployment
      engineState: 'active'
      chaosServiceAccount: pod-delete-sa
      monitoring: false
      jobCleanUpPolicy: 'delete'
      experiments:
        - name: pod-delete
          spec:
            components:
              env:
                - name: TOTAL_CHAOS_DURATION
                  value: '30'

# <----------! PHASE-2 !------------>

## the engine template is much lesser in length, i.e., only salient information will 
## be passed such as experiment name & serviceaccount. The chaos-scheduler 
## operator (actually, a co-controller inside the scheduler operator, lets call it 
## engine-constructor) will be able to derive the rest from a pre-defined config 
## template (configmap or any other?) which provides a standard set of
## run-properties (jobCleanupPolicy, monitoring). The app info will be derived by 
## the constructor controller (which, runs as a sidecar, say) based on an annotation:

## litmuschaos.io/chaos-schedule: schedule-nginx-chaos

## This in principle is similar to requirement discussed here: 
## https://github.com/litmuschaos/litmus/issues/1227

kind: ChaosSchedule
metadata:
  name: schedule-nginx-chaos
spec:
  schedule:
    type: "repeat"
    startTime: "2020-02-04T13:09:00Z"
    endTime: "2020-02-04T13:14:00Z"
    minChaosInterval: ""
    instanceCount: "2"
    includedDays: "monday,tuesday,wednesday"
    random: false
  ## suspend: true to freeze/halt chaos
  suspend: false 
  ## embed the entire engine
  engineTemplate:
    spec:
      chaosServiceAccount: pod-delete-sa
      experiments:
        - pod-delete

ksatchit · 2020-04-17T10:12:46Z

cc: @Sanjay1611 @rahulchheda @AmitKumarDas

rahulchheda · 2020-04-17T10:28:10Z

Here is something I have been working on, creating a controller with all the functionality of a cronJob, similar to a CR yaml as such

apiVersion: batch/v2alpha1
kind: CronJob
metadata:
  name: rdm2bq
  labels:
    name: rdm2bq
    role: job
spec:
  schedule: "* * * * *"
  concurrencyPolicy: "Forbid"
  suspend: false
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    metadata:
      name: engine123
      namespace: jack

rahulchheda · 2020-04-17T10:35:11Z

So, my thoughts around these are:
- Its very easy to create a controller that just uses simplified scheme with just cronTab eg: "*/1 * * * *"
- But still, a feel the need to keep it in sync with the cronJobController (of Kubernetes), to support all functionality of those, and ease of readability of spec.
- So i would like to create a controller, that has the logic of CronJob controller embedded inside it, it seems very difficult at first, but still I'm trying to eastablish this as a POC.
- Other way to look at this is, we don't take cronJob into reference as all the functionality of those might not be needed in our case.

Do you make sense? @ksatchit

ksatchit · 2020-04-17T10:53:49Z

IMO it might be beneficial to:

Abstract the cron string with something more generic that users will understand (we will still use that as the underlying implementation logic + some custom wrapper logic)
A native cron still doesn't have ability to specify randomness, instance counts (do these chaos actions these many number of times within say "X" time to "Y" time), exclude certain days etc.,

AmitKumarDas · 2020-04-17T11:06:18Z

@ksatchit @rahulchheda can we just have some comments dedicated to write down the requirements in plain English. I guess we have got into too much into K8s controllers & so on. We will get back to implementation aspects as well. However, wanted to get the requirements understood first.

After reading through the comments, I guess we want to create experiments via engine. These experiments perhaps run only once. We would like to create experiments in a scheduled manner as per the problem statement.

ksatchit · 2020-04-17T11:22:15Z

@ksatchit @rahulchheda can we just have some comments dedicated to write down the requirements in plain English. I guess we have got into too much into K8s controllers & so on. We will get back to implementation aspects as well. However, wanted to get the requirements understood first.

After reading through the comments, I guess we want to create experiments via engine. These experiments perhaps run only once. We would like to create experiments in a scheduled manner as per the problem statement.

@AmitKumarDas @cpitstick-argo 's requirement is succinct in this regard. Have tried to keep the schema of the ChaosSchedule in an english-way

cpitstick-argo · 2020-04-17T14:38:27Z

Here is how I currently do it. One concern I would have is: Do you need to delete the old chaosengine before you create a new one? Maybe this is an internal implementation detail, but I think it's important for keeping the Kubernetes cluster clean. Note that this is currently a helm template, initialized with values that look like this:

chaos_installation_namespace: "hello-world-alpha"
chaos_application_label: "app=hello-world-alpha-application"
kubectl_image: "url"
pod_delete:
  enabled: true
  cron: "*/10 * * * *"
  total_chaos_duration: 480
  chaos_interval: 2

{{- if .Values.pod_delete.enabled }}
apiVersion: v1
kind: ConfigMap
metadata:
  name: pod-delete-manifest
  namespace: {{ .Values.chaos_installation_namespace }}
data:
  pod-delete-manifest.yaml: |
    apiVersion: litmuschaos.io/v1alpha1
    kind: ChaosEngine
    metadata:
      name: chaos-engine-pod-delete
      namespace: {{ .Values.chaos_installation_namespace }}
    spec:
      appinfo:
        appns: {{ .Values.chaos_installation_namespace }}
        applabel: {{ .Values.chaos_application_label }}
        appkind: deployment
      # It can be true/false
      annotationCheck: "true"
      # It can be active/stop
      engineState: "active"
      #ex. values: ns1:name=percona,ns2:run=nginx
      auxiliaryAppInfo: ''
      #ex. values: ns1:name=percona,ns2:run=nginx
      auxiliaryAppInfo: ""
      chaosServiceAccount: cloud-platform-chaos-admin
      monitoring: false
      # It can be delete/retain
      jobCleanUpPolicy: "delete" # "retain" for debugging. "delete" for production.
      experiments:
        - name: pod-delete
          spec:
            components:
              env:
                # set chaos duration (in sec) as desired
                - name: TOTAL_CHAOS_DURATION
                  value: "{{ .Values.pod_delete.total_chaos_duration }}"
                # set chaos interval (in sec) as desired
                - name: CHAOS_INTERVAL
                  value: "{{ .Values.pod_delete.chaos_interval }}"
                # pod failures without --force & default terminationGracePeriodSeconds
                - name: FORCE
                  value: "false"
---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: pod-delete-litmus-chaos-cron
  namespace: {{ .Values.chaos_installation_namespace }}
spec:
  schedule: "{{ .Values.pod_delete.cron }}"
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: cloud-platform-chaos-admin
          initContainers:
            - name: pod-delete-remove-chaos-engine
              image: {{ .Values.kubectl_image }}
              imagePullPolicy: IfNotPresent
              args:
                - "delete"
                - "ChaosEngine"
                - "chaos-engine-pod-delete"
                - "-n"
                - "{{ .Values.chaos_installation_namespace }}"
                - "--ignore-not-found=true"
          containers:
            - name: pod-delete-litmus-chaos-cron
              image: {{ .Values.kubectl_image }}
              imagePullPolicy: IfNotPresent
              args:
                - "apply"
                - "-f"
                - "/manifest/pod-delete-manifest.yaml"
              volumeMounts:
                - mountPath: /manifest
                  name: manifest-temp
          restartPolicy: OnFailure
          volumes:
            - name: manifest-temp
              configMap:
                name: pod-delete-manifest
                defaultMode: 0777
{{- end }}

ksatchit · 2020-04-18T00:54:12Z

Yes. The scheduler is expected to remove the ChaosEngine (and thereby other chaos resources) at the end of each instance.

ksatchit · 2020-05-16T06:20:16Z

An alpha version of the scheduler is now available. Refer:

Keeping this issue open for further refinement and discussion. But will remove this issue from the milestone tracker. Enhancements to the alpha scheduler in 1.4 will be via new issues.

ksatchit · 2020-06-17T08:36:16Z

1.5 consists of enhanced schema for the scheduler, OpenAPI v3 validation and support for halt/resume chaos. Support for randomized scheduling is tracked here: #1562

Closing this issue now.

ksatchit added area/chaos-operator kind/feature project/community Issues raised by community members labels Feb 25, 2020

ksatchit added this to the 1.3 milestone Mar 3, 2020

ksatchit added kind/design state/dev-in-progress state/design-complete labels Mar 3, 2020

ksatchit modified the milestones: 1.3, Chaos-Backlog Mar 16, 2020

ksatchit mentioned this issue Apr 2, 2020

Chaosengine was not clear after the experiment was over #1398

Closed

ksatchit modified the milestones: Chaos-Backlog, 1.4 Apr 16, 2020

rahulchheda self-assigned this Apr 16, 2020

Sanjay1611 self-assigned this Apr 17, 2020

AmitKumarDas mentioned this issue Apr 17, 2020

feat: scheduled custom resource mayadata-io/d-operators#40

Open

ajeshbaby added this to To do in 1.4 May 11, 2020

ksatchit removed this from the 1.4 milestone May 16, 2020

ksatchit removed this from To do in 1.4 May 16, 2020

ksatchit closed this as completed Jun 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better way to do "scheduled chaos" #1223

Better way to do "scheduled chaos" #1223

cpitstick-argo commented Feb 24, 2020

ksatchit commented Mar 16, 2020

ksatchit commented Apr 17, 2020 •

edited

Loading

ksatchit commented Apr 17, 2020 •

edited

Loading

rahulchheda commented Apr 17, 2020

rahulchheda commented Apr 17, 2020

ksatchit commented Apr 17, 2020

AmitKumarDas commented Apr 17, 2020 •

edited

Loading

ksatchit commented Apr 17, 2020 •

edited

Loading

cpitstick-argo commented Apr 17, 2020

ksatchit commented Apr 18, 2020 •

edited

Loading

ksatchit commented May 16, 2020

ksatchit commented Jun 17, 2020

Better way to do "scheduled chaos" #1223

Better way to do "scheduled chaos" #1223

Comments

cpitstick-argo commented Feb 24, 2020

ksatchit commented Mar 16, 2020

ksatchit commented Apr 17, 2020 • edited Loading

ksatchit commented Apr 17, 2020 • edited Loading

rahulchheda commented Apr 17, 2020

rahulchheda commented Apr 17, 2020

ksatchit commented Apr 17, 2020

AmitKumarDas commented Apr 17, 2020 • edited Loading

ksatchit commented Apr 17, 2020 • edited Loading

cpitstick-argo commented Apr 17, 2020

ksatchit commented Apr 18, 2020 • edited Loading

ksatchit commented May 16, 2020

ksatchit commented Jun 17, 2020

ksatchit commented Apr 17, 2020 •

edited

Loading

ksatchit commented Apr 17, 2020 •

edited

Loading

AmitKumarDas commented Apr 17, 2020 •

edited

Loading

ksatchit commented Apr 17, 2020 •

edited

Loading

ksatchit commented Apr 18, 2020 •

edited

Loading