Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better way to do "scheduled chaos" #1223

Closed
cpitstick-argo opened this issue Feb 24, 2020 · 12 comments
Closed

Better way to do "scheduled chaos" #1223

cpitstick-argo opened this issue Feb 24, 2020 · 12 comments

Comments

@cpitstick-argo
Copy link

Use-case: To have a way to run experiments on a schedule, as in a cron.

There are two ways I know of to do this:

  1. (Hacky): Set the experiment duration to MAX_INT (or equivalent) and then set the interval to whatever period is desired. This would, in theory, work for over 1000 years for 32-bit MAX_INT (1035.625 to be exact). However, it is clearly not the best architecture as the engine is running the entire time consuming more resources than may be necessary, especially on larger Kubernetes clusters. It is also extremely non-standard and does not take advantage of the standardized language for cron jobs.

  2. Delete the ChaosEngine pod periodically utilizing the Kubernete Cron framework, as such:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: cpu-hog-litmus-chaos-cron
  namespace: {{ .chaos-namespace }}
spec:
  schedule: "*/5 * * * *" #Every five minutes
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: cpu-hog-litmus-chaos-cron
            image: alpine
            args:
            - /bin/sh
            - -c
            - 'apk add curl &&
              curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.14.9/bin/linux/amd64/kubectl &&
              chmod +x ./kubectl &&
              ./kubectl delete pod engine-hello-world-node-cpu-hog-runner -n hello-world-example'
          restartPolicy: OnFailure

I view this as superior to #1 as it relies on more standardized infrastructure, but it is still a hack as it relies on the undefined behavior that experiments restart when the ChaosEngine pod is deleted.

Ergo: Litmus needs a more formal way to support this.

@ksatchit
Copy link
Member

cc: @chandankumar4 @Sanjay1611

@ksatchit
Copy link
Member

ksatchit commented Apr 17, 2020

Here are some thoughts about how the scheduled chaos can be implemented. The idea is to continue using ChaosEngine to describe chaos intent, with a separate CR to purely define scheduling policies. The scheduler controller can create/delete the engine based on the policy selected with enhancements over time to smart-generate the engine based on certain default templates (phase-2)

While kubernetes cron is pretty good in itself the need for custom logic over it is to provide other capabilities such as instance count definitions, included/excluded days, randomness quotient etc., which is needed in the chaos context.

# <----------! PHASE-1 !------------>

## fields may be set/empty depending on type=immediate(now)/onceAtTime(once)/
## repeat(b/w intervals).
## ChaosScheduler controller creates the engine resource depending upon schedule policy
## Below schema allows you to reuse chaosengine as is, continue to be gitOps-controlled

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosSchedule
metadata:
  name: schedule-nginx-chaos
spec:
  schedule:
    type: "repeat"
    startTime: "2020-02-04T13:09:00Z"
    endTime: "2020-02-04T13:14:00Z"
    minChaosInterval: ""
    instanceCount: "2"
    includedDays: 0-6
    random: false
  ## suspend: true to freeze/halt chaos
  suspend: false
  ## embed the entire engine
  engineTemplate: |
    apiVersion: litmuschaos.io/v1alpha1
    kind: ChaosEngine
    metadata:
      name: nginx-chaos
    spec:
      appinfo:
        appns: default
        applabel: "app=nginx"
        appkind: deployment
      engineState: 'active'
      chaosServiceAccount: pod-delete-sa
      monitoring: false
      jobCleanUpPolicy: 'delete'
      experiments:
        - name: pod-delete
          spec:
            components:
              env:
                - name: TOTAL_CHAOS_DURATION
                  value: '30'
# <----------! PHASE-2 !------------>

## the engine template is much lesser in length, i.e., only salient information will 
## be passed such as experiment name & serviceaccount. The chaos-scheduler 
## operator (actually, a co-controller inside the scheduler operator, lets call it 
## engine-constructor) will be able to derive the rest from a pre-defined config 
## template (configmap or any other?) which provides a standard set of
## run-properties (jobCleanupPolicy, monitoring). The app info will be derived by 
## the constructor controller (which, runs as a sidecar, say) based on an annotation:

## litmuschaos.io/chaos-schedule: schedule-nginx-chaos

## This in principle is similar to requirement discussed here: 
## https://github.com/litmuschaos/litmus/issues/1227

kind: ChaosSchedule
metadata:
  name: schedule-nginx-chaos
spec:
  schedule:
    type: "repeat"
    startTime: "2020-02-04T13:09:00Z"
    endTime: "2020-02-04T13:14:00Z"
    minChaosInterval: ""
    instanceCount: "2"
    includedDays: "monday,tuesday,wednesday"
    random: false
  ## suspend: true to freeze/halt chaos
  suspend: false 
  ## embed the entire engine
  engineTemplate:
    spec:
      chaosServiceAccount: pod-delete-sa
      experiments:
        - pod-delete

@Sanjay1611 Sanjay1611 self-assigned this Apr 17, 2020
@ksatchit
Copy link
Member

ksatchit commented Apr 17, 2020

@rahulchheda
Copy link
Member

Here is something I have been working on, creating a controller with all the functionality of a cronJob, similar to a CR yaml as such

apiVersion: batch/v2alpha1
kind: CronJob
metadata:
  name: rdm2bq
  labels:
    name: rdm2bq
    role: job
spec:
  schedule: "* * * * *"
  concurrencyPolicy: "Forbid"
  suspend: false
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    metadata:
      name: engine123
      namespace: jack

@rahulchheda
Copy link
Member

So, my thoughts around these are:
- Its very easy to create a controller that just uses simplified scheme with just cronTab eg: "*/1 * * * *"
- But still, a feel the need to keep it in sync with the cronJobController (of Kubernetes), to support all functionality of those, and ease of readability of spec.
- So i would like to create a controller, that has the logic of CronJob controller embedded inside it, it seems very difficult at first, but still I'm trying to eastablish this as a POC.
- Other way to look at this is, we don't take cronJob into reference as all the functionality of those might not be needed in our case.

Do you make sense? @ksatchit

@ksatchit
Copy link
Member

IMO it might be beneficial to:

  • Abstract the cron string with something more generic that users will understand (we will still use that as the underlying implementation logic + some custom wrapper logic)
  • A native cron still doesn't have ability to specify randomness, instance counts (do these chaos actions these many number of times within say "X" time to "Y" time), exclude certain days etc.,

@AmitKumarDas
Copy link

AmitKumarDas commented Apr 17, 2020

@ksatchit @rahulchheda can we just have some comments dedicated to write down the requirements in plain English. I guess we have got into too much into K8s controllers & so on. We will get back to implementation aspects as well. However, wanted to get the requirements understood first.

After reading through the comments, I guess we want to create experiments via engine. These experiments perhaps run only once. We would like to create experiments in a scheduled manner as per the problem statement.

@ksatchit
Copy link
Member

ksatchit commented Apr 17, 2020

@ksatchit @rahulchheda can we just have some comments dedicated to write down the requirements in plain English. I guess we have got into too much into K8s controllers & so on. We will get back to implementation aspects as well. However, wanted to get the requirements understood first.

After reading through the comments, I guess we want to create experiments via engine. These experiments perhaps run only once. We would like to create experiments in a scheduled manner as per the problem statement.

@AmitKumarDas @cpitstick-argo 's requirement is succinct in this regard. Have tried to keep the schema of the ChaosSchedule in an english-way

@cpitstick-argo
Copy link
Author

Here is how I currently do it. One concern I would have is: Do you need to delete the old chaosengine before you create a new one? Maybe this is an internal implementation detail, but I think it's important for keeping the Kubernetes cluster clean. Note that this is currently a helm template, initialized with values that look like this:

chaos_installation_namespace: "hello-world-alpha"
chaos_application_label: "app=hello-world-alpha-application"
kubectl_image: "url"
pod_delete:
  enabled: true
  cron: "*/10 * * * *"
  total_chaos_duration: 480
  chaos_interval: 2
{{- if .Values.pod_delete.enabled }}
apiVersion: v1
kind: ConfigMap
metadata:
  name: pod-delete-manifest
  namespace: {{ .Values.chaos_installation_namespace }}
data:
  pod-delete-manifest.yaml: |
    apiVersion: litmuschaos.io/v1alpha1
    kind: ChaosEngine
    metadata:
      name: chaos-engine-pod-delete
      namespace: {{ .Values.chaos_installation_namespace }}
    spec:
      appinfo:
        appns: {{ .Values.chaos_installation_namespace }}
        applabel: {{ .Values.chaos_application_label }}
        appkind: deployment
      # It can be true/false
      annotationCheck: "true"
      # It can be active/stop
      engineState: "active"
      #ex. values: ns1:name=percona,ns2:run=nginx
      auxiliaryAppInfo: ''
      #ex. values: ns1:name=percona,ns2:run=nginx
      auxiliaryAppInfo: ""
      chaosServiceAccount: cloud-platform-chaos-admin
      monitoring: false
      # It can be delete/retain
      jobCleanUpPolicy: "delete" # "retain" for debugging. "delete" for production.
      experiments:
        - name: pod-delete
          spec:
            components:
              env:
                # set chaos duration (in sec) as desired
                - name: TOTAL_CHAOS_DURATION
                  value: "{{ .Values.pod_delete.total_chaos_duration }}"
                # set chaos interval (in sec) as desired
                - name: CHAOS_INTERVAL
                  value: "{{ .Values.pod_delete.chaos_interval }}"
                # pod failures without --force & default terminationGracePeriodSeconds
                - name: FORCE
                  value: "false"
---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: pod-delete-litmus-chaos-cron
  namespace: {{ .Values.chaos_installation_namespace }}
spec:
  schedule: "{{ .Values.pod_delete.cron }}"
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: cloud-platform-chaos-admin
          initContainers:
            - name: pod-delete-remove-chaos-engine
              image: {{ .Values.kubectl_image }}
              imagePullPolicy: IfNotPresent
              args:
                - "delete"
                - "ChaosEngine"
                - "chaos-engine-pod-delete"
                - "-n"
                - "{{ .Values.chaos_installation_namespace }}"
                - "--ignore-not-found=true"
          containers:
            - name: pod-delete-litmus-chaos-cron
              image: {{ .Values.kubectl_image }}
              imagePullPolicy: IfNotPresent
              args:
                - "apply"
                - "-f"
                - "/manifest/pod-delete-manifest.yaml"
              volumeMounts:
                - mountPath: /manifest
                  name: manifest-temp
          restartPolicy: OnFailure
          volumes:
            - name: manifest-temp
              configMap:
                name: pod-delete-manifest
                defaultMode: 0777
{{- end }}

@ksatchit
Copy link
Member

ksatchit commented Apr 18, 2020

Yes. The scheduler is expected to remove the ChaosEngine (and thereby other chaos resources) at the end of each instance.

@ajeshbaby ajeshbaby added this to To do in 1.4 May 11, 2020
@ksatchit
Copy link
Member

An alpha version of the scheduler is now available. Refer:

Keeping this issue open for further refinement and discussion. But will remove this issue from the milestone tracker. Enhancements to the alpha scheduler in 1.4 will be via new issues.

@ksatchit ksatchit removed this from the 1.4 milestone May 16, 2020
@ksatchit ksatchit removed this from To do in 1.4 May 16, 2020
@ksatchit
Copy link
Member

1.5 consists of enhanced schema for the scheduler, OpenAPI v3 validation and support for halt/resume chaos. Support for randomized scheduling is tracked here: #1562

Closing this issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
1.4
  
Awaiting triage
Development

No branches or pull requests

5 participants