Skip to content

1.12.0

Compare
Choose a tag to compare
@ksatchit ksatchit released this 15 Jan 19:05
390d134

New Features & Enhancements

  • Moves the Litmus Portal to beta-1 phase with the following improvements:

    • Supports edit of (cron) schedule in chaos workflows
    • Ability to suspend/disable schedules
    • Improved chaos workflow diagrams with appropriate log representation for different stages/steps
    • Increased (K8s) validation in the workflow construction wizard
    • Adds the infra changes necessary to support private repositories for MyHub (UI support to come in 1.13.0)
  • Introduces a revamped chaos-exporter that removes the current dependency on the heptio event-router for the experiment execution state, which was being used to build chaos-interleaved application dashboards. The chaos exporter now pushes an increased set of metrics on chaos start/end times, status, success percentage per run, experiment specific cumulative pass/fail counts, etc., and has options to operate in both cluster-wide as well as the namespaced modes.

  • Enhances the httpProbe with options to skip certificate checks via the insecureSkipVerify flag in the ChaosEngine schema

  • Enhances the pod-autoscaler experiment with the ability to scale multiple applications (type: deployments, statefulsets) based on an APP_AFFECTED_PERC environment variable, with the apps being filtered via label selectors. Also adds support for OnChaos probes for the experiment.

  • Supports random selection of EC2 instances/Kubernetes nodes for the ec2-terminate experiment in cases where the target instance is not explicitly specified.

  • Improves error handling logic in the node-drain experiment and also adds a timeout (equal to the chaos duration period) flag to the drain operation to prevent indefinite execution (ex: to honor pod disruption budgets, stuck evictions)

  • Extends the ImagePullPolicy configuration to external probe pods (in cases where the cmdProbe is configured to run on “source” images other than the litmus go-runner).

  • Homogenizes the experiment pod logs for target pod information prior to chaos injection

  • Promotes the non-root go-runner from tech-preview to a release image. Accompanied by changes to experiments where applicable (commands, paths & file permissions)

  • Introduces a tech-preview of enhanced chaos rollback/revert logic (used initially for network chaos experiments executed in “serial” sequence ) to achieve guaranteed chaos rollback/revert under failure conditions (helper pod eviction, unexpected chaos process termination, deletion/removal, etc.,) (litmuschaos/go-runner:1.12.0-revert)

  • Enhances the ChaosResult schema to hold cumulative success/failure count information of the different run instances for a given experiment.

  • Introduces a new scaffolded chaoslib template in the litmus SDK that allows injection and revert of chaos via the CHAOS_INJECT_COMMAND & CHAOS_KILL_COMMAND environment variables, thereby giving users flexibility in creating preview experiments.

  • Releases the v0.3.1 of the chaos-ci-lib with fixes and enhancements to the chaos BDD library, and updates the e2e suites to use it.

  • Migration to GitHub Actions (with parallel workflows for lint, security scan, e2e & build/push operations) from TravisCI (where applicable) in lieu of reduced support for OSS projects on the latter.

  • Enhances the litmus-e2e suite with new tests for verification of annotation-enabled & disabled chaos execution, ec2-terminate experiment & pumba-based chaoslib functionality. Adds the feature coverage tracker with an initial set of testcases for litmus-portal e2e pipelines

  • Enhances the litmus-helm chart testing workflows as per the latest K8s/Helm standards

  • Improves the node-restart & adds node-poweroff experiment documentation with steps to obtain the ssh-keys & setup the secrets for execution.

  • Simplifies the experiment pages UX on the ChaosHub with explanation/steps to use the chaos artifacts

Major Bug Fixes

  • Fixes spurious events received on ChaosEngines installed with engineState set to stop (for deferred execution purposes). Also ensures that the ChaosInitialization is recorded once finalizers have been applied on the CR

  • Prevents a false positive with probe execution (in cases where probes were defined without the RunProperties specification) by mandating the latter using CRD validation.

  • Fixes failed/timed-out helper pod checks in the node-restart and node-poweroff experiments with an enhanced status check logic that looks for variadic/desired pod states (such as Succeeded, Running, etc..,) instead of just “Running”

  • Fixes the failure to kill target docker containers using the “litmus” LIB due to the missing “host” flag pointing to the correct daemon socket path

  • Fixes a regression on the pod-cpu-hog experiment that caused only a single md5sum process to be launched on the target pods irrespective of the CPU_CORES (number of cores) input to the experiment.

  • Fixes a regression (panic) on the chaos-runner caused upon secret volumes definition in the ChaosExperiment/ChaosEngine

  • Synchronizes event messages (from the experiment pod as well as chaos-runner pod sources) with the latest experiment status/verdict in case of repeated execution (caused by frequent abort/restart operations) instead of holding stale info.

  • Replaces hardcoded socket paths in experiment helper configurations with values derived from the SOCKET_PATH environment variable

  • Fixes failed application status checks on infra-chaos experiments where the .spec.appinfo.applabel is not specified/skipped. In this case, the health of all pods in the chaos namespace is verified.

  • Fixes the documentation with the correct kubectl command to patch the ChaosEngine for abort/restart.

Major Known Issues & Limitations

Issue:

Forced removal of the experiment helper pods (where applicable: notably network chaos experiments) either manually or due to Kubernetes eviction can render the chaos revert operation at the end of the chaos duration a failure/ a non-event. This will cause the application under test (AUT) to continue being subjected to chaos unless manually recovered.

Workaround:

With experiment pod logs it can be deciphered that the helper operations have failed. In which case, the AUT pod(s) can be deleted so they can be rescheduled again (this is applicable only to those applications deployed as a higher-level controller such as deployment/statefulset/daemonset, etc.,) with a new network namespace.

Fix:

This is being actively worked on (retry mechanism for chaos revert initiated in case of failed/missing helper pods) and should be available in a subsequent release.

Issue:

The pod-cpu-hog & pod-memory-hog experiments that run in the pod-exec mode (which is typically used when the users don’t want to mount runtime’s socket files on their pods) using the default lib can tend to fail - in spite of chaos being injected successfully - due to the unavailability of certain default utils in the target’s image that is used for detecting the chaos process and killing them/reverting chaos at the end of the chaos duration.

Workaround:

Users can identify the necessary commands to identify and kill the chaos processes and pass them to the experiment via env variable CHAOS_KILL_COMMAND
Alternatively, then can make use of the pumba chaoslib that uses external containers with SYS_ADMIN docker capability to inject/revert the chaos, while mounting the runtime socket file. Note that this is supported only on docker at this point.

Fix:

This is being actively worked on (native litmus chaoslib that can inject stress processes w/o exec requirement for docker/containerd/crio) and should be available in a subsequent release.

Installation

kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.12.0.yaml

Verify your installation

  • Verify if the chaos operator is running
    kubectl get pods -n litmus

  • Verify if chaos CRDs are installed
    kubectl get crds | grep chaos

For more details refer to the documentation at Docs