Skip to content

Commit

Permalink
cpu-stress Chaos Experiment (#518)
Browse files Browse the repository at this point in the history
* modified the cmdProbe for inline mode of execution to accomodate litmusd

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* go mod tidy

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* bootstrapped process-kill experiment files

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* updated types.go and environment.go

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* updated secret envs

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* updated experiment logic and added steady state validation steps

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* removed action from probe refactor function parameters

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* added serial and parallel chaos execution steps

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* added conn parameter to probe

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* added logic for closing websocket in the end of the experiment

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* added experiment to bin

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* corrected the agent endpoint

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* corrected environement.go

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* updated logs, removed close message and added parallel sequence as default

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* updated experiment charts

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* updated experiment charts

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* updated authorization header, replaced Processes struct with int slice of pids

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* restored experiment image

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* updated test.yml

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* added rbac, README, exported charts

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* added websocket connection to chaos details struct, restored probe functions params

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* removed websocket connection in chaoslib params

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* updated code function

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* updated readme

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* restructured directories, added m-agent tag

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* updated workflow branch

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* removed guest-os pkg

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* Chore(stress-chaos): Run CPU chaos with percentage of cpu cores (#482)

* Chore(stress-chaos): Run CPU chaos with percentage of cores

Signed-off-by: uditgaurav <udit@chaosnative.com>

* updated client side m-agent design; added channelised message sending

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* added liveness check for process kill

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* updated mutex lock to an RWMutex lock, locked read operations on the map

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* Fixeing alpine CVEs by upgrading the version (#486)

* updated WaitForDurationAndCheckLiveness function

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* updated cpu-stress experiment and steady-state condition

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* corrected probe format

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* added functionality for multiple websocket connections

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* updated liveness check to test for all the connections and added parallel chaos injection

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* updated m-agent cmd probe for only one agent endpoint

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* updated underChaosEndpoints for abort

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* optimised make connections logic

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* removed redundant check and comments

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* updated comments for function

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* updated chaosInterval timer for fixing infinitely running chaosInterval

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* added CLOSE_CONNECTION action for closure of websocket connections

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* Chore(vulnerability): Remove openebs retry module and update pkgs (#488)

* Chore(vulnerability): Fix some vulnerability by updaing the pkgs

Signed-off-by: uditgaurav <udit@chaosnative.com>

* Chore(vulnerability): Remove openebs retry module and update pkgs

Signed-off-by: udit <udit@chaosnative.com>

* added chaos revert logic

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* updated connection close on ERROR functionalty and return on Read error

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* added log for chaos revert

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* reverted env params

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* added abort log info, added defer close statement to message listener, added load percentage validation

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* updated probe error feedback, removed charts

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* updated mutex locks for RLock and RUnlock, updated connect agent function parameters

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* Chore(cgroup): Add support for cgroup version2 in stress-chaos experiment (#490)

Signed-off-by: uditgaurav <udit@chaosnative.com>

* updated mutex locks

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* Chore(snyk): Fix snyk security scan on litmus-go (#492)

Signed-off-by: uditgaurav <udit@chaosnative.com>

* Chore(network-chaos): Randomize Chaos Tunables for Netowork Chaos Experiment (#491)

* Chore(network-chaos):

Signed-off-by: uditgaurav <udit@chaosnative.com>

* Chore(network-chaos): Randomize Chaos Tunables for Netowork Chaos Experiment

Signed-off-by: uditgaurav <udit@chaosnative.com>

Co-authored-by: Karthik Satchitanand <karthik.s@mayadata.io>

* Chore(randomize): Randomize stress-chaos tunables (#487)

* Chore(randomize): Randomize stress-chaos tunables

Signed-off-by: uditgaurav <udit@chaosnative.com>

* Update stress-chaos.go

* Chore(randomize): Randomize chaos tunables for schedule chaos and disk-fill (#493)

* Chore(randomize): Randomize chaos tunables for schedule chaos and disk-fill

Signed-off-by: uditgaurav <udit@chaosnative.com>

* Chore(randomize): Randomize chaos tunables for schedule chaos and disk-fill

Signed-off-by: uditgaurav <udit@chaosnative.com>

* (enahncement)experiment: add node label filter for pod network and stress chaos (#494)

Signed-off-by: uditgaurav <udit@chaosnative.com>

* Fix(targetContainer): Incorrect target container passed in the helper pod for pod level experiments (#496)

* Fix target container issue

Signed-off-by: uditgaurav <udit@chaosnative.com>

* Fix target container issue

Signed-off-by: uditgaurav <udit@chaosnative.com>

* Fix(container-kill): Adds statusCheckTimeout to container kill recovery (#498)

Signed-off-by: uditgaurav <udit@chaosnative.com>

* Fix(container-kill): Adds statusCheckTimeout to container kill recovery (#499)

Signed-off-by: uditgaurav <udit@chaosnative.com>

* Chore(warn): Remove warning Neither --kubeconfig nor --master was specified for InClusterConfig (#507)

Signed-off-by: uditgaurav <udit@chaosnative.com>

* Chore(ssm): Update the ssm file path in the Dockerfile (#508)

Signed-off-by: uditgaurav <udit@chaosnative.com>

* GCP Experiments Refactor, New Label Selector Experiments and IAM Integration (#495)

* experiment init

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* updated experiment file

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* updated experiment lib

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* updated post chaos validation

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* updated empty slices to nil, updated experiment name in environment.go

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* removed experiment charts

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* bootstrapped gcp-vm-disk-loss-by-label artiacts

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* removed device-names input for gcp-vm-disk-loss experiment, added API calls to derive device name internally

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* removed redundant condition check in gcp-vm-disk-loss experiment pre-requisite checks

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* reformatted error messages

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* replaced the SetTargetInstances function

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* added settargetdisk function for getting target disk names using label

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* refactored Target Disk Attached VM Instance memorisation, updated vm-disk-loss and added lib logic for vm-disk-loss-by-label experiment

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* added experiment to bin and cleared default experiment name in environment.go

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* removed charts

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* updated test.yml

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* updated AutoScalingGroup to ManagedInstanceGroup; updated logic for checking InstanceStop recovery for ManagedInstanceGroup VMs; Updated log and error messages with VM names

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* removed redundant computeService code snippets

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* removed redundant computeService code snippets in gcp-disk-loss experiments

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* updated logic for deriving default gcp sa credentials for computeService

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* updated logging for IAM integration

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* refactored log and error messages and wait for start/stop instances logic

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* fixed logs, optimised control statements, added comments, corrected experiment names

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* fixed file exists check logic

Signed-off-by: Neelanjan Manna <neelanjan.manna@harness.io>

* updated instance and device name fetch logic for disk loss

Signed-off-by: Neelanjan Manna <neelanjan.manna@harness.io>

* updated logs

Signed-off-by: Neelanjan Manna <neelanjan.manna@harness.io>

* update(sdk): updating litmus sdk for the defaultAppHealthCheck (#513)

Signed-off-by: shubhamc <shubhamc@jfrog.com>

Co-authored-by: shubhamc <shubhamc@jfrog.com>

* fix: updated release workflow (#512)

Signed-off-by: Soumya Ghosh Dastidar <gdsoumya@gmail.com>

* Added Active Node Count Check using AWS APIs (#500)

* Added node count check using aws apis

Signed-off-by: Akash Shrivastava <akash@chaosnative.com>

* Added node count check using aws apis to instance terminate by tag experiment

Signed-off-by: Akash Shrivastava <akash@chaosnative.com>

* Log improvements; Code improvement in findActiveNodeCount function;

Signed-off-by: Akash Shrivastava <akash@chaosnative.com>

* Added log for instance status check failed in find active node count

Signed-off-by: Akash Shrivastava <akash@chaosnative.com>

* Added check if active node count is less than provided instance ids

Signed-off-by: Akash Shrivastava <akash@chaosnative.com>

* updated appns podlist filtering error handling (#515)

Signed-off-by: Neelanjan Manna <neelanjan.manna@harness.io>

Co-authored-by: Udit Gaurav <35391335+uditgaurav@users.noreply.github.com>
Co-authored-by: Vedant Shrotria <vedant.shrotria@harness.io>

* go mod tidy

Signed-off-by: neelanjan00 <neelanjan@chaosnative.com>

* return error if node not present (#516)

Signed-off-by: Akash Shrivastava <akash@chaosnative.com>

* Chore(helper pod): Make setHelper data as tunable (#519)

Signed-off-by: uditgaurav <udit@chaosnative.com>

* added CPUs check in prerequisites check

Signed-off-by: Neelanjan Manna <neelanjan.manna@harness.io>

* removed .DS_Store

Signed-off-by: Neelanjan Manna <neelanjan.manna@harness.io>

* removed .DS_Store

Signed-off-by: Neelanjan Manna <neelanjan.manna@harness.io>

* updated rbac and readme

Signed-off-by: Neelanjan Manna <neelanjan.manna@harness.io>

* removed .DS_Store

Signed-off-by: Neelanjan Manna <neelanjan.manna@harness.io>

* updated qemu github action

Signed-off-by: Neelanjan Manna <neelanjan.manna@harness.io>

* updated qemu action version

Signed-off-by: Neelanjan Manna <neelanjan.manna@harness.io>

* updated m-agent go-runner tag to 2.10.0-Beta1

Signed-off-by: Neelanjan Manna <neelanjan.manna@harness.io>

* updated target names

Signed-off-by: Neelanjan Manna <neelanjan.manna@harness.io>

* updated machine=>Machine targets, removed .DS_Store

Signed-off-by: Neelanjan Manna <neelanjan.manna@harness.io>

Co-authored-by: Udit Gaurav <35391335+uditgaurav@users.noreply.github.com>
Co-authored-by: Raj Babu Das <mail.rajdas@gmail.com>
Co-authored-by: Karthik Satchitanand <karthik.s@mayadata.io>
Co-authored-by: Shubham Chaudhary <shubham.chaudhary@mayadata.io>
Co-authored-by: shubhamc <shubhamc@jfrog.com>
Co-authored-by: Soumya Ghosh Dastidar <44349253+gdsoumya@users.noreply.github.com>
Co-authored-by: Akash Shrivastava <akash@chaosnative.com>
Co-authored-by: Vedant Shrotria <vedant.shrotria@harness.io>
  • Loading branch information
9 people committed Jun 14, 2022
1 parent 7a11fd5 commit f4892a7
Show file tree
Hide file tree
Showing 12 changed files with 758 additions and 5 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ jobs:
ref: ${{ github.event.pull_request.head.sha }}

- name: Set up QEMU
uses: docker/setup-qemu-action@v1
uses: docker/setup-qemu-action@v2
with:
platforms: all

Expand All @@ -83,7 +83,7 @@ jobs:
push: false
file: build/Dockerfile
platforms: linux/amd64,linux/arm64
tags: litmuschaos/go-runner:m-agent
tags: litmuschaos/go-runner:2.10.0-Beta1

trivy:
needs: pre-checks
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/push.yml
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ jobs:
- uses: actions/checkout@v2

- name: Set up QEMU
uses: docker/setup-qemu-action@v1
uses: docker/setup-qemu-action@v2
with:
platforms: all

Expand All @@ -69,4 +69,4 @@ jobs:
push: true
file: build/Dockerfile
platforms: linux/amd64,linux/arm64
tags: litmuschaos/go-runner:m-agent
tags: litmuschaos/go-runner:2.10.0-Beta1
2 changes: 1 addition & 1 deletion .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ jobs:
echo "${RELEASE_TAG}" > ${{ github.workspace }}/tag.txt
- name: Set up QEMU
uses: docker/setup-qemu-action@v1
uses: docker/setup-qemu-action@v2
with:
platforms: all

Expand Down
3 changes: 3 additions & 0 deletions bin/experiment/experiment.go
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ import (
ebsLossByTag "github.com/litmuschaos/litmus-go/experiments/kube-aws/ebs-loss-by-tag/experiment"
ec2TerminateByID "github.com/litmuschaos/litmus-go/experiments/kube-aws/ec2-terminate-by-id/experiment"
ec2TerminateByTag "github.com/litmuschaos/litmus-go/experiments/kube-aws/ec2-terminate-by-tag/experiment"
cpuStress "github.com/litmuschaos/litmus-go/experiments/os/cpu-stress/experiment"
processKill "github.com/litmuschaos/litmus-go/experiments/os/process-kill/experiment"
vmpoweroff "github.com/litmuschaos/litmus-go/experiments/vmware/vm-poweroff/experiment"

Expand Down Expand Up @@ -165,6 +166,8 @@ func main() {
redfishNodeRestart.NodeRestart(clients)
case "process-kill":
processKill.ProcessKill(clients)
case "cpu-stress":
cpuStress.CPUStressExperiment(clients)
case "gcp-vm-instance-stop-by-label":
gcpVMInstanceStopByLabel.GCPVMInstanceStopByLabel(clients)
case "gcp-vm-disk-loss-by-label":
Expand Down
308 changes: 308 additions & 0 deletions chaoslib/litmus/cpu-stress/lib/cpu-stress.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,308 @@
package lib

import (
"os"
"os/signal"
"strconv"
"strings"
"sync"
"syscall"
"time"

"github.com/gorilla/websocket"
clients "github.com/litmuschaos/litmus-go/pkg/clients"
"github.com/litmuschaos/litmus-go/pkg/events"
"github.com/litmuschaos/litmus-go/pkg/log"
"github.com/litmuschaos/litmus-go/pkg/machine/common/messages"
experimentTypes "github.com/litmuschaos/litmus-go/pkg/os/cpu-stress/types"
"github.com/litmuschaos/litmus-go/pkg/probe"
"github.com/litmuschaos/litmus-go/pkg/types"
"github.com/litmuschaos/litmus-go/pkg/utils/common"
"github.com/pkg/errors"
)

var inject, abort chan os.Signal
var timeDuration = 60 * time.Second
var chaosRevert sync.WaitGroup
var underChaosEndpoints []int

type cpuStressParams struct {
Workers string
Load string
Timeout string
}

// InjectCPUStressChaos contains the prepration and injection steps for the experiment
func InjectCPUStressChaos(experimentsDetails *experimentTypes.ExperimentDetails, clients clients.ClientSets, resultDetails *types.ResultDetails, eventsDetails *types.EventDetails, chaosDetails *types.ChaosDetails) error {

// inject channel is used to transmit signal notifications.
inject = make(chan os.Signal, 1)
// Catch and relay certain signal(s) to inject channel.
signal.Notify(inject, os.Interrupt, syscall.SIGTERM)

// abort channel is used to transmit signal notifications.
abort = make(chan os.Signal, 1)
// Catch and relay certain signal(s) to abort channel.
signal.Notify(abort, os.Interrupt, syscall.SIGTERM)

// waiting for the ramp time before chaos injection
if experimentsDetails.RampTime != 0 {
log.Infof("[Ramp]: Waiting for the %vs ramp time before injecting chaos", experimentsDetails.RampTime)
common.WaitForDuration(experimentsDetails.RampTime)
}

agentEndpointList := strings.Split(experimentsDetails.AgentEndpoints, ",")

select {
case <-inject:
// stopping the chaos execution, if abort signal received
os.Exit(0)
default:

// watching for the abort signal and revert the chaos
go AbortWatcher(chaosDetails.WebsocketConnections, agentEndpointList, abort, chaosDetails)
chaosRevert.Add(1)

switch strings.ToLower(experimentsDetails.Sequence) {
case "serial":
if err := injectChaosInSerialMode(experimentsDetails, agentEndpointList, clients, resultDetails, eventsDetails, chaosDetails, abort); err != nil {
return err
}
case "parallel":
if err := injectChaosInParallelMode(experimentsDetails, agentEndpointList, clients, resultDetails, eventsDetails, chaosDetails, abort); err != nil {
return err
}
default:
return errors.Errorf("%v sequence is not supported", experimentsDetails.Sequence)
}

// wait for the ramp time after chaos injection
if experimentsDetails.RampTime != 0 {
log.Infof("[Ramp]: Waiting for the %vs ramp time after injecting chaos", experimentsDetails.RampTime)
common.WaitForDuration(experimentsDetails.RampTime)
}
}

return nil
}

// injectChaosInSerialMode injects CPU stress chaos in serial mode i.e. one after the other
func injectChaosInSerialMode(experimentsDetails *experimentTypes.ExperimentDetails, agentEndpointList []string, clients clients.ClientSets, resultDetails *types.ResultDetails, eventsDetails *types.EventDetails, chaosDetails *types.ChaosDetails, abort chan os.Signal) error {

//ChaosStartTimeStamp contains the start timestamp, when the chaos injection begin
ChaosStartTimeStamp := time.Now()
duration := int(time.Since(ChaosStartTimeStamp).Seconds())

for duration < experimentsDetails.ChaosDuration {

if experimentsDetails.EngineName != "" {
msg := "Injecting " + experimentsDetails.ExperimentName + " chaos in VM instance"
types.SetEngineEventAttributes(eventsDetails, types.ChaosInject, msg, "Normal", chaosDetails)
events.GenerateEvents(eventsDetails, clients, chaosDetails, "ChaosEngine")
}

for i := range agentEndpointList {

log.Infof("[Chaos]: Injecting CPU stress for %s agent endpoint", agentEndpointList[i])
feedback, payload, err := messages.SendMessageToAgent(chaosDetails.WebsocketConnections[i], "EXECUTE_EXPERIMENT", cpuStressParams{experimentsDetails.CPUs, experimentsDetails.LoadPercentage, strconv.Itoa(experimentsDetails.ChaosInterval)}, &timeDuration)
if err != nil {
return errors.Errorf("failed while sending message to agent, err: %v", err)
}

// ACTION_SUCCESSFUL feedback is received only if the cpu stress chaos has been injected successfully
if feedback != "ACTION_SUCCESSFUL" {
if feedback == "ERROR" {

agentError, err := messages.GetErrorMessage(payload)
if err != nil {
return errors.Errorf("failed to interpret error message from agent, err: %v", err)
}

return errors.Errorf("error occured while injecting CPU stress chaos for %s agent endpoint, err: %s", agentEndpointList[i], agentError)
}

return errors.Errorf("unintelligible feedback received from agent: %s", feedback)
}

underChaosEndpoints = append(underChaosEndpoints, i)

common.SetTargets(agentEndpointList[i], "injected", "Machine", chaosDetails)

log.Infof("[Chaos]: CPU stress chaos injected successfully in %s agent endpoint", agentEndpointList[i])

// run the probes during chaos
// the OnChaos probes execution will start in the first iteration and keep running for the entire chaos duration
if len(resultDetails.ProbeDetails) != 0 && i == 0 {
if err = probe.RunProbes(chaosDetails, clients, resultDetails, "DuringChaos", eventsDetails); err != nil {
return err
}
}

// wait for the chaos interval
log.Infof("[Wait]: Waiting for chaos interval of %vs", experimentsDetails.ChaosInterval)
if err := common.WaitForDurationAndCheckLiveness([]*websocket.Conn{chaosDetails.WebsocketConnections[i]}, []string{agentEndpointList[i]}, experimentsDetails.ChaosInterval, abort, &chaosRevert); err != nil {
return errors.Errorf("error occured during liveness check, err: %v", err)
}

log.Infof("[Chaos]: Reverting CPU stress for %s agent endpoint", agentEndpointList[i])
feedback, payload, err = messages.SendMessageToAgent(chaosDetails.WebsocketConnections[i], "REVERT_CHAOS", nil, &timeDuration)
if err != nil {
return errors.Errorf("failed while sending message to agent, err: %v", err)
}

// ACTION_SUCCESSFUL feedback is received only if the cpu stress chaos has been injected successfully
if feedback != "ACTION_SUCCESSFUL" {
if feedback == "ERROR" {

agentError, err := messages.GetErrorMessage(payload)
if err != nil {
return errors.Errorf("failed to interpret error message from agent, err: %v", err)
}

return errors.Errorf("error occured while reverting CPU stress chaos for %s agent endpoint, err: %s", agentEndpointList[i], agentError)
}

return errors.Errorf("unintelligible feedback received from agent: %s", feedback)
}

underChaosEndpoints = underChaosEndpoints[:len(underChaosEndpoints)-1]

common.SetTargets(agentEndpointList[i], "reverted", "Machine", chaosDetails)
}

duration = int(time.Since(ChaosStartTimeStamp).Seconds())
}

return nil
}

// injectChaosInParallelMode injects CPU stress chaos in parallel mode i.e. all at once
func injectChaosInParallelMode(experimentsDetails *experimentTypes.ExperimentDetails, agentEndpointList []string, clients clients.ClientSets, resultDetails *types.ResultDetails, eventsDetails *types.EventDetails, chaosDetails *types.ChaosDetails, abort chan os.Signal) error {

//ChaosStartTimeStamp contains the start timestamp, when the chaos injection begin
ChaosStartTimeStamp := time.Now()
duration := int(time.Since(ChaosStartTimeStamp).Seconds())

for duration < experimentsDetails.ChaosDuration {

if experimentsDetails.EngineName != "" {
msg := "Injecting " + experimentsDetails.ExperimentName + " chaos in VM instance"
types.SetEngineEventAttributes(eventsDetails, types.ChaosInject, msg, "Normal", chaosDetails)
events.GenerateEvents(eventsDetails, clients, chaosDetails, "ChaosEngine")
}

// inject cpu stress chaos
for i := range agentEndpointList {

log.Infof("[Chaos]: Injecting CPU stress for %s agent endpoint", agentEndpointList[i])
feedback, payload, err := messages.SendMessageToAgent(chaosDetails.WebsocketConnections[i], "EXECUTE_EXPERIMENT", cpuStressParams{experimentsDetails.CPUs, experimentsDetails.LoadPercentage, strconv.Itoa(experimentsDetails.ChaosInterval)}, &timeDuration)
if err != nil {
return errors.Errorf("failed while sending message to agent, err: %v", err)
}

// ACTION_SUCCESSFUL feedback is received only if the cpu stress chaos has been injected successfully
if feedback != "ACTION_SUCCESSFUL" {
if feedback == "ERROR" {

agentError, err := messages.GetErrorMessage(payload)
if err != nil {
return errors.Errorf("failed to interpret error message from agent, err: %v", err)
}

return errors.Errorf("error occured while injecting CPU stress chaos for %s agent endpoint, err: %s", agentEndpointList[i], agentError)
}

return errors.Errorf("unintelligible feedback received from agent: %s", feedback)
}

underChaosEndpoints = append(underChaosEndpoints, i)

common.SetTargets(agentEndpointList[i], "injected", "Machine", chaosDetails)

log.Infof("[Chaos]: CPU stress chaos injected successfully in %s agent endpoint", agentEndpointList[i])
}

// run the probes during chaos
// the OnChaos probes execution will start in the first iteration and keep running for the entire chaos duration
if len(resultDetails.ProbeDetails) != 0 {
if err := probe.RunProbes(chaosDetails, clients, resultDetails, "DuringChaos", eventsDetails); err != nil {
return err
}
}

// wait for the chaos interval
log.Infof("[Wait]: Waiting for chaos interval of %vs", experimentsDetails.ChaosInterval)
if err := common.WaitForDurationAndCheckLiveness(chaosDetails.WebsocketConnections, agentEndpointList, experimentsDetails.ChaosInterval, abort, &chaosRevert); err != nil {
return errors.Errorf("error occured during liveness check, err: %v", err)
}

for i := range agentEndpointList {

log.Infof("[Chaos]: Reverting CPU stress for %s agent endpoint", agentEndpointList[i])
feedback, payload, err := messages.SendMessageToAgent(chaosDetails.WebsocketConnections[i], "REVERT_CHAOS", nil, &timeDuration)
if err != nil {
return errors.Errorf("failed while sending message to agent, err: %v", err)
}

// ACTION_SUCCESSFUL feedback is received only if the cpu stress chaos has been injected successfully
if feedback != "ACTION_SUCCESSFUL" {
if feedback == "ERROR" {

agentError, err := messages.GetErrorMessage(payload)
if err != nil {
return errors.Errorf("failed to interpret error message from agent, err: %v", err)
}

return errors.Errorf("error occured while reverting CPU stress chaos for %s agent endpoint, err: %s", agentEndpointList[i], agentError)
}

return errors.Errorf("unintelligible feedback received from agent: %s", feedback)
}

common.SetTargets(agentEndpointList[i], "reverted", "Machine", chaosDetails)

underChaosEndpoints = underChaosEndpoints[1:]
}

duration = int(time.Since(ChaosStartTimeStamp).Seconds())
}

return nil
}

// AbortWatcher will watch for the abort signal and revert the chaos
func AbortWatcher(connections []*websocket.Conn, agentEndpointList []string, abort chan os.Signal, chaosDetails *types.ChaosDetails) {

<-abort

log.Info("[Abort]: Chaos Revert Started")

for _, i := range underChaosEndpoints {

log.Infof("[Abort]: Reverting CPU stress for %s agent endpoint", agentEndpointList[i])
feedback, payload, err := messages.SendMessageToAgent(connections[i], "ABORT_EXPERIMENT", nil, &timeDuration)
if err != nil {
log.Errorf("unable to send abort chaos message to %s agent endpoint, err: %v", agentEndpointList[i], err)
}

// ACTION_SUCCESSFUL feedback is received only if the cpu stress chaos has been aborted successfully
if feedback != "ACTION_SUCCESSFUL" {
if feedback == "ERROR" {

agentError, err := messages.GetErrorMessage(payload)
if err != nil {
log.Errorf("failed to interpret error message from agent, err: %v", err)
}

log.Errorf("error occured while aborting the experiment for %s agent endpoint, err: %s", agentEndpointList[i], agentError)
}

log.Errorf("unintelligible feedback received from agent: %s", feedback)
}

common.SetTargets(agentEndpointList[i], "reverted", "Machine", chaosDetails)
}

log.Info("[Abort]: Chaos Revert Completed")
os.Exit(1)
}
14 changes: 14 additions & 0 deletions experiments/os/cpu-stress/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
## Experiment Metadata

<table>
<tr>
<th> Name </th>
<th> Description </th>
<th> Documentation Link </th>
</tr>
<tr>
<td> CPU Stress </td>
<td> CPU Stress experiment can stress the CPUs of target machine(s). </td>
<td> Coming Soon </td>
</tr>
</table>

0 comments on commit f4892a7

Please sign in to comment.