Fix(container-kill): Adds statusCheckTimeout to container kill recovery #498

uditgaurav · 2022-03-30T04:45:48Z

Signed-off-by: uditgaurav udit@chaosnative.com

What this PR does / why we need it:

Adds statusCheckTimeout to container kill recovery

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #

Special notes for your reviewer:

Checklist:

Fixes #
PR messages has document related information
Labelled this PR & related issue with breaking-changes tag
PR messages has breaking changes related information
Labelled this PR & related issue with requires-upgrade tag
PR messages has upgrade related information
Commit has unit tests
Commit has integration tests
E2E run Required for the changes

Signed-off-by: uditgaurav <udit@chaosnative.com>

* Chore(stress-chaos): Run CPU chaos with percentage of cpu cores (#482) * Chore(stress-chaos): Run CPU chaos with percentage of cores Signed-off-by: uditgaurav <udit@chaosnative.com> * Fixeing alpine CVEs by upgrading the version (#486) * Chore(vulnerability): Remove openebs retry module and update pkgs (#488) * Chore(vulnerability): Fix some vulnerability by updaing the pkgs Signed-off-by: uditgaurav <udit@chaosnative.com> * Chore(vulnerability): Remove openebs retry module and update pkgs Signed-off-by: udit <udit@chaosnative.com> * Chore(cgroup): Add support for cgroup version2 in stress-chaos experiment (#490) Signed-off-by: uditgaurav <udit@chaosnative.com> * Chore(snyk): Fix snyk security scan on litmus-go (#492) Signed-off-by: uditgaurav <udit@chaosnative.com> * Chore(network-chaos): Randomize Chaos Tunables for Netowork Chaos Experiment (#491) * Chore(network-chaos): Signed-off-by: uditgaurav <udit@chaosnative.com> * Chore(network-chaos): Randomize Chaos Tunables for Netowork Chaos Experiment Signed-off-by: uditgaurav <udit@chaosnative.com> Co-authored-by: Karthik Satchitanand <karthik.s@mayadata.io> * Chore(randomize): Randomize stress-chaos tunables (#487) * Chore(randomize): Randomize stress-chaos tunables Signed-off-by: uditgaurav <udit@chaosnative.com> * Update stress-chaos.go * Chore(randomize): Randomize chaos tunables for schedule chaos and disk-fill (#493) * Chore(randomize): Randomize chaos tunables for schedule chaos and disk-fill Signed-off-by: uditgaurav <udit@chaosnative.com> * Chore(randomize): Randomize chaos tunables for schedule chaos and disk-fill Signed-off-by: uditgaurav <udit@chaosnative.com> * (enahncement)experiment: add node label filter for pod network and stress chaos (#494) Signed-off-by: uditgaurav <udit@chaosnative.com> * Fix(targetContainer): Incorrect target container passed in the helper pod for pod level experiments (#496) * Fix target container issue Signed-off-by: uditgaurav <udit@chaosnative.com> * Fix target container issue Signed-off-by: uditgaurav <udit@chaosnative.com> * Fix(container-kill): Adds statusCheckTimeout to container kill recovery (#498) Signed-off-by: uditgaurav <udit@chaosnative.com> * Fix(container-kill): Adds statusCheckTimeout to container kill recovery (#499) Signed-off-by: uditgaurav <udit@chaosnative.com> * Chore(warn): Remove warning Neither --kubeconfig nor --master was specified for InClusterConfig (#507) Signed-off-by: uditgaurav <udit@chaosnative.com> * Chore(ssm): Update the ssm file path in the Dockerfile (#508) Signed-off-by: uditgaurav <udit@chaosnative.com> * GCP Experiments Refactor, New Label Selector Experiments and IAM Integration (#495) * experiment init Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * updated experiment file Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * updated experiment lib Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * updated post chaos validation Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * updated empty slices to nil, updated experiment name in environment.go Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * removed experiment charts Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * bootstrapped gcp-vm-disk-loss-by-label artiacts Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * removed device-names input for gcp-vm-disk-loss experiment, added API calls to derive device name internally Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * removed redundant condition check in gcp-vm-disk-loss experiment pre-requisite checks Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * reformatted error messages Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * replaced the SetTargetInstances function Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * added settargetdisk function for getting target disk names using label Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * refactored Target Disk Attached VM Instance memorisation, updated vm-disk-loss and added lib logic for vm-disk-loss-by-label experiment Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * added experiment to bin and cleared default experiment name in environment.go Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * removed charts Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * updated test.yml Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * updated AutoScalingGroup to ManagedInstanceGroup; updated logic for checking InstanceStop recovery for ManagedInstanceGroup VMs; Updated log and error messages with VM names Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * removed redundant computeService code snippets Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * removed redundant computeService code snippets in gcp-disk-loss experiments Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * updated logic for deriving default gcp sa credentials for computeService Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * updated logging for IAM integration Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * refactored log and error messages and wait for start/stop instances logic Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * fixed logs, optimised control statements, added comments, corrected experiment names Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * fixed file exists check logic Signed-off-by: Neelanjan Manna <neelanjan.manna@harness.io> * updated instance and device name fetch logic for disk loss Signed-off-by: Neelanjan Manna <neelanjan.manna@harness.io> * updated logs Signed-off-by: Neelanjan Manna <neelanjan.manna@harness.io> * update(sdk): updating litmus sdk for the defaultAppHealthCheck (#513) Signed-off-by: shubhamc <shubhamc@jfrog.com> Co-authored-by: shubhamc <shubhamc@jfrog.com> * fix: updated release workflow (#512) Signed-off-by: Soumya Ghosh Dastidar <gdsoumya@gmail.com> * Added Active Node Count Check using AWS APIs (#500) * Added node count check using aws apis Signed-off-by: Akash Shrivastava <akash@chaosnative.com> * Added node count check using aws apis to instance terminate by tag experiment Signed-off-by: Akash Shrivastava <akash@chaosnative.com> * Log improvements; Code improvement in findActiveNodeCount function; Signed-off-by: Akash Shrivastava <akash@chaosnative.com> * Added log for instance status check failed in find active node count Signed-off-by: Akash Shrivastava <akash@chaosnative.com> * Added check if active node count is less than provided instance ids Signed-off-by: Akash Shrivastava <akash@chaosnative.com> * updated appns podlist filtering error handling (#515) Signed-off-by: Neelanjan Manna <neelanjan.manna@harness.io> Co-authored-by: Udit Gaurav <35391335+uditgaurav@users.noreply.github.com> Co-authored-by: Vedant Shrotria <vedant.shrotria@harness.io> * return error if node not present (#516) Signed-off-by: Akash Shrivastava <akash@chaosnative.com> * Chore(helper pod): Make setHelper data as tunable (#519) Signed-off-by: uditgaurav <udit@chaosnative.com> Co-authored-by: Udit Gaurav <35391335+uditgaurav@users.noreply.github.com> Co-authored-by: Raj Babu Das <mail.rajdas@gmail.com> Co-authored-by: Karthik Satchitanand <karthik.s@mayadata.io> Co-authored-by: Shubham Chaudhary <shubham.chaudhary@mayadata.io> Co-authored-by: shubhamc <shubhamc@jfrog.com> Co-authored-by: Soumya Ghosh Dastidar <44349253+gdsoumya@users.noreply.github.com> Co-authored-by: Akash Shrivastava <akash@chaosnative.com> Co-authored-by: Vedant Shrotria <vedant.shrotria@harness.io>

* modified the cmdProbe for inline mode of execution to accomodate litmusd Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * go mod tidy Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * bootstrapped process-kill experiment files Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * updated types.go and environment.go Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * updated secret envs Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * updated experiment logic and added steady state validation steps Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * removed action from probe refactor function parameters Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * added serial and parallel chaos execution steps Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * added conn parameter to probe Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * added logic for closing websocket in the end of the experiment Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * added experiment to bin Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * corrected the agent endpoint Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * corrected environement.go Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * updated logs, removed close message and added parallel sequence as default Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * updated experiment charts Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * updated experiment charts Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * updated authorization header, replaced Processes struct with int slice of pids Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * restored experiment image Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * updated test.yml Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * added rbac, README, exported charts Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * added websocket connection to chaos details struct, restored probe functions params Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * removed websocket connection in chaoslib params Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * updated code function Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * updated readme Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * restructured directories, added m-agent tag Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * updated workflow branch Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * removed guest-os pkg Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * Chore(stress-chaos): Run CPU chaos with percentage of cpu cores (#482) * Chore(stress-chaos): Run CPU chaos with percentage of cores Signed-off-by: uditgaurav <udit@chaosnative.com> * updated client side m-agent design; added channelised message sending Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * added liveness check for process kill Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * updated mutex lock to an RWMutex lock, locked read operations on the map Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * Fixeing alpine CVEs by upgrading the version (#486) * updated WaitForDurationAndCheckLiveness function Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * updated cpu-stress experiment and steady-state condition Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * corrected probe format Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * added functionality for multiple websocket connections Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * updated liveness check to test for all the connections and added parallel chaos injection Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * updated m-agent cmd probe for only one agent endpoint Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * updated underChaosEndpoints for abort Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * optimised make connections logic Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * removed redundant check and comments Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * updated comments for function Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * updated chaosInterval timer for fixing infinitely running chaosInterval Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * added CLOSE_CONNECTION action for closure of websocket connections Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * Chore(vulnerability): Remove openebs retry module and update pkgs (#488) * Chore(vulnerability): Fix some vulnerability by updaing the pkgs Signed-off-by: uditgaurav <udit@chaosnative.com> * Chore(vulnerability): Remove openebs retry module and update pkgs Signed-off-by: udit <udit@chaosnative.com> * added chaos revert logic Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * updated connection close on ERROR functionalty and return on Read error Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * added log for chaos revert Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * reverted env params Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * added abort log info, added defer close statement to message listener, added load percentage validation Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * updated probe error feedback, removed charts Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * updated mutex locks for RLock and RUnlock, updated connect agent function parameters Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * Chore(cgroup): Add support for cgroup version2 in stress-chaos experiment (#490) Signed-off-by: uditgaurav <udit@chaosnative.com> * updated mutex locks Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * Chore(snyk): Fix snyk security scan on litmus-go (#492) Signed-off-by: uditgaurav <udit@chaosnative.com> * Chore(network-chaos): Randomize Chaos Tunables for Netowork Chaos Experiment (#491) * Chore(network-chaos): Signed-off-by: uditgaurav <udit@chaosnative.com> * Chore(network-chaos): Randomize Chaos Tunables for Netowork Chaos Experiment Signed-off-by: uditgaurav <udit@chaosnative.com> Co-authored-by: Karthik Satchitanand <karthik.s@mayadata.io> * Chore(randomize): Randomize stress-chaos tunables (#487) * Chore(randomize): Randomize stress-chaos tunables Signed-off-by: uditgaurav <udit@chaosnative.com> * Update stress-chaos.go * Chore(randomize): Randomize chaos tunables for schedule chaos and disk-fill (#493) * Chore(randomize): Randomize chaos tunables for schedule chaos and disk-fill Signed-off-by: uditgaurav <udit@chaosnative.com> * Chore(randomize): Randomize chaos tunables for schedule chaos and disk-fill Signed-off-by: uditgaurav <udit@chaosnative.com> * (enahncement)experiment: add node label filter for pod network and stress chaos (#494) Signed-off-by: uditgaurav <udit@chaosnative.com> * Fix(targetContainer): Incorrect target container passed in the helper pod for pod level experiments (#496) * Fix target container issue Signed-off-by: uditgaurav <udit@chaosnative.com> * Fix target container issue Signed-off-by: uditgaurav <udit@chaosnative.com> * Fix(container-kill): Adds statusCheckTimeout to container kill recovery (#498) Signed-off-by: uditgaurav <udit@chaosnative.com> * Fix(container-kill): Adds statusCheckTimeout to container kill recovery (#499) Signed-off-by: uditgaurav <udit@chaosnative.com> * Chore(warn): Remove warning Neither --kubeconfig nor --master was specified for InClusterConfig (#507) Signed-off-by: uditgaurav <udit@chaosnative.com> * Chore(ssm): Update the ssm file path in the Dockerfile (#508) Signed-off-by: uditgaurav <udit@chaosnative.com> * GCP Experiments Refactor, New Label Selector Experiments and IAM Integration (#495) * experiment init Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * updated experiment file Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * updated experiment lib Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * updated post chaos validation Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * updated empty slices to nil, updated experiment name in environment.go Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * removed experiment charts Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * bootstrapped gcp-vm-disk-loss-by-label artiacts Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * removed device-names input for gcp-vm-disk-loss experiment, added API calls to derive device name internally Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * removed redundant condition check in gcp-vm-disk-loss experiment pre-requisite checks Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * reformatted error messages Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * replaced the SetTargetInstances function Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * added settargetdisk function for getting target disk names using label Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * refactored Target Disk Attached VM Instance memorisation, updated vm-disk-loss and added lib logic for vm-disk-loss-by-label experiment Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * added experiment to bin and cleared default experiment name in environment.go Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * removed charts Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * updated test.yml Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * updated AutoScalingGroup to ManagedInstanceGroup; updated logic for checking InstanceStop recovery for ManagedInstanceGroup VMs; Updated log and error messages with VM names Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * removed redundant computeService code snippets Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * removed redundant computeService code snippets in gcp-disk-loss experiments Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * updated logic for deriving default gcp sa credentials for computeService Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * updated logging for IAM integration Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * refactored log and error messages and wait for start/stop instances logic Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * fixed logs, optimised control statements, added comments, corrected experiment names Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * fixed file exists check logic Signed-off-by: Neelanjan Manna <neelanjan.manna@harness.io> * updated instance and device name fetch logic for disk loss Signed-off-by: Neelanjan Manna <neelanjan.manna@harness.io> * updated logs Signed-off-by: Neelanjan Manna <neelanjan.manna@harness.io> * update(sdk): updating litmus sdk for the defaultAppHealthCheck (#513) Signed-off-by: shubhamc <shubhamc@jfrog.com> Co-authored-by: shubhamc <shubhamc@jfrog.com> * fix: updated release workflow (#512) Signed-off-by: Soumya Ghosh Dastidar <gdsoumya@gmail.com> * Added Active Node Count Check using AWS APIs (#500) * Added node count check using aws apis Signed-off-by: Akash Shrivastava <akash@chaosnative.com> * Added node count check using aws apis to instance terminate by tag experiment Signed-off-by: Akash Shrivastava <akash@chaosnative.com> * Log improvements; Code improvement in findActiveNodeCount function; Signed-off-by: Akash Shrivastava <akash@chaosnative.com> * Added log for instance status check failed in find active node count Signed-off-by: Akash Shrivastava <akash@chaosnative.com> * Added check if active node count is less than provided instance ids Signed-off-by: Akash Shrivastava <akash@chaosnative.com> * updated appns podlist filtering error handling (#515) Signed-off-by: Neelanjan Manna <neelanjan.manna@harness.io> Co-authored-by: Udit Gaurav <35391335+uditgaurav@users.noreply.github.com> Co-authored-by: Vedant Shrotria <vedant.shrotria@harness.io> * go mod tidy Signed-off-by: neelanjan00 <neelanjan@chaosnative.com> * return error if node not present (#516) Signed-off-by: Akash Shrivastava <akash@chaosnative.com> * Chore(helper pod): Make setHelper data as tunable (#519) Signed-off-by: uditgaurav <udit@chaosnative.com> * added CPUs check in prerequisites check Signed-off-by: Neelanjan Manna <neelanjan.manna@harness.io> * removed .DS_Store Signed-off-by: Neelanjan Manna <neelanjan.manna@harness.io> * removed .DS_Store Signed-off-by: Neelanjan Manna <neelanjan.manna@harness.io> * updated rbac and readme Signed-off-by: Neelanjan Manna <neelanjan.manna@harness.io> * removed .DS_Store Signed-off-by: Neelanjan Manna <neelanjan.manna@harness.io> * updated qemu github action Signed-off-by: Neelanjan Manna <neelanjan.manna@harness.io> * updated qemu action version Signed-off-by: Neelanjan Manna <neelanjan.manna@harness.io> * updated m-agent go-runner tag to 2.10.0-Beta1 Signed-off-by: Neelanjan Manna <neelanjan.manna@harness.io> * updated target names Signed-off-by: Neelanjan Manna <neelanjan.manna@harness.io> * updated machine=>Machine targets, removed .DS_Store Signed-off-by: Neelanjan Manna <neelanjan.manna@harness.io> Co-authored-by: Udit Gaurav <35391335+uditgaurav@users.noreply.github.com> Co-authored-by: Raj Babu Das <mail.rajdas@gmail.com> Co-authored-by: Karthik Satchitanand <karthik.s@mayadata.io> Co-authored-by: Shubham Chaudhary <shubham.chaudhary@mayadata.io> Co-authored-by: shubhamc <shubhamc@jfrog.com> Co-authored-by: Soumya Ghosh Dastidar <44349253+gdsoumya@users.noreply.github.com> Co-authored-by: Akash Shrivastava <akash@chaosnative.com> Co-authored-by: Vedant Shrotria <vedant.shrotria@harness.io>

Fix(container-kill): Adds statusCheckTimeout to container kill recovery

40612d9

Signed-off-by: uditgaurav <udit@chaosnative.com>

uditgaurav requested review from ispeakc0de and ksatchit as code owners March 30, 2022 04:45

ksatchit approved these changes Mar 30, 2022

View reviewed changes

Merge branch 'master' into container-kill-fix

cf2504c

avaakash approved these changes Apr 13, 2022

View reviewed changes

uditgaurav merged commit 7d7adcb into litmuschaos:master Apr 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix(container-kill): Adds statusCheckTimeout to container kill recovery #498

Fix(container-kill): Adds statusCheckTimeout to container kill recovery #498

uditgaurav commented Mar 30, 2022

Fix(container-kill): Adds statusCheckTimeout to container kill recovery #498

Fix(container-kill): Adds statusCheckTimeout to container kill recovery #498

Conversation

uditgaurav commented Mar 30, 2022