Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aws az down #230

Closed
wants to merge 1 commit into from
Closed

Conversation

damienomurchu
Copy link

@damienomurchu damienomurchu commented Dec 9, 2020

What

Add litmus experiment to simulate failure of an aws az.

Why

Evaluate resiliency of application under test if an az goes down.

How

  • Add new litmus-go experiment for az-down
  • create dummy ACLs with deny all traffic policies
  • assign ACLs to subnets of az under test
  • revert ACL changes after experiment concludes

@damienomurchu
Copy link
Author

damienomurchu commented Dec 10, 2020

Thanks for the eyes @ksatchit, I'm having a little trouble verifying it using okteto which I documented in #229 (before I created this PR) if I'm missing anything besides the serviceaccount and the litmus-experiment I should deploy to test locally

@ksatchit
Copy link
Member

Hi @damienomurchu ! once again, thanks for working on the experiment. Have shared some info here regarding the tests. At this point, okteto is more for repeated devtest. The tests w/ chaosexperiment and chaosengine is a separate step - not integrated w/ okteto as such.

@damienomurchu damienomurchu force-pushed the aws-az-down branch 2 times, most recently from bd04c79 to 6da4196 Compare December 11, 2020 17:51
@damienomurchu
Copy link
Author

damienomurchu commented Dec 11, 2020

I've been running this locally with okteto and verified all but the destructive aspects of the experiment involving the reassignment of the blocking ACL's to take the targeted az instances out of circulation.

The multi-az cluster I've been testing this against is due to expire, so I will test this on a new cluster on Monday most likely. All going well I'll tidy up the PR and take it out of draft.

Thanks for the helpful info and fix on #229 !

@ksatchit
Copy link
Member

Thanks @damienomurchu !! Also Adding other community members specifically interested in this experiment for their perspective/reviews - cc: @kazukousen @suhrud-kumar


import "github.com/litmuschaos/litmus-go/pkg/log"

func AZDown() error {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the az-down business logic is in the experiment right now, but will look to refactor and move here

@damienomurchu damienomurchu force-pushed the aws-az-down branch 5 times, most recently from 921e7d8 to 01be13f Compare December 15, 2020 12:18
@damienomurchu
Copy link
Author

damienomurchu commented Dec 15, 2020

Mostly verified by running experiment locally with okteto against a multi-az cluster Will run against a fresh cluster later today by regular litmus workflow.

Signed-off-by: Damien Murphy <damurphy@redhat.com>
@damienomurchu
Copy link
Author

damienomurchu commented Dec 15, 2020

Hmm, getting a panic over a nil map assignment I wasn't getting when running the experiment locally with okteto

❯ oc logs -f pod/az-down-chaos-runner                                    
W1215 15:51:49.797393       1 client_config.go:541] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2020-12-15T15:51:49Z" level=info msg="Experiments details are as follows" Experiments List="[az-down]" Engine Name=az-down-chaos appLabels="app.kubernetes.io/name=kafka" appKind= Service Account Name=az-down-sa Engine Namespace=litmus
panic: assignment to entry in nil map

goroutine 1 [running]:
github.com/litmuschaos/chaos-runner/pkg/utils.(*ExperimentDetails).SetLabels(...)
	/home/circleci/go/src/github.com/litmuschaos/chaos-runner/pkg/utils/experimentHelpers.go:168
github.com/litmuschaos/chaos-runner/pkg/utils.(*ExperimentDetails).SetValueFromChaosExperiment(0xc420891ba8, 0xc4201c3a40, 0xc4202da100, 0xc420891a00, 0x1, 0x1)
	/home/circleci/go/src/github.com/litmuschaos/chaos-runner/pkg/utils/experimentHelpers.go:40 +0x1d5
github.com/litmuschaos/chaos-runner/pkg/utils.(*ExperimentDetails).SetValueFromChaosResources(0xc420891ba8, 0xc420891a00, 0xc4201c3a40, 0xc4202da100, 0xc4202da1e0, 0x1)
	/home/circleci/go/src/github.com/litmuschaos/chaos-runner/pkg/utils/experimentHelpers.go:206 +0x396
main.main()
	/home/circleci/go/src/github.com/litmuschaos/chaos-runner/bin/runner.go:56 +0x779

I've rebuilt the binary and the image from the latest PR changes before testing. Any maps I'm using seem to be initialised but I seem to be missing something so any fresh eyes would be welcome :)

@damienomurchu
Copy link
Author

damienomurchu commented Dec 15, 2020

@ksatchit I feel the PR is probably sufficiently advanced to consider taking it out of draft, but after today I'm mostly AFK for about a month so I will be guided by you as to whether I should take this out of draft before then and resolve any outstanding issues thrown out by the CI.

@ksatchit
Copy link
Member

Hi @damienomurchu , yes I guess we can move it out of draft/to actual PR!! Will take a closer look shortly (just wrapping up the 1.11.0 release)

@ksatchit
Copy link
Member

Tagging @uditgaurav on the e2e failure on container-kill experiment.

@ksatchit ksatchit added the WIP label May 25, 2021
@damienomurchu
Copy link
Author

Closing this as we went a different direction on this instead, re-using a shell script using the aws-cli to perform the same steps as in the logic in this experiment. I will leave the PR branch up in case any of this work is useful to others.

Thanks again for the help and feedback with this PR @ksatchit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants