New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added cluster shut down scenario #25
Added cluster shut down scenario #25
Conversation
This PR is based on node_scenarios PR: #10 as I am reusing the AWS node stop and start functions. |
Can one of the admins verify this patch? |
3888ee3
to
513be85
Compare
kraken/invoke/command.py
Outdated
except Exception: | ||
logging.error("Failed to run %s" % (command)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should keep a logging statement here and not just pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function was added just for the stop_kubelet node scenario, when this command is executed once, the kubelet stops even if the command exceeds the timeout (here set to 45s) and exception indicating 'failed to run command' wouldn't be meaningful in that case.
kraken/kubernetes/client.py
Outdated
nodes = [] | ||
try: | ||
ret = cli.list_node(pretty=True) | ||
if label_selector: | ||
ret = cli.list_node(pretty=True, label_selector=label_selector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see anywhere where you call list_nodes and pass it a label_selector. Is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, previously in line, list_nodes was used instead of list_killable_node.
I have just a few little comments but I was able to run this cluster shut down the other day and it worked perfectly! |
for node in nodes: | ||
cloud_object.stop_instances(node_id[node]) | ||
logging.info("Waiting for 250s to shut down all the nodes") | ||
time.sleep(250) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we able to have the user set this in their config or use 250 as a default? Is there a specific reason we chose 250 seconds here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no specific reason, it took around 2 mins on 10 node cluster. So I chose 250 seconds to accommodate clusters of bigger sizes. start_instance function can be called on a node only when it is in stopped state else it throws error. However I have added a try except condition such that we sleep for 10 additional seconds when a node isn't in stopped state even after 250 seconds.
stopped_nodes = nodes - restarted_nodes | ||
logging.info("Waiting for 250s to allow cluster component initilization") | ||
time.sleep(250) | ||
logging.info("Successfully injected cluster_shut_down scenario!") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we able to add in a verification that the nodes are all back up and ready here? Is that too much for kraken that it can just be handled in cerberus? Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this part would be handled by cerberus. With cerberus intergration, when we receive a true, kraken proceeds with next scenario indicating all the nodes are ready but with false, we terminate kraken indicating some components aren't healthy. But it can be explicitly specified after line when cerberus integration is enabled if needed. Thoughts?
fb242b9
to
a7cb346
Compare
@yashashreesuresh Can we rebase the code please? |
a7cb346
to
2dc9aa3
Compare
This commit adds a scenario to shut down all the nodes including the masters and restarts them after a specified duration.
2dc9aa3
to
4e4ba5a
Compare
@chaitanyaenr I have rebased the code. But I was not able to test completely as I don't have a cluster. After this PR gets merged, cluster_shut_down can be moved to a separate file where cluster_shut_down scenario for other cloud types can be added as well. |
@yashashreesuresh Thanks for rebasing the code. No worries about the cluster, will test the scenario and let you know how it goes. |
@@ -41,4 +43,4 @@ We are always looking for more enhancements, fixes to make it better, any contri | |||
### Community | |||
Key Members(slack_usernames): paigerube14, rook, mffiedler, mohit, dry923, rsevilla, ravi | |||
* [**#sig-scalability on Kubernetes Slack**](https://kubernetes.slack.com) | |||
* [**#forum-perfscale on CoreOS Slack**](https://coreos.slack.com) | |||
* [**#forum-perfscale on CoreOS Slack**](https://coreos.slack.com) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this an open slack channel that anyone can get on?
@@ -18,6 +18,8 @@ kraken: | |||
- litmus_scenarios: # List of litmus scenarios to load | |||
- - https://hub.litmuschaos.io/api/chaos/1.10.0?file=charts/generic/node-cpu-hog/rbac.yaml | |||
- scenarios/node_hog_engine.yaml | |||
- cluster_shut_down_scenario: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scenario type in run_kraken is looking for scenario type that ends with an "s". Just need to add an s here to be: cluster_shut_down_scenarios
runs = shut_down_config["runs"] | ||
shut_down_duration = shut_down_config["shut_down_duration"] | ||
cloud_type = shut_down_config["cloud_type"] | ||
if cloud_type == "aws": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add in the other cloud types that have been added?
In addition, could we add in an else case for if the cloud type is not supported and to not run/exit with an error message
Based most of my coding on this PR and got it merged for AWS, Azure, Openstack and GCP |
This commit adds a scenario to shut down all the nodes including the masters and restarts them after a specified duration.