Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added cluster shut down scenario #25

Closed

Conversation

yashashreesuresh
Copy link
Contributor

This commit adds a scenario to shut down all the nodes including the masters and restarts them after a specified duration.

@yashashreesuresh
Copy link
Contributor Author

This PR is based on node_scenarios PR: #10 as I am reusing the AWS node stop and start functions.

@rht-perf-ci
Copy link

Can one of the admins verify this patch?

except Exception:
logging.error("Failed to run %s" % (command))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should keep a logging statement here and not just pass

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function was added just for the stop_kubelet node scenario, when this command is executed once, the kubelet stops even if the command exceeds the timeout (here set to 45s) and exception indicating 'failed to run command' wouldn't be meaningful in that case.

nodes = []
try:
ret = cli.list_node(pretty=True)
if label_selector:
ret = cli.list_node(pretty=True, label_selector=label_selector)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see anywhere where you call list_nodes and pass it a label_selector. Is that correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, previously in line, list_nodes was used instead of list_killable_node.

@paigerube14
Copy link
Collaborator

I have just a few little comments but I was able to run this cluster shut down the other day and it worked perfectly!

for node in nodes:
cloud_object.stop_instances(node_id[node])
logging.info("Waiting for 250s to shut down all the nodes")
time.sleep(250)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we able to have the user set this in their config or use 250 as a default? Is there a specific reason we chose 250 seconds here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no specific reason, it took around 2 mins on 10 node cluster. So I chose 250 seconds to accommodate clusters of bigger sizes. start_instance function can be called on a node only when it is in stopped state else it throws error. However I have added a try except condition such that we sleep for 10 additional seconds when a node isn't in stopped state even after 250 seconds.

stopped_nodes = nodes - restarted_nodes
logging.info("Waiting for 250s to allow cluster component initilization")
time.sleep(250)
logging.info("Successfully injected cluster_shut_down scenario!")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we able to add in a verification that the nodes are all back up and ready here? Is that too much for kraken that it can just be handled in cerberus? Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this part would be handled by cerberus. With cerberus intergration, when we receive a true, kraken proceeds with next scenario indicating all the nodes are ready but with false, we terminate kraken indicating some components aren't healthy. But it can be explicitly specified after line when cerberus integration is enabled if needed. Thoughts?

@chaitanyaenr
Copy link
Collaborator

@yashashreesuresh Can we rebase the code please?

This commit adds a scenario to shut down all the nodes including
the masters and restarts them after a specified duration.
@yashashreesuresh
Copy link
Contributor Author

@yashashreesuresh Can we rebase the code please?

@chaitanyaenr I have rebased the code. But I was not able to test completely as I don't have a cluster.
The code needs to be tested.

After this PR gets merged, cluster_shut_down can be moved to a separate file where cluster_shut_down scenario for other cloud types can be added as well.

@chaitanyaenr
Copy link
Collaborator

@yashashreesuresh Thanks for rebasing the code. No worries about the cluster, will test the scenario and let you know how it goes.

@@ -41,4 +43,4 @@ We are always looking for more enhancements, fixes to make it better, any contri
### Community
Key Members(slack_usernames): paigerube14, rook, mffiedler, mohit, dry923, rsevilla, ravi
* [**#sig-scalability on Kubernetes Slack**](https://kubernetes.slack.com)
* [**#forum-perfscale on CoreOS Slack**](https://coreos.slack.com)
* [**#forum-perfscale on CoreOS Slack**](https://coreos.slack.com)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this an open slack channel that anyone can get on?

@@ -18,6 +18,8 @@ kraken:
- litmus_scenarios: # List of litmus scenarios to load
- - https://hub.litmuschaos.io/api/chaos/1.10.0?file=charts/generic/node-cpu-hog/rbac.yaml
- scenarios/node_hog_engine.yaml
- cluster_shut_down_scenario:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scenario type in run_kraken is looking for scenario type that ends with an "s". Just need to add an s here to be: cluster_shut_down_scenarios

runs = shut_down_config["runs"]
shut_down_duration = shut_down_config["shut_down_duration"]
cloud_type = shut_down_config["cloud_type"]
if cloud_type == "aws":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add in the other cloud types that have been added?
In addition, could we add in an else case for if the cloud type is not supported and to not run/exit with an error message

@paigerube14
Copy link
Collaborator

Based most of my coding on this PR and got it merged for AWS, Azure, Openstack and GCP
#102

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants