Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added cluster shut down scenario #25

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@ Kraken supports pod, node and time/date based scenarios.

- [Time Scenarios](docs/time_scenarios.md)

- [Cluster Shut Down Scenarios](docs/cluster_shut_down_scenarios.md)

### Kraken scenario pass/fail criteria and report
It's important to make sure to check if the targeted component recovered from the chaos injection and also if the Kubernetes/OpenShift cluster is healthy as failures in one component can have an adverse impact on other components. Kraken does this by:
- Having built in checks for pod and node based scenarios to ensure the expected number of replicas and nodes are up. It also supports running custom scripts with the checks.
Expand All @@ -41,4 +43,4 @@ We are always looking for more enhancements, fixes to make it better, any contri
### Community
Key Members(slack_usernames): paigerube14, rook, mffiedler, mohit, dry923, rsevilla, ravi
* [**#sig-scalability on Kubernetes Slack**](https://kubernetes.slack.com)
* [**#forum-perfscale on CoreOS Slack**](https://coreos.slack.com)
* [**#forum-perfscale on CoreOS Slack**](https://coreos.slack.com)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this an open slack channel that anyone can get on?

2 changes: 2 additions & 0 deletions config/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@ kraken:
- litmus_scenarios: # List of litmus scenarios to load
- - https://hub.litmuschaos.io/api/chaos/1.10.0?file=charts/generic/node-cpu-hog/rbac.yaml
- scenarios/node_hog_engine.yaml
- cluster_shut_down_scenario:
Copy link
Collaborator

@paigerube14 paigerube14 Feb 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scenario type in run_kraken is looking for scenario type that ends with an "s". Just need to add an s here to be: cluster_shut_down_scenarios

- scenarios/cluster_shut_down_scenario.yml

cerberus:
cerberus_enabled: False # Enable it when cerberus is previously installed
Expand Down
9 changes: 9 additions & 0 deletions docs/cluster_shut_down_scenarios.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
#### Kubernetes/OpenShift cluster shut down scenario
Scenario to shut down all the nodes including the masters and restart them after specified duration. Cluster shut down scenario can be injected by placing the shut_down config file under cluster_shut_down_scenario option in the kraken config. Refer to [cluster_shut_down_scenario](https://github.com/openshift-scale/kraken/blob/master/scenarios/cluster_shut_down_scenario.yml) config file.

```
cluster_shut_down_scenario: # Scenario to stop all the nodes for specified duration and restart the nodes
runs: 1 # Number of times to execute the cluster_shut_down scenario
shut_down_duration: 120 # duration in seconds to shut down the cluster
cloud_type: aws # cloud type on which Kubernetes/OpenShift runs
```
60 changes: 59 additions & 1 deletion run_kraken.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
import kraken.invoke.command as runcommand
import kraken.litmus.common_litmus as common_litmus
import kraken.node_actions.common_node_functions as nodeaction
from kraken.node_actions.aws_node_scenarios import aws_node_scenarios
from kraken.node_actions.aws_node_scenarios import AWS, aws_node_scenarios
from kraken.node_actions.general_cloud_node_scenarios import general_node_scenarios
from kraken.node_actions.gcp_node_scenarios import gcp_node_scenarios
import kraken.time_actions.common_time_functions as time_actions
Expand Down Expand Up @@ -277,6 +277,57 @@ def litmus_scenarios(scenarios_list, config, litmus_namespaces, litmus_uninstall
return litmus_namespaces


# Inject the cluster shut down scenario
def cluster_shut_down(shut_down_config, config):
runs = shut_down_config["runs"]
shut_down_duration = shut_down_config["shut_down_duration"]
cloud_type = shut_down_config["cloud_type"]
if cloud_type == "aws":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add in the other cloud types that have been added?
In addition, could we add in an else case for if the cloud type is not supported and to not run/exit with an error message

cloud_object = AWS()

nodes = set(kubecli.list_nodes())
node_id = {}
for node in nodes:
node_id[node] = cloud_object.get_instance_id(node)

for _ in range(runs):
logging.info("Starting cluster_shut_down scenario injection")
for node in nodes:
cloud_object.stop_instances(node_id[node])
logging.info("Waiting for 250s to shut down all the nodes")
time.sleep(250)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we able to have the user set this in their config or use 250 as a default? Is there a specific reason we chose 250 seconds here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no specific reason, it took around 2 mins on 10 node cluster. So I chose 250 seconds to accommodate clusters of bigger sizes. start_instance function can be called on a node only when it is in stopped state else it throws error. However I have added a try except condition such that we sleep for 10 additional seconds when a node isn't in stopped state even after 250 seconds.

logging.info("Shutting down the cluster for the specified duration: %s"
% (shut_down_duration))
time.sleep(shut_down_duration)
logging.info("Restarting the nodes")
restarted_nodes = set()
stopped_nodes = nodes
while restarted_nodes != nodes:
for node in stopped_nodes:
try:
cloud_object.start_instances(node_id[node])
restarted_nodes.add(node)
except Exception:
time.sleep(10)
continue
stopped_nodes = nodes - restarted_nodes
logging.info("Waiting for 250s to allow cluster component initilization")
time.sleep(250)
logging.info("Successfully injected cluster_shut_down scenario!")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we able to add in a verification that the nodes are all back up and ready here? Is that too much for kraken that it can just be handled in cerberus? Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this part would be handled by cerberus. With cerberus intergration, when we receive a true, kraken proceeds with next scenario indicating all the nodes are ready but with false, we terminate kraken indicating some components aren't healthy. But it can be explicitly specified after line when cerberus integration is enabled if needed. Thoughts?

cerberus_integration(config)
logging.info("")


def cluster_shut_down_scenarios(scenarios_list, config):
for shut_down_config in scenarios_list:
with open(shut_down_config, 'r') as f:
shut_down_config = yaml.full_load(f)
shut_down_config = shut_down_config["cluster_shut_down_scenario"]
cluster_shut_down(shut_down_config, config)
logging.info("Waiting for the specified duration: %s" % (wait_duration))
time.sleep(wait_duration)


# Main function
def main(cfg):
# Start kraken
Expand Down Expand Up @@ -329,6 +380,7 @@ def main(cfg):
failed_post_scenarios = []
litmus_namespaces = []
litmus_installed = False

# Loop to run the chaos starts here
while (int(iteration) < iterations):
# Inject chaos scenarios specified in the config
Expand All @@ -350,6 +402,7 @@ def main(cfg):
# Inject time skew chaos scenarios specified in the config
elif scenario_type == "time_scenarios":
time_scenarios(scenarios_list, config)

elif scenario_type == "litmus_scenarios":
if not litmus_installed:
common_litmus.install_litmus(litmus_version)
Expand All @@ -359,8 +412,13 @@ def main(cfg):
litmus_namespaces,
litmus_uninstall)

# Inject cluster shut down scenario specified in the config
elif scenario_type == "cluster_shut_down_scenarios":
cluster_shut_down_scenarios(scenarios_list, config)

iteration += 1
logging.info("")

if litmus_uninstall and litmus_installed:
for namespace in litmus_namespaces:
common_litmus.delete_chaos(namespace)
Expand Down
4 changes: 4 additions & 0 deletions scenarios/cluster_shut_down_scenario.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
cluster_shut_down_scenario: # Scenario to stop all the nodes for specified duration and restart the nodes
runs: 1 # Number of times to execute the cluster_shut_down scenario
shut_down_duration: 120 # duration in seconds to shut down the cluster
cloud_type: aws # cloud type on which Kubernetes/OpenShift runs