-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Litmus integration with node cpu hog experiment #33
Conversation
Can one of the admins verify this patch? |
kraken/litmus/common_litmus.py
Outdated
(engine_name, experiment_name, namespace)) | ||
status = chaos_result.split(": ")[1].strip() | ||
while status == "Awaited": | ||
logging.info("Waiting for chaos result to finish waiting, sleeping 10 seconds") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big deal, but this is chatty on the console for longer tests - get this repeatedly. Maybe something to follow up on as we see how other Litmus scenarios run in kraken.
run_kraken.py
Outdated
if litmus_scenarios: | ||
for namespace in litmus_namespaces: | ||
common_litmus.delete_experiments(namespace) | ||
common_litmus.uninstall_litmus() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we catch SIGINT (ctrl-c) and clean up Litmus? I think we're doing that in Cerberus, possibly for slack. This would help when a) things don't go well with a scenario and it is stuck and b) when running in daemon mode. We could do this as a follow on PR - not a must to merge this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes definitely think that this would be nice to add. I might do it separate from this PR just to be able to get it in quickly
def delete_experiments(namespace): | ||
runcommand.invoke("kubectl delete chaosengine --all -n " + str(namespace)) | ||
runcommand.invoke("kubectl delete chaosexperiment --all -n " + str(namespace)) | ||
runcommand.invoke("kubectl delete chaosresult --all -n " + str(namespace)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you forgot to cleanup service accounts here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The service accounts should be cleaned up here: https://github.com/openshift-scale/kraken/blob/4b83f67379b058446b8fce7f5040ed61cbd43476/run_kraken.py#L285
Is that correct?
if "not found" in status: | ||
return False | ||
|
||
if status == "Pass": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to take into consideration/results also litmus probes state, in future.
43c5f57
to
8fd4106
Compare
There are 3 custom resources that are created during each litmus scenario. Below is a description of the resources: | ||
* ChaosEngine: A resource to link a Kubernetes application or Kubernetes node to a ChaosExperiment. ChaosEngine is watched by Litmus' Chaos-Operator which then invokes Chaos-Experiments | ||
* ChaosExperiment: A resource to group the configuration parameters of a chaos experiment. ChaosExperiment CRs are created by the operator when experiments are invoked by ChaosEngine. | ||
* ChaosResult : A resource to hold the results of a chaos-experiment. The Chaos-exporter reads the results and exports the metrics into a configured Prometheus server. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on using the in-cluster prometheus by default to ship the metrics in case of OpenShift?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is built into litmus. Would have to be on their side? Would need some clarification on this part
config/config.yaml
Outdated
- time_scenarios: # List of chaos time scenarios to load | ||
- scenarios/time_scenarios_example.yml | ||
- litmus_scenarios: | ||
- - https://hub.litmuschaos.io/api/chaos/1.9.1?file=charts/generic/node-cpu-hog/rbac.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe consume the rbac file local to scenarios dir to be able to modify the namespace? Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure which is better here. Might be good to keep it has the github link to keep it up to date but could see it both ways
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for late comment. GitHub link has disadvantage of not working in disconnected clusters. Probably not a common scenario and not a blocker for this PR. But does using the link make the node_hog_rbac.yaml file superfluous? Or we should keep it around?
9254ec2
to
ca7bd5a
Compare
I think I have updated this PR with all comments, @jordigilh are you able to take another look? |
LGTM Thanks! /Jordi |
3aab944
to
5e11af1
Compare
config/config.yaml
Outdated
- time_scenarios: # List of chaos time scenarios to load | ||
- scenarios/time_scenarios_example.yml | ||
- litmus_scenarios: | ||
- - https://hub.litmuschaos.io/api/chaos/1.9.1?file=charts/generic/node-cpu-hog/rbac.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for late comment. GitHub link has disadvantage of not working in disconnected clusters. Probably not a common scenario and not a blocker for this PR. But does using the link make the node_hog_rbac.yaml file superfluous? Or we should keep it around?
scenarios/node_hog_rbac.yaml
Outdated
@@ -0,0 +1,37 @@ | |||
--- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this file still needed since the config is now using the URL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we keep the experiment file (had previously deleted from this PR) and this file for the disconnected clusters situation you were talking about in your previous comment? Or better to just call it out in the documentation that you'll have to download and use the files of whatever version they want so it doesn't become super out of date?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was my confusion, I needed to down load the experiment file and add it to config.yaml (it is gone from there too).
- wget https://hub.litmuschaos.io/api/chaos/1.9.1?file=charts/generic/node-cpu-hog/experiment.yaml -O scenarios/experiment.yaml
- edit config/config.yml and add experiment.yaml after node_hog_engine.yaml
- TARGET_NODES worked fine for 1.9.1 - that was in the sample experiment file
Our sample config.yaml should probably work OOTB and include a local experiment file and local engine file since they are most likely to be updated. If not, we need to include instructions in the Litmus README for downloading scenarios and adding them to the config. My opinion, anyways. @jordigilh @chaitanyaenr other opinions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding, the code will install all experiments based on the version you specify in the overarching config file. So no need to do the wget or have the experiment files locally.
Setting the TARGET_NODES in ChaosEngine file should set an environment variable to properly set the variables for the specific experiment file with the given target node name.
Working on validating this and updating the documentation file for litmus to be a little more clear on the steps needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you can keep this locally for disconnected clusters, but then you'll have to make sure that the spec matches the litmus go runner version, particularly when you upgrade the litmus version.
You only need to apply the chaosengine
that will be used for this particular experiment, and discard the others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the rbac file for the node_hog scenario
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you'll need the rbac file to define the sa, the role an role binding. Unless you are handling that elsewhere :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant removed the downloaded rbac file from the scenarios folder. Still using the hub.litmus link in the config to set the sa, role and role binding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok sorry, my bad :)
scenarios/node_hog_engine.yaml
Outdated
value: '60' | ||
|
||
# ENTER THE NAME OF THE APPLICATION NODE | ||
- name: APP_NODE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this env var needs to be TARGET_NODES (see: https://docs.litmuschaos.io/docs/node-cpu-hog/#prepare-chaosengine) @jordigilh Can you confirm? Even when I use TARGET_NODES, though, it is not honored and the load runs on a different node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the key was updated between versions 1.9 and 1.10. In 1.9 documentation it shows APP_NODES still (https://docs.litmuschaos.io/docs/1.9.0/node-cpu-hog/)
Seems like the versions are updated quite often. Do we want to just pick a version and keep using it until we have a need to update. Should we use 1.10 now since it's the latest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@paigerube14 confirmed it should be named TARGET_NODES
.
They also changed the structure of the probes in the schema. Since you don't use probes in this case, you should be good to move to 1.10.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated config file to use the 1.10.0 versions.
@paigerube14 LGTM. Let's squash and merge and iterate in subsequent PRs/Issue |
This PR is for #31
This will install litmus, apply any experiments and check the status of them and then tear down litmus with each kraken run.
The scenario I was able to run and am adding the scenario files here are for the node cpu hog. For this scenario you have to add the node you want to hog the cpu of in both the scenario/node_hog.yaml as well as scenario/node_hog_engine.yaml
Other scenarios that we can add can be found here:
https://hub.litmuschaos.io/