Litmus integration with node cpu hog experiment #33

paigerube14 · 2020-10-07T20:09:55Z

This PR is for #31

This will install litmus, apply any experiments and check the status of them and then tear down litmus with each kraken run.

The scenario I was able to run and am adding the scenario files here are for the node cpu hog. For this scenario you have to add the node you want to hog the cpu of in both the scenario/node_hog.yaml as well as scenario/node_hog_engine.yaml

Other scenarios that we can add can be found here:
https://hub.litmuschaos.io/

rht-perf-ci · 2020-10-07T20:51:02Z

Can one of the admins verify this patch?

scenarios/node_hog_engine.yaml

scenarios/node_hog.yaml

mffiedler · 2020-10-09T18:12:01Z

kraken/litmus/common_litmus.py

+                                     (engine_name, experiment_name, namespace))
+    status = chaos_result.split(": ")[1].strip()
+    while status == "Awaited":
+        logging.info("Waiting for chaos result to finish waiting, sleeping 10 seconds")


Not a big deal, but this is chatty on the console for longer tests - get this repeatedly. Maybe something to follow up on as we see how other Litmus scenarios run in kraken.

kraken/litmus/common_litmus.py

mffiedler · 2020-10-09T19:48:57Z

run_kraken.py

+        if litmus_scenarios:
+            for namespace in litmus_namespaces:
+                common_litmus.delete_experiments(namespace)
+            common_litmus.uninstall_litmus()


can we catch SIGINT (ctrl-c) and clean up Litmus? I think we're doing that in Cerberus, possibly for slack. This would help when a) things don't go well with a scenario and it is stuck and b) when running in daemon mode. We could do this as a follow on PR - not a must to merge this one.

Yes definitely think that this would be nice to add. I might do it separate from this PR just to be able to get it in quickly

kraken/litmus/common_litmus.py

machacekondra · 2020-10-12T12:50:09Z

kraken/litmus/common_litmus.py

+def delete_experiments(namespace):
+    runcommand.invoke("kubectl delete chaosengine --all -n " + str(namespace))
+    runcommand.invoke("kubectl delete chaosexperiment --all -n " + str(namespace))
+    runcommand.invoke("kubectl delete chaosresult --all -n " + str(namespace))


I think you forgot to cleanup service accounts here.

The service accounts should be cleaned up here: https://github.com/openshift-scale/kraken/blob/4b83f67379b058446b8fce7f5040ed61cbd43476/run_kraken.py#L285

Is that correct?

machacekondra · 2020-10-12T12:50:39Z

kraken/litmus/common_litmus.py

+        if "not found" in status:
+            return False
+
+    if status == "Pass":


It would be nice to take into consideration/results also litmus probes state, in future.

config/config.yaml

scenarios/node_hog.yaml

config/config.yaml

docs/litmus_scenarios.md

kraken/litmus/common_litmus.py

run_kraken.py

scenarios/node_hog.yaml

config/config.yaml

docs/litmus_scenarios.md

run_kraken.py

scenarios/node_hog.yaml

run_kraken.py

kraken/litmus/common_litmus.py

scenarios/node_hog.yaml

config/config.yaml

kraken/litmus/common_litmus.py

chaitanyaenr · 2020-11-08T15:26:01Z

docs/litmus_scenarios.md

+There are 3 custom resources that are created during each litmus scenario. Below is a description of the resources:
+* ChaosEngine: A resource to link a Kubernetes application or Kubernetes node to a ChaosExperiment. ChaosEngine is watched by Litmus' Chaos-Operator which then invokes Chaos-Experiments
+* ChaosExperiment: A resource to group the configuration parameters of a chaos experiment. ChaosExperiment CRs are created by the operator when experiments are invoked by ChaosEngine.
+* ChaosResult : A resource to hold the results of a chaos-experiment. The Chaos-exporter reads the results and exports the metrics into a configured Prometheus server.


Thoughts on using the in-cluster prometheus by default to ship the metrics in case of OpenShift?

I think this is built into litmus. Would have to be on their side? Would need some clarification on this part

run_kraken.py

chaitanyaenr · 2020-11-08T15:35:12Z

config/config.yaml

        -   time_scenarios:                                # List of chaos time scenarios to load
            - scenarios/time_scenarios_example.yml
+        -   litmus_scenarios:
+            - - https://hub.litmuschaos.io/api/chaos/1.9.1?file=charts/generic/node-cpu-hog/rbac.yaml


Maybe consume the rbac file local to scenarios dir to be able to modify the namespace? Thoughts?

I'm not sure which is better here. Might be good to keep it has the github link to keep it up to date but could see it both ways

Sorry for late comment. GitHub link has disadvantage of not working in disconnected clusters. Probably not a common scenario and not a blocker for this PR. But does using the link make the node_hog_rbac.yaml file superfluous? Or we should keep it around?

scenarios/node_hog.yaml

paigerube14 · 2020-11-18T14:20:20Z

I think I have updated this PR with all comments, @jordigilh are you able to take another look?

jordigilh · 2020-11-18T14:35:30Z

LGTM

Thanks!

/Jordi

mffiedler · 2020-12-02T15:16:22Z

config/config.yaml

        -   time_scenarios:                                # List of chaos time scenarios to load
            - scenarios/time_scenarios_example.yml
+        -   litmus_scenarios:
+            - - https://hub.litmuschaos.io/api/chaos/1.9.1?file=charts/generic/node-cpu-hog/rbac.yaml


Sorry for late comment. GitHub link has disadvantage of not working in disconnected clusters. Probably not a common scenario and not a blocker for this PR. But does using the link make the node_hog_rbac.yaml file superfluous? Or we should keep it around?

mffiedler · 2020-12-02T15:16:38Z

scenarios/node_hog_rbac.yaml

@@ -0,0 +1,37 @@
+---


Is this file still needed since the config is now using the URL?

Should we keep the experiment file (had previously deleted from this PR) and this file for the disconnected clusters situation you were talking about in your previous comment? Or better to just call it out in the documentation that you'll have to download and use the files of whatever version they want so it doesn't become super out of date?

That was my confusion, I needed to down load the experiment file and add it to config.yaml (it is gone from there too).

wget https://hub.litmuschaos.io/api/chaos/1.9.1?file=charts/generic/node-cpu-hog/experiment.yaml -O scenarios/experiment.yaml

edit config/config.yml and add experiment.yaml after node_hog_engine.yaml

TARGET_NODES worked fine for 1.9.1 - that was in the sample experiment file

Our sample config.yaml should probably work OOTB and include a local experiment file and local engine file since they are most likely to be updated. If not, we need to include instructions in the Litmus README for downloading scenarios and adding them to the config. My opinion, anyways. @jordigilh @chaitanyaenr other opinions.

From my understanding, the code will install all experiments based on the version you specify in the overarching config file. So no need to do the wget or have the experiment files locally.
Setting the TARGET_NODES in ChaosEngine file should set an environment variable to properly set the variables for the specific experiment file with the given target node name.

Working on validating this and updating the documentation file for litmus to be a little more clear on the steps needed.

Yes, you can keep this locally for disconnected clusters, but then you'll have to make sure that the spec matches the litmus go runner version, particularly when you upgrade the litmus version.
You only need to apply the chaosengine that will be used for this particular experiment, and discard the others.

Removed the rbac file for the node_hog scenario

I think you'll need the rbac file to define the sa, the role an role binding. Unless you are handling that elsewhere :)

I meant removed the downloaded rbac file from the scenarios folder. Still using the hub.litmus link in the config to set the sa, role and role binding

ok sorry, my bad :)

mffiedler · 2020-12-02T15:46:56Z

scenarios/node_hog_engine.yaml

+              value: '60'
+
+             # ENTER THE NAME OF THE APPLICATION NODE
+            - name: APP_NODE


I think this env var needs to be TARGET_NODES (see: https://docs.litmuschaos.io/docs/node-cpu-hog/#prepare-chaosengine) @jordigilh Can you confirm? Even when I use TARGET_NODES, though, it is not honored and the load runs on a different node.

It looks like the key was updated between versions 1.9 and 1.10. In 1.9 documentation it shows APP_NODES still (https://docs.litmuschaos.io/docs/1.9.0/node-cpu-hog/)

Seems like the versions are updated quite often. Do we want to just pick a version and keep using it until we have a need to update. Should we use 1.10 now since it's the latest?

@paigerube14 confirmed it should be named TARGET_NODES.
They also changed the structure of the probes in the schema. Since you don't use probes in this case, you should be good to move to 1.10.0.

Updated config file to use the 1.10.0 versions.

mffiedler · 2020-12-03T17:40:42Z

@paigerube14 LGTM. Let's squash and merge and iterate in subsequent PRs/Issue

paigerube14 requested review from mffiedler and chaitanyaenr October 7, 2020 20:09

paigerube14 force-pushed the litmus_1 branch from 33d504e to bb23110 Compare October 7, 2020 20:25

mffiedler reviewed Oct 9, 2020

View reviewed changes

scenarios/node_hog_engine.yaml Outdated Show resolved Hide resolved

mffiedler reviewed Oct 9, 2020

View reviewed changes

scenarios/node_hog.yaml Outdated Show resolved Hide resolved

mffiedler reviewed Oct 9, 2020

View reviewed changes

kraken/litmus/common_litmus.py Outdated Show resolved Hide resolved

mffiedler reviewed Oct 9, 2020

View reviewed changes

machacekondra reviewed Oct 12, 2020

View reviewed changes

mffiedler mentioned this pull request Oct 21, 2020

Support for Azure cloud #36

Closed

chaitanyaenr added the Needs Rebase label Oct 21, 2020

chaitanyaenr reviewed Oct 21, 2020

View reviewed changes

config/config.yaml Outdated Show resolved Hide resolved

paigerube14 force-pushed the litmus_1 branch from ea89b86 to 12c38e8 Compare October 21, 2020 20:03

machacekondra reviewed Oct 26, 2020

View reviewed changes

scenarios/node_hog.yaml Outdated Show resolved Hide resolved

paigerube14 force-pushed the litmus_1 branch 3 times, most recently from 43c5f57 to 8fd4106 Compare October 30, 2020 16:26

masayag suggested changes Nov 3, 2020

View reviewed changes