Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Litmus integration with node cpu hog experiment #33

Merged
merged 1 commit into from
Dec 3, 2020

Conversation

paigerube14
Copy link
Collaborator

This PR is for #31

This will install litmus, apply any experiments and check the status of them and then tear down litmus with each kraken run.

The scenario I was able to run and am adding the scenario files here are for the node cpu hog. For this scenario you have to add the node you want to hog the cpu of in both the scenario/node_hog.yaml as well as scenario/node_hog_engine.yaml

Other scenarios that we can add can be found here:
https://hub.litmuschaos.io/

@rht-perf-ci
Copy link

Can one of the admins verify this patch?

scenarios/node_hog.yaml Outdated Show resolved Hide resolved
(engine_name, experiment_name, namespace))
status = chaos_result.split(": ")[1].strip()
while status == "Awaited":
logging.info("Waiting for chaos result to finish waiting, sleeping 10 seconds")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal, but this is chatty on the console for longer tests - get this repeatedly. Maybe something to follow up on as we see how other Litmus scenarios run in kraken.

run_kraken.py Outdated
if litmus_scenarios:
for namespace in litmus_namespaces:
common_litmus.delete_experiments(namespace)
common_litmus.uninstall_litmus()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we catch SIGINT (ctrl-c) and clean up Litmus? I think we're doing that in Cerberus, possibly for slack. This would help when a) things don't go well with a scenario and it is stuck and b) when running in daemon mode. We could do this as a follow on PR - not a must to merge this one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes definitely think that this would be nice to add. I might do it separate from this PR just to be able to get it in quickly

kraken/litmus/common_litmus.py Outdated Show resolved Hide resolved
kraken/litmus/common_litmus.py Outdated Show resolved Hide resolved
kraken/litmus/common_litmus.py Show resolved Hide resolved
def delete_experiments(namespace):
runcommand.invoke("kubectl delete chaosengine --all -n " + str(namespace))
runcommand.invoke("kubectl delete chaosexperiment --all -n " + str(namespace))
runcommand.invoke("kubectl delete chaosresult --all -n " + str(namespace))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you forgot to cleanup service accounts here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if "not found" in status:
return False

if status == "Pass":

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to take into consideration/results also litmus probes state, in future.

config/config.yaml Outdated Show resolved Hide resolved
@paigerube14 paigerube14 force-pushed the litmus_1 branch 3 times, most recently from 43c5f57 to 8fd4106 Compare October 30, 2020 16:26
config/config.yaml Outdated Show resolved Hide resolved
docs/litmus_scenarios.md Outdated Show resolved Hide resolved
kraken/litmus/common_litmus.py Outdated Show resolved Hide resolved
run_kraken.py Outdated Show resolved Hide resolved
scenarios/node_hog.yaml Outdated Show resolved Hide resolved
config/config.yaml Outdated Show resolved Hide resolved
docs/litmus_scenarios.md Outdated Show resolved Hide resolved
run_kraken.py Outdated Show resolved Hide resolved
scenarios/node_hog.yaml Outdated Show resolved Hide resolved
scenarios/node_hog.yaml Outdated Show resolved Hide resolved
config/config.yaml Outdated Show resolved Hide resolved
kraken/litmus/common_litmus.py Outdated Show resolved Hide resolved
There are 3 custom resources that are created during each litmus scenario. Below is a description of the resources:
* ChaosEngine: A resource to link a Kubernetes application or Kubernetes node to a ChaosExperiment. ChaosEngine is watched by Litmus' Chaos-Operator which then invokes Chaos-Experiments
* ChaosExperiment: A resource to group the configuration parameters of a chaos experiment. ChaosExperiment CRs are created by the operator when experiments are invoked by ChaosEngine.
* ChaosResult : A resource to hold the results of a chaos-experiment. The Chaos-exporter reads the results and exports the metrics into a configured Prometheus server.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on using the in-cluster prometheus by default to ship the metrics in case of OpenShift?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is built into litmus. Would have to be on their side? Would need some clarification on this part

run_kraken.py Show resolved Hide resolved
- time_scenarios: # List of chaos time scenarios to load
- scenarios/time_scenarios_example.yml
- litmus_scenarios:
- - https://hub.litmuschaos.io/api/chaos/1.9.1?file=charts/generic/node-cpu-hog/rbac.yaml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe consume the rbac file local to scenarios dir to be able to modify the namespace? Thoughts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure which is better here. Might be good to keep it has the github link to keep it up to date but could see it both ways

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for late comment. GitHub link has disadvantage of not working in disconnected clusters. Probably not a common scenario and not a blocker for this PR. But does using the link make the node_hog_rbac.yaml file superfluous? Or we should keep it around?

scenarios/node_hog.yaml Outdated Show resolved Hide resolved
@paigerube14
Copy link
Collaborator Author

I think I have updated this PR with all comments, @jordigilh are you able to take another look?

@jordigilh
Copy link
Collaborator

LGTM

Thanks!

/Jordi

- time_scenarios: # List of chaos time scenarios to load
- scenarios/time_scenarios_example.yml
- litmus_scenarios:
- - https://hub.litmuschaos.io/api/chaos/1.9.1?file=charts/generic/node-cpu-hog/rbac.yaml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for late comment. GitHub link has disadvantage of not working in disconnected clusters. Probably not a common scenario and not a blocker for this PR. But does using the link make the node_hog_rbac.yaml file superfluous? Or we should keep it around?

@@ -0,0 +1,37 @@
---
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file still needed since the config is now using the URL?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep the experiment file (had previously deleted from this PR) and this file for the disconnected clusters situation you were talking about in your previous comment? Or better to just call it out in the documentation that you'll have to download and use the files of whatever version they want so it doesn't become super out of date?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my confusion, I needed to down load the experiment file and add it to config.yaml (it is gone from there too).

Our sample config.yaml should probably work OOTB and include a local experiment file and local engine file since they are most likely to be updated. If not, we need to include instructions in the Litmus README for downloading scenarios and adding them to the config. My opinion, anyways. @jordigilh @chaitanyaenr other opinions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, the code will install all experiments based on the version you specify in the overarching config file. So no need to do the wget or have the experiment files locally.
Setting the TARGET_NODES in ChaosEngine file should set an environment variable to properly set the variables for the specific experiment file with the given target node name.

Working on validating this and updating the documentation file for litmus to be a little more clear on the steps needed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you can keep this locally for disconnected clusters, but then you'll have to make sure that the spec matches the litmus go runner version, particularly when you upgrade the litmus version.
You only need to apply the chaosengine that will be used for this particular experiment, and discard the others.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the rbac file for the node_hog scenario

Copy link
Collaborator

@jordigilh jordigilh Dec 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you'll need the rbac file to define the sa, the role an role binding. Unless you are handling that elsewhere :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant removed the downloaded rbac file from the scenarios folder. Still using the hub.litmus link in the config to set the sa, role and role binding

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok sorry, my bad :)

value: '60'

# ENTER THE NAME OF THE APPLICATION NODE
- name: APP_NODE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this env var needs to be TARGET_NODES (see: https://docs.litmuschaos.io/docs/node-cpu-hog/#prepare-chaosengine) @jordigilh Can you confirm? Even when I use TARGET_NODES, though, it is not honored and the load runs on a different node.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the key was updated between versions 1.9 and 1.10. In 1.9 documentation it shows APP_NODES still (https://docs.litmuschaos.io/docs/1.9.0/node-cpu-hog/)

Seems like the versions are updated quite often. Do we want to just pick a version and keep using it until we have a need to update. Should we use 1.10 now since it's the latest?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paigerube14 confirmed it should be named TARGET_NODES.
They also changed the structure of the probes in the schema. Since you don't use probes in this case, you should be good to move to 1.10.0.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated config file to use the 1.10.0 versions.

@mffiedler
Copy link
Collaborator

@paigerube14 LGTM. Let's squash and merge and iterate in subsequent PRs/Issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants