Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Litmus attacks fail to work on OpenShift Cluster v4.3 #1538

Closed
Vijay5775 opened this issue Jun 8, 2020 · 15 comments
Closed

Litmus attacks fail to work on OpenShift Cluster v4.3 #1538

Vijay5775 opened this issue Jun 8, 2020 · 15 comments
Assignees
Labels
kind/bug project/community Issues raised by community members

Comments

@Vijay5775
Copy link

I'm trying to run Chaos tests using Litmus on OCP v4.3. I've followed all steps outlined here -> https://docs.litmuschaos.io/docs/openshift-litmus/

The output of each commands outlined in the above URL did appear as expected. But, when trying to run the experiment, it fails. I've just tried the first experiment 'container-kill' but unfortunately, can't get the container killed.

Additionally, have elevated privileges to the Service Account that I created 'container-kill-sa' and added it to 'anyuid' and 'privileged' scc's but still can't seem to able to crack.

One thing I did notice is, its triggering a 'pumba-sig-kill' container to initiate the attack (as attached and the container I'm targeting is highlighted as well). If Litmus is based on Pumba, then I doubt it will work on OCP v4.3, as Pumba developer has already confirmed that the code doesn't support CRI-O based runtime environments (which OCP v4.3 runs on and this is the case from OCP version 3.7. i.e. starting from OCP v3.7, the runtime environment is CRI-O).

Please can someone help confirm if above is the reason for attacks getting failed? If not, is there a way to get this resolved please? many thanks.

image

p.s. also can't seem to figure out anything from the log (as attached)
image

@ispeakc0de
Copy link
Member

ispeakc0de commented Jun 9, 2020

Yes, You are right the Pumba is not supporting the CRIO runtime. We are using the Pumba(docker) as the default one. You could use a different LIB to run this container-kill exp on containerd/CRIO run times internally it uses crictl. Please verify the socket path on your cluster nodes and modify accordingly

We can override the value of LIB and LIB_IMAGE from the chaos-engine CR. (https://docs.litmuschaos.io/docs/container-kill/#supported-experiment-tunables)
I have prepared the sample chaos-engine. It may help

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: nginx-chaos
  namespace: default
spec:
  # It can be true/false
  annotationCheck: 'true'
  # It can be active/stop
  engineState: 'active'
  #ex. values: ns1:name=percona,ns2:run=nginx 
  auxiliaryAppInfo: ''
  appinfo:
    appns: 'default'
    applabel: 'app=nginx'
    appkind: 'deployment'
  chaosServiceAccount: container-kill-sa
  monitoring: false
  # It can be delete/retain
  jobCleanUpPolicy: 'delete' 
  experiments:
    - name: container-kill
      spec:
        components:
          env:
            # specify the name of the container to be killed
            - name: TARGET_CONTAINER
              value: 'nginx'

            # provide the chaos interval
            - name: CHAOS_INTERVAL
              value: '10'

            # provide the total chaos duration
            - name: TOTAL_CHAOS_DURATION
              value: '20'

            # provide the lib here
            - name: LIB
              value: 'containerd'

            # provide the lib image here
            - name: LIB_IMAGE
              value: 'litmuschaos/container-killer:latest'

             # provide the container runtime path for containerd
             # applicable only for containerd runtime
            - name: CONTAINER_PATH
              value: '/run/containerd/containerd.sock'

@Vijay5775
Copy link
Author

Thanks a lot Shubham.

For verifying the socket path, do I issue something like the below please? thanks again.

# cat /etc/crictl.yaml
runtime-endpoint: /var/run/crio/crio.sock

@Vijay5775
Copy link
Author

Hi Shubham,

Below is the chaosengine.yaml but no luck with the attack. Anything you can suggest please? unsure if I need to change anything on the below please? many thanks.


apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: articles-chaos
  namespace: cloud-native-starter
spec:
  # It can be true/false
  annotationCheck: 'true'
  # It can be active/stop
  engineState: 'active'
  #ex. values: ns1:name=percona,ns2:run=nginx 
  auxiliaryAppInfo: ''
  appinfo:
    appns: 'cloud-native-starter'
    applabel: 'app=articles'
    appkind: 'deployment'
  chaosServiceAccount: container-kill-sa
  monitoring: false
  # It can be delete/retain
  jobCleanUpPolicy: 'delete' 
  experiments:
    - name: container-kill
      spec:
        components:
          env:
            # specify the name of the container to be killed
            - name: TARGET_CONTAINER
              value: 'articles'

            # provide the chaos interval
            - name: CHAOS_INTERVAL
              value: '10'

            # provide the total chaos duration
            - name: TOTAL_CHAOS_DURATION
              value: '20'

            # provide the lib here
            - name: LIB
              value: 'containerd'

            # provide the lib image here
            - name: LIB_IMAGE
              value: 'litmuschaos/container-killer:latest'

             # provide the container runtime path for containerd
             # applicable only for containerd runtime
            - name: CONTAINER_PATH
              value: '/var/run/crio/crio.sock'

@ksatchit
Copy link
Member

cc: @gprasath @ispeakc0de

@ksatchit ksatchit added kind/bug project/community Issues raised by community members labels Jun 11, 2020
@ispeakc0de
Copy link
Member

Hi @Vijay5775

Can you please share the following information to understand the problem?

  • Logs of the containerd-helper pod i.e, containerd-chaos-xxxxxx, if created already and also verify the path of sock file /var/run/crio/crio.sock and /etc/crictl.yaml
  • If helper pod is not created then provide the logs of the container-kill pod i.e, container-kill-xxxx.
  • If the container-kill pod is not created then provide the logs of chaos-operator

@Vijay5775
Copy link
Author

As requested,

#1 containerd-chaos-sdjnfb

Observation: The log window just shows a 'Hello!' and the pod just continues to show a status of 'running'. It doesn't quit even after completion of the test unless you manually delete. I've attached the log file pulled from the OpenShift console (again will just have a 'Hello!' text)

Pending: Will update the sock file /var/run/crio/crio.sock and /etc/crictl.yaml details possibly by noon today

#2 container-kill-0brnoh-6fp2b

Observation:

Could see this message in Line 133 and in subsequent lines as:

"level=fatal msg=\"failed to connect: failed to connect, make sure you are running as root and the runtime has been started: context deadline exceeded\"\ncommand terminated with exit code 1", "stderr_lines": ["time=\"2020-06-11T19:39:24Z\" level=fatal msg=\"failed to connect: failed to connect, make sure you are running as root and the runtime has been started: context deadline exceeded\"", "command terminated with exit code 1""

Have attached the log for you to review and advice further, if the issue is due to the above? (have masked only the IP as 10.xx.xx.xx, rest is just 'as-is' from OpenShift Console)
p.s., did add the ServiceAccount created as part of the procedure to the privileged scc's. unsure if I need to elevate anything else here? Is the message above suggesting that to run the "oc apply -f chaosengine.yaml" file, someone has to have a 'cluster-admin' privilege? since in real-life scenario, a normal tester won't be able to get such privileges to perform such tests. Please suggest? thanks.
container-kill-0brnoh-6fp2b-container-kill-0brnoh.log
containerd-chaos-sdjnfb-containerd-chaos.log
articles-chaos-runner-chaos-runner.log

#3 articles-chaos-runner-chaos-runner
Log attached as-is from the OpenShift Console for reference.

@Vijay5775
Copy link
Author

The pending details as below, thanks.

/usr/libexec/crio/conmon -s -c f25882e0f18f6260555c2734f468ad9ac423ad4c775ca299c0bdabb2fe9a9fbf -n k8s_prometheus_prometheus-k8s-0_openshift-monitoring_b6646d09-ae9e-4182-bd05-774b10332028_1 -u f25882e0f18f6260555c2734f468ad9ac423ad4c775ca299c0bdabb2fe9a9fbf -r /usr/bin/runc -b /var/data/crioruntimestorage/overlay-containers/f25882e0f18f6260555c2734f468ad9ac423ad4c775ca299c0bdabb2fe9a9fbf/userdata --persist-dir /var/data/criorootstorage/overlay-containers/f25882e0f18f6260555c2734f468ad9ac423ad4c775ca299c0bdabb2fe9a9fbf/userdata -p /var/data/crioruntimestorage/overlay-containers/f25882e0f18f6260555c2734f468ad9ac423ad4c775ca299c0bdabb2fe9a9fbf/userdata/pidfile -P /var/data/crioruntimestorage/overlay-containers/f25882e0f18f6260555c2734f468ad9ac423ad4c775ca299c0bdabb2fe9a9fbf/userdata/conmon-pidfile -l /var/log/pods/openshift-monitoring_prometheus-k8s-0_b6646d09-ae9e-4182-bd05-774b10332028/prometheus/1.log --exit-dir /var/run/crio/exits --socket-dir-path /var/run/crio --log-level error --runtime-arg --root=/run/runc

[root@kube-bqv4q0ls0pvtrrksmdug-dciglibga15-default-000003a3 crio]# pwd
/var/run/crio
[root@kube-bqv4q0ls0pvtrrksmdug-dciglibga15-default-000003a3 crio]# ls | grep sock
crio.sock
[root@kube-bqv4q0ls0pvtrrksmdug-dciglibga15-default-000003a3 crio]#
[root@kube-bqv4q0ls0pvtrrksmdug-dciglibga15-default-000003a3 crio]# more /etc/crictl.yaml
runtime-endpoint: unix:///var/run/crio/crio.sock

Appreciate if you can take a look and advice of a fix please? many thanks.

@ispeakc0de
Copy link
Member

ispeakc0de commented Jun 13, 2020

Hi @Vijay5775

Thanks for providing all the information. I have made a few modifications to the container-kill experiment. It would be great if you will try the following experiment CR(modified according to your use case). Please let me know if you will still face any problems in running this experiment.

apiVersion: litmuschaos.io/v1alpha1
description:
  message: "Kills a container belonging to an application pod \n"
kind: ChaosExperiment
metadata:
  name: container-kill
  version: 0.1.16
spec:
  definition:
    scope: Namespaced
    permissions:
      - apiGroups:
          - ""
          - "apps"
          - "batch"
          - "litmuschaos.io"
        resources:
          - "jobs"
          - "pods"
          - "pods/log"
          - "events"
          - "pods/exec"
          - "chaosengines"
          - "chaosexperiments"
          - "chaosresults"
        verbs:
          - "create"
          - "list"
          - "get"
          - "update"
          - "patch"
          - "delete"
    image: "litmuschaos/ansible-runner:ci"
    args:
    - -c
    - ansible-playbook ./experiments/generic/container_kill/container_kill_ansible_logic.yml -i /etc/ansible/hosts -vv; exit 0
    command:
    - /bin/bash
    env:

    - name: ANSIBLE_STDOUT_CALLBACK
      value: 'default'

    - name: TARGET_CONTAINER
      value: 'articles'

    # Period to wait before injection of chaos in sec
    - name: RAMP_TIME
      value: ''

    # It supports pumba and containerd 
    - name: LIB
      value: 'containerd'

    # provide the chaos interval
    - name: CHAOS_INTERVAL
      value: '10'

    # provide the total chaos duration
    - name: TOTAL_CHAOS_DURATION
      value: '20'

    - name: CONTAINER_PATH
      value: '/var/run/crio/crio.sock'
    # LIB_IMAGE can be - gaiaadm/pumba:0.6.5, litmuschaos/container-kill-helper:latest
    # For pumba image use: gaiaadm/pumba:0.6.5
    # For containerd/crio image use: litmuschaos/container-kill-helper:latest
    - name: LIB_IMAGE  
      value: 'litmuschaos/container-kill-helper:latest' 

    labels:
      name: container-kill

@Vijay5775
Copy link
Author

Vijay5775 commented Jun 14, 2020

Hi @ispeakc0de

Thanks for your prompt response and support.

After applying the new 'container-kill' experiment supplied above, did ran into errors initially but was able to spot that the previous 'chaosengine.yaml' had a reference to LIB_IMAGE as "litmuschaos/container-killer:latest". Changed this to value "litmuschaos/container-kill-helper:latest" that was within the info you supplied and all good now. Able to execute successfully and the test has finally passed :)

Results as below,

1st Iteration:
image

2nd Iteration:
image

Test Summary: Passed
image

Few things below I need your help with please:

  1. Did notice that it sort of targets the same Container within the 'articles' application for both the iterations. Is there a RANDOMNESS function where it picks different Containers for each iteration for 'articles' application? or is there some other option please?

  2. Is there a separate LIB_IMAGE I should be using for the other experiments (as below) please?

Pod Delete
Pod Network Latency
Pod Network Loss
Pod Network Corruption
Pod CPU Hog
Pod Memory Hog
Disk Fill
Disk Loss
Node CPU Hog
Node Memory Hog
Node Drain

And thanks again for the excellent support provided. Much appreciated, cheers.

@Vijay5775
Copy link
Author

Hi @ispeakc0de

The other thing I noticed is, the execution (container-kill experiment) just runs successfully for the first time. Once I change the target container details within the same yaml files (chaosengine.yaml & container_kill_experiment.yaml) and then attempt to run the test again to kill a different container, then the test doesn't appear to run nor schedule any new container-kill-xxxx pods. (also, it doesn't work even if I try to target the first container again that was previously successful)

Wasn't able to debug this much further but could see the below,

image

Please can you help advice? many thanks.

@ispeakc0de
Copy link
Member

ispeakc0de commented Jun 17, 2020

Yes, The experiment docs have the Supported Experiment Tunables section which contains details of additional LIB_IMAGE or env if required. Adding the pod-delete link for reference.
NOTE: The conflicts like a mismatch of any env can be avoided using a versioned image for the litmus i.e litmuschaos/ansible-runner:1.5.0 and pull the corresponding chart from the charthub.

@ispeakc0de
Copy link
Member

ispeakc0de commented Jun 17, 2020

There is one to one mapping between the chaos experiment and chaos engine. If you modified the chaos experiment (made some changes in its spec & recreate it). The corresponding chaos engine is unable to consume the new changes as it already created the resources with the older one.

To reflect the new changes we have to re-create the corresponding chaos engine as well so that it will point to the newer version of chaos experiment.

from the provided screenshot I am able to see a Summary event which is created by the container-kill-xxxx-xxxx pod stating that the experiment Failed. If possible can you please share the logs of the same pod container-kill-xxxx-xxxx pod? It may help to find the reason of failure.

@Vijay5775
Copy link
Author

Hi @ispeakc0de ,

Did try recreating fresh yaml's for Chaos Experiment and Chaos Engine.

#1 It doesn't appear to initiate or trigger any attacks, nor schedule any container-kill-xxxx-xxxx pods.
#2 I'm able to see only the attached 'Unable to get chaosengine' on the OC portal and console

Appreciate if you can help advice where the problem is please? thanks.

image

image

I've attached the yaml's below for reference:
1st Iteration:

chaoseng1.txt
container_kill_exp1.txt

2nd Iteration:

chaoseng2.txt
container_kill_exp2.txt

Regards,
Vijay

@ispeakc0de
Copy link
Member

Hi @Vijay5775 ,

I apologize for the late response, Is your application annotated? if not please annotate the application first because In your chaosengine manifest, the spec.annotationCheck attribute is true so the application should be annotated before you apply chaosengine.

kubectl annotate deploymentconfig <deployment-configname> litmuschaos.io/chaos="true" -n satoghos-in

If you don't want to annotate the application, alternative is change the value of spec.annotationCheck to false in the chaosengine.

@ksatchit
Copy link
Member

ksatchit commented Sep 16, 2020

The support for containerd & CRI-O runtime has been enhanced in the latest 1.8.0 release (ref: release notes). Also supported is the ability to inject chaos w/o defining chaos annotations on the target deployment (via .spec.annotationCheck: "false"). Will be closing this issue on account of this. Please re-open if the issues are still observed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug project/community Issues raised by community members
Projects
None yet
Development

No branches or pull requests

3 participants