Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(chaos): add kubelet-restart chaos test #890

Merged
merged 4 commits into from
Sep 15, 2020

Conversation

harshshekhar15
Copy link
Contributor

@harshshekhar15 harshshekhar15 commented Sep 8, 2020

Signed-off-by: Harsh Shekhar harsh.shekhar@mayadata.io

This PR intends to do the following:

  • Add Kubera chaos test - TCID-KUBELET-RESTART.
  • Add docker install in Dockerfile.

Exact application name that is under test.

Storage engine that is under test

OpenEBS version if required.

Assumptions of this PR

Notes to reviewer.

Anything else we need to know?

Versions:

$ Kubernetes version
$ Kubernetes platform
$ kubectl version

# Restart Kubelet container present on the node
- name: Restarting Kubelet container in Rancher
shell: docker restart kubelet

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it needs to be restarted on the node where the pods are scheduled

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • If restarted, it will be recreated immediately. Can we stop kubelet container, wait till node gets into NotReady.
  • The pods in the respective node should be rescheduled on some other node if resources are available.
  • Once things are set, you can start kubelet container back on the node.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To do this, we need root access right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gprasath it will restart the docker on the node in which the pod is scheduled. We have made use of Docker out of Docker for this. We have mounted the host's docker socket to this pod for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gprasath what will happen to the statefulsets scheduled on that node if we stop the kubelet till node is in NotReady state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we need root access, so we are running the container with securityContext -- privileged: true

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gprasath what will happen to the statefulsets scheduled on that node if we stop the kubelet till node is in NotReady state?

That statefulset replica pod will be in pending state if the node is not ready

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gprasath it will restart the docker on the node in which the pod is scheduled. We have made use of Docker out of Docker for this. We have mounted the host's docker socket to this pod for this.

Restart is instantaneous. There won't be any considerable impact.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gprasath I think I was not clear -- the above command will restart kubelet container running on the host node itself.

# Task will fail if all the pods are not in 'Running' phase
- name: Checking Kubera pods status
shell: kubectl get pods -n {{ kuberaNamespace }} --field-selector=status.phase!=Running --no-headers
register: podStatus
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when restarting, pods will not enter not running state as it becomes online immediately

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gprasath we are verifying if Kubera is working fine post the restart.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restart won't have any impact.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default pod eviction timeout is 5 minutes. There will be an impact in container status if and only if this is violated @harshshekhar15

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we ensure kubera pods & the restart job are running on same node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AmitKumarDas some of the Kubera pods will be present on whatever node the job's pod is scheduled.

path: /var/run/docker.sock
type: File
imagePullSecrets:
- name: oep-secret
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where do we create this secret?
can we add more info about this.

Copy link
Contributor Author

@harshshekhar15 harshshekhar15 Sep 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gprasath we are creating this secret post installing Kubera as a part of litmus pre-requisites, ref - https://github.com/mayadata-io/oep-e2e/blob/master/litmus/prerequisite/docker-secret.yml

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment in the places where we use this secret

@gprasath
Copy link
Contributor

gprasath commented Sep 9, 2020

And can you add a readme for this scenario, describing the procedure

Copy link
Contributor

@amitbhatt818 amitbhatt818 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

apiVersion: batch/v1
kind: Job
metadata:
generateName: kubelet-restart-

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we restart kubelet of specific nodes?
We may target those node where specific app is running.
Otherwise this might result in flakiness.
Please prove otherwise if this will not result in flakiness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AmitKumarDas no we are not specifying any node, on whichever node the job's pod is scheduled it will restart docker on that host node.This will not result in flakiness as there is not much effects of kubelet restart on Kubera components, just that the pod of any statefulset scheduled on that node will go into NotReady state and will be back to Ready state as soon as kubelet starts running on that node.

Signed-off-by: Harsh Shekhar <harsh.shekhar@mayadata.io>
Signed-off-by: Harsh Shekhar <harsh.shekhar@mayadata.io>
Signed-off-by: Harsh Shekhar <harsh.shekhar@mayadata.io>
Signed-off-by: Harsh Shekhar <harsh.shekhar@mayadata.io>
when: platform == "RANCHER"

- name: Printing the status of nodes of the cluster
shell: kubectl get nodes -o wide
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we check the status of node where the chaos is injected?

@gprasath
Copy link
Contributor

Enhancement to this test: #892

@gprasath gprasath merged commit 2abfb6a into master Sep 15, 2020
@harshshekhar15 harshshekhar15 deleted the add-kubelet-restart-test branch September 15, 2020 10:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants