Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

manager, Reduce toleration in case of node reboot #374

Conversation

RamLavi
Copy link
Member

@RamLavi RamLavi commented Nov 2, 2022

What this PR does / why we need it:
Currently if the node upon which KMP-manager deployment runs reboots,
the KMP pod is evicted only 5 minutes (default) after the node is unreachable.
Since KMP-manager deployment is is essential for VM create/update
service-ability, reducing this eviction to 1 minute. This is done by
altering the tolerations config [0]

This means that if the node is not available/ready for more than 1
minute, the pod will evict and will try to reschedule using the
configured affinity.

[0] https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-based-evictions

Special notes for your reviewer:

Release note:

Reduce Kubemacpool-manager node eviction timeout

@RamLavi
Copy link
Member Author

RamLavi commented Nov 2, 2022

Since kubevirtci does not support multiple master clusters, there is no option to check this behavior on CI, so I checked this PR on U/S cluster manually by

  1. deploying KMP on a 2 node cluster
  2. moving KMP to node X
  3. on docker pausing node X container
  4. waiting until node X status is set to NotReady
  5. counting 1 minutes and expecting the KMP pod to evict to the other node

@RamLavi
Copy link
Member Author

RamLavi commented Nov 2, 2022

/retest

@RamLavi
Copy link
Member Author

RamLavi commented Nov 3, 2022

/cc @ormergi

Copy link

@orelmisan orelmisan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @RamLavi, please consider adding a reference to the relevant K8s documentation to the PR description:
https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-based-evictions

Currently if the node upon which KMP-manager deployment runs reboots,
the KMP pod is evicted only 5 minutes (default) after the node is unreachable.
Since KMP-manager deployment is is essential for VM create/update
service-ability, reducing this eviction to 1 minute. This is done by
altering the tolerations config [0]

This means that if the node is not available/ready for more than 1
minute, the pod will evict and will try to reschedule using the
configured affinity.

[0] https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-based-evictions

Signed-off-by: Ram Lavi <ralavi@redhat.com>
@RamLavi RamLavi force-pushed the reduce_kmp_node_eviction_time branch from 4de1d87 to 85cf418 Compare November 3, 2022 12:57
@RamLavi
Copy link
Member Author

RamLavi commented Nov 3, 2022

Hi @RamLavi, please consider adding a reference to the relevant K8s documentation to the PR description: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-based-evictions

Done

@kubevirt-bot
Copy link
Collaborator

@orelmisan: changing LGTM is restricted to collaborators

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@RamLavi
Copy link
Member Author

RamLavi commented Nov 3, 2022

failed test errored on BadRequest from the server, and it doesn't seem related. I want to rerun the lane a few times to be sure
/test pull-kubemacpool-e2e-k8s

Copy link
Member

@qinqon qinqon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some e2e test ? It's a important feature

Copy link
Member

@qinqon qinqon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@kubevirt-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: qinqon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubevirt-bot kubevirt-bot merged commit f559c07 into k8snetworkplumbingwg:main Nov 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants