New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1933847: enable hard affinity + PodDisruptionBudget for Prometheus and Thanos Ruler pods #1341
Bug 1933847: enable hard affinity + PodDisruptionBudget for Prometheus and Thanos Ruler pods #1341
Conversation
@simonpasquier: An error was encountered querying GitHub for users with public email (juzhao@redhat.com) for bug 1933847 on the Bugzilla server at https://bugzilla.redhat.com. No known errors were detected, please see the full error message for details. Full error message.
non-200 OK status code: 403 Forbidden body: "{\n \"documentation_url\": \"https://docs.github.com/en/free-pro-team@latest/rest/overview/resources-in-the-rest-api#secondary-rate-limits\",\n \"message\": \"You have exceeded a secondary rate limit. Please wait a few minutes before you try again.\"\n}\n"
Please contact an administrator to resolve this issue, then request a bug refresh with In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
2e997dc
to
4f9c446
Compare
/retitle Bug 1933847: enable hard affinity + PodDisruptionBudget for Prometheus and Thanos Ruler pods |
/test e2e-agnostic |
1 similar comment
/test e2e-agnostic |
@simonpasquier: This pull request references Bugzilla bug 1933847, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/bugzilla refresh The requirements for Bugzilla bugs have changed, recalculating validity. |
@openshift-merge-robot: This pull request references Bugzilla bug 1933847, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
4f9c446
to
1f7ca91
Compare
1f7ca91
to
20277d9
Compare
/bugzilla refresh |
@simonpasquier: This pull request references Bugzilla bug 1933847, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
20277d9
to
0804599
Compare
/assign @jan--f |
This change introduces back hard pod anti-affinity rules and pod disruption budgets for Prometheus (both k8s and user-workload) as well as Thanos Ruler. The cluster monitoring operator updates the `Upgradeable` condition to false when it detects that the pods aren't correctly spread to ensure that an upgrade only happens in safe configurations. This change fixes a couple of bugs: * https://bugzilla.redhat.com/show_bug.cgi?id=1933847 * https://bugzilla.redhat.com/show_bug.cgi?id=1949262 * https://bugzilla.redhat.com/show_bug.cgi?id=1955490 It also removes the TestRebalanceWorkloads end-to-end tests because on a fresh cluster with hard anti-affinity rules, it isn't possible anymore to trigger a situation where both pods would be scheduled on the same node. Signed-off-by: Simon Pasquier <spasquie@redhat.com>
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
0804599
to
fc854a5
Compare
This looks good, I have one question though: in #1330 we actively balance prometheus and alertmanager. Here we set hard affinity settings for prometheus and thanos ruler. Should alertmanager also get hard affinity settings? |
@jan--f yes we have https://bugzilla.redhat.com/show_bug.cgi?id=1955489 to track this but it implies reducing the number of replicas to 2 and using the |
@simonpasquier: This pull request references Bugzilla bug 1933847, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
While #1472 has merged in 4.9 a couple of hours, we'll keep on holding this PR a bit to make sure that we haven't introduced any regression. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jan--f, simonpasquier The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel |
@simonpasquier: Some pull requests linked via external trackers have merged: The following pull requests linked via external trackers have not merged:
These pull request must merge or be unlinked from the Bugzilla bug in order for it to move to the next state. Once unlinked, request a bug refresh with Bugzilla bug 1933847 has not been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This requires #1330 (on 4.9 and 4.8.z) as well as openshift/release#21564 to be merged first.We need #1472 to be merged (backport of #1330 to release-4.9) to ensure that before the pods spread on different nodes before the upgrade to 4.10 starts (otherwise the new hard anti-affinity rules might block the process).