New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1807140: jsonnet/rules.jsonnet: Add MultipleContainersOOMKilled alert #676
Bug 1807140: jsonnet/rules.jsonnet: Add MultipleContainersOOMKilled alert #676
Conversation
@lilic: No Bugzilla bug is referenced in the title of this pull request. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
1 similar comment
@lilic: No Bugzilla bug is referenced in the title of this pull request. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
message: 'Multiple containers were out of memory killed within the past 15 minutes.', | ||
}, | ||
labels: { | ||
severity: 'info', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why info
? OOM means we don't have enough memory in a cluster and it should be scaled, which in many cases is a manual action requiring human intervention and possibly a page. Can we raise it to warning
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because this alert is not exact we do not want to do warning level, decided in the PR that was merged and offline conversations with the team. They are adding a proper metric in crio directly in 4.5 that will actually have the info we need on how many containers are being OOM and we will remove this alert and make it warning then. For now as this is good enough alert and its mainly used for us to get a ballpark on how often we see OOMs in the wild.
See #668 and #642 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's mostly because this is a heuristic, we assume that if a container restarted and it's last termination reason was an OOM, then all of those restarts were OOMs. If we truly knew that x amount of OOMs happened then I would agree to making this warning but with this heuristic I would be more careful.
@lilic: This pull request references Bugzilla bug 1807139, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
fbcbf40
to
ee14eb5
Compare
@@ -196,6 +196,17 @@ | |||
severity: 'warning', | |||
}, | |||
}, | |||
{ | |||
expr: 'sum(max by(namespace, container, pod) (increase(kube_pod_container_status_restarts_total[12m])) and max by(namespace, container, pod) (kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}) == 1) > 5', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iirc 12min comes from some kube behavior right? Do we have some docs about this to link to for future Frederic to understand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ha :D yes this is the way kube restarts containers, no docs just a lot of testing with various numbers of timespans, so it eventually normalised the graph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, that works for me .. no need for comment then
/bugzilla refresh |
@lilic: This pull request references Bugzilla bug 1807139, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Looking at this without extra context is there a reason this isn't added to the kubernetes-mixin project first and then in here? |
@metalmatze we will be adding the proper alert in the mixin project when the crio metric is available, see the metric in the PR linked in the previous PR that was merged into master, this PR is just backporting it :D This is a guess alert and not exact at all, we want to fully test it and just want to get an indication of how often these alerts fire, see discussion in this PR a bit above :) #676 (comment) and its also detailed in the bugzilla linked and on slack. ;) I have created a jira ticket for that so we don't forget to remove this once the proper metric that tracks the number of OOM killed containers is added. |
/lgtm |
/retest Please review the full test history for this PR and help us cut down flakes. |
3 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
@lilic this needs a rebase now |
@lilic: This pull request references Bugzilla bug 1807140, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@smarterclayton to have this merged in 4.4 we need to have the dependant bugzilla in verified or modified state, you mentioned you will get it verified? |
#668 was at 10, you need change master to 5 and then associate it with 1807139 and then once it merges this will merge (you don't need a verified PR in master, you just need a merged PR in master) |
/bugzilla refresh |
@lilic: This pull request references Bugzilla bug 1807140, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
PR opened #690 |
/bugzilla refresh |
@lilic: This pull request references Bugzilla bug 1807140, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This triggers when multiple containers were killed within a certain range.This alert is not 100% accurate, but it is the best we can currently do, with the metrics we have.
Had to rebase so lgtm was removed @openshift/openshift-team-monitoring PTAL |
/bugzilla refresh |
@lilic: This pull request references Bugzilla bug 1807140, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 6 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: brancz, lilic, s-urbaniak The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@lilic: All pull requests linked via external trackers have merged. Bugzilla bug 1807140 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherry-pick release-4.3 |
@rphillips: #676 failed to apply on top of branch "release-4.3":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
cc @openshift/openshift-team-monitoring
/hold
Lets see if we need this adjusted first or if its fine like this. So far in 4.4 there were no alerts firing for this, but then we only have CI and nightlies and the OOM bug was fixed.