Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod readiness metrics, part 2 #2250

Open
erhudy opened this issue Nov 27, 2023 · 3 comments
Open

Pod readiness metrics, part 2 #2250

erhudy opened this issue Nov 27, 2023 · 3 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@erhudy
Copy link

erhudy commented Nov 27, 2023

What would you like to be added:

I would like kube-state-metrics to be able to report on pod readiness gates in order to have proper alerting set up for readiness gates which are not proceeding.

Why is this needed:

This is to be able to monitor and alert on stuck readiness gates from within the viewpoint of Kubernetes.

This issue was originally raised in #1981 with a proposed solution (combining it into the overall pod ready status), but was closed because that solution was deemed not satisfactory. I propose an alternative solution below.

Describe the solution you'd like

I propose the creation of two new metrics, tentatively named kube_pod_readiness_gate_status_total and kube_pod_readiness_gate_status_ready, to track the number of total PRGs and the ready number thereof.

The disposition of these metrics can be determined fairly simply by inspecting the pod spec and status (examples taken below from a real pod):

spec:
  readinessGates:
  - conditionType: target-health.elbv2.k8s.aws/k8s-kongk8s-kongk8su-928018d45d
  - conditionType: target-health.elbv2.k8s.aws/k8s-kongk8s-kongk8su-01c4e87d17
  - conditionType: target-health.elbv2.k8s.aws/k8s-kongk8s-kongk8su-02d5a84fe7
  - conditionType: target-health.elbv2.k8s.aws/k8s-kongk8s-kongk8su-15c742bed9
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: null
    message: 'Health checks failed with these codes: [401]'
    reason: Target.ResponseCodeMismatch
    status: "True"
    type: target-health.elbv2.k8s.aws/k8s-kongk8s-kongk8su-15c742bed9
  - lastProbeTime: null
    lastTransitionTime: null
    message: 'Health checks failed with these codes: [401]'
    reason: Target.ResponseCodeMismatch
    status: "True"
    type: target-health.elbv2.k8s.aws/k8s-kongk8s-kongk8su-928018d45d
  - lastProbeTime: null
    lastTransitionTime: null
    message: Target registration is in progress
    reason: Elb.RegistrationInProgress
    status: "True"
    type: target-health.elbv2.k8s.aws/k8s-kongk8s-kongk8su-01c4e87d17
  - lastProbeTime: null
    lastTransitionTime: null
    message: Target registration is in progress
    reason: Elb.RegistrationInProgress
    status: "True"
    type: target-health.elbv2.k8s.aws/k8s-kongk8s-kongk8su-02d5a84fe7

For a given pod, the total metric would be determined by the number of entries in the spec.readinessGates array, and the number of ready PRGs would be determined by following each entry from the array to the matching conditions of that type in status.conditions and seeing which ones are reporting status: "True".

Additional context

I am happy to do the work to implement this, I just want a thumbs up on the proposed methodology before I do anything.

@erhudy erhudy added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 27, 2023
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Nov 27, 2023
@dgrisonnet
Copy link
Member

/assign
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 30, 2023
@dgrisonnet
Copy link
Member

These metrics are going to be quite expensive.

Is your goal to monitor the elb readiness via the pods? Because for pod readiness, you should already be able to alert on kube_pod_status_ready.

@erhudy
Copy link
Author

erhudy commented Dec 6, 2023

Pod readiness doesn't necessarily capture what I'm looking for here. It's possible for Kubernetes to consider a pod ready, but have pod readiness gates failing; e.g. a misconfiguration on an associated ingress could prevent the AWS LB controller from provisioning an ELB, which would be reflected on the pod readiness gate, but the pod itself would still show up as N/N ready because it's fine from the kubelet's perspective.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

3 participants