Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check whether metricObj can be converted to *v1beta2.MetricValueList #80392

Merged
merged 1 commit into from Jul 30, 2019

Conversation

tedyu
Copy link
Contributor

@tedyu tedyu commented Jul 20, 2019

What type of PR is this?
/kind bug

What this PR does / why we need it:
As #80390 mentioned, namespacedMetrics#GetForObject doesn't check whether metricObj can be converted to *v1beta2.MetricValueList .

this PR adds the check to avoid potential panic.

Which issue(s) this PR fixes:
Fixes #80390

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jul 20, 2019
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jul 20, 2019
@tedyu
Copy link
Contributor Author

tedyu commented Jul 22, 2019

cc @piosz

@gyuho
Copy link
Member

gyuho commented Jul 23, 2019

@kawych Could we get this reviewed?

@shyamjvs
Copy link
Member

/cc @MaciekPytel @DirectXMan12 @losipiuk

Could one of you take a look at this? This seems to be a serious bug with HPA panic'ing and causing controller-manager to crashloop when the typecast to v1beta2 doesn't work:

E0724 00:06:29.842903       1 runtime.go:69] Observed a panic: &runtime.TypeAssertionError{interfaceString:"runtime.Object", concreteString:"*v1.Status", assertedString:"*v1beta2.MetricVal
ueList", missingMethod:""} (interface conversion: runtime.Object is *v1.Status, not *v1beta2.MetricValueList)
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:76
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/usr/local/go/src/runtime/asm_amd64.s:573
/usr/local/go/src/runtime/panic.go:502
/usr/local/go/src/runtime/iface.go:252
/usr/local/go/src/runtime/iface.go:262
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/metrics/pkg/client/custom_metrics/versioned_client.go:269
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/metrics/pkg/client/custom_metrics/multi_client.go:136
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/controller/podautoscaler/metrics/rest_metrics_client.go:113
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/controller/podautoscaler/replica_calculator.go:158
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/controller/podautoscaler/horizontal.go:347
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/controller/podautoscaler/horizontal.go:274
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/controller/podautoscaler/horizontal.go:550
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/controller/podautoscaler/horizontal.go:318
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/controller/podautoscaler/horizontal.go:210
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/controller/podautoscaler/horizontal.go:198
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/controller/podautoscaler/horizontal.go:164
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/usr/local/go/src/runtime/asm_amd64.s:2361
panic: interface conversion: runtime.Object is *v1.Status, not *v1beta2.MetricValueList [recovered]
        panic: interface conversion: runtime.Object is *v1.Status, not *v1beta2.MetricValueList

goroutine 1733 [running]:
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x107
panic(0x330f7c0, 0xc4221f9c80)
        /usr/local/go/src/runtime/panic.go:502 +0x229
k8s.io/kubernetes/vendor/k8s.io/metrics/pkg/client/custom_metrics.(*namespacedMetrics).GetForObjects(0xc420cfb2a0, 0x0, 0x0, 0x3a5b3e8, 0x3, 0x3df7240, 0xc420c80e20, 0xc4221e4000, 0x13, 0x3df72a0, ...)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/metrics/pkg/client/custom_metrics/versioned_client.go:269 +0x4d3
k8s.io/kubernetes/vendor/k8s.io/metrics/pkg/client/custom_metrics.(*multiClientInterface).GetForObjects(0xc4221fa380, 0x0, 0x0, 0x3a5b3e8, 0x3, 0x3df7240, 0xc420c80e20, 0xc4221e4000, 0x13, 0x3df72a0, ...)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/metrics/pkg/client/custom_metrics/multi_client.go:136 +0x118
k8s.io/kubernetes/pkg/controller/podautoscaler/metrics.(*customMetricsClient).GetRawMetric(0xc420426270, 0xc4221e4000, 0x13, 0xc420886db7, 0x7, 0x3df7240, 0xc420c80e20, 0x3df72a0, 0x5ba7488, 0x0, ...)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/controller/podautoscaler/metrics/rest_metrics_client.go:113 +0x102
k8s.io/kubernetes/pkg/controller/podautoscaler.(*ReplicaCalculator).GetMetricReplicas(0xc420d66fc0, 0x2, 0x55d4a80, 0xc4221e4000, 0x13, 0xc420886db7, 0x7, 0x3df7240, 0xc420c80e20, 0x3df72a0, ...)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/controller/podautoscaler/replica_calculator.go:158 +0xb0
k8s.io/kubernetes/pkg/controller/podautoscaler.(*HorizontalController).computeStatusForPodsMetric(0xc420aeb180, 0x2, 0xc42233b678, 0x4, 0x0, 0xc42230da00, 0x0, 0x0, 0xc4222ff500, 0x3df7240, ...)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/controller/podautoscaler/horizontal.go:347 +0xd8
k8s.io/kubernetes/pkg/controller/podautoscaler.(*HorizontalController).computeReplicasForMetrics(0xc420aeb180, 0xc4222ff500, 0xc421044280, 0xc420f5f710, 0x1, 0x1, 0x11, 0x3b2be6e, 0x3d, 0x0, ...)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/controller/podautoscaler/horizontal.go:274 +0x933
k8s.io/kubernetes/pkg/controller/podautoscaler.(*HorizontalController).reconcileAutoscaler(0xc420aeb180, 0xc42200f980, 0xc4209da460, 0x17, 0x0, 0x0)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/controller/podautoscaler/horizontal.go:550 +0x16de
k8s.io/kubernetes/pkg/controller/podautoscaler.(*HorizontalController).reconcileKey(0xc420aeb180, 0xc4209da460, 0x17, 0x30401e0, 0xc420a57c40, 0xc420d74280)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/controller/podautoscaler/horizontal.go:318 +0x277
k8s.io/kubernetes/pkg/controller/podautoscaler.(*HorizontalController).processNextWorkItem(0xc420aeb180, 0xc422133d00)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/controller/podautoscaler/horizontal.go:210 +0xde
k8s.io/kubernetes/pkg/controller/podautoscaler.(*HorizontalController).worker(0xc420aeb180)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/controller/podautoscaler/horizontal.go:198 +0x2b
k8s.io/kubernetes/pkg/controller/podautoscaler.(*HorizontalController).(k8s.io/kubernetes/pkg/controller/podautoscaler.worker)-fm()
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/controller/podautoscaler/horizontal.go:164 +0x2a
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc421f63a00)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x54
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc421f63a00, 0x3b9aca00, 0x0, 0x1, 0xc42044a420)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0xbd
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc421f63a00, 0x3b9aca00, 0xc42044a420)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by k8s.io/kubernetes/pkg/controller/podautoscaler.(*HorizontalController).Run
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/controller/podautoscaler/horizontal.go:164 +0x1ca

This seems to be a regression introduced in this commit - kubernetes/metrics@2bd4d84

Happy to give more info.

@shyamjvs
Copy link
Member

I digged a bit further into the issue and found the following:

  • HPA was panic'ing (as shown above) while trying to typecast the result of scraping the custom metrics API from *v1.Status to *v1beta2.MetricValueList
  • The custom metrics API itself was being served by kube-metrics-adapter pod running on the cluster. From logs of that pod, it seemed that it was successfully serving the requests from HPA:
I0724 00:58:39.350470       1 wrap.go:42] GET /apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/%2A/total-response-rate?labelSelector=app%3D<redacted>: (326.604µs) 200 [[kube-controller-manager/v1.12.6 (linux/amd64) kubernetes/d69f1bf/horizontal-pod-autoscaler] 10.188.3.89:58484]
  • However, even though it was a 200 code, the response being sent wasn't actually a valid one. I found this by making a "get metrics" call myself:
$  kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/total-response-rate?labelSelector=app=<redacted>" --v=8
I0724 02:26:41.904913   29275 round_trippers.go:416] GET http://localhost:8080/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/%2A/total-response-rate?labelSelector=app%3Dcollector-green%2Ccomponent%3Dcollector%2Crelease%3Dcollector-green
I0724 02:26:41.904957   29275 round_trippers.go:423] Request Headers:
I0724 02:26:41.904963   29275 round_trippers.go:426]     Accept: application/json, */*
I0724 02:26:41.904968   29275 round_trippers.go:426]     User-Agent: kubectl/v1.12.6 (linux/amd64) kubernetes/d69f1bf
I0724 02:26:41.909067   29275 round_trippers.go:441] Response Status: 200 OK in 4 milliseconds
I0724 02:26:41.909081   29275 round_trippers.go:444] Response Headers:
I0724 02:26:41.909084   29275 round_trippers.go:447]     Date: Wed, 24 Jul 2019 02:26:41 GMT
I0724 02:26:41.909093   29275 round_trippers.go:447]     Audit-Id: 7a1fb523-f5ea-43a8-8015-62fb3ba7771a
I0724 02:26:41.909100   29275 round_trippers.go:447]     Content-Length: 122
I0724 02:26:41.909107   29275 round_trippers.go:447]     Content-Type: application/json
{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"expected pointer, but got nil","code":500}
  • There 2 problems here:
    • One is with the metrics adapter: that it returned 200 status code even though it had some failure with scraping metrics, while hiding the actual status code inside the response body (i.e 500). I'll post this issue under Crash in Kube Controller Manager zalando-incubator/kube-metrics-adapter#40
    • Second is with the k8s metrics client: which is trying to typecast the response as MetricValueList without checking if it's feasible and consequently panic'ing. IMO we should prevent an externally-served API (like this one) to cause crashes in the kube-controller-manager. And I think this PR helps with that

@szuecs
Copy link
Member

szuecs commented Jul 24, 2019

I think a while ago we had this issue, too.
Basically it's a version miss match between the version of Metrics objects.

@szuecs
Copy link
Member

szuecs commented Jul 24, 2019

We have an open PR that should fix the issue on our side for users of v1beta1 (GKE) zalando-incubator/kube-metrics-adapter#64
If someone could check it in GKE would be nice.

@tedyu
Copy link
Contributor Author

tedyu commented Jul 24, 2019

ping @piosz for review

@tedyu
Copy link
Contributor Author

tedyu commented Jul 25, 2019

@kawych
Can you take another look ?

@kawych
Copy link
Contributor

kawych commented Jul 26, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 26, 2019
mikkeloscar added a commit to zalando-incubator/kube-metrics-adapter that referenced this pull request Jul 26, 2019
Fixes the response from `GetMetricsBySelector` in case no metrics are
found. This issue caused a panic in kube-controller-manager:
kubernetes/kubernetes#80392

Fix #40

Signed-off-by: Mikkel Oscar Lyderik Larsen <mikkel.larsen@zalando.de>
mikkeloscar added a commit to zalando-incubator/kube-metrics-adapter that referenced this pull request Jul 26, 2019
Fixes the response from `GetMetricsBySelector` in case no metrics are
found. This issue caused a panic in kube-controller-manager:
kubernetes/kubernetes#80392

Fix #40

Signed-off-by: Mikkel Oscar Lyderik Larsen <mikkel.larsen@zalando.de>
szuecs pushed a commit to zalando-incubator/kube-metrics-adapter that referenced this pull request Jul 26, 2019
Fixes the response from `GetMetricsBySelector` in case no metrics are
found. This issue caused a panic in kube-controller-manager:
kubernetes/kubernetes#80392

Fix #40

Signed-off-by: Mikkel Oscar Lyderik Larsen <mikkel.larsen@zalando.de>
@tedyu
Copy link
Contributor Author

tedyu commented Jul 29, 2019

@piosz
Can you review ?

@piosz
Copy link
Member

piosz commented Jul 30, 2019

/approve
reviewed by @kawych

@piosz
Copy link
Member

piosz commented Jul 30, 2019

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: piosz, tedyu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 30, 2019
@piosz piosz added the sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. label Jul 30, 2019
@k8s-ci-robot k8s-ci-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jul 30, 2019
@k8s-ci-robot k8s-ci-robot merged commit c6101ab into kubernetes:master Jul 30, 2019
@k8s-ci-robot k8s-ci-robot added this to the v1.16 milestone Jul 30, 2019
@shyamjvs
Copy link
Member

Thanks!

@tedyu Could you cherry-pick the changes to our supported versions (1.13, 1.14, 1.15)?

@tedyu
Copy link
Contributor Author

tedyu commented Jul 30, 2019

I would be happy to do that.

My workspace is with master.
Let me follow:
https://github.com/eBay/Kubernetes/blob/master/docs/devel/cherry-picks.md

Thanks

@shyamjvs
Copy link
Member

You can use this script to generate PRs for the patches - https://github.com/kubernetes/kubernetes/blob/master/hack/cherry_pick_pull.sh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kube-controller-manager panic on horizonal pod autoscaler asserting "*v1beta2.MetricValueList"
7 participants