Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gRPC status code labels are reported as description strings rather than numbers #11449

Closed
patrick-steadman opened this issue Oct 2, 2023 · 1 comment · Fixed by linkerd/linkerd2-proxy#2480
Assignees
Labels

Comments

@patrick-steadman
Copy link

What is the issue?

After upgrading linkerd-proxy to cr.l5d.io/linkerd/proxy:stable-2.14.0, the grpc_status label for metric becomes a full-sentence description with whitespace like ""The operation completed successfully" rather than a numerical code like "2". Likely this was introduced earlier.

This change breaks existing dashboards and monitoring that expect the numeric gPRC response codes. These description strings also have whitespace and other characters that don't work well in tags.

How can it be reproduced?

It can be reproduced by running version stable-2.14.0 of the linkerd-proxy and making gRPC requests to a meshed service. When viewing the output from the /metrics endpoint of prometheus/openmetrics, the output will look something like:

route_response_total{direction="inbound",dst="algolia.thundercats.svc.cluster.local:8080",rt_route="GetSearchApiKey",status_code="200",classification="success",grpc_status="The operation completed successfully",error=""} 12
route_response_total{direction="inbound",dst="algolia.thundercats.svc.cluster.local:8080",rt_route="GetSearchApiKeys",status_code="200",classification="success",grpc_status="The operation completed successfully",error=""} 2

I think it'd also be pretty easy to reproduce this in a failing unit test case here: https://github.com/linkerd/linkerd2-proxy/blob/45b324f7b4bc5221b5ca796f79830d04ecce8e79/linkerd/app/core/src/metrics.rs#L425C48-L425C48

Logs, error output, etc

route_response_total{direction="inbound",dst="algolia.thundercats.svc.cluster.local:8080",rt_route="GetSearchApiKey",status_code="200",classification="success",grpc_status="The operation completed successfully",error=""} 12
route_response_total{direction="inbound",dst="algolia.thundercats.svc.cluster.local:8080",rt_route="GetSearchApiKeys",status_code="200",classification="success",grpc_status="The operation completed successfully",error=""} 2

output of linkerd check -o short

linkerd-identity
----------------
‼ issuer cert is valid for at least 60 days
    issuer certificate will expire on 2023-10-03T18:50:09Z
    see https://linkerd.io/2.14/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints

linkerd-webhooks-and-apisvc-tls
-------------------------------
‼ proxy-injector cert is valid for at least 60 days
    certificate will expire on 2023-10-02T18:50:08Z
    see https://linkerd.io/2.14/checks/#l5d-proxy-injector-webhook-cert-not-expiring-soon for hints
‼ sp-validator cert is valid for at least 60 days
    certificate will expire on 2023-10-02T18:04:59Z
    see https://linkerd.io/2.14/checks/#l5d-sp-validator-webhook-cert-not-expiring-soon for hints
‼ policy-validator cert is valid for at least 60 days
    certificate will expire on 2023-10-02T18:50:09Z
    see https://linkerd.io/2.14/checks/#l5d-policy-validator-webhook-cert-not-expiring-soon for hints

linkerd-version
---------------
‼ cli is up-to-date
    is running version 2.14.0 but the latest stable version is 2.14.1
    see https://linkerd.io/2.14/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 2.14.0 but the latest stable version is 2.14.1
    see https://linkerd.io/2.14/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
        * linkerd-destination-59b7b9ddbd-fw2gp (stable-2.14.0)
        * linkerd-destination-59b7b9ddbd-gkp7t (stable-2.14.0)
        * linkerd-destination-59b7b9ddbd-t5lzr (stable-2.14.0)
        * linkerd-identity-759686967-7lddr (stable-2.14.0)
        * linkerd-identity-759686967-cdfwm (stable-2.14.0)
        * linkerd-identity-759686967-zq2xm (stable-2.14.0)
        * linkerd-proxy-injector-64fcfb4cd7-8dm28 (stable-2.14.0)
        * linkerd-proxy-injector-64fcfb4cd7-nntv6 (stable-2.14.0)
        * linkerd-proxy-injector-64fcfb4cd7-v7b75 (stable-2.14.0)
    see https://linkerd.io/2.14/checks/#l5d-cp-proxy-version for hints

linkerd-jaeger
--------------
‼ jaeger extension proxies are up-to-date
    some proxies are not running the current version:
        * collector-5f57dc685b-488nb (stable-2.14.0)
        * jaeger-79dd465474-qmtnc (stable-2.14.0)
        * jaeger-injector-84c8c45df4-l8z54 (stable-2.14.0)
    see https://linkerd.io/2.14/checks/#l5d-jaeger-proxy-cp-version for hints

linkerd-viz
-----------
‼ tap API server cert is valid for at least 60 days
    certificate will expire on 2023-10-02T22:52:07Z
    see https://linkerd.io/2.14/checks/#l5d-tap-cert-not-expiring-soon for hints
‼ viz extension proxies are up-to-date
    some proxies are not running the current version:
        * metrics-api-86f769585-hwwrc (stable-2.14.0)
        * prometheus-76d7bcc46f-zz7dd (stable-2.14.0)
        * tap-55f596cf7-lcctv (stable-2.14.0)
        * tap-55f596cf7-wcbb6 (stable-2.14.0)
        * tap-55f596cf7-wst2z (stable-2.14.0)
        * tap-injector-6656db5976-xcqdg (stable-2.14.0)
        * web-7df59675cc-47977 (stable-2.14.0)
    see https://linkerd.io/2.14/checks/#l5d-viz-proxy-cp-version for hints

Status check results are √

Environment

  • Kuberentes 2.25
  • Linkerd 2.14

Possible solution

Add call to .code() here so that a numeric code is emitted? https://github.com/linkerd/linkerd2-proxy/blob/main/linkerd/app/core/src/metrics.rs#L425C48-L425C48

Additional context

No response

Would you like to work on fixing this bug?

yes

@adleong
Copy link
Member

adleong commented Oct 2, 2023

thanks for reporting. I'm looking into this.

@adleong adleong self-assigned this Oct 2, 2023
adleong added a commit to linkerd/linkerd2-proxy that referenced this issue Oct 4, 2023
Fixes linkerd/linkerd2#11449

The `grpc_status` metric label is rendered as a long form, human readable string value in the proxy metrics.  For example:

```
response_total{direction="outbound", [...], classification="failure",grpc_status="Unknown error",error=""} 1
```

This is because of the Display impl for Code.  We explicitly convert to an i32 so this renders as a number instead:

```
response_total{direction="outbound", [...] ,classification="failure",grpc_status="2",error=""} 1
```

Signed-off-by: Alex Leong <alex@buoyant.io>
@hawkw hawkw mentioned this issue Oct 19, 2023
hawkw added a commit that referenced this issue Oct 19, 2023
## edge-23.10.3

This edge release fixes issues in the proxy and destination controller which can
result in Linkerd proxies sending traffic to stale endpoints. In addition, it
contains other bugfixes and updates dependencies to include patches for the
security advisories [CVE-2023-44487]/GHSA-qppj-fm5r-hxr3 and GHSA-c827-hfw6-qwvm.

* Fixed an issue where the Destination controller could stop processing
  changes in the endpoints of a destination, if a proxy subscribed to that
  destination stops reading service discovery updates. This issue results in
  proxies attempting to send traffic for that destination to stale endpoints
  ([#11483], fixes [#11480], [#11279], and [#10590])
* Fixed a regression introduced in stable-2.13.0 where proxies would not
  terminate unused service discovery watches, exerting backpressure on the
  Destination controller which could cause it to become stuck
  ([linkerd2-proxy#2484] and [linkerd2-proxy#2486])
* Added `INFO`-level logging to the proxy when endpoints are added or removed
  from a load balancer. These logs are enabled by default, and can be disabled
  by [setting the proxy log level][proxy-log-level] to
  `warn,linkerd=info,linkerd_proxy_balance=warn` or similar
  ([linkerd2-proxy#2486])
* Fixed a regression where the proxy rendered `grpc_status` metric labels as a
  string rather than as the numeric status code ([linkerd2-proxy#2480]; fixes
  [#11449])
* Added missing `imagePullSecrets` to `linkerd-jaeger` ServiceAccount ([#11504])
* Updated the control plane's dependency on the `golang.google.org/grpc` Go
  package to include patches for [CVE-2023-44487]/GHSA-qppj-fm5r-hxr3 ([#11496])
* Updated dependencies on `rustix` to include patches for GHSA-c827-hfw6-qwvm
  ([linkerd2-proxy#2488] and [#11512]).

[#10590]: #10590
[#11279]: #11279
[#11483]: #11483
[#11449]: #11449
[#11480]: #11480
[#11504]: #11504
[#11504]: #11512
[linkerd2-proxy#2480]: linkerd/linkerd2-proxy#2480
[linkerd2-proxy#2484]: linkerd/linkerd2-proxy#2484
[linkerd2-proxy#2486]: linkerd/linkerd2-proxy#2486
[linkerd2-proxy#2488]: linkerd/linkerd2-proxy#2488
[proxy-log-level]: https://linkerd.io/2.14/tasks/modifying-proxy-log-level/
[CVE-2023-44487]: GHSA-qppj-fm5r-hxr3
hawkw pushed a commit to linkerd/linkerd2-proxy that referenced this issue Oct 25, 2023
Fixes linkerd/linkerd2#11449

The `grpc_status` metric label is rendered as a long form, human readable string value in the proxy metrics.  For example:

```
response_total{direction="outbound", [...], classification="failure",grpc_status="Unknown error",error=""} 1
```

This is because of the Display impl for Code.  We explicitly convert to an i32 so this renders as a number instead:

```
response_total{direction="outbound", [...] ,classification="failure",grpc_status="2",error=""} 1
```

Signed-off-by: Alex Leong <alex@buoyant.io>
hawkw pushed a commit to linkerd/linkerd2-proxy that referenced this issue Oct 25, 2023
Fixes linkerd/linkerd2#11449

The `grpc_status` metric label is rendered as a long form, human readable string value in the proxy metrics.  For example:

```
response_total{direction="outbound", [...], classification="failure",grpc_status="Unknown error",error=""} 1
```

This is because of the Display impl for Code.  We explicitly convert to an i32 so this renders as a number instead:

```
response_total{direction="outbound", [...] ,classification="failure",grpc_status="2",error=""} 1
```

Signed-off-by: Alex Leong <alex@buoyant.io>
hawkw pushed a commit to linkerd/linkerd2-proxy that referenced this issue Oct 26, 2023
Fixes linkerd/linkerd2#11449

The `grpc_status` metric label is rendered as a long form, human readable string value in the proxy metrics.  For example:

```
response_total{direction="outbound", [...], classification="failure",grpc_status="Unknown error",error=""} 1
```

This is because of the Display impl for Code.  We explicitly convert to an i32 so this renders as a number instead:

```
response_total{direction="outbound", [...] ,classification="failure",grpc_status="2",error=""} 1
```

Signed-off-by: Alex Leong <alex@buoyant.io>
hawkw pushed a commit to linkerd/linkerd2-proxy that referenced this issue Oct 26, 2023
Fixes linkerd/linkerd2#11449

The `grpc_status` metric label is rendered as a long form, human readable string value in the proxy metrics.  For example:

```
response_total{direction="outbound", [...], classification="failure",grpc_status="Unknown error",error=""} 1
```

This is because of the Display impl for Code.  We explicitly convert to an i32 so this renders as a number instead:

```
response_total{direction="outbound", [...] ,classification="failure",grpc_status="2",error=""} 1
```

Signed-off-by: Alex Leong <alex@buoyant.io>
mateiidavid added a commit that referenced this issue Oct 26, 2023
This stable release fixes issues in the proxy and Destination controller which
can result in Linkerd proxies sending traffic to stale endpoints. In addition,
it contains a bug fix for profile resolutions for pods bound on host ports and
includes patches for security advisory [CVE-2023-44487]/GHSA-qppj-fm5r-hxr3

* Control Plane
  * Fixed an issue where the Destination controller could stop processing
    changes in the endpoints of a destination, if a proxy subscribed to that
    destination stops reading service discovery updates. This issue results in
    proxies attempting to send traffic for that destination to stale endpoints
    ([#11483], fixes [#11480], [#11279], [#10590])
  * Fixed an issue where the Destination controller would not update pod
    metadata for profile resolutions for a pod accessed via the host network
    (e.g. HostPort endpoints) ([#11334])
  * Addressed [CVE-2023-44487]/GHSA-qppj-fm5r-hxr3 by upgrading several
    dependencies (including Go's gRPC and net libraries)

* Proxy
  * Fixed a regression where the proxy rendered `grpc_status` metric labels as
    a string rather than as the numeric status code ([linkerd2-proxy#2480];
    fixes [#11449])
  * Fixed a regression introduced in stable-2.13.0 where proxies would not
    terminate unusred service discovery watches, exerting backpressure on the
    Destination controller which could cause it to become stuck
    ([linkerd2-proxy#2484])

[#10590]: #10590
[#11279]: #11279
[#11483]: #11483
[#11480]: #11480
[#11334]: #11334
[#11449]: #11449
[CVE-2023-44487]: GHSA-qppj-fm5r-hxr3
[linkerd2-proxy#2480]: linkerd/linkerd2-proxy#2480
[linkerd2-proxy#2484]: linkerd/linkerd2-proxy#2484

Signed-off-by: Matei David <matei@buoyant.io>
mateiidavid added a commit that referenced this issue Oct 26, 2023
This stable release fixes issues in the proxy and Destination controller which
can result in Linkerd proxies sending traffic to stale endpoints. In addition,
it contains a bug fix for profile resolutions for pods bound on host ports and
includes patches for security advisory [CVE-2023-44487]/GHSA-qppj-fm5r-hxr3

* Control Plane
  * Fixed an issue where the Destination controller could stop processing
    changes in the endpoints of a destination, if a proxy subscribed to that
    destination stops reading service discovery updates. This issue results in
    proxies attempting to send traffic for that destination to stale endpoints
    ([#11483], fixes [#11480], [#11279], [#10590])
  * Fixed an issue where the Destination controller would not update pod
    metadata for profile resolutions for a pod accessed via the host network
    (e.g. HostPort endpoints) ([#11334])
  * Addressed [CVE-2023-44487]/GHSA-qppj-fm5r-hxr3 by upgrading several
    dependencies (including Go's gRPC and net libraries)

* Proxy
  * Fixed a regression where the proxy rendered `grpc_status` metric labels as
    a string rather than as the numeric status code ([linkerd2-proxy#2480];
    fixes [#11449])
  * Fixed a regression introduced in stable-2.13.0 where proxies would not
    terminate unusred service discovery watches, exerting backpressure on the
    Destination controller which could cause it to become stuck
    ([linkerd2-proxy#2484])

[#10590]: #10590
[#11279]: #11279
[#11483]: #11483
[#11480]: #11480
[#11334]: #11334
[#11449]: #11449
[CVE-2023-44487]: GHSA-qppj-fm5r-hxr3
[linkerd2-proxy#2480]: linkerd/linkerd2-proxy#2480
[linkerd2-proxy#2484]: linkerd/linkerd2-proxy#2484

---------

Signed-off-by: Matei David <matei@buoyant.io>
Co-authored-by: Alejandro Pedraza <alejandro@buoyant.io>
Co-authored-by: Oliver Gould <ver@buoyant.io>
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 3, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants