Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Customizing Istio Metrics with unexpected metrics #38277

Closed
3 of 15 tasks
YaHuiSong opened this issue Apr 7, 2022 · 13 comments
Closed
3 of 15 tasks

Customizing Istio Metrics with unexpected metrics #38277

YaHuiSong opened this issue Apr 7, 2022 · 13 comments
Labels
area/extensions and telemetry lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically. lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while

Comments

@YaHuiSong
Copy link

YaHuiSong commented Apr 7, 2022

Bug Description

situation:
I configured the customize COUNTER metric posta_total and HISTOGRAM metrics istio_posta_duration_millionseconds with dimensions url_path and response_status in istioOperator yaml file. Then I got some strange issue

  1. got extra wired metric with names like following
response_status___200___url_path____app___istio_posta_total
response_status___200___url_path____app___istio_posta_duration_milliseconds_buckets
response_status___200___url_path____app___istio_posta_duration_milliseconds_sum
response_status___200___url_path____app___istio_posta_duration_milliseconds_total
  1. for HOSTGRAM metric, the buckets distribution is not correct

use the PromQL with
sum(istio_request_duration_milliseconds_bucket{pod="xxxxxxxx"}) by (le)

result: 2 values
image

Use istio standard metric with the same pod
sum(istio_request_duration_milliseconds_bucket{pod="xxxxx"}) by (le)

result: 3 values
image

  1. another question, not a bug
    Is it possible If I want to override the existing HOSTGRAM metric dimensions le .
    I know the telemetry V2 does not support customizing buckets for histogram type metrics from fqa https://istio.io/latest/about/faq/metrics-and-logs/#telemetry-v1-vs-v2.

But I am not sure if telemetry V2 support customizing buckets right now. While as the le default value is [0.5, 1, 5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10000, 30000, 60000, 300000, 600000, 1800000, 3600000]
For reality service, the values almost locate less than 5000, the most of default le values is not used actually.
So do istio team has any plan to support customizing buckets for histogram type metrics to make it work better if it does not support now.

I am freshman in istio metric, correct me if I made any mistake. Thanks in advance

What I did:

  1. configure the istioOperator with
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
 name: istio
 namespace: istio-system
spec:
 components:
   ingressGateways:
   - enabled: true
     k8s:
       env:
         ....
     name: istio-ingressgateway
 hub: gcr.io/istio-release
 profile: default
 meshConfig:
   defaultConfig:
     proxyStatsMatcher:
       inclusionPrefixes:
         - istio_posta_total
         - istio_posta_duration_milliseconds_buckets
         - istio_posta_duration_milliseconds_sum
         - istio_posta_duration_milliseconds_total
     extraStatTags:
       - url_path
       - response_status
 values:
   gateways:
     istio-ingressgateway:
       type: NodePort
   pilot:
     env:
       PILOT_ENABLE_VIRTUAL_SERVICE_DELEGATE: "true"
   telemetry:
     v2:
       prometheus:
         configOverride:
           inboundSidecar:
             definitions:
             - name: posta_total
               type: "COUNTER"
               value: "1"
             - name: posta_duration_milliseconds
               type: "HISTOGRAM"
               value: "1"
             metrics:
               - name: posta_total
                 dimensions:
                   url_path: request.url_path
                   response_status: string(response.code)
               - name: posta_duration_milliseconds
                 dimensions:
                   url_path: request.url_path
                   response_status: string(response.code)

  1. check the envoyFilter
  2. restart the application pod
  3. check the metrics with istio prometheus dashboard

Version

➜  ✗ istioctl version
client version: 1.11.0
control plane version: 1.12.4
data plane version: 1.12.4 (1103 proxies)
➜  ✗ kubectl version --short
Client Version: v1.22.1
Server Version: v1.21.5-eks-bc4871b
➜  ✗ helm version --short
v3.6.3+gd506314

Additional Information

Running with the following config:

istio-namespace: istio-system
full-secrets: false
timeout (mins): 30
include: { }
exclude: { Namespaces: kube-system, kube-public, kube-node-lease, local-path-storage } AND { Namespaces: kube-system, kube-public, kube-node-lease, local-path-storage }
end-time: 2022-04-07 15:07:07.306058 +0800 CST

The following Istio control plane revisions/versions were found in the cluster:

Revision default:
&version.MeshInfo{
    {
        Component: "pilot",
        Info:      version.BuildInfo{Version:"1.12.4", GitRevision:"d60cc270251c6fed857dd5ec0546207e0a7dec1f", GolangVersion:"go1.17.7", BuildStatus:"Clean", GitTag:"1.12.4"},
    },
    {
        Component: "pilot",
        Info:      version.BuildInfo{Version:"1.12.4", GitRevision:"d60cc270251c6fed857dd5ec0546207e0a7dec1f", GolangVersion:"go1.17.7", BuildStatus:"Clean", GitTag:"1.12.4"},
    },
    {
        Component: "pilot",
        Info:      version.BuildInfo{Version:"1.12.4", GitRevision:"d60cc270251c6fed857dd5ec0546207e0a7dec1f", GolangVersion:"go1.17.7", BuildStatus:"Clean", GitTag:"1.12.4"},
    },
    {
        Component: "pilot",
        Info:      version.BuildInfo{Version:"1.12.4", GitRevision:"d60cc270251c6fed857dd5ec0546207e0a7dec1f", GolangVersion:"go1.17.7", BuildStatus:"Clean", GitTag:"1.12.4"},
    },
    {
        Component: "pilot",
        Info:      version.BuildInfo{Version:"1.12.4", GitRevision:"d60cc270251c6fed857dd5ec0546207e0a7dec1f", GolangVersion:"go1.17.7", BuildStatus:"Clean", GitTag:"1.12.4"},
    },
}

The following proxy revisions/versions were found in the cluster:
Revision default: Versions {1.12.4}

xxxxxx

Affected product area

  • Docs
  • Installation
  • Networking
  • Performance and Scalability
  • Extensions and Telemetry
  • Security
  • Test and Release
  • User Experience
  • Developer Infrastructure
  • Upgrade
  • Multi Cluster
  • Virtual Machine
  • Control Plane Revisions

Is this the right place to submit this?

  • This is not a security vulnerability
  • This is not a question about how to use Istio
@zirain
Copy link
Member

zirain commented Apr 7, 2022

got extra wired metric with names like following

I try that config local, everything looks good, did you restart the pod after you changes the IstioOperator CR?

@YaHuiSong
Copy link
Author

YaHuiSong commented Apr 7, 2022

@zirain What istio version you are using
yes, I did restart the pod, then got the wired metrics and I tried this configure in 2 different clusters, both got the wired metrics, from envoy dashboard, got this
image
image

@zirain
Copy link
Member

zirain commented Apr 7, 2022

do you have any istio annotations on the pod?
I test it with istio v1.12, can you share the config_dump of the pod that have wired metric.

@douglas-reid
Copy link
Contributor

For (1): are response_status and url_path present in the in-cluster MeshConfig extra stats tag definition? It looks like they should be, based on the config you present, but it would be worth double-checking.

Generally, when we see funky exported prom metrics like that, it is an indication that the regexps configured to transform the raw metrics into prom metrics are incorrect (and missing these cases). That, typically, indicates that the proxy is not aware of the statsTag info at bootstrap (startup time).

We recommend restarting the pods with the proxies to force regeneration of the bootstrap info from the control plane. If the MeshConfig looks correct and there are not annotations on the pod itself potentially overriding that configuration, maybe confirm there are not error logs in the control plane?

For (2): It looks like there may be an issue with copy/paste in your examples. Both queries look the same to me. What is difference being expressed ? I think what you are suggesting is that your custom histogram is acting as a counter. Is that correct?

if so, that is probably due to the use of the value of "1" in your configuration.

             - name: posta_duration_milliseconds
               type: "HISTOGRAM"
               value: "1"

Means that you are making an observation of 1 millisecond with each request. You will likely want to use some expression over request.duration instead.

For (3), adding customization around histogram buckets is something we would like to offer. IIRC, this is still a bootstrap setting in Envoy (and not under the focus of the Telemetry API at the moment, as it requires proxy restarts). I think there is progress that can be made, but it will require a decent amount of doing.

@YaHuiSong
Copy link
Author

YaHuiSong commented Apr 8, 2022

@douglas-reid @zirain Thanks for your quick response

For(1): I checked the pod, the annotation of pod is as following

  proxy.istio.io/config: '{ "holdApplicationUntilProxyStarts": true }'
  sidecar.istio.io/status: '{"initContainers":["istio-init"],"containers":["istio-proxy"],"volumes":["istio-envoy","istio-data","istio-podinfo","istio-token","istiod-ca-cert"],"imagePullSecrets":null,"revision":"default"}'

Think this does not have relation to the customize metrics. Besides, the value of the pod ENV PROXY_CONFIG looks fine as

- name: PROXY_CONFIG
     value: |
       {"extraStatTags":["url_path","response_status"],"proxyStatsMatcher":{"inclusionPrefixes":["istio_posta_total","istio_posta_duration_milliseconds_buckets","istio_posta_duration_milliseconds_sum","istio_posta_duration_milliseconds_total"]},"holdApplicationUntilProxyStarts":true}
  

Also I checked

  1. the stats of envoy , seems the stats format is same with the istio standard
  2. the istiod log, since I apply the IstioOperator with customize metrics, there are no error relative logs.
  3. the envoyFilter with customize metric config
  4. envoy config_dump
    check.zip the attachments contain the stats of envoy, envoyFilter and config_dump files

For (2) Yeah, @douglas-reid you are correct. Thanks for this. it works for me now

For (3) It is a great news as there is progress. I know it is not an easy work for the customize of bucket. Waiting for the good news. Thanks for informing this job

@zirain
Copy link
Member

zirain commented Apr 8, 2022

the config_dump file your shared seems is not complete?

@YaHuiSong
Copy link
Author

YaHuiSong commented Apr 14, 2022

@zirain sorry for missing the message, redump the config and uploaded ,plz help check
config_dump.zip

@zirain
Copy link
Member

zirain commented Apr 14, 2022

@zirain sorry for missing the message, redump the config and uploaded ,plz help check config_dump.zip

the configdump seems ok, does this pod still have wired metrics?

@YaHuiSong
Copy link
Author

@zirain yes, the pods with istio envoy still have the wired metrics

@douglas-reid
Copy link
Contributor

@YaHuiSong can you post the non-prometheus dump of the stats from a proxy that matches that config dump?

@YaHuiSong
Copy link
Author

@douglas-reid What do you mean the non-prometheus dump of the stats?I used the curl http://localhost:15000/config_dump to get the config dump, what is the difference between this config dump and the non-prometheus dump.
Thanks in advance

@douglas-reid
Copy link
Contributor

@douglas-reid What do you mean the non-prometheus dump of the stats?I used the curl http://localhost:15000/config_dump to get the config dump, what is the difference between this config dump and the non-prometheus dump. Thanks in advance

Try: curl http://localhost:15000/stats

@istio-policy-bot istio-policy-bot added the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Jul 28, 2022
@istio-policy-bot
Copy link

🚧 This issue or pull request has been closed due to not having had activity from an Istio team member since 2022-04-28. If you feel this issue or pull request deserves attention, please reopen the issue. Please see this wiki page for more information. Thank you for your contributions.

Created by the issue and PR lifecycle manager.

@istio-policy-bot istio-policy-bot added the lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically. label Aug 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/extensions and telemetry lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically. lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while
Projects
None yet
Development

No branches or pull requests

4 participants