Change ingress_upstream_latency_seconds metrics type to Histogram #8397

Lyt99 · 2022-03-29T12:00:54Z

Change metrics ingress_upstream_latency_seconds type from Summary to Histogram, and both change the example grafana dashboard to use new metrics type.

What this PR does / why we need it:

It fixes #8285, and may fixes #8228.

We have met this problem in benchmark and production environment.

When use Summary metric type with a heavy load , the Summary need to recalculate the quantitate, which is expensive and not lock-free. It will block other metric requests, consume large amount of memory, and get OOM killed.

Change Summary to lock-free Histogram would solve the problem. But introduce breaking change.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation only

Which issue/s this PR fixes

fixes #8285 and #8228

How Has This Been Tested?

Create a cluster, and install nginx-ingress-controller with only 1 replica.
Add some backends and Ingress rule.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-dep
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx
        resources:
          limits:
            memory: "128Mi"
            cpu: "500m"
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: nginx-svc
spec:
  type: ClusterIP
  selector:
    app: nginx
  ports:
  - port: 80
    targetPort: 80
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: test-ingress-example
  labels:
    name: test-ingress-example
spec:
  rules:
  - host: test.example.com
    http:
      paths:
      - pathType: Prefix
        path: /
        backend:
          service:
            name: nginx-svc
            port:
              number: 80

Run wrk with template below to benchmark it. Replace <Controller Endpoint IP> to real endpoint IP.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: alpine
spec:
  selector:
    matchLabels:
      app: alpine
  replicas: 3
  template:
    metadata:
      labels:
        app: alpine
    spec:
      containers:
      - name: wrk
        image: alpine:latest
        command: 
          - "sh"
          - "-c"
          - |
            mount -o remount rw /proc/sys
            sysctl -w net.core.somaxconn=65535
            sysctl -w net.ipv4.ip_local_port_range="1024 65535"
            apk update
            apk add bash curl wrk
            echo <Controller Endpoint IP> test.example.com >> /etc/hosts
            wrk --timeout 20s -d 20m -c 1024 -t 1024 http://test.example.com
        securityContext:
          capabilities:
              drop:
              - ALL
              add:
              - SYS_ADMIN

Observe the memory usage of controller pod.

Checklist:

My change requires a change to the documentation.
I have updated the documentation accordingly.
I've read the CONTRIBUTION guide
I have added tests to cover my changes.
All new and existing tests passed.

k8s-ci-robot · 2022-03-29T12:01:02Z

@Lyt99: This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2022-03-29T12:01:03Z

Hi @Lyt99. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2022-04-13T02:16:54Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Lyt99
To complete the pull request process, please ask for approval from elvinefendi after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tao12345666333 · 2022-04-13T02:34:42Z

Thanks, it's on my list, I will review it asap

/assign

Lyt99 · 2022-05-31T07:06:49Z

Hi @tao12345666333, any progress here?

tao12345666333 · 2022-06-24T16:44:30Z

/ok-to-test

Lyt99 · 2022-06-25T12:57:40Z

BTW, #8726 added a new Summary type metric, I'm afraid that this feature will also trigger the problem.

k8s-triage-robot · 2022-09-23T13:57:36Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-ci-robot · 2022-09-30T18:34:35Z

@Lyt99: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-triage-robot · 2022-10-30T19:33:01Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Lyt99 · 2022-10-31T02:48:39Z

Close, please checkout #8285 for the reason.

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 29, 2022

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 29, 2022

k8s-ci-robot added needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 29, 2022

k8s-ci-robot requested review from ElvinEfendi and strongjz March 29, 2022 12:01

k8s-ci-robot assigned ElvinEfendi Mar 29, 2022

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 13, 2022

Lyt99 force-pushed the metrics-use-histogram branch from 452eaf1 to 1935251 Compare April 13, 2022 02:16

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 13, 2022

k8s-ci-robot assigned tao12345666333 Apr 13, 2022

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 23, 2022

Lyt99 force-pushed the metrics-use-histogram branch from 1935251 to d432a57 Compare June 23, 2022 06:56

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 23, 2022

Change ingress_upstream_latency_seconds from Summary to Histogram

8d1609b

Lyt99 force-pushed the metrics-use-histogram branch from d432a57 to 8d1609b Compare June 23, 2022 07:08

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 24, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 23, 2022

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 30, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 30, 2022

Lyt99 closed this Oct 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change ingress_upstream_latency_seconds metrics type to Histogram #8397

Change ingress_upstream_latency_seconds metrics type to Histogram #8397

Lyt99 commented Mar 29, 2022

k8s-ci-robot commented Mar 29, 2022

k8s-ci-robot commented Mar 29, 2022

k8s-ci-robot commented Apr 13, 2022

tao12345666333 commented Apr 13, 2022

Lyt99 commented May 31, 2022

tao12345666333 commented Jun 24, 2022

Lyt99 commented Jun 25, 2022

k8s-triage-robot commented Sep 23, 2022

k8s-ci-robot commented Sep 30, 2022

k8s-triage-robot commented Oct 30, 2022

Lyt99 commented Oct 31, 2022

Change ingress_upstream_latency_seconds metrics type to Histogram #8397

Change ingress_upstream_latency_seconds metrics type to Histogram #8397

Conversation

Lyt99 commented Mar 29, 2022

What this PR does / why we need it:

Types of changes

Which issue/s this PR fixes

How Has This Been Tested?

Checklist:

k8s-ci-robot commented Mar 29, 2022

k8s-ci-robot commented Mar 29, 2022

k8s-ci-robot commented Apr 13, 2022

tao12345666333 commented Apr 13, 2022

Lyt99 commented May 31, 2022

tao12345666333 commented Jun 24, 2022

Lyt99 commented Jun 25, 2022

k8s-triage-robot commented Sep 23, 2022

k8s-ci-robot commented Sep 30, 2022

k8s-triage-robot commented Oct 30, 2022

Lyt99 commented Oct 31, 2022