OCPBUGS-18971: limit number of simultaneous client requests #76

simonpasquier · 2023-10-02T07:37:02Z

Another implementation for OCPBUGS-18971.

This change adds an hard-coded limit of 100 simultaneous client requests against the Prometheus service. It avoids exhausting the number of TCP connection requests for clusters with lots of running containers when a client asks for pod metrics on all namespaces (the number of requests is equal to twice the number of all the pods).

simonpasquier · 2023-10-02T14:10:18Z

/cc @machine424

simonpasquier · 2023-10-03T15:19:14Z

/cc @dgrisonnet

dgrisonnet

This looks fine to me, but I am still pretty concerned about allowing customers to use VPA with prometheus-adapter. It might have side-effect that they don't expect and disrupt their workload.

dgrisonnet · 2023-10-04T09:24:58Z

pkg/client/limiter.go

+)
+
+func init() {
+	legacyregistry.MustRegister(inflightRequests)


note: we will have to change the registry once kubernetes-sigs#599 is merged and rebased

Yes it was a quick modification to verify that the change does what it's meant to do. I can remove it as it's not necessary for the fix.

On second thought, I've added a max_requests metric so we can even alert on the prometheus adapter being throttled (e.g. inflight/max == 1 for a significant time).

machine424 · 2023-10-04T15:40:22Z

pkg/client/limiter_test.go

+		total.Add(1)
+
+		w.Write([]byte("{}"))
+		if r.URL.Path == "/nonblocking" {


Knowing that c.Do(ctx2, "GET", "/nonblocking", nil) would never reach this, why keeping it? to make the test easier to debug?

To ensure that the test catches the issue when/if c.Do(ctx2, "GET", "/nonblocking", nil) isn't blocked on the client side.

without the /nonblocking , in that case the test would be stuck as long as unblock isn't closed.
But ok this would make the test easier to debug

machine424 · 2023-10-04T15:45:55Z

pkg/client/limiter_test.go

+		}(i)
+	}
+
+	// Wait for the unblocked requests to hit the server.


Suggested change

// Wait for the unblocked requests to hit the server.

// Wait for the first maxConcurrentRequests requests to hit the server.

To avoid confusion with /nonblocking.

machine424 · 2023-10-04T15:50:31Z

pkg/client/limiter_test.go

+	case ctx2.Err() == nil:
+		t.Fatalf("expected %dth request to timeout", maxConcurrentRequests+2)
+	}
+


we can also add a check total.Load() == maxConcurrentRequests in here that proves that will also prove that the query doesn't reach the server and also that the equality in line 71 wasn't just transient.

it should be detected by the Do() request returning without error.

The client/Do function may be faulty, it's better to check from server side as well. (+ check that the for total.Load() != maxConcurrentRequests check wasn't transient)

The total.Load() != maxConcurrentRequests can't be transient because the test spawns exactly 100 connections then waits for total == 100. Only after that, it starts the request to /nonblocking.

it spawns maxConcurrentRequests + 1 = 101 (for i := 0; i < maxConcurrentRequests+1; i++) IIUC.

machine424 · 2023-10-04T15:52:22Z

pkg/client/limiter_test.go

+
+	// Make one more blocked request which should timeout before hitting the server.
+	ctx2, _ := context.WithTimeout(ctx, time.Second)
+	_, err := c.Do(ctx2, "GET", "/nonblocking", nil)


should we check that it returns an APIResponse{} as well?

since it returns an error, the first returned value is irrelevant IMHO.

Just a safeguard, in case the receiver doesn't start by checking the error.

not sure to understand :)

we assume the function calling this Do to not use the result if there is an error, but what if that function doesn't check the error and uses a faulty response that Do returns.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

pkg/client/limiter.go

machine424 · 2023-10-05T13:19:17Z

pkg/client/limiter.go

+// that it could fail the Kubelet liveness probes and lead to Prometheus pod
+// restarts.
+// The number has been chosen from empirical data.
+const maxConcurrentRequests = 100


no power of 2, I'm disappointed :)

At worst a Prometheus pod will serve both prom-adapter (if the other is down or LB missed it) or even 3 or 4 in some weird rolling scenarios which results in 4XX connections < --web.max-connections=512, which leaves room in that queue and the other ones for the other clients.

While testing, I've realized that even with 2 prometheus adapter pods and 2 Prometheus pods running, it can be that all adapter requests go to the same Prometheus because the service is configured with client IP affinity.

Yes, had some similar cases as well while testing, didn't dig into how LB is set.
I think 100 is a good value.

machine424 · 2023-10-05T13:20:50Z

(just some nits especially if we don't touch the code in the future, we don't have to worry about regressions)

openshift-ci-robot · 2023-10-05T15:28:06Z

@simonpasquier: This pull request references Jira Issue OCPBUGS-18971, which is invalid:

expected the bug to target the "4.15.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Another implementation for OCPBUGS-18971.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

simonpasquier · 2023-10-05T15:29:25Z

/jira refresh

openshift-ci-robot · 2023-10-05T15:29:33Z

@simonpasquier: This pull request references Jira Issue OCPBUGS-18971, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.15.0) matches configured target version for branch (4.15.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @juzhao

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2023-10-05T15:29:34Z

@simonpasquier: This pull request references Jira Issue OCPBUGS-18971, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.15.0) matches configured target version for branch (4.15.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @juzhao

In response to this:

Another implementation for OCPBUGS-18971.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

simonpasquier · 2023-10-05T15:30:46Z

/hold
need to fix some nits.

jan--f · 2023-10-06T09:59:42Z

pkg/client/limiter_test.go

+	for total.Load() != maxConcurrentRequests {
+	}
+
+	// Make one more request which should be blocked at the client level.


Just for my edification, why do the 101st request separately?

because it's easier to understand what the test does + L82 ensures that the first 100 requests have hit the server.

openshift-ci · 2023-10-06T11:55:23Z

@simonpasquier: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

simonpasquier · 2023-10-06T13:39:08Z

With @machine424, we checked on a live cluster with 1000 user namespaces + pods with and without the PR.

8 concurrent pods.metrics list clients with this PR:

thread 7 (last 10 calls): avg: 12.332307567s, max: 17.173590539s
thread 6 (last 10 calls): avg: 7.09612641s, max: 14.421384552s
thread 2 (last 10 calls): avg: 9.685171344s, max: 14.058853074s
thread 0 (last 10 calls): avg: 8.964397842s, max: 13.960817837s
thread 1 (last 10 calls): avg: 7.854924695s, max: 12.632421842s
thread 4 (last 10 calls): avg: 8.987394292s, max: 14.40615962s
thread 3 (last 10 calls): avg: 8.558045975s, max: 11.919687794s
thread 5 (last 10 calls): avg: 7.505023338s, max: 13.791409648s

4 concurrent pods.metrics list clients without this PR:

thread 2 (last 10 calls): avg: 10.388223139s, max: 18.269492054s
thread 1 (last 10 calls): avg: 15.003234702s, max: 37.633859814s
thread 0 (last 10 calls): avg: 15.600047318s, max: 37.038546625s
thread 2 (last 10 calls): avg: 16.452837928s, max: 31.755846445s

The PR provides better request latency from a client standpoint. Without the PR, increasing the number of pods.metrics clients to 8 triggers liveness probe failures.

simonpasquier · 2023-10-09T07:35:16Z

/hold cancel

it should be good for another round of review now.

machine424 · 2023-10-09T11:39:29Z

/lgtm

openshift-ci · 2023-10-09T11:42:26Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: machine424, simonpasquier

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [simonpasquier]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2023-10-09T11:45:14Z

@simonpasquier: Jira Issue OCPBUGS-18971: All pull requests linked via external trackers have merged:

openshift/k8s-prometheus-adapter#76

Jira Issue OCPBUGS-18971 has been moved to the MODIFIED state.

In response to this:

Another implementation for OCPBUGS-18971.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

simonpasquier · 2023-10-09T13:25:59Z

/cherrypick release-4.14

openshift-cherrypick-robot · 2023-10-09T13:26:46Z

@simonpasquier: new pull request created: #77

In response to this:

/cherrypick release-4.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

simonpasquier · 2023-10-09T13:29:33Z

/jira refresh

openshift-ci-robot · 2023-10-09T13:29:35Z

@simonpasquier: Jira Issue OCPBUGS-18971 is in an unrecognized state (MODIFIED) and will not be moved to the MODIFIED state.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 2, 2023

openshift-ci bot requested review from raptorsun and slashpai October 2, 2023 07:37

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 2, 2023

openshift-ci bot requested a review from machine424 October 2, 2023 14:10

openshift-ci bot requested a review from dgrisonnet October 3, 2023 15:19

dgrisonnet reviewed Oct 4, 2023

View reviewed changes

machine424 suggested changes Oct 4, 2023

View reviewed changes

Add metrics for max + inflight requests

562c57c

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

simonpasquier force-pushed the ocpbugs-18971-take-2 branch from fbbd511 to 562c57c Compare October 5, 2023 09:08

simonpasquier mentioned this pull request Oct 5, 2023

WIP: fix: limit number of concurrent client requests #75

Closed

machine424 suggested changes Oct 5, 2023

View reviewed changes

simonpasquier changed the title ~~WIP: fix: limit number of simultaneous client requests~~ OCPBUGS-18971: limit number of simultaneous client requests Oct 5, 2023

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 5, 2023

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Oct 5, 2023

openshift-ci bot requested a review from juzhao October 5, 2023 15:29

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 5, 2023

Address @machine424 comments

4bebc51

jan--f reviewed Oct 6, 2023

View reviewed changes

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 9, 2023

openshift-ci bot assigned machine424 Oct 9, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 9, 2023

openshift-ci bot merged commit f5d8b0a into openshift:master Oct 9, 2023
6 checks passed

simonpasquier deleted the ocpbugs-18971-take-2 branch October 9, 2023 13:25

openshift-cherrypick-robot mentioned this pull request Oct 9, 2023

[release-4.14] OCPBUGS-20250: limit number of simultaneous client requests #77

Merged

	// Wait for the unblocked requests to hit the server.
	// Wait for the first maxConcurrentRequests requests to hit the server.

OCPBUGS-18971: limit number of simultaneous client requests #76

OCPBUGS-18971: limit number of simultaneous client requests #76

Conversation

simonpasquier commented Oct 2, 2023

simonpasquier commented Oct 2, 2023

simonpasquier commented Oct 3, 2023

dgrisonnet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simonpasquier Oct 4, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

machine424 commented Oct 5, 2023

openshift-ci-robot commented Oct 5, 2023

simonpasquier commented Oct 5, 2023

openshift-ci-robot commented Oct 5, 2023

openshift-ci-robot commented Oct 5, 2023

simonpasquier commented Oct 5, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci bot commented Oct 6, 2023

simonpasquier commented Oct 6, 2023

simonpasquier commented Oct 9, 2023

machine424 commented Oct 9, 2023

openshift-ci bot commented Oct 9, 2023

openshift-ci-robot commented Oct 9, 2023

simonpasquier commented Oct 9, 2023

openshift-cherrypick-robot commented Oct 9, 2023

simonpasquier commented Oct 9, 2023

openshift-ci-robot commented Oct 9, 2023

simonpasquier Oct 4, 2023 •

edited