New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-18971: limit number of simultaneous client requests #76
OCPBUGS-18971: limit number of simultaneous client requests #76
Conversation
This change adds an hard-coded limit of 100 simultaneous client requests against the Prometheus service. It avoids exhausting the number of TCP connection requests for clusters with lots of running containers when a client asks for pod metrics on all namespaces (the number of requests is equal to twice the number of all the pods).
/cc @machine424 |
/cc @dgrisonnet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine to me, but I am still pretty concerned about allowing customers to use VPA with prometheus-adapter. It might have side-effect that they don't expect and disrupt their workload.
pkg/client/limiter.go
Outdated
) | ||
|
||
func init() { | ||
legacyregistry.MustRegister(inflightRequests) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: we will have to change the registry once kubernetes-sigs#599 is merged and rebased
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it was a quick modification to verify that the change does what it's meant to do. I can remove it as it's not necessary for the fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, I've added a max_requests metric so we can even alert on the prometheus adapter being throttled (e.g. inflight/max == 1 for a significant time).
total.Add(1) | ||
|
||
w.Write([]byte("{}")) | ||
if r.URL.Path == "/nonblocking" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Knowing that c.Do(ctx2, "GET", "/nonblocking", nil)
would never reach this, why keeping it? to make the test easier to debug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To ensure that the test catches the issue when/if c.Do(ctx2, "GET", "/nonblocking", nil)
isn't blocked on the client side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without the /nonblocking
, in that case the test would be stuck as long as unblock isn't closed.
But ok this would make the test easier to debug
pkg/client/limiter_test.go
Outdated
}(i) | ||
} | ||
|
||
// Wait for the unblocked requests to hit the server. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Wait for the unblocked requests to hit the server. | |
// Wait for the first maxConcurrentRequests requests to hit the server. |
To avoid confusion with /nonblocking.
case ctx2.Err() == nil: | ||
t.Fatalf("expected %dth request to timeout", maxConcurrentRequests+2) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can also add a check total.Load() == maxConcurrentRequests
in here that proves that will also prove that the query doesn't reach the server and also that the equality in line 71 wasn't just transient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should be detected by the Do() request returning without error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The client/Do function may be faulty, it's better to check from server side as well. (+ check that the for total.Load() != maxConcurrentRequests
check wasn't transient)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The total.Load() != maxConcurrentRequests
can't be transient because the test spawns exactly 100 connections then waits for total == 100. Only after that, it starts the request to /nonblocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it spawns maxConcurrentRequests + 1 = 101 (for i := 0; i < maxConcurrentRequests+1; i++) IIUC.
|
||
// Make one more blocked request which should timeout before hitting the server. | ||
ctx2, _ := context.WithTimeout(ctx, time.Second) | ||
_, err := c.Do(ctx2, "GET", "/nonblocking", nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we check that it returns an APIResponse{}
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since it returns an error, the first returned value is irrelevant IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a safeguard, in case the receiver doesn't start by checking the error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure to understand :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we assume the function calling this Do
to not use the result if there is an error, but what if that function doesn't check the error and uses a faulty response that Do
returns.
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
fbbd511
to
562c57c
Compare
// that it could fail the Kubelet liveness probes and lead to Prometheus pod | ||
// restarts. | ||
// The number has been chosen from empirical data. | ||
const maxConcurrentRequests = 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no power of 2, I'm disappointed :)
At worst a Prometheus pod will serve both prom-adapter (if the other is down or LB missed it) or even 3 or 4 in some weird rolling scenarios which results in 4XX connections < --web.max-connections=512, which leaves room in that queue and the other ones for the other clients.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While testing, I've realized that even with 2 prometheus adapter pods and 2 Prometheus pods running, it can be that all adapter requests go to the same Prometheus because the service is configured with client IP affinity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, had some similar cases as well while testing, didn't dig into how LB is set.
I think 100 is a good value.
(just some nits especially if we don't touch the code in the future, we don't have to worry about regressions) |
@simonpasquier: This pull request references Jira Issue OCPBUGS-18971, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/jira refresh |
@simonpasquier: This pull request references Jira Issue OCPBUGS-18971, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@simonpasquier: This pull request references Jira Issue OCPBUGS-18971, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/hold |
for total.Load() != maxConcurrentRequests { | ||
} | ||
|
||
// Make one more request which should be blocked at the client level. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my edification, why do the 101st request separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because it's easier to understand what the test does + L82 ensures that the first 100 requests have hit the server.
@simonpasquier: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
With @machine424, we checked on a live cluster with 1000 user namespaces + pods with and without the PR. 8 concurrent pods.metrics list clients with this PR:
4 concurrent pods.metrics list clients without this PR:
The PR provides better request latency from a client standpoint. Without the PR, increasing the number of pods.metrics clients to 8 triggers liveness probe failures. |
/hold cancel it should be good for another round of review now. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: machine424, simonpasquier The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@simonpasquier: Jira Issue OCPBUGS-18971: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-18971 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherrypick release-4.14 |
@simonpasquier: new pull request created: #77 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/jira refresh |
@simonpasquier: Jira Issue OCPBUGS-18971 is in an unrecognized state (MODIFIED) and will not be moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Another implementation for OCPBUGS-18971.