Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-18971: limit number of simultaneous client requests #76

Merged
merged 3 commits into from Oct 9, 2023

Conversation

simonpasquier
Copy link

Another implementation for OCPBUGS-18971.

This change adds an hard-coded limit of 100 simultaneous client requests
against the Prometheus service. It avoids exhausting the number of TCP
connection requests for clusters with lots of running containers when a
client asks for pod metrics on all namespaces (the number of requests is
equal to twice the number of all the pods).
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 2, 2023
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 2, 2023
@simonpasquier
Copy link
Author

/cc @machine424

@openshift-ci openshift-ci bot requested a review from machine424 October 2, 2023 14:10
@simonpasquier
Copy link
Author

/cc @dgrisonnet

@openshift-ci openshift-ci bot requested a review from dgrisonnet October 3, 2023 15:19
Copy link
Member

@dgrisonnet dgrisonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine to me, but I am still pretty concerned about allowing customers to use VPA with prometheus-adapter. It might have side-effect that they don't expect and disrupt their workload.

)

func init() {
legacyregistry.MustRegister(inflightRequests)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: we will have to change the registry once kubernetes-sigs#599 is merged and rebased

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it was a quick modification to verify that the change does what it's meant to do. I can remove it as it's not necessary for the fix.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, I've added a max_requests metric so we can even alert on the prometheus adapter being throttled (e.g. inflight/max == 1 for a significant time).

total.Add(1)

w.Write([]byte("{}"))
if r.URL.Path == "/nonblocking" {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Knowing that c.Do(ctx2, "GET", "/nonblocking", nil) would never reach this, why keeping it? to make the test easier to debug?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To ensure that the test catches the issue when/if c.Do(ctx2, "GET", "/nonblocking", nil) isn't blocked on the client side.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without the /nonblocking , in that case the test would be stuck as long as unblock isn't closed.
But ok this would make the test easier to debug

}(i)
}

// Wait for the unblocked requests to hit the server.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Wait for the unblocked requests to hit the server.
// Wait for the first maxConcurrentRequests requests to hit the server.

To avoid confusion with /nonblocking.

case ctx2.Err() == nil:
t.Fatalf("expected %dth request to timeout", maxConcurrentRequests+2)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can also add a check total.Load() == maxConcurrentRequests in here that proves that will also prove that the query doesn't reach the server and also that the equality in line 71 wasn't just transient.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be detected by the Do() request returning without error.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The client/Do function may be faulty, it's better to check from server side as well. (+ check that the for total.Load() != maxConcurrentRequests check wasn't transient)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The total.Load() != maxConcurrentRequests can't be transient because the test spawns exactly 100 connections then waits for total == 100. Only after that, it starts the request to /nonblocking.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it spawns maxConcurrentRequests + 1 = 101 (for i := 0; i < maxConcurrentRequests+1; i++) IIUC.


// Make one more blocked request which should timeout before hitting the server.
ctx2, _ := context.WithTimeout(ctx, time.Second)
_, err := c.Do(ctx2, "GET", "/nonblocking", nil)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we check that it returns an APIResponse{} as well?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since it returns an error, the first returned value is irrelevant IMHO.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a safeguard, in case the receiver doesn't start by checking the error.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure to understand :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we assume the function calling this Do to not use the result if there is an error, but what if that function doesn't check the error and uses a faulty response that Do returns.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
pkg/client/limiter.go Outdated Show resolved Hide resolved
pkg/client/limiter.go Outdated Show resolved Hide resolved
// that it could fail the Kubelet liveness probes and lead to Prometheus pod
// restarts.
// The number has been chosen from empirical data.
const maxConcurrentRequests = 100

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no power of 2, I'm disappointed :)

At worst a Prometheus pod will serve both prom-adapter (if the other is down or LB missed it) or even 3 or 4 in some weird rolling scenarios which results in 4XX connections < --web.max-connections=512, which leaves room in that queue and the other ones for the other clients.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While testing, I've realized that even with 2 prometheus adapter pods and 2 Prometheus pods running, it can be that all adapter requests go to the same Prometheus because the service is configured with client IP affinity.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, had some similar cases as well while testing, didn't dig into how LB is set.
I think 100 is a good value.

@machine424
Copy link

(just some nits especially if we don't touch the code in the future, we don't have to worry about regressions)

@simonpasquier simonpasquier changed the title WIP: fix: limit number of simultaneous client requests OCPBUGS-18971: limit number of simultaneous client requests Oct 5, 2023
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 5, 2023
@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Oct 5, 2023
@openshift-ci-robot
Copy link

@simonpasquier: This pull request references Jira Issue OCPBUGS-18971, which is invalid:

  • expected the bug to target the "4.15.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Another implementation for OCPBUGS-18971.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@simonpasquier
Copy link
Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Oct 5, 2023
@openshift-ci-robot
Copy link

@simonpasquier: This pull request references Jira Issue OCPBUGS-18971, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.15.0) matches configured target version for branch (4.15.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @juzhao

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link

@simonpasquier: This pull request references Jira Issue OCPBUGS-18971, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.15.0) matches configured target version for branch (4.15.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @juzhao

In response to this:

Another implementation for OCPBUGS-18971.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested a review from juzhao October 5, 2023 15:29
@simonpasquier
Copy link
Author

/hold
need to fix some nits.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 5, 2023
for total.Load() != maxConcurrentRequests {
}

// Make one more request which should be blocked at the client level.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my edification, why do the 101st request separately?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because it's easier to understand what the test does + L82 ensures that the first 100 requests have hit the server.

@openshift-ci
Copy link

openshift-ci bot commented Oct 6, 2023

@simonpasquier: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@simonpasquier
Copy link
Author

With @machine424, we checked on a live cluster with 1000 user namespaces + pods with and without the PR.

8 concurrent pods.metrics list clients with this PR:

thread 7 (last 10 calls): avg: 12.332307567s, max: 17.173590539s
thread 6 (last 10 calls): avg: 7.09612641s, max: 14.421384552s
thread 2 (last 10 calls): avg: 9.685171344s, max: 14.058853074s
thread 0 (last 10 calls): avg: 8.964397842s, max: 13.960817837s
thread 1 (last 10 calls): avg: 7.854924695s, max: 12.632421842s
thread 4 (last 10 calls): avg: 8.987394292s, max: 14.40615962s
thread 3 (last 10 calls): avg: 8.558045975s, max: 11.919687794s
thread 5 (last 10 calls): avg: 7.505023338s, max: 13.791409648s

4 concurrent pods.metrics list clients without this PR:

thread 2 (last 10 calls): avg: 10.388223139s, max: 18.269492054s
thread 1 (last 10 calls): avg: 15.003234702s, max: 37.633859814s
thread 0 (last 10 calls): avg: 15.600047318s, max: 37.038546625s
thread 2 (last 10 calls): avg: 16.452837928s, max: 31.755846445s

The PR provides better request latency from a client standpoint. Without the PR, increasing the number of pods.metrics clients to 8 triggers liveness probe failures.

@simonpasquier
Copy link
Author

/hold cancel

it should be good for another round of review now.

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 9, 2023
@machine424
Copy link

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 9, 2023
@openshift-ci
Copy link

openshift-ci bot commented Oct 9, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: machine424, simonpasquier

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot merged commit f5d8b0a into openshift:master Oct 9, 2023
6 checks passed
@openshift-ci-robot
Copy link

@simonpasquier: Jira Issue OCPBUGS-18971: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-18971 has been moved to the MODIFIED state.

In response to this:

Another implementation for OCPBUGS-18971.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@simonpasquier simonpasquier deleted the ocpbugs-18971-take-2 branch October 9, 2023 13:25
@simonpasquier
Copy link
Author

/cherrypick release-4.14

@openshift-cherrypick-robot

@simonpasquier: new pull request created: #77

In response to this:

/cherrypick release-4.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@simonpasquier
Copy link
Author

/jira refresh

@openshift-ci-robot
Copy link

@simonpasquier: Jira Issue OCPBUGS-18971 is in an unrecognized state (MODIFIED) and will not be moved to the MODIFIED state.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants