Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

oauth-proxy's CPU limit is far too low #62

Open
fcami opened this issue Feb 2, 2023 · 2 comments
Open

oauth-proxy's CPU limit is far too low #62

fcami opened this issue Feb 2, 2023 · 2 comments

Comments

@fcami
Copy link

fcami commented Feb 2, 2023

Describe the bug

oauth-proxy's CPU limit is far too low once a certain number of routes are created, to answer liveness probes in a timely fashion.

To Reproduce
Steps to reproduce the behavior:

Deploy about 80-120 inference models with routes.
A config that ALWAYS reproduces the problem:
6 namespaces, each with 2 model mesh pods, and 800 inference models per
MINIO_MODEL_COUNT=800
NS_COUNT=6

Expected behavior

All model mesh pods deployed, all inference services routes created, everything ready to service inference requests.

Actual behavior

oauth liveness probes are missed:

  Warning  Unhealthy  5m50s (x664 over 23h)  kubelet  Liveness probe failed: Get "https://10.131.0.35:8443/oauth/healthz": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

Leading to:

  Warning  BackOff    10m (x5667 over 22h)   kubelet  Back-off restarting failed container

And obviously:

modelmesh-serving-ovms-1.x-5bbbf88fdf-spxlw   4/5     CrashLoopBackOff   490 (3m27s ago)   23h

In fact, all model mesh instances (pods) are unstable, due to the oauth-proxy container failing its liveness probes.

Environment (please complete the following information):

ODH

  • quay.io/opendatahub/rest-proxy:v0.9.3-auth
  • openvino/model_server:2022.2
  • quay.io/opendatahub/modelmesh-runtime-adapter:v0.9.3-auth
  • quay.io/opendatahub/modelmesh:v0.9.3-auth
  • registry.redhat.io/openshift4/ose-oauth-proxy:v4.8

Additional context

The CPU limit is probably set far too low (100m) for a SSL-terminated endpoints.
As flooding the endpoint might be enough to cause system instability and therefore DoS, maybe setting a much higher limit (or no CPU limit at all) would be better.

fcami added a commit to fcami/modelmesh-serving that referenced this issue Feb 2, 2023
With a CPU limit of 100m, oauth-proxy seems unable to cope with the
load associated with its own liveness probes, leading to the model
mesh pods being restarted every so often once a certain number of
routes (inference services) are created.

Raise the CPU limit from 100m to 2.
Raise the CPU request from 100m to 0.5.

Related-to: opendatahub-io#62
Related-to: opendatahub-io#16
Signed-off-by: François Cami <fcami@redhat.com>
@heyselbi
Copy link

heyselbi commented Dec 5, 2023

@fcami is this still an issue? What's the reason for higher limit for oauth-proxy?

@fcami
Copy link
Author

fcami commented Dec 5, 2023

The reason is explained in the original post.
I cannot test anymore, so please do what you will.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Status: No status
Status: New/Backlog
2 participants