You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
oauth-proxy's CPU limit is far too low once a certain number of routes are created, to answer liveness probes in a timely fashion.
To Reproduce
Steps to reproduce the behavior:
Deploy about 80-120 inference models with routes.
A config that ALWAYS reproduces the problem:
6 namespaces, each with 2 model mesh pods, and 800 inference models per
MINIO_MODEL_COUNT=800
NS_COUNT=6
Expected behavior
All model mesh pods deployed, all inference services routes created, everything ready to service inference requests.
Actual behavior
oauth liveness probes are missed:
Warning Unhealthy 5m50s (x664 over 23h) kubelet Liveness probe failed: Get "https://10.131.0.35:8443/oauth/healthz": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
The CPU limit is probably set far too low (100m) for a SSL-terminated endpoints.
As flooding the endpoint might be enough to cause system instability and therefore DoS, maybe setting a much higher limit (or no CPU limit at all) would be better.
The text was updated successfully, but these errors were encountered:
fcami
added a commit
to fcami/modelmesh-serving
that referenced
this issue
Feb 2, 2023
With a CPU limit of 100m, oauth-proxy seems unable to cope with the
load associated with its own liveness probes, leading to the model
mesh pods being restarted every so often once a certain number of
routes (inference services) are created.
Raise the CPU limit from 100m to 2.
Raise the CPU request from 100m to 0.5.
Related-to: opendatahub-io#62
Related-to: opendatahub-io#16
Signed-off-by: François Cami <fcami@redhat.com>
Describe the bug
oauth-proxy's CPU limit is far too low once a certain number of routes are created, to answer liveness probes in a timely fashion.
To Reproduce
Steps to reproduce the behavior:
Deploy about 80-120 inference models with routes.
A config that ALWAYS reproduces the problem:
6 namespaces, each with 2 model mesh pods, and 800 inference models per
MINIO_MODEL_COUNT=800
NS_COUNT=6
Expected behavior
All model mesh pods deployed, all inference services routes created, everything ready to service inference requests.
Actual behavior
oauth liveness probes are missed:
Leading to:
And obviously:
In fact, all model mesh instances (pods) are unstable, due to the oauth-proxy container failing its liveness probes.
Environment (please complete the following information):
ODH
Additional context
The CPU limit is probably set far too low (100m) for a SSL-terminated endpoints.
As flooding the endpoint might be enough to cause system instability and therefore DoS, maybe setting a much higher limit (or no CPU limit at all) would be better.
The text was updated successfully, but these errors were encountered: