Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Container size not recognized when CPU limit is not set #1730

Closed
1 task done
tkolo opened this issue Sep 1, 2023 · 3 comments · Fixed by #1739
Closed
1 task done

[Bug]: Container size not recognized when CPU limit is not set #1730

tkolo opened this issue Sep 1, 2023 · 3 comments · Fixed by #1739
Assignees
Labels
feature/ds-projects Data Science Projects feature (formerly Data Science Groupings - DSG) kind/bug Something isn't working priority/normal An issue with the product; fix when possible rhods-1.33

Comments

@tkolo
Copy link

tkolo commented Sep 1, 2023

Is there an existing issue for this?

  • I have searched the existing issues

Deploy type

Downstream version (eg. RHODS 1.29)

Version

RHODS 1.31

Current Behavior

When I modify worspace sizes to ones that include both requests but limit only memory usage, after selecting they're applied correctly but later not recognized by ODS dashboard

Expected Behavior

workspace sizes should be correctly recognized in dashboard

Steps To Reproduce

  1. Install RHODS/OpenDataHub
  2. In dashborad config, modify noetbook sizes to include sizes that do not limit CPU
  3. Create a new data science project
  4. Add workspace with the new size
  5. Notice that on the project dashboard new workspace is now visible with unknown size

Workaround (if any)

Set CPU limits

What browsers are you seeing the problem on?

Firefox, Chrome, Microsoft Edge

Anything else

My reasoning behind not setting CPU limit comes from this blog post: https://home.robusta.dev/blog/stop-using-cpu-limits

@tkolo tkolo added kind/bug Something isn't working priority/normal An issue with the product; fix when possible untriaged Indicates the newly create issue has not been triaged yet labels Sep 1, 2023
@tkolo
Copy link
Author

tkolo commented Sep 1, 2023

Similar issues: #1513, #826

@manaswinidas manaswinidas added feature/ds-projects Data Science Projects feature (formerly Data Science Groupings - DSG) and removed untriaged Indicates the newly create issue has not been triaged yet labels Sep 6, 2023
@shalberd
Copy link
Contributor

Hmm, well not setting limits is lazy, you just shift the problem handling to the VM / Worker Node Level and allow containers to grab resources. Maybe you also have to differentiate here between CPU and Memory ...
Regarding the argument: "This is what happens when you have CPU limits. Resources are available but you aren't allowed to use them." you typically monitor average workloads with Prometheus, also set up alerts, and on that basis set your limit and request. Also: Resources are available for other containers / pods to use, which is of a lot of value in dynamic scheduling environments. Think pipeline task containers, for example.
Would be interesting to hear what others think of this.
About the dashboard bug: yes, good that is was fixed.

@tkolo
Copy link
Author

tkolo commented Sep 13, 2023

I agree that memory should have set limits, mentioned article seems to agree on that point as well. Although this is also double-edge sword - I've noticed that often when the process attempts to exceed it's limit it's simply OOM Killed rather than just prevented from growing.

As for CPU, I'd say that whenever setting limits is helpful or not largely depends on cluster's use case. If it's a production cluster where some resources must be guaranteed from starvation and it's advisable to prevent any form of cpu starvation then sure, limits are probably a good idea. Especially since kubernetes will grant different QoS classes to pods depending on how requests/limits are set

In my use case however we're talking about bare-metal cluster that is exclusively dedicated for data analytics (hence usage of RHODS 🙂). In that case, as long as I can have some guarantees that the cluster won't starve it's core components to death (it's single master node cluster... for now), which CPU and memory reequests alone seem to provide - I don't care if my data analyst's notebook takes 6 cores or 126 cores. As long as it doesn't excessively affect other workloads and any excess resources are split evenly between loads that request them, it's all good.
That of course doesn't mean that proper monitoring is not needed, however in practice my main concern right now is resource overbooking rather than over-consumption. For example oauth-proxy in odh-dashboard deployment requests 1GB RAM and half a core of CPU, despite needing a fraction of these resources in practice. Why that particular deployment scales itself to 5 instances in my cluster is also beyond me, perhaps I'll fix both in future PR 🙂.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature/ds-projects Data Science Projects feature (formerly Data Science Groupings - DSG) kind/bug Something isn't working priority/normal An issue with the product; fix when possible rhods-1.33
Projects
Archived in project
4 participants