You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.
Scenario
There are low utilization jobs which might block others job from submission. We'd like to have a service plugin which can:
detect all the jobs' utilization
notify users with low utilization in recent few days (default 20% in 5 days, customization)
if user has justification for the usage of the job, admin can extend the job lifetime. otherwise the low utilization jobs will be killed automaticaly in 1 day.
Another alternative implementation is: provide a incentive model with bonus tokens to the user, and let the user decide how to spend the gpu hours.
The text was updated successfully, but these errors were encountered:
Add job start time and GPU hours in Job utilization. Currently, rest-server only return job submission time and job completion time. Doesn't return job start running time. Refer to Job API not return correct appLaunchedTime #4295
Change user GPU utilization to weighted average. Since currently restAPI return job duration based on completion-time - submission-time not completion-time-start-running-time. The weighted average might not correct. Refer to Job API not return correct appLaunchedTime #4295
Add a date info to the email notification's title. i.e. from "pai cluster utilization" to "pai cluster utilization - 3.17"
Add a status column for the job status at the moment of report generated
Add a GPU count column for the GPU used by the job
Enable debugging mode for debug VC.
(debugging mode: users can SSH the node and use for debug within 1~2 hours, system will automatically disconnect the node when time is up.)
Prototype for Enable cluster level policy for job management
(the prototype: Disable SSH port. Jobs will be automatically killed if their utilization is continuously lower than 20% in 1~2 hours)
Scenario
There are low utilization jobs which might block others job from submission. We'd like to have a service plugin which can:
Another alternative implementation is: provide a incentive model with bonus tokens to the user, and let the user decide how to spend the gpu hours.
The text was updated successfully, but these errors were encountered: