Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
revert cache to original state on evict and bind errors #909
What this PR does / why we need it:
Might be related to #891.
Special notes for your reviewer:
Hi @mateuszlitwin. Thanks for your PR.
I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with
Once the patch is verified, the new status will be reflected by the
I understand the commands that are listed here.
@k82cn Will do.
BTW. The bug that happened on my cluster was that each time "Selected node NotReady" happened during allocation binding, the task would end up in the Binding status even though pod was Pending. This PR is attempt to make Bind/Evict handle failures like that.
However I do not understand why "Selected node NotReady" would happen in the first place. It seems to me it can only happen if
EDIT: I suppose it was some race condition (SchedulerCache and Session having different nodes)
This change ensures that node, task and job info will remain unchanged in case of an error during SchedulerCache.Bind and SchedulerCache.Evict calls. Before if error occurred during binding phase (e.g. "Selected node NotReady") task could get stuck in the Binding status indefinitely while the real pod would be in the Pending status. - SchedulerCache.Evict and SchedulerCache.Bind will revert task status on NodeInfo.UpdateTask and NodeInfo.AddTask errors. - Modified behavior of NodeInfo.AddTask. AddTask will now update task's node name upon successful addition, this is similar to how JobInfo.UpdateTaskStatus updates task's status. - Handling unchecked error in JobInfo.UpdateTaskStatus. - FATAL logging in NodeInfo.UpdateTask when impossible situation happens - failing to add a task after removal of a task from node info. Might be related to #891
[APPROVALNOTIFIER] This PR is APPROVED
The full list of commands accepted by this bot can be found here.
The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing
@k82cn: GitHub didn't allow me to request PR reviews from the following users: sivanzcw, lminzhw.
Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs.