Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

[Rest Server] Update restart policy to avoid stuck pending pods #3856

Merged
merged 2 commits into from
Nov 18, 2019

Conversation

abuccts
Copy link
Member

@abuccts abuccts commented Nov 15, 2019

Update restart policy to avoid stuck pending pods #3760.

Update restart policy to avoid stuck pending pods #3760
@@ -319,7 +319,7 @@ const generateTaskRole = (taskRole, labels, config) => {
},
spec: {
privileged: false,
restartPolicy: 'Never',
restartPolicy: gangAllocation === 'true' ? 'Never' : 'OnFailure',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comments for this? We should revert this change after we update k8s to 16.2 or above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Add comments
@abuccts abuccts merged commit 3a7a351 into master Nov 18, 2019
@abuccts abuccts deleted the xiongyf/update-restart-policy branch November 18, 2019 03:28
@yqwang-ms
Copy link
Member

User may not want to forever retry on failure. Could alert auto delete the pod if detected such issue? @Binyang2014

@Binyang2014
Copy link
Contributor

User may not want to forever retry on failure. Could alert auto delete the pod if detected such issue? @Binyang2014

Maybe we can use alert-manager webhook to achieve this web-hook. Create another service receive such alert and do some actions.

For better experience, admin can config this webhook's reaction when alert is received.

For now, if we want to auto delete the pod. I think we should find a way to let admin know what will happen, and admin can turn off such feature if it causes other issues.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
Pure K8s based PAI
Awaiting triage
Development

Successfully merging this pull request may close these issues.

None yet

3 participants