Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too long time needed for a job to be killed after exceeding relatively short ActiveDeadlineSeconds #32149

Closed
soltysh opened this issue Sep 6, 2016 · 6 comments · Fixed by #48454
Closed

Comments

@soltysh
Copy link
Contributor

@soltysh soltysh commented Sep 6, 2016

If you set relatively short .spec.activeDeadlineSeconds, let's say 60 seconds (or generally anything less than full resync time which is 10 mins) and your job stabilizes after few initial seconds, iow. now new pods are created or anything involving job controller, you end up waiting more than 10 minutes before the Job is actually killed due to exceeding ADS.
The reason for that is that controller synchronizes objects only when job itself or underlying pod changes. Or, when full resync is performed once 10 mins. This leads to unnecessarily wait that longer timeout.

Possible fixes:

  1. Mark jobs with ADS set and have them full resynced more often.

  2. Queue jobs with ADS set, so they are resynced more often.

    Other possibilities?

@erictune @janetkuo ideas?

@soltysh

This comment has been minimized.

Copy link
Contributor Author

@soltysh soltysh commented Sep 6, 2016

The problem showed up in e2e tests, which bloated to almost 11 minutes due to waiting for the next resync, see #31973 for a fix in there.

@erictune

This comment has been minimized.

Copy link
Member

@erictune erictune commented Sep 6, 2016

Job controller needs to have a priority queue of upcoming deadlines and the corresponding job objects, and each time the sync loop runs, it needs to pop all passed deadlines, verify the job still has the deadline, and then delete the job.

@soltysh

This comment has been minimized.

Copy link
Contributor Author

@soltysh soltysh commented Sep 13, 2016

When fixing this issue make sure to remove the hack introduced when fixing e2e in #31973.

@janetkuo

This comment has been minimized.

Copy link
Member

@janetkuo janetkuo commented Apr 26, 2017

We could enqueue after a certain amount of time (use delaying queue) based on activeDeadlineSeconds.

@soltysh

This comment has been minimized.

Copy link
Contributor Author

@soltysh soltysh commented Apr 27, 2017

Yup, that's the plan, basically.

@weiwei04

This comment has been minimized.

Copy link
Contributor

@weiwei04 weiwei04 commented Jul 4, 2017

Opened a pr to fix this issue, @soltysh @erictune @erictune

k8s-github-robot pushed a commit that referenced this issue Aug 29, 2017
Automatic merge from submit-queue (batch tested with PRs 44719, 48454)

check job ActiveDeadlineSeconds

**What this PR does / why we need it**:

enqueue a sync task after ActiveDeadlineSeconds

**Which issue this PR fixes** *: 

fixes #32149

**Special notes for your reviewer**:

**Release note**:

```release-note
enqueue a sync task to wake up jobcontroller to check job ActiveDeadlineSeconds in time
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.