Add build timeout, by setting ActiveDeadlineSeconds on a build pod #4821

soltysh · 2015-09-28T10:22:05Z

@bparees this is partially dealing with the problems mentioned in #341 by adding possibility to timeout build. But there are two questions:

The name of the variable, I've chosen exactly the same name you apply to a pod, which is ActiveDeadlineSeconds, is it ok?
Do we want to default that to some reasonable value, eg. 30mins, for now it's not. If yes - will we allow disabling it, by setting it to zero or negative value?

One last thing, you specify this timeout in seconds, exactly as you do with ADS.

bparees · 2015-09-28T14:19:52Z

@smarterclayton api review. Personally I think we might want to use a different name. MaxDuration or Timeout or something. ActiveDeadlineSeconds doesn't really convey what we're doing with it here. (personally I read it as "deadline in seconds before this must be active, though i realize that's not what it means on the pod either)

@soltysh thanks for doing this.... can you add a test (maybe extended or integration) that actually attempts to run a build with a short ADS and confirms it gets killed?

what state does the pod go to when it times out? is it marked failed? ultimately we want the build to be marked failed. I'm also wondering if there's any way to indicate on the build why it failed, otherwise it may be confusing to users.

This timeout is from when the pod is created until it terminates, so it includes any time spent pulling the image, right? What about the case where the image for the pod can't be found/pulled (so the pod keeps retrying the pull). Will that eventually timeout due to this value also?

soltysh · 2015-09-28T15:54:06Z

what state does the pod go to when it times out? is it marked failed? ultimately we want the build to be marked failed. I'm also wondering if there's any way to indicate on the build why it failed, otherwise it may be confusing to users.

The pod state is DeadlineExceeded we'd have to map it into Build status which is currently Failed. You can even observe following events on a build:

  5m        5m      2   {kubelet openshiftdev.local}                        DeadlineExceeded    Pod was active on the node longer than specified deadline
  5m        5m      1   {kubelet openshiftdev.local}    spec.containers{sti-build}      killing         Killing with docker id bc00c09a36bc
  5m        5m      1   {kubelet openshiftdev.local}    implicitly required container POD   killing         Killing with docker id e15d170bb989

I'll fix it and add the e2e test once we agree on the field name. Besides I forgot to update generate.go to copy the ADS over to a build and I was wondering why it wasn't working for me just now 😉.

This timeout is from when the pod is created until it terminates, so it includes any time spent pulling the image, right? What about the case where the image for the pod can't be found/pulled (so the pod keeps retrying the pull). Will that eventually timeout due to this value also?

The case where it takes too long to get an image counts into ADS as you said. With not found images it's different, the build controller at build creation checks if the builder image exists, which ensures the image existence. But in all other cases ADS will actively kill the pod once it reaches the timeout.

bparees · 2015-09-28T20:01:21Z

The pod state is DeadlineExceeded we'd have to map it into Build status which is currently Failed. You can even observe following events on a build:

maybe it's just user training but the events list probably isn't the most obvious way to indicate the outcome here. Maybe it's just a matter of the UIs (cli, console) checking if that event exists and providing a better indication though. I'd start by making it cleared in the describe() impl.

The case where it takes too long to get an image counts into ADS as you said. With not found images it's different, the build controller at build creation checks if the builder image exists, which ensures the image existence. But in all other cases ADS will actively kill the pod once it reaches the timeout.

I think you're describing the case when there's no imagestream found so we can't resolve the imagestream reference into a pull spec. i'm talking about the case where k8s attempts the docker pull and gets no image. That is after the build (and pod) are created. k8s will repeatedly retry the pull in that case. Does that fall under ADS? It sounds like yes.

smarterclayton · 2015-09-29T13:38:58Z

Well, to be fair we should be updating the status of the build and build
config to represent these states, which we have plenty of issues to cover.
Rather than change the UI right now I'd prefer to start fixing that problem.

On Mon, Sep 28, 2015 at 10:01 PM, Ben Parees notifications@github.com
wrote:

The pod state is DeadlineExceeded we'd have to map it into Build status
which is currently Failed. You can even observe following events on a build:

maybe it's just user training but the events list probably isn't the most
obvious way to indicate the outcome here. Maybe it's just a matter of the
UIs (cli, console) checking if that event exists and providing a better
indication though. I'd start by making it cleared in the describe() impl.

The case where it takes too long to get an image counts into ADS as you
said. With not found images it's different, the build controller at build
creation checks if the builder image exists, which ensures the image
existence. But in all other cases ADS will actively kill the pod once it
reaches the timeout.

I think you're describing the case when there's no imagestream found so we
can't resolve the imagestream reference into a pull spec. i'm talking about
the case where k8s attempts the docker pull and gets no image. That is
after the build (and pod) are created. k8s will repeatedly retry the pull
in that case. Does that fall under ADS? It sounds like yes.

—
Reply to this email directly or view it on GitHub
#4821 (comment).

soltysh · 2015-09-29T13:53:18Z

@smarterclayton actually that was intended as a 2nd part of the issue, first introduce the timeout, which will allow us addressing part of the problems. Then I'll be extending the status with appropriate information, including the active killing. Does that sounds comforting?

smarterclayton · 2015-09-29T15:07:17Z

Yes, I'd rather not change the UI at this moment to work around and just
fix the real problem.

On Tue, Sep 29, 2015 at 3:53 PM, Maciej Szulik notifications@github.com
wrote:

@smarterclayton https://github.com/smarterclayton actually that was
intended as a 2nd part of the issue, first introduce the timeout, which
will allow us addressing part of the problems. Then I'll be extending the
status with appropriate information, including the active killing. Does
that sounds comforting?

—
Reply to this email directly or view it on GitHub
#4821 (comment).

soltysh · 2015-10-02T13:34:49Z

@bparees ready for final review. I've added extended tests as requested.

soltysh · 2015-10-02T13:54:46Z

@smarterclayton the status is handled already by @rhcarvalho in #4909

openshift-bot · 2015-10-02T18:50:17Z

continuous-integration/openshift-jenkins/test Waiting: Determining build queue position

bparees · 2015-10-02T20:32:17Z

pkg/build/api/types.go

@@ -58,6 +58,10 @@ type BuildSpec struct {

 	// Compute resource requirements to execute the build
 	Resources kapi.ResourceRequirements
+
+	// Optional duration in seconds relative to the StartTime that the build may be active on a node


relative to the StartTime is not correct, as noted in the docs PR.

bparees · 2015-10-02T20:37:00Z

one minor nit and would like to see tests pass, otherwise lgtm.

@rhcarvalho after this merges please make sure that "oc get build foo" shows something informational for builds that were timed out (ie shows more than just "Failed")

soltysh · 2015-10-05T19:50:51Z

Fixed comment and hoping for re-test, since previous was make verify flake for some weird reason.

bparees · 2015-10-05T20:54:23Z

api/swagger-spec/oapi-v1.json

+     "maxDuration": {
+      "type": "integer",
+      "format": "int64",
+      "description": "optional duration in seconds the build may be active on a node relative to StartTime before the system will actively try to mark it failed and kill associated containers; value must be a positive integer"


soltysh · 2015-10-05T21:06:40Z

Right... my bad. Should be fixed now.

bparees · 2015-10-05T21:14:33Z

lgtm. @smarterclayton api review please.

smarterclayton · 2015-10-06T18:41:38Z

api/swagger-spec/oapi-v1.json

@@ -13819,6 +13819,11 @@
     "resources": {
      "$ref": "v1.ResourceRequirements",
      "description": "the desired compute resources the build should have"
+     },
+     "maxDuration": {


Why isn't this activeDeadlineSeconds?

The description is lifted exactly from activeDeadlineSeconds - that makes me think the name should be exactly activeDeadlineSeconds.

We've talked with @bparees about naming and decided to go with maxDuration as it speaks more clearly what the field does than activeDeadlineSecods.
Other than that if sometime in the future we'll decide to allow multi-pod builds, we'd still have maxDuration which will leverage job's maxDuration (or whatever it'll be called).

Anything you name this has to have a suffix of "DeadlineSeconds" because
that's the established convention in the API. You can call it
buildDeadlineSeconds or activeDeadlineSeconds or (other variant), but it
has to fit that pattern.

On Oct 6, 2015, at 4:50 PM, Maciej Szulik notifications@github.com wrote:

In api/swagger-spec/oapi-v1.json
#4821 (comment):

@@ -13819,6 +13819,11 @@
"resources": {
"$ref": "v1.ResourceRequirements",
"description": "the desired compute resources the build should have"

},

"maxDuration": {

We've talked with @bparees https://github.com/bparees about naming and
decided to go with maxDuration as it speaks more clearly what the field
does than activeDeadlineSecods.
Other than that if sometime in the future we'll decide to allow multi-pod
builds, we'd still have maxDuration which will leverage job's maxDuration
(or whatever it'll be called).

—
Reply to this email directly or view it on GitHub
https://github.com/openshift/origin/pull/4821/files#r41320288.

@bparees your choice, personally I'd go with buildDeadlineSeconds
On Oct 6, 2015 11:12 PM, "Clayton Coleman" notifications@github.com wrote:

In api/swagger-spec/oapi-v1.json
#4821 (comment):

@@ -13819,6 +13819,11 @@
"resources": {
"$ref": "v1.ResourceRequirements",
"description": "the desired compute resources the build should have"

},

"maxDuration": {

Anything you name this has to have a suffix of "DeadlineSeconds" because
that's the established convention in the API. You can call it
buildDeadlineSeconds or activeDeadlineSeconds or (other variant), but it
has to fit that pattern. On Oct 6, 2015, at 4:50 PM, Maciej Szulik <
notifications@github.com> wrote: In api/swagger-spec/oapi-v1.json <
https://github.com/openshift/origin/pull/4821#discussion_r41320288>:
@@ -13819,6 +13819,11 @@ "resources": { "$ref": "v1.ResourceRequirements",
"description": "the desired compute resources the build should have" + }, +
"maxDuration": {
We've talked with @bparees https://github.com/bparees <
https://github.com/bparees> about naming and decided to go with
maxDuration as it speaks more clearly what the field does than
activeDeadlineSecods. Other than that if sometime in the future we'll
decide to allow multi-pod builds, we'd still have maxDuration which will
leverage job's maxDuration (or whatever it'll be called). — Reply to this
email directly or view it on GitHub <
https://github.com/openshift/origin/pull/4821/files#r41320288>.

—
Reply to this email directly or view it on GitHub
https://github.com/openshift/origin/pull/4821/files#r41323078.

But, you have to justify why this field which does exactly the same thing
as activeDeadlineSeconds and even means the same thing should be different.

On Tue, Oct 6, 2015 at 5:22 PM, Maciej Szulik notifications@github.com
wrote:

In api/swagger-spec/oapi-v1.json
#4821 (comment):

@@ -13819,6 +13819,11 @@
"resources": {
"$ref": "v1.ResourceRequirements",
"description": "the desired compute resources the build should have"

},

"maxDuration": {

@bparees https://github.com/bparees your choice, personally I'd go with
buildDeadlineSeconds
… <#1503f089b371cccd_>
On Oct 6, 2015 11:12 PM, "Clayton Coleman" notifications@github.com
wrote: In api/swagger-spec/oapi-v1.json <
https://github.com/openshift/origin/pull/4821#discussion_r41323078>: > @@
-13819,6 +13819,11 @@ > "resources": { > "$ref": "v1.ResourceRequirements",

"description": "the desired compute resources the build should have" > +
}, > + "maxDuration": { Anything you name this has to have a suffix of
"DeadlineSeconds" because that's the established convention in the API. You
can call it buildDeadlineSeconds or activeDeadlineSeconds or (other
variant), but it has to fit that pattern. On Oct 6, 2015, at 4:50 PM,
Maciej Szulik < notifications@github.com> wrote: In
api/swagger-spec/oapi-v1.json <
https://github.com/openshift/origin/pull/4821#discussion_r41320288>: @@
-13819,6 +13819,11 @@ "resources": { "$ref": "v1.ResourceRequirements",
"description": "the desired compute resources the build should have" + }, +
"maxDuration": { We've talked with @bparees https://github.com/bparees <
https://github.com/bparees> < https://github.com/bparees> about naming
and decided to go with maxDuration as it speaks more clearly what the field
does than activeDeadlineSecods. Other than that if sometime in the future
we'll decide to allow multi-pod builds, we'd still have maxDuration which
will leverage job's maxDuration (or whatever it'll be called). — Reply to
this email directly or view it on GitHub <
https://github.com/openshift/origin/pull/4821/files#r41320288>. — Reply
to this email directly or view it on GitHub <
https://github.com/openshift/origin/pull/4821/files#r41323078>.

—
Reply to this email directly or view it on GitHub
https://github.com/openshift/origin/pull/4821/files#r41324241.

activeDeadlineSeconds is not a good name. it implies a deadline for something to become active. it does not imply a maximum amount of time something is allowed to be active. Not to mention "deadline" implies a specific point in time, not a length of time. activeDurationSeconds might have made sense.

so my justification is i want the buildconfig api to actually make sense when people look at the field, they will intuitively understand what the field does.

And i don't see any reason this has to match the underlying api we are using to implement it. We could choose to implement build duration limits with any manner of mechanisms. That we happened to choose to implement it by leveraging an underlying api that's not being exposed to the user here is irrelevant to how we name this field.

if it has to end in DeadlineSeconds then i want it named "buildDurationDeadlineSeconds"

completionDeadlineSeconds I might buy. We're going to use standard
terminology in our APIs because we are a Kube ecosystem project. The
reason it is "seconds" is because by convention everywhere in the API we
use seconds to denote the type of a file. The reason it is deadline is
because it's a precise technical term to describe the behavior when the
time is exceeded. You can't have a duration deadline. A deadline is a
duration, and it's measured in seconds.

On Tue, Oct 6, 2015 at 6:00 PM, Ben Parees notifications@github.com wrote:

In api/swagger-spec/oapi-v1.json
#4821 (comment):

@@ -13819,6 +13819,11 @@
"resources": {
"$ref": "v1.ResourceRequirements",
"description": "the desired compute resources the build should have"

},

"maxDuration": {

activeDeadlineSeconds is not a good name. it implies a deadline for
something to become active. it does not imply a maximum amount of time
something is allowed to be active. Not to mention "deadline" implies a
specific point in time, not a length of time. activeDurationSeconds might
have made sense.

so my justification is i want the buildconfig api to actually make sense
when people look at the field, they will intuitively understand what the
field does.

And i don't see any reason this has to match the underlying api we are
using to implement it. We could choose to implement build duration limits
with any manner of mechanisms. That we happened to choose to implement it
by leveraging an underlying api that's not being exposed to the user here
is irrelevant to how we name this field.

if it has to end in DeadlineSeconds then i want it named
"buildDurationDeadlineSeconds"

—
Reply to this email directly or view it on GitHub
https://github.com/openshift/origin/pull/4821/files#r41328471.

then it's a precise technical term that's being used incorrectly. deadlines are points in time, not lengths of time. a length of time can be used to compute a deadline given a starting time, but it's not a deadline by itself.

completionDeadlineSeconds it is.

soltysh · 2015-10-07T09:04:19Z

CompletionDeadlineSeconds it is then.

soltysh · 2015-10-07T09:06:57Z

Docs updated as well, ptal.

bparees · 2015-10-07T09:21:34Z

lgtm

soltysh · 2015-10-09T07:10:20Z

I'm seeing 4 extended failures, all of them are from k8s, shouldn't those be disabled @Kargakis or are these post go 1.5.1 problem @mfojtik ?

soltysh · 2015-10-09T07:11:59Z

The failures are:

/data/src/github.com/openshift/origin/Godeps/_workspace/src/k8s.io/kubernetes/test/e2e/scheduler_predicates.go:373

And three times:

/data/src/github.com/openshift/origin/Godeps/_workspace/src/k8s.io/kubernetes/test/e2e/expansion.go:129

soltysh · 2015-10-09T19:44:54Z

I've temporarily disabled one e2e tests that kept failing on my machine locally, that is SchedulerPredicates. I've seen it being changed significantly upstream, which should be picked up with next rebase.

soltysh · 2015-10-14T09:20:49Z

Rebased and removed the 2nd commit re SchedulerPredicates disablement. Let's see the jenkins results now.

mfojtik · 2015-10-14T09:28:08Z

pkg/build/api/types.go

@@ -58,6 +58,11 @@ type BuildSpec struct {

 	// Compute resource requirements to execute the build
 	Resources kapi.ResourceRequirements
+
+	// Optional duration in seconds, counted from the time when a build pod gets


if this is optional, maybe it would be worth to explain what is the default behavior if I don't set this

If this is optional the default behavior is that it does not influence build duration, at least that's my understanding. Why then adding some additional info here.

mfojtik · 2015-10-14T09:30:18Z

LGTM (couple small nits)

soltysh · 2015-10-14T10:11:37Z

I've removed parallel word from e2e Describe. I've also removed all test tags, will tag this with just this test.

soltysh · 2015-10-14T10:16:09Z

Addressed @mfojtik nit regarding describe naming, changed Completion Deadline Seconds to Fail Build After

soltysh · 2015-10-14T10:16:15Z

[testonlyextended][extended:core(CompletionDeadlineSeconds)]

openshift-bot · 2015-10-14T10:17:00Z

continuous-integration/openshift-jenkins/test Running (https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin/5773/)

soltysh · 2015-10-14T10:33:50Z

Added validation, of which I previously forgotten. Thanks to @wzheng1 for catching that in the test cases.

openshift-bot · 2015-10-14T10:56:59Z

Evaluated for origin testonlyextended up to c5abd48

mfojtik · 2015-10-14T11:46:21Z

LGTM

soltysh · 2015-10-14T12:08:52Z

@smarterclayton awaiting your sign off for this.

smarterclayton · 2015-10-14T14:26:10Z

LGTM [merge]

smarterclayton · 2015-10-14T14:27:11Z

BTW your extended run didn't test anything:

=== SUITE OpenShift extended tests suite (224 total specs, 0 will run):
=== RUN "test/extended/images/s2i_php.go":

smarterclayton · 2015-10-14T14:29:01Z

The extended tests should probably fail if 0 tests are run.

mfojtik · 2015-10-14T14:32:44Z

I wonder why nothing was ran, that focus should match @soltysh test.

openshift-bot · 2015-10-14T14:33:00Z

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_origin/3616/) (Image: devenv-fedora_2464)

openshift-bot · 2015-10-14T14:33:01Z

Evaluated for origin merge up to c5abd48

openshift-bot · 2015-10-14T14:33:01Z

[Test]ing while waiting on the merge queue

openshift-bot · 2015-10-14T14:41:00Z

Evaluated for origin test up to c5abd48

openshift-bot · 2015-10-14T14:56:59Z

continuous-integration/openshift-jenkins/testonlyextended SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin/5774/)

soltysh · 2015-10-14T15:07:58Z

@mfojtik can I ask you kindly to have a closer look at it, you're the e2e expert here 😻

Merged by openshift-bot

soltysh force-pushed the build_timeout branch from 3c92891 to f992bb6 Compare October 2, 2015 13:34

soltysh mentioned this pull request Oct 2, 2015

Adding build timeout openshift/openshift-docs#1039

Merged

bparees reviewed Oct 2, 2015
View reviewed changes

soltysh force-pushed the build_timeout branch from f992bb6 to d3a3668 Compare October 5, 2015 19:50

bparees reviewed Oct 5, 2015
View reviewed changes

soltysh force-pushed the build_timeout branch from d3a3668 to 3b6d21a Compare October 5, 2015 21:06

smarterclayton reviewed Oct 6, 2015
View reviewed changes

soltysh force-pushed the build_timeout branch from 3b6d21a to 4c30741 Compare October 7, 2015 09:04

soltysh force-pushed the build_timeout branch from 4c30741 to a63eeb9 Compare October 9, 2015 19:43

soltysh force-pushed the build_timeout branch from a63eeb9 to f1d8a58 Compare October 14, 2015 09:20

mfojtik reviewed Oct 14, 2015
View reviewed changes

soltysh force-pushed the build_timeout branch from f1d8a58 to ee2b543 Compare October 14, 2015 10:10

soltysh force-pushed the build_timeout branch from ee2b543 to 385735d Compare October 14, 2015 10:14

Add build timeout, by setting ActiveDeadlineSeconds on a build pod

c5abd48

soltysh force-pushed the build_timeout branch from 385735d to c5abd48 Compare October 14, 2015 10:32

openshift-bot pushed a commit that referenced this pull request Oct 14, 2015

Merge pull request #4821 from soltysh/build_timeout

a101136

Merged by openshift-bot

openshift-bot merged commit a101136 into openshift:master Oct 14, 2015

soltysh deleted the build_timeout branch October 27, 2015 08:30

Add build timeout, by setting ActiveDeadlineSeconds on a build pod #4821

Add build timeout, by setting ActiveDeadlineSeconds on a build pod #4821

Conversation

soltysh commented Sep 28, 2015

bparees commented Sep 28, 2015

soltysh commented Sep 28, 2015

bparees commented Sep 28, 2015

smarterclayton commented Sep 29, 2015

soltysh commented Sep 29, 2015

smarterclayton commented Sep 29, 2015

soltysh commented Oct 2, 2015

soltysh commented Oct 2, 2015

openshift-bot commented Oct 2, 2015

Choose a reason for hiding this comment

bparees commented Oct 2, 2015

soltysh commented Oct 5, 2015

Choose a reason for hiding this comment

soltysh commented Oct 5, 2015

bparees commented Oct 5, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

soltysh commented Oct 7, 2015

soltysh commented Oct 7, 2015

bparees commented Oct 7, 2015

soltysh commented Oct 9, 2015

soltysh commented Oct 9, 2015

soltysh commented Oct 9, 2015

soltysh commented Oct 14, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mfojtik commented Oct 14, 2015

soltysh commented Oct 14, 2015

soltysh commented Oct 14, 2015

soltysh commented Oct 14, 2015

openshift-bot commented Oct 14, 2015

soltysh commented Oct 14, 2015

openshift-bot commented Oct 14, 2015

mfojtik commented Oct 14, 2015

soltysh commented Oct 14, 2015

smarterclayton commented Oct 14, 2015

smarterclayton commented Oct 14, 2015

smarterclayton commented Oct 14, 2015

mfojtik commented Oct 14, 2015

openshift-bot commented Oct 14, 2015

openshift-bot commented Oct 14, 2015

openshift-bot commented Oct 14, 2015

openshift-bot commented Oct 14, 2015

openshift-bot commented Oct 14, 2015

soltysh commented Oct 14, 2015