Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add build timeout, by setting ActiveDeadlineSeconds on a build pod #4821

Merged
merged 1 commit into from
Oct 14, 2015

Conversation

soltysh
Copy link
Contributor

@soltysh soltysh commented Sep 28, 2015

@bparees this is partially dealing with the problems mentioned in #341 by adding possibility to timeout build. But there are two questions:

  1. The name of the variable, I've chosen exactly the same name you apply to a pod, which is ActiveDeadlineSeconds, is it ok?
  2. Do we want to default that to some reasonable value, eg. 30mins, for now it's not. If yes - will we allow disabling it, by setting it to zero or negative value?

One last thing, you specify this timeout in seconds, exactly as you do with ADS.

@bparees
Copy link
Contributor

bparees commented Sep 28, 2015

@smarterclayton api review. Personally I think we might want to use a different name. MaxDuration or Timeout or something. ActiveDeadlineSeconds doesn't really convey what we're doing with it here. (personally I read it as "deadline in seconds before this must be active, though i realize that's not what it means on the pod either)

@soltysh thanks for doing this.... can you add a test (maybe extended or integration) that actually attempts to run a build with a short ADS and confirms it gets killed?

what state does the pod go to when it times out? is it marked failed? ultimately we want the build to be marked failed. I'm also wondering if there's any way to indicate on the build why it failed, otherwise it may be confusing to users.

This timeout is from when the pod is created until it terminates, so it includes any time spent pulling the image, right? What about the case where the image for the pod can't be found/pulled (so the pod keeps retrying the pull). Will that eventually timeout due to this value also?

@soltysh
Copy link
Contributor Author

soltysh commented Sep 28, 2015

what state does the pod go to when it times out? is it marked failed? ultimately we want the build to be marked failed. I'm also wondering if there's any way to indicate on the build why it failed, otherwise it may be confusing to users.

The pod state is DeadlineExceeded we'd have to map it into Build status which is currently Failed. You can even observe following events on a build:

  5m        5m      2   {kubelet openshiftdev.local}                        DeadlineExceeded    Pod was active on the node longer than specified deadline
  5m        5m      1   {kubelet openshiftdev.local}    spec.containers{sti-build}      killing         Killing with docker id bc00c09a36bc
  5m        5m      1   {kubelet openshiftdev.local}    implicitly required container POD   killing         Killing with docker id e15d170bb989

I'll fix it and add the e2e test once we agree on the field name. Besides I forgot to update generate.go to copy the ADS over to a build and I was wondering why it wasn't working for me just now 😉.

This timeout is from when the pod is created until it terminates, so it includes any time spent pulling the image, right? What about the case where the image for the pod can't be found/pulled (so the pod keeps retrying the pull). Will that eventually timeout due to this value also?

The case where it takes too long to get an image counts into ADS as you said. With not found images it's different, the build controller at build creation checks if the builder image exists, which ensures the image existence. But in all other cases ADS will actively kill the pod once it reaches the timeout.

@bparees
Copy link
Contributor

bparees commented Sep 28, 2015

The pod state is DeadlineExceeded we'd have to map it into Build status which is currently Failed. You can even observe following events on a build:

maybe it's just user training but the events list probably isn't the most obvious way to indicate the outcome here. Maybe it's just a matter of the UIs (cli, console) checking if that event exists and providing a better indication though. I'd start by making it cleared in the describe() impl.

The case where it takes too long to get an image counts into ADS as you said. With not found images it's different, the build controller at build creation checks if the builder image exists, which ensures the image existence. But in all other cases ADS will actively kill the pod once it reaches the timeout.

I think you're describing the case when there's no imagestream found so we can't resolve the imagestream reference into a pull spec. i'm talking about the case where k8s attempts the docker pull and gets no image. That is after the build (and pod) are created. k8s will repeatedly retry the pull in that case. Does that fall under ADS? It sounds like yes.

@smarterclayton
Copy link
Contributor

Well, to be fair we should be updating the status of the build and build
config to represent these states, which we have plenty of issues to cover.
Rather than change the UI right now I'd prefer to start fixing that problem.

On Mon, Sep 28, 2015 at 10:01 PM, Ben Parees notifications@github.com
wrote:

The pod state is DeadlineExceeded we'd have to map it into Build status
which is currently Failed. You can even observe following events on a build:

maybe it's just user training but the events list probably isn't the most
obvious way to indicate the outcome here. Maybe it's just a matter of the
UIs (cli, console) checking if that event exists and providing a better
indication though. I'd start by making it cleared in the describe() impl.

The case where it takes too long to get an image counts into ADS as you
said. With not found images it's different, the build controller at build
creation checks if the builder image exists, which ensures the image
existence. But in all other cases ADS will actively kill the pod once it
reaches the timeout.

I think you're describing the case when there's no imagestream found so we
can't resolve the imagestream reference into a pull spec. i'm talking about
the case where k8s attempts the docker pull and gets no image. That is
after the build (and pod) are created. k8s will repeatedly retry the pull
in that case. Does that fall under ADS? It sounds like yes.


Reply to this email directly or view it on GitHub
#4821 (comment).

@soltysh
Copy link
Contributor Author

soltysh commented Sep 29, 2015

@smarterclayton actually that was intended as a 2nd part of the issue, first introduce the timeout, which will allow us addressing part of the problems. Then I'll be extending the status with appropriate information, including the active killing. Does that sounds comforting?

@smarterclayton
Copy link
Contributor

Yes, I'd rather not change the UI at this moment to work around and just
fix the real problem.

On Tue, Sep 29, 2015 at 3:53 PM, Maciej Szulik notifications@github.com
wrote:

@smarterclayton https://github.com/smarterclayton actually that was
intended as a 2nd part of the issue, first introduce the timeout, which
will allow us addressing part of the problems. Then I'll be extending the
status with appropriate information, including the active killing. Does
that sounds comforting?


Reply to this email directly or view it on GitHub
#4821 (comment).

@soltysh
Copy link
Contributor Author

soltysh commented Oct 2, 2015

@bparees ready for final review. I've added extended tests as requested.

@soltysh
Copy link
Contributor Author

soltysh commented Oct 2, 2015

@smarterclayton the status is handled already by @rhcarvalho in #4909

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test Waiting: Determining build queue position

@@ -58,6 +58,10 @@ type BuildSpec struct {

// Compute resource requirements to execute the build
Resources kapi.ResourceRequirements

// Optional duration in seconds relative to the StartTime that the build may be active on a node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relative to the StartTime is not correct, as noted in the docs PR.

@bparees
Copy link
Contributor

bparees commented Oct 2, 2015

one minor nit and would like to see tests pass, otherwise lgtm.

@rhcarvalho after this merges please make sure that "oc get build foo" shows something informational for builds that were timed out (ie shows more than just "Failed")

@soltysh
Copy link
Contributor Author

soltysh commented Oct 5, 2015

Fixed comment and hoping for re-test, since previous was make verify flake for some weird reason.

"maxDuration": {
"type": "integer",
"format": "int64",
"description": "optional duration in seconds the build may be active on a node relative to StartTime before the system will actively try to mark it failed and kill associated containers; value must be a positive integer"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regen docs

@soltysh
Copy link
Contributor Author

soltysh commented Oct 5, 2015

Right... my bad. Should be fixed now.

@bparees
Copy link
Contributor

bparees commented Oct 5, 2015

lgtm. @smarterclayton api review please.

@@ -13819,6 +13819,11 @@
"resources": {
"$ref": "v1.ResourceRequirements",
"description": "the desired compute resources the build should have"
},
"maxDuration": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't this activeDeadlineSeconds?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description is lifted exactly from activeDeadlineSeconds - that makes me think the name should be exactly activeDeadlineSeconds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've talked with @bparees about naming and decided to go with maxDuration as it speaks more clearly what the field does than activeDeadlineSecods.
Other than that if sometime in the future we'll decide to allow multi-pod builds, we'd still have maxDuration which will leverage job's maxDuration (or whatever it'll be called).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything you name this has to have a suffix of "DeadlineSeconds" because
that's the established convention in the API. You can call it
buildDeadlineSeconds or activeDeadlineSeconds or (other variant), but it
has to fit that pattern.

On Oct 6, 2015, at 4:50 PM, Maciej Szulik notifications@github.com wrote:

In api/swagger-spec/oapi-v1.json
#4821 (comment):

@@ -13819,6 +13819,11 @@
"resources": {
"$ref": "v1.ResourceRequirements",
"description": "the desired compute resources the build should have"

  • },
    
  • "maxDuration": {
    

We've talked with @bparees https://github.com/bparees about naming and
decided to go with maxDuration as it speaks more clearly what the field
does than activeDeadlineSecods.
Other than that if sometime in the future we'll decide to allow multi-pod
builds, we'd still have maxDuration which will leverage job's maxDuration
(or whatever it'll be called).


Reply to this email directly or view it on GitHub
https://github.com/openshift/origin/pull/4821/files#r41320288.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bparees your choice, personally I'd go with buildDeadlineSeconds
On Oct 6, 2015 11:12 PM, "Clayton Coleman" notifications@github.com wrote:

In api/swagger-spec/oapi-v1.json
#4821 (comment):

@@ -13819,6 +13819,11 @@
"resources": {
"$ref": "v1.ResourceRequirements",
"description": "the desired compute resources the build should have"

  • },
    
  • "maxDuration": {
    

Anything you name this has to have a suffix of "DeadlineSeconds" because
that's the established convention in the API. You can call it
buildDeadlineSeconds or activeDeadlineSeconds or (other variant), but it
has to fit that pattern. On Oct 6, 2015, at 4:50 PM, Maciej Szulik <
notifications@github.com> wrote: In api/swagger-spec/oapi-v1.json <
https://github.com/openshift/origin/pull/4821#discussion_r41320288>:
@@ -13819,6 +13819,11 @@ "resources": { "$ref": "v1.ResourceRequirements",
"description": "the desired compute resources the build should have" + }, +
"maxDuration": {
We've talked with @bparees https://github.com/bparees <
https://github.com/bparees> about naming and decided to go with
maxDuration as it speaks more clearly what the field does than
activeDeadlineSecods. Other than that if sometime in the future we'll
decide to allow multi-pod builds, we'd still have maxDuration which will
leverage job's maxDuration (or whatever it'll be called). — Reply to this
email directly or view it on GitHub <
https://github.com/openshift/origin/pull/4821/files#r41320288>.


Reply to this email directly or view it on GitHub
https://github.com/openshift/origin/pull/4821/files#r41323078.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, you have to justify why this field which does exactly the same thing
as activeDeadlineSeconds and even means the same thing should be different.

On Tue, Oct 6, 2015 at 5:22 PM, Maciej Szulik notifications@github.com
wrote:

In api/swagger-spec/oapi-v1.json
#4821 (comment):

@@ -13819,6 +13819,11 @@
"resources": {
"$ref": "v1.ResourceRequirements",
"description": "the desired compute resources the build should have"

  • },
    
  • "maxDuration": {
    

@bparees https://github.com/bparees your choice, personally I'd go with
buildDeadlineSeconds
… <#1503f089b371cccd_>
On Oct 6, 2015 11:12 PM, "Clayton Coleman" notifications@github.com
wrote: In api/swagger-spec/oapi-v1.json <
https://github.com/openshift/origin/pull/4821#discussion_r41323078>: > @@
-13819,6 +13819,11 @@ > "resources": { > "$ref": "v1.ResourceRequirements",

"description": "the desired compute resources the build should have" > +
}, > + "maxDuration": { Anything you name this has to have a suffix of
"DeadlineSeconds" because that's the established convention in the API. You
can call it buildDeadlineSeconds or activeDeadlineSeconds or (other
variant), but it has to fit that pattern. On Oct 6, 2015, at 4:50 PM,
Maciej Szulik < notifications@github.com> wrote: In
api/swagger-spec/oapi-v1.json <
https://github.com/openshift/origin/pull/4821#discussion_r41320288>: @@
-13819,6 +13819,11 @@ "resources": { "$ref": "v1.ResourceRequirements",
"description": "the desired compute resources the build should have" + }, +
"maxDuration": { We've talked with @bparees https://github.com/bparees <
https://github.com/bparees> < https://github.com/bparees> about naming
and decided to go with maxDuration as it speaks more clearly what the field
does than activeDeadlineSecods. Other than that if sometime in the future
we'll decide to allow multi-pod builds, we'd still have maxDuration which
will leverage job's maxDuration (or whatever it'll be called). — Reply to
this email directly or view it on GitHub <
https://github.com/openshift/origin/pull/4821/files#r41320288>. — Reply
to this email directly or view it on GitHub <
https://github.com/openshift/origin/pull/4821/files#r41323078>.


Reply to this email directly or view it on GitHub
https://github.com/openshift/origin/pull/4821/files#r41324241.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

activeDeadlineSeconds is not a good name. it implies a deadline for something to become active. it does not imply a maximum amount of time something is allowed to be active. Not to mention "deadline" implies a specific point in time, not a length of time. activeDurationSeconds might have made sense.

so my justification is i want the buildconfig api to actually make sense when people look at the field, they will intuitively understand what the field does.

And i don't see any reason this has to match the underlying api we are using to implement it. We could choose to implement build duration limits with any manner of mechanisms. That we happened to choose to implement it by leveraging an underlying api that's not being exposed to the user here is irrelevant to how we name this field.

if it has to end in DeadlineSeconds then i want it named "buildDurationDeadlineSeconds"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

completionDeadlineSeconds I might buy. We're going to use standard
terminology in our APIs because we are a Kube ecosystem project. The
reason it is "seconds" is because by convention everywhere in the API we
use seconds to denote the type of a file. The reason it is deadline is
because it's a precise technical term to describe the behavior when the
time is exceeded. You can't have a duration deadline. A deadline is a
duration, and it's measured in seconds.

On Tue, Oct 6, 2015 at 6:00 PM, Ben Parees notifications@github.com wrote:

In api/swagger-spec/oapi-v1.json
#4821 (comment):

@@ -13819,6 +13819,11 @@
"resources": {
"$ref": "v1.ResourceRequirements",
"description": "the desired compute resources the build should have"

  • },
    
  • "maxDuration": {
    

activeDeadlineSeconds is not a good name. it implies a deadline for
something to become active. it does not imply a maximum amount of time
something is allowed to be active. Not to mention "deadline" implies a
specific point in time, not a length of time. activeDurationSeconds might
have made sense.

so my justification is i want the buildconfig api to actually make sense
when people look at the field, they will intuitively understand what the
field does.

And i don't see any reason this has to match the underlying api we are
using to implement it. We could choose to implement build duration limits
with any manner of mechanisms. That we happened to choose to implement it
by leveraging an underlying api that's not being exposed to the user here
is irrelevant to how we name this field.

if it has to end in DeadlineSeconds then i want it named
"buildDurationDeadlineSeconds"


Reply to this email directly or view it on GitHub
https://github.com/openshift/origin/pull/4821/files#r41328471.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then it's a precise technical term that's being used incorrectly. deadlines are points in time, not lengths of time. a length of time can be used to compute a deadline given a starting time, but it's not a deadline by itself.

completionDeadlineSeconds it is.

@soltysh
Copy link
Contributor Author

soltysh commented Oct 7, 2015

CompletionDeadlineSeconds it is then.

@soltysh
Copy link
Contributor Author

soltysh commented Oct 7, 2015

Docs updated as well, ptal.

@bparees
Copy link
Contributor

bparees commented Oct 7, 2015

lgtm

@soltysh
Copy link
Contributor Author

soltysh commented Oct 9, 2015

I'm seeing 4 extended failures, all of them are from k8s, shouldn't those be disabled @Kargakis or are these post go 1.5.1 problem @mfojtik ?

@soltysh
Copy link
Contributor Author

soltysh commented Oct 9, 2015

The failures are:

/data/src/github.com/openshift/origin/Godeps/_workspace/src/k8s.io/kubernetes/test/e2e/scheduler_predicates.go:373

And three times:

/data/src/github.com/openshift/origin/Godeps/_workspace/src/k8s.io/kubernetes/test/e2e/expansion.go:129

@soltysh
Copy link
Contributor Author

soltysh commented Oct 9, 2015

I've temporarily disabled one e2e tests that kept failing on my machine locally, that is SchedulerPredicates. I've seen it being changed significantly upstream, which should be picked up with next rebase.

@soltysh
Copy link
Contributor Author

soltysh commented Oct 14, 2015

Rebased and removed the 2nd commit re SchedulerPredicates disablement. Let's see the jenkins results now.

@@ -58,6 +58,11 @@ type BuildSpec struct {

// Compute resource requirements to execute the build
Resources kapi.ResourceRequirements

// Optional duration in seconds, counted from the time when a build pod gets
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is optional, maybe it would be worth to explain what is the default behavior if I don't set this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is optional the default behavior is that it does not influence build duration, at least that's my understanding. Why then adding some additional info here.

@mfojtik
Copy link
Contributor

mfojtik commented Oct 14, 2015

LGTM (couple small nits)

@soltysh
Copy link
Contributor Author

soltysh commented Oct 14, 2015

I've removed parallel word from e2e Describe. I've also removed all test tags, will tag this with just this test.

@soltysh
Copy link
Contributor Author

soltysh commented Oct 14, 2015

Addressed @mfojtik nit regarding describe naming, changed Completion Deadline Seconds to Fail Build After

@soltysh
Copy link
Contributor Author

soltysh commented Oct 14, 2015

[testonlyextended][extended:core(CompletionDeadlineSeconds)]

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test Running (https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin/5773/)

@soltysh
Copy link
Contributor Author

soltysh commented Oct 14, 2015

Added validation, of which I previously forgotten. Thanks to @wzheng1 for catching that in the test cases.

@openshift-bot
Copy link
Contributor

Evaluated for origin testonlyextended up to c5abd48

@mfojtik
Copy link
Contributor

mfojtik commented Oct 14, 2015

LGTM

@soltysh
Copy link
Contributor Author

soltysh commented Oct 14, 2015

@smarterclayton awaiting your sign off for this.

@smarterclayton
Copy link
Contributor

LGTM [merge]

@smarterclayton
Copy link
Contributor

BTW your extended run didn't test anything:

=== SUITE OpenShift extended tests suite (224 total specs, 0 will run):
=== RUN "test/extended/images/s2i_php.go":

@smarterclayton
Copy link
Contributor

The extended tests should probably fail if 0 tests are run.

@mfojtik
Copy link
Contributor

mfojtik commented Oct 14, 2015

I wonder why nothing was ran, that focus should match @soltysh test.

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_origin/3616/) (Image: devenv-fedora_2464)

@openshift-bot
Copy link
Contributor

Evaluated for origin merge up to c5abd48

@openshift-bot
Copy link
Contributor

[Test]ing while waiting on the merge queue

@openshift-bot
Copy link
Contributor

Evaluated for origin test up to c5abd48

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/testonlyextended SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin/5774/)

@soltysh
Copy link
Contributor Author

soltysh commented Oct 14, 2015

@mfojtik can I ask you kindly to have a closer look at it, you're the e2e expert here 😻

openshift-bot pushed a commit that referenced this pull request Oct 14, 2015
@openshift-bot openshift-bot merged commit a101136 into openshift:master Oct 14, 2015
@soltysh soltysh deleted the build_timeout branch October 27, 2015 08:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants