-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add build timeout, by setting ActiveDeadlineSeconds on a build pod #4821
Conversation
@smarterclayton api review. Personally I think we might want to use a different name. MaxDuration or Timeout or something. ActiveDeadlineSeconds doesn't really convey what we're doing with it here. (personally I read it as "deadline in seconds before this must be active, though i realize that's not what it means on the pod either) @soltysh thanks for doing this.... can you add a test (maybe extended or integration) that actually attempts to run a build with a short ADS and confirms it gets killed? what state does the pod go to when it times out? is it marked failed? ultimately we want the build to be marked failed. I'm also wondering if there's any way to indicate on the build why it failed, otherwise it may be confusing to users. This timeout is from when the pod is created until it terminates, so it includes any time spent pulling the image, right? What about the case where the image for the pod can't be found/pulled (so the pod keeps retrying the pull). Will that eventually timeout due to this value also? |
The pod state is
I'll fix it and add the e2e test once we agree on the field name. Besides I forgot to update
The case where it takes too long to get an image counts into ADS as you said. With not found images it's different, the build controller at build creation checks if the builder image exists, which ensures the image existence. But in all other cases ADS will actively kill the pod once it reaches the timeout. |
maybe it's just user training but the events list probably isn't the most obvious way to indicate the outcome here. Maybe it's just a matter of the UIs (cli, console) checking if that event exists and providing a better indication though. I'd start by making it cleared in the describe() impl.
I think you're describing the case when there's no imagestream found so we can't resolve the imagestream reference into a pull spec. i'm talking about the case where k8s attempts the docker pull and gets no image. That is after the build (and pod) are created. k8s will repeatedly retry the pull in that case. Does that fall under ADS? It sounds like yes. |
Well, to be fair we should be updating the status of the build and build On Mon, Sep 28, 2015 at 10:01 PM, Ben Parees notifications@github.com
|
@smarterclayton actually that was intended as a 2nd part of the issue, first introduce the timeout, which will allow us addressing part of the problems. Then I'll be extending the status with appropriate information, including the active killing. Does that sounds comforting? |
Yes, I'd rather not change the UI at this moment to work around and just On Tue, Sep 29, 2015 at 3:53 PM, Maciej Szulik notifications@github.com
|
@bparees ready for final review. I've added extended tests as requested. |
@smarterclayton the status is handled already by @rhcarvalho in #4909 |
continuous-integration/openshift-jenkins/test Waiting: Determining build queue position |
@@ -58,6 +58,10 @@ type BuildSpec struct { | |||
|
|||
// Compute resource requirements to execute the build | |||
Resources kapi.ResourceRequirements | |||
|
|||
// Optional duration in seconds relative to the StartTime that the build may be active on a node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
relative to the StartTime is not correct, as noted in the docs PR.
one minor nit and would like to see tests pass, otherwise lgtm. @rhcarvalho after this merges please make sure that "oc get build foo" shows something informational for builds that were timed out (ie shows more than just "Failed") |
Fixed comment and hoping for re-test, since previous was |
"maxDuration": { | ||
"type": "integer", | ||
"format": "int64", | ||
"description": "optional duration in seconds the build may be active on a node relative to StartTime before the system will actively try to mark it failed and kill associated containers; value must be a positive integer" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regen docs
Right... my bad. Should be fixed now. |
lgtm. @smarterclayton api review please. |
@@ -13819,6 +13819,11 @@ | |||
"resources": { | |||
"$ref": "v1.ResourceRequirements", | |||
"description": "the desired compute resources the build should have" | |||
}, | |||
"maxDuration": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't this activeDeadlineSeconds
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description is lifted exactly from activeDeadlineSeconds - that makes me think the name should be exactly activeDeadlineSeconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've talked with @bparees about naming and decided to go with maxDuration
as it speaks more clearly what the field does than activeDeadlineSecods
.
Other than that if sometime in the future we'll decide to allow multi-pod builds, we'd still have maxDuration
which will leverage job's maxDuration
(or whatever it'll be called).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anything you name this has to have a suffix of "DeadlineSeconds" because
that's the established convention in the API. You can call it
buildDeadlineSeconds or activeDeadlineSeconds or (other variant), but it
has to fit that pattern.
On Oct 6, 2015, at 4:50 PM, Maciej Szulik notifications@github.com wrote:
In api/swagger-spec/oapi-v1.json
#4821 (comment):
@@ -13819,6 +13819,11 @@
"resources": {
"$ref": "v1.ResourceRequirements",
"description": "the desired compute resources the build should have"
},
"maxDuration": {
We've talked with @bparees https://github.com/bparees about naming and
decided to go with maxDuration as it speaks more clearly what the field
does than activeDeadlineSecods.
Other than that if sometime in the future we'll decide to allow multi-pod
builds, we'd still have maxDuration which will leverage job's maxDuration
(or whatever it'll be called).
—
Reply to this email directly or view it on GitHub
https://github.com/openshift/origin/pull/4821/files#r41320288.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bparees your choice, personally I'd go with buildDeadlineSeconds
On Oct 6, 2015 11:12 PM, "Clayton Coleman" notifications@github.com wrote:
In api/swagger-spec/oapi-v1.json
#4821 (comment):@@ -13819,6 +13819,11 @@
"resources": {
"$ref": "v1.ResourceRequirements",
"description": "the desired compute resources the build should have"
},
"maxDuration": {
Anything you name this has to have a suffix of "DeadlineSeconds" because
that's the established convention in the API. You can call it
buildDeadlineSeconds or activeDeadlineSeconds or (other variant), but it
has to fit that pattern. On Oct 6, 2015, at 4:50 PM, Maciej Szulik <
notifications@github.com> wrote: In api/swagger-spec/oapi-v1.json <
https://github.com/openshift/origin/pull/4821#discussion_r41320288>:
@@ -13819,6 +13819,11 @@ "resources": { "$ref": "v1.ResourceRequirements",
"description": "the desired compute resources the build should have" + }, +
"maxDuration": {
We've talked with @bparees https://github.com/bparees <
https://github.com/bparees> about naming and decided to go with
maxDuration as it speaks more clearly what the field does than
activeDeadlineSecods. Other than that if sometime in the future we'll
decide to allow multi-pod builds, we'd still have maxDuration which will
leverage job's maxDuration (or whatever it'll be called). — Reply to this
email directly or view it on GitHub <
https://github.com/openshift/origin/pull/4821/files#r41320288>.—
Reply to this email directly or view it on GitHub
https://github.com/openshift/origin/pull/4821/files#r41323078.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, you have to justify why this field which does exactly the same thing
as activeDeadlineSeconds and even means the same thing should be different.
On Tue, Oct 6, 2015 at 5:22 PM, Maciej Szulik notifications@github.com
wrote:
In api/swagger-spec/oapi-v1.json
#4821 (comment):@@ -13819,6 +13819,11 @@
"resources": {
"$ref": "v1.ResourceRequirements",
"description": "the desired compute resources the build should have"
},
"maxDuration": {
@bparees https://github.com/bparees your choice, personally I'd go with
buildDeadlineSeconds
… <#1503f089b371cccd_>
On Oct 6, 2015 11:12 PM, "Clayton Coleman" notifications@github.com
wrote: In api/swagger-spec/oapi-v1.json <
https://github.com/openshift/origin/pull/4821#discussion_r41323078>: > @@
-13819,6 +13819,11 @@ > "resources": { > "$ref": "v1.ResourceRequirements","description": "the desired compute resources the build should have" > +
}, > + "maxDuration": { Anything you name this has to have a suffix of
"DeadlineSeconds" because that's the established convention in the API. You
can call it buildDeadlineSeconds or activeDeadlineSeconds or (other
variant), but it has to fit that pattern. On Oct 6, 2015, at 4:50 PM,
Maciej Szulik < notifications@github.com> wrote: In
api/swagger-spec/oapi-v1.json <
https://github.com/openshift/origin/pull/4821#discussion_r41320288>: @@
-13819,6 +13819,11 @@ "resources": { "$ref": "v1.ResourceRequirements",
"description": "the desired compute resources the build should have" + }, +
"maxDuration": { We've talked with @bparees https://github.com/bparees <
https://github.com/bparees> < https://github.com/bparees> about naming
and decided to go with maxDuration as it speaks more clearly what the field
does than activeDeadlineSecods. Other than that if sometime in the future
we'll decide to allow multi-pod builds, we'd still have maxDuration which
will leverage job's maxDuration (or whatever it'll be called). — Reply to
this email directly or view it on GitHub <
https://github.com/openshift/origin/pull/4821/files#r41320288>. — Reply
to this email directly or view it on GitHub <
https://github.com/openshift/origin/pull/4821/files#r41323078>.—
Reply to this email directly or view it on GitHub
https://github.com/openshift/origin/pull/4821/files#r41324241.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
activeDeadlineSeconds is not a good name. it implies a deadline for something to become active. it does not imply a maximum amount of time something is allowed to be active. Not to mention "deadline" implies a specific point in time, not a length of time. activeDurationSeconds might have made sense.
so my justification is i want the buildconfig api to actually make sense when people look at the field, they will intuitively understand what the field does.
And i don't see any reason this has to match the underlying api we are using to implement it. We could choose to implement build duration limits with any manner of mechanisms. That we happened to choose to implement it by leveraging an underlying api that's not being exposed to the user here is irrelevant to how we name this field.
if it has to end in DeadlineSeconds then i want it named "buildDurationDeadlineSeconds"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
completionDeadlineSeconds I might buy. We're going to use standard
terminology in our APIs because we are a Kube ecosystem project. The
reason it is "seconds" is because by convention everywhere in the API we
use seconds to denote the type of a file. The reason it is deadline is
because it's a precise technical term to describe the behavior when the
time is exceeded. You can't have a duration deadline. A deadline is a
duration, and it's measured in seconds.
On Tue, Oct 6, 2015 at 6:00 PM, Ben Parees notifications@github.com wrote:
In api/swagger-spec/oapi-v1.json
#4821 (comment):@@ -13819,6 +13819,11 @@
"resources": {
"$ref": "v1.ResourceRequirements",
"description": "the desired compute resources the build should have"
},
"maxDuration": {
activeDeadlineSeconds is not a good name. it implies a deadline for
something to become active. it does not imply a maximum amount of time
something is allowed to be active. Not to mention "deadline" implies a
specific point in time, not a length of time. activeDurationSeconds might
have made sense.so my justification is i want the buildconfig api to actually make sense
when people look at the field, they will intuitively understand what the
field does.And i don't see any reason this has to match the underlying api we are
using to implement it. We could choose to implement build duration limits
with any manner of mechanisms. That we happened to choose to implement it
by leveraging an underlying api that's not being exposed to the user here
is irrelevant to how we name this field.if it has to end in DeadlineSeconds then i want it named
"buildDurationDeadlineSeconds"—
Reply to this email directly or view it on GitHub
https://github.com/openshift/origin/pull/4821/files#r41328471.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then it's a precise technical term that's being used incorrectly. deadlines are points in time, not lengths of time. a length of time can be used to compute a deadline given a starting time, but it's not a deadline by itself.
completionDeadlineSeconds it is.
|
Docs updated as well, ptal. |
lgtm |
The failures are:
And three times:
|
I've temporarily disabled one e2e tests that kept failing on my machine locally, that is SchedulerPredicates. I've seen it being changed significantly upstream, which should be picked up with next rebase. |
a63eeb9
to
f1d8a58
Compare
Rebased and removed the 2nd commit re SchedulerPredicates disablement. Let's see the jenkins results now. |
@@ -58,6 +58,11 @@ type BuildSpec struct { | |||
|
|||
// Compute resource requirements to execute the build | |||
Resources kapi.ResourceRequirements | |||
|
|||
// Optional duration in seconds, counted from the time when a build pod gets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this is optional, maybe it would be worth to explain what is the default behavior if I don't set this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is optional the default behavior is that it does not influence build duration, at least that's my understanding. Why then adding some additional info here.
LGTM (couple small nits) |
f1d8a58
to
ee2b543
Compare
I've removed |
ee2b543
to
385735d
Compare
Addressed @mfojtik nit regarding describe naming, changed |
[testonlyextended][extended:core(CompletionDeadlineSeconds)] |
continuous-integration/openshift-jenkins/test Running (https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin/5773/) |
385735d
to
c5abd48
Compare
Added validation, of which I previously forgotten. Thanks to @wzheng1 for catching that in the test cases. |
Evaluated for origin testonlyextended up to c5abd48 |
LGTM |
@smarterclayton awaiting your sign off for this. |
LGTM [merge] |
BTW your extended run didn't test anything:
|
The extended tests should probably fail if 0 tests are run. |
I wonder why nothing was ran, that focus should match @soltysh test. |
continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_origin/3616/) (Image: devenv-fedora_2464) |
Evaluated for origin merge up to c5abd48 |
[Test]ing while waiting on the merge queue |
Evaluated for origin test up to c5abd48 |
continuous-integration/openshift-jenkins/testonlyextended SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin/5774/) |
@mfojtik can I ask you kindly to have a closer look at it, you're the e2e expert here 😻 |
Merged by openshift-bot
@bparees this is partially dealing with the problems mentioned in #341 by adding possibility to timeout build. But there are two questions:
ActiveDeadlineSeconds
, is it ok?One last thing, you specify this timeout in seconds, exactly as you do with ADS.