Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ui: explain why alloc is failing #17213

Closed
wants to merge 2 commits into from
Closed

Conversation

ChaiWithJai
Copy link
Contributor

Resolves #16942

This PR creates a new derived state property on the Allocation model to show why an allocation has stopped rescheduling and updates the template to provide a better reason about why the allocation has stopped rescheduling and links to the follow up evaluation to enable the user to debug what's happening.

Comment on lines +222 to +223
case this.task.code.errors.length > 0:
return "there was an error in the task or service's code.";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this prevent reschedule, or simply throw an error but otherwise continue on? Asking because I want to make sure that we're giving the right cause when we say "stopped .... because of ________"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a derived state computation to show a reason why an allocation has stopped rescheduling to be shown in the allocations.allocation.index view. This is only shown when the current allocation (the one that's associated with the view) has stopped rescheduling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the invocation in the template layer for usage details and you'll see that this only is applied for the top level allocation on the reschedule timeline because the rest of the items in the timeline are RescheduleEventRow which only provides links and no text.

@github-actions
Copy link

Ember Asset Size action

As of e5116aa

Files that got Bigger 🚨:

File raw gzip
nomad-ui.js +1.05 kB +278 B

Files that stayed the same size 🤷‍:

File raw gzip
vendor.js 0 B 0 B
nomad-ui.css 0 B 0 B
vendor.css 0 B 0 B

'task.{config.errors.length,code.errors.length}'
)
get failureReason() {
switch (this.hasStoppedRescheduling) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my context, this field value we're reading is from this function, right?:

  get hasStoppedRescheduling() {
    return (
      !this.get('nextAllocation.content') &&
      !this.get('followUpEvaluation.content') &&
      this.clientStatus === 'failed'
    );
  }

If so, it doesn't seem like this field is relevant to determining why the allocation is failing, where you only care about client status.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmmmm.... good call.

)
get failureReason() {
switch (this.hasStoppedRescheduling) {
case this.node.status === 'failed':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no "failed" node status. See structs.go#L1866-L1871

Comment on lines +218 to +219
case this.resources.length === 0:
return 'the resources that the allocation was scheduled on were not available.';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the allocation has no resources, it wouldn't have been placed in the first place, right? I don't think we can actually see this case.

case this.task.config.errors.length > 0:
return "the task or service's configuration was incorrect.";
case this.task.code.errors.length > 0:
return "there was an error in the task or service's code.";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we know this -- we only know that the task returned errors. It could be a bug in the user's code, an infrastructure outage, networking failure, etc, etc and all of that is owned by the application (not Nomad). It'd probably be handy to link to where we show the task events here though.

@github-actions
Copy link

Ember Test Audit comparison

main e5116aa change
passes 1496 1495 -1
failures 4 5 +1
flaky 0 0 0
duration 000ms 000ms -000ms

@tgross
Copy link
Member

tgross commented May 16, 2023

@ChaiWithJai this doesn't seem like it actually solves the problem described in #16942. The description you have here (and the code) tells us why the allocation failed, not why the scheduler stopped scheduling it.

@ChaiWithJai
Copy link
Contributor Author

@ChaiWithJai this doesn't seem like it actually solves the problem described in #16942. The description you have here (and the code) tells us why the allocation failed, not why the scheduler stopped scheduling it.

Yeah... looks I conflated the concepts. There doesn't appear to be a solution on this problem. There's no follow up evaluation to link for the evaluations view and we can't predictably compute derived state to give the user a better explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ui] Give explanation for hasStoppedRescheduling
3 participants