New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MGMT-15878: Ensure that hosts emit event showing why preparation failed. #5521
Conversation
@paul-maidment: This pull request references MGMT-15878 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.15.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Skipping CI for Draft Pull Request. |
@paul-maidment: This pull request references MGMT-15878 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.15.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: paul-maidment The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@paul-maidment: This pull request references MGMT-15878 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.15.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
f778b4e
to
95780fb
Compare
internal/bminventory/inventory.go
Outdated
@@ -1292,10 +1292,12 @@ func (b *bareMetalInventory) InstallClusterInternal(ctx context.Context, params | |||
// prepare cluster and hosts for installation | |||
err = b.db.Transaction(func(tx *gorm.DB) error { | |||
if err = b.clusterApi.PrepareForInstallation(ctx, cluster, tx); err != nil { | |||
b.clusterApi.HandlePreInstallError(ctx, cluster, fmt.Errorf("failed to transition to installation preparation due to error:%w", err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't all of these already return the error to the controller, showing the result in the SpecSync condition?
https://github.com/openshift/assisted-service/blob/master/internal/controller/controllers/clusterdeployments_controller.go#L394
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 i don't think that we have an issue in the sync part
internal/cluster/transition.go
Outdated
@@ -143,7 +144,7 @@ func (th *transitionHandler) PostPrepareForInstallation(sw stateswitch.StateSwit | |||
if !ok { | |||
return errors.New("PostPrepareForInstallation invalid argument") | |||
} | |||
extra := append(append(make([]interface{}, 0), "install_started_at", strfmt.DateTime(time.Now()), "installation_preparation_completion_status", ""), resetLogsField...) | |||
extra := append(append(make([]interface{}, 0), "install_started_at", strfmt.DateTime(time.Now())), resetLogsField...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to prevent that field being wiped on restart of installation.
If this remains here, we cannot use this field as the basis of any condition that will persist across installation attempts.
912d4d6
to
2613c24
Compare
@paul-maidment: This pull request references MGMT-15878 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.15.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
2613c24
to
54e048c
Compare
@paul-maidment: This pull request references MGMT-15878 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.15.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
59ba5bd
to
d654d62
Compare
cc @ori-amizur |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #5521 +/- ##
==========================================
+ Coverage 67.70% 67.93% +0.22%
==========================================
Files 233 233
Lines 34267 34746 +479
==========================================
+ Hits 23202 23603 +401
- Misses 9001 9069 +68
- Partials 2064 2074 +10
|
internal/host/statemachine.go
Outdated
DestinationState: stateswitch.State(models.HostStatusKnown), | ||
PostTransition: th.PostHostPreparationTimeout(), | ||
Documentation: stateswitch.TransitionRuleDoc{ | ||
Name: "Preparing timed out host move to insufficient", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It moves to known
108bace
to
5d05b15
Compare
/lgtm |
/hold |
/unhold |
/hold |
5d05b15
to
cf2f8fe
Compare
cf2f8fe
to
50ba399
Compare
/retest |
c827422
to
87b97b3
Compare
/unhold |
internal/host/transition.go
Outdated
if !ok { | ||
return false, errors.New("IsPreparingTimedOut incompatible type of StateSwitch") | ||
} | ||
// if *sHost.host.Status != models.HostStatusPreparingForInstallation { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove commented code
When preparation fails for a host, we do not implement any kind of a timeout for the host to indicate that this has occurred. This means that it is sometimes not possible for the user to determine the cause of a cluster timeout (which will inevitably be caused by the timeout of a host during preparation.) Presently, there are two ways in which a host may time out (no result received within the cluster timeout) An inconclusive result from the pulling of cluster images An inconclusive result from the disk speed check This PR introduces a timeout to detect these scenarios and report on them in a host timeout event so that the user may have a clue as to what has happened. This PR is in addition to MGMT-15814 which introduces a cluster condition to track a cluster timeout when there is a failure to configure the preparation of a cluster within a given time frame (for example if the assisted pod crashes) Together these PR's should improve the overall quality of error reporting.
87b97b3
to
92f3e38
Compare
/lgtm |
@paul-maidment: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
When preparation fails for a host, we do not implement any kind of a timeout for the host to indicate that this has occurred.
This means that it is sometimes not possible for the user to determine the cause of a cluster timeout (which will inevitably be caused by the timeout of a host during preparation.)
Presently, there are two ways in which a host may time out (no result received within the cluster timeout)
This PR introduces a timeout to detect these scenarios and report on them in a host timeout event so that the user may have a clue as to what has happened.
This PR is in addition to MGMT-15814 which introduces a cluster condition to track a cluster timeout when there is a failure to configure the preparation of a cluster within a given time frame (for example if the assisted pod crashes)
Together these PR's should improve the overall quality of error reporting.
List all the issues related to this PR
What environments does this code impact?
How was this code tested?
Checklist
docs
, README, etc)Reviewers Checklist