Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MGMT-15878: Ensure that hosts emit event showing why preparation failed. #5521

Merged
merged 1 commit into from Nov 6, 2023

Conversation

paul-maidment
Copy link
Contributor

@paul-maidment paul-maidment commented Oct 1, 2023

When preparation fails for a host, we do not implement any kind of a timeout for the host to indicate that this has occurred.

This means that it is sometimes not possible for the user to determine the cause of a cluster timeout (which will inevitably be caused by the timeout of a host during preparation.)

Presently, there are two ways in which a host may time out (no result received within the cluster timeout)

  • An inconclusive result from the pulling of cluster images
  • An inconclusive result from the disk speed check

This PR introduces a timeout to detect these scenarios and report on them in a host timeout event so that the user may have a clue as to what has happened.

This PR is in addition to MGMT-15814 which introduces a cluster condition to track a cluster timeout when there is a failure to configure the preparation of a cluster within a given time frame (for example if the assisted pod crashes)

Together these PR's should improve the overall quality of error reporting.

List all the issues related to this PR

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • [] None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 1, 2023

@paul-maidment: This pull request references MGMT-15878 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.15.0" version, but no target version was set.

In response to this:

Presently when installing a cluster, there is no indication of the reason for the last failure in preparation. We record the fact that a failure occurred in that database field cluster.installation_preparation_completion_status but we do not store any reason for the failure.

We are adding the field cluster.installation_preparation_completion_status_reason so that we may record the reason at the same time an event is generated.

The content of cluster.installation_preparation_completion_status and cluster.installation_preparation_completion_status_reason are used to determine the state of a newly created condition LastInstallationPreparationFailed

List all the issues related to this PR

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • [] None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 1, 2023
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 1, 2023
@openshift-ci
Copy link

openshift-ci bot commented Oct 1, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 1, 2023
@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 1, 2023

@paul-maidment: This pull request references MGMT-15878 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.15.0" version, but no target version was set.

In response to this:

Presently when installing a cluster, there is no indication of the reason for the last failure in preparation. We record the fact that a failure occurred in that database field cluster.installation_preparation_completion_status but we do not store any reason for the failure.

We are adding the field cluster.installation_preparation_completion_status_reason so that we may record the reason at the same time an event is generated.

The content of cluster.installation_preparation_completion_status and cluster.installation_preparation_completion_status_reason are used to determine the state of a newly created condition LastInstallationPreparationFailed

List all the issues related to this PR

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • [] None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link

openshift-ci bot commented Oct 1, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: paul-maidment

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 1, 2023
@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 1, 2023

@paul-maidment: This pull request references MGMT-15878 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.15.0" version, but no target version was set.

In response to this:

Presently when installing a cluster, there is no indication of the reason for the last failure in preparation. We record the fact that a failure occurred in that database field cluster.installation_preparation_completion_status but we do not store any reason for the failure.

We are adding the field cluster.installation_preparation_completion_status_reason so that we may record the reason at the same time an event is generated.

The content of cluster.installation_preparation_completion_status and cluster.installation_preparation_completion_status_reason are used to determine the state of a newly created condition LastInstallationPreparationFailed

List all the issues related to this PR

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • [] None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@paul-maidment paul-maidment force-pushed the MGMT-15878 branch 3 times, most recently from f778b4e to 95780fb Compare October 2, 2023 01:18
@@ -1292,10 +1292,12 @@ func (b *bareMetalInventory) InstallClusterInternal(ctx context.Context, params
// prepare cluster and hosts for installation
err = b.db.Transaction(func(tx *gorm.DB) error {
if err = b.clusterApi.PrepareForInstallation(ctx, cluster, tx); err != nil {
b.clusterApi.HandlePreInstallError(ctx, cluster, fmt.Errorf("failed to transition to installation preparation due to error:%w", err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't all of these already return the error to the controller, showing the result in the SpecSync condition?
https://github.com/openshift/assisted-service/blob/master/internal/controller/controllers/clusterdeployments_controller.go#L394

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 i don't think that we have an issue in the sync part

@@ -143,7 +144,7 @@ func (th *transitionHandler) PostPrepareForInstallation(sw stateswitch.StateSwit
if !ok {
return errors.New("PostPrepareForInstallation invalid argument")
}
extra := append(append(make([]interface{}, 0), "install_started_at", strfmt.DateTime(time.Now()), "installation_preparation_completion_status", ""), resetLogsField...)
extra := append(append(make([]interface{}, 0), "install_started_at", strfmt.DateTime(time.Now())), resetLogsField...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to prevent that field being wiped on restart of installation.
If this remains here, we cannot use this field as the basis of any condition that will persist across installation attempts.

@paul-maidment paul-maidment force-pushed the MGMT-15878 branch 2 times, most recently from 912d4d6 to 2613c24 Compare October 3, 2023 15:40
@openshift-ci openshift-ci bot added the api-review Categorizes an issue or PR as actively needing an API review. label Oct 3, 2023
@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 3, 2023

@paul-maidment: This pull request references MGMT-15878 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.15.0" version, but no target version was set.

In response to this:

Presently when installing a cluster, there is no indication of the reason for the last failure in preparation. We record the fact that a failure occurred in that database field cluster.installation_preparation_completion_status but we do not store any reason for the failure.

We are adding the field cluster.installation_preparation_completion_status_reason so that we may record the reason at the same time an event is generated.

The content of cluster.installation_preparation_completion_status and cluster.installation_preparation_completion_status_reason are used to determine the state of a newly created condition LastInstallationPreparationFailed

List all the issues related to this PR

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • [] None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot removed the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 4, 2023
@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 4, 2023

@paul-maidment: This pull request references MGMT-15878 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.15.0" version, but no target version was set.

In response to this:

Presently when installing a cluster, there is no indication of the reason for the last failure in preparation. We record the fact that a failure occurred in that database field cluster.installation_preparation_completion_status but we do not store any reason for the failure.

We are adding the field cluster.installation_preparation_completion_status_reason so that we may record the reason at the same time an event is generated.

The content of cluster.installation_preparation_completion_status and cluster.installation_preparation_completion_status_reason are used to determine the state of a newly created condition LastInstallationPreparationFailed

List all the issues related to this PR

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • [] None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 4, 2023
@paul-maidment paul-maidment force-pushed the MGMT-15878 branch 3 times, most recently from 59ba5bd to d654d62 Compare October 12, 2023 13:37
@paul-maidment paul-maidment marked this pull request as ready for review October 12, 2023 13:41
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 12, 2023
@paul-maidment
Copy link
Contributor Author

cc @ori-amizur

@codecov
Copy link

codecov bot commented Oct 12, 2023

Codecov Report

Merging #5521 (92f3e38) into master (73b1599) will increase coverage by 0.22%.
Report is 6 commits behind head on master.
The diff coverage is 86.36%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5521      +/-   ##
==========================================
+ Coverage   67.70%   67.93%   +0.22%     
==========================================
  Files         233      233              
  Lines       34267    34746     +479     
==========================================
+ Hits        23202    23603     +401     
- Misses       9001     9069      +68     
- Partials     2064     2074      +10     
Files Coverage Δ
internal/cluster/statemachine.go 100.00% <100.00%> (ø)
internal/host/common.go 88.88% <ø> (ø)
internal/host/config.go 100.00% <ø> (ø)
internal/host/monitor.go 81.36% <ø> (-0.12%) ⬇️
internal/host/statemachine.go 100.00% <100.00%> (ø)
internal/host/transition.go 54.61% <80.00%> (+1.99%) ⬆️

... and 8 files with indirect coverage changes

DestinationState: stateswitch.State(models.HostStatusKnown),
PostTransition: th.PostHostPreparationTimeout(),
Documentation: stateswitch.TransitionRuleDoc{
Name: "Preparing timed out host move to insufficient",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It moves to known

@ori-amizur
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 26, 2023
@paul-maidment
Copy link
Contributor Author

/hold
Issue spotted with messaging around timeout

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 26, 2023
@paul-maidment
Copy link
Contributor Author

/unhold
Tested manually and working well

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 26, 2023
@paul-maidment
Copy link
Contributor Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 26, 2023
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 26, 2023
@paul-maidment
Copy link
Contributor Author

/retest

@paul-maidment paul-maidment force-pushed the MGMT-15878 branch 4 times, most recently from c827422 to 87b97b3 Compare November 5, 2023 10:15
@paul-maidment
Copy link
Contributor Author

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 5, 2023
if !ok {
return false, errors.New("IsPreparingTimedOut incompatible type of StateSwitch")
}
// if *sHost.host.Status != models.HostStatusPreparingForInstallation {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove commented code

When preparation fails for a host, we do not implement any kind of a timeout for the host to indicate that this has occurred.

This means that it is sometimes not possible for the user to determine the cause of a cluster timeout (which will inevitably be caused by the timeout of a host during preparation.)

Presently, there are two ways in which a host may time out (no result received within the cluster timeout)

    An inconclusive result from the pulling of cluster images
    An inconclusive result from the disk speed check

This PR introduces a timeout to detect these scenarios and report on them in a host timeout event so that the user may have a clue as to what has happened.

This PR is in addition to MGMT-15814 which introduces a cluster condition to track a cluster timeout when there is a failure to configure the preparation of a cluster within a given time frame (for example if the assisted pod crashes)

Together these PR's should improve the overall quality of error reporting.
@ori-amizur
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 6, 2023
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 27e5672 and 2 for PR HEAD 92f3e38 in total

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD ea87037 and 1 for PR HEAD 92f3e38 in total

Copy link

openshift-ci bot commented Nov 6, 2023

@paul-maidment: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 08e3d8f into openshift:master Nov 6, 2023
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-review Categorizes an issue or PR as actively needing an API review. approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants