Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PWX-30419: Use reconcile loop to check pre-flight status. Timeout after 15m. #1070

Merged
merged 63 commits into from
Aug 7, 2023

Conversation

jrivera-px
Copy link
Contributor

What this PR does / why we need it:
When executing pre-flight check we would wait for the checks to complete blocking the reconcile loop. This prevented any other state change from being processed. So this change corrected that. Use cluster conditions to determine if we are still executing a pre-flight. If so check the status each time though the loop. Add a 15 min timeout based on the time the DS has been running.
Which issue(s) this PR fixes (optional)
Closes #
PWX-30419
Special notes for your reviewer:

harsh-px and others added 30 commits March 28, 2023 16:07
Signed-off-by: Harsh Desai <hadesai@purestorage.com>
Signed-off-by: Harsh Desai <hadesai@purestorage.com>
Signed-off-by: Jose Rivera <jose@portworx.com>
Signed-off-by: Jose Rivera <jose@portworx.com>
Signed-off-by: Jose Rivera <jose@portworx.com>
Signed-off-by: Jose Rivera <jose@portworx.com>
* Updating CSV to use 23.3.1 released image

* Update for 23.3.1 release

* Controller gen vendor

Signed-off-by: Piyush Nimbalkar <pnimbalkar@purestorage.com>

* PWX-29389 Add CRD for portworx diags collection

Signed-off-by: Piyush Nimbalkar <pnimbalkar@purestorage.com>

* PWX-29409: Ignore zones with no nodes (#1008)

In disaggregated mode, there could be zones in which no storage nodes
  might be present. Such a zone would make the maxSNPZ value to be 0.
  CHanging the behavior to ignore 0 nodes in a zone for maxSNPZ
  calculation.

Signed-off-by: Naveen Revanna <nrevanna@purestorage.com>

---------

Signed-off-by: Piyush Nimbalkar <pnimbalkar@purestorage.com>
Signed-off-by: Naveen Revanna <nrevanna@purestorage.com>
Co-authored-by: CNBU Jenkins <cnbu-jenkins@purestorage.com>
Co-authored-by: Jiafeng Liao <jliao@purestorage.com>
Co-authored-by: Piyush Nimbalkar <pnimbalkar@purestorage.com>
Co-authored-by: Naveen Revanna <83608369+nrevanna@users.noreply.github.com>
Signed-off-by: Jose Rivera <jose@portworx.com>
…emove 'wait' code.

Signed-off-by: Jose Rivera <jose@portworx.com>
Signed-off-by: Jose Rivera <jose@portworx.com>
Signed-off-by: Jose Rivera <jose@portworx.com>
Signed-off-by: Jose Rivera <jose@portworx.com>
…that don't work since Validate was removed from the controller.validate() func. PWX-30373 to try and fix later.

Signed-off-by: Jose Rivera <jose@portworx.com>
… check failure to trigger the needed workflow.

Signed-off-by: Jose Rivera <jose@portworx.com>
Signed-off-by: Jose Rivera <jose@portworx.com>
Signed-off-by: Jose Rivera <jose@portworx.com>
…space.

Signed-off-by: Jose Rivera <jose@portworx.com>
…running CBT namespace.

Signed-off-by: Jose Rivera <jose@portworx.com>
…ctionality correctly.

Signed-off-by: Jose Rivera <jose@portworx.com>
Signed-off-by: Jose Rivera <jose@portworx.com>
… exists.

Signed-off-by: Jose Rivera <jose@portworx.com>
Signed-off-by: Jose Rivera <jose@portworx.com>
Signed-off-by: Jose Rivera <jose@portworx.com>
Signed-off-by: Jose Rivera <jose@portworx.com>
Signed-off-by: Jose Rivera <jose@portworx.com>

// Skip over driver Validate() if an error has occurred
if err == nil {
err = c.driverValidate(toUpdate)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we return an cluster condition constant showing the state of the preflight pods?

Comment on lines 537 to 543
if toUpdate.Annotations[pxutil.AnnotationPreflightCheck] != "false" { // In progress
condition.Status = corev1.ClusterConditionStatusInProgress
logrus.Infof("storage cluster preflight in progress...")

} else {
condition.Status = corev1.ClusterConditionStatusCompleted
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the condition based on the output from driverValidate.

@@ -412,55 +412,59 @@ func (c *Controller) validateCloudStorageLabelKey(cluster *corev1.StorageCluster
return nil
}

func (c *Controller) runPreflightCheck(cluster *corev1.StorageCluster) error {
func bringUpStorageCluster(cluster *corev1.StorageCluster) (bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to return an error here.

@@ -657,6 +714,15 @@ func (c *Controller) syncStorageCluster(
}
}

if startOk, err := bringUpStorageCluster(cluster); err == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: isPreflightComplete?

if err != nil {
if errors.IsNotFound(err) {
condition.Status = corev1.ClusterConditionStatusInProgress
condition.Message = "pre-flight: in progress..."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*Preflight pods haven't started yet.

…termin status..

Signed-off-by: Jose Rivera <jose@portworx.com>
…ault condition on CloudDrivePermission error.

Signed-off-by: Jose Rivera <jose@portworx.com>
Copy link
Member

@piyush-nimbalkar piyush-nimbalkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrivera-px can we UTs for these changes either in this PR or a different one?

if total != 0 && completed == total {
logrus.Infof("pre-flight: checks completed...")
} else {
if age >= 15*time.Minute {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Extract the duration as a constant?

@@ -842,7 +920,7 @@ func (c *Controller) gcNeeded(obj client.Object) bool {

if obj.GetObjectKind().GroupVersionKind().Kind == "ConfigMap" {
if obj.GetName() == "px-attach-driveset-lock" ||
strings.HasPrefix(obj.GetName(), "px-bringup-queue-lock") {
strings.HasPrefix(obj.GetName(), "pxb-ringup-queue-lock") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is a typo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah transpose the chars on accident..

Signed-off-by: Jose Rivera <jose@portworx.com>
…imeout.

Signed-off-by: Jose Rivera <jose@portworx.com>
Signed-off-by: Jose Rivera <jose@portworx.com>
Signed-off-by: Jose Rivera <jose@portworx.com>
@jrivera-px jrivera-px merged commit 0298ebb into master Aug 7, 2023
6 of 8 checks passed
jrivera-px added a commit that referenced this pull request Aug 7, 2023
…er 15m. (#1070)

* PWX-28826 Boilerplace

Signed-off-by: Harsh Desai <hadesai@purestorage.com>

* more boilerplate

Signed-off-by: Harsh Desai <hadesai@purestorage.com>

* PWX-28826:  Pre-flight check for DMthin.

Signed-off-by: Jose Rivera <jose@portworx.com>

* PWX-28826: Add comments and move StorageNode cleanup.

Signed-off-by: Jose Rivera <jose@portworx.com>

* Passed checks should be Info events.

Signed-off-by: Jose Rivera <jose@portworx.com>

* Passed checks should be Info events. (#1010)

Signed-off-by: Jose Rivera <jose@portworx.com>

* Pwx 28826 (#1011)

* Pwx 28826 (#1012)

* PWX-28826: Update with the latest master changes. (#1013)

* Updating CSV to use 23.3.1 released image

* Update for 23.3.1 release

* Controller gen vendor

Signed-off-by: Piyush Nimbalkar <pnimbalkar@purestorage.com>

* PWX-29389 Add CRD for portworx diags collection

Signed-off-by: Piyush Nimbalkar <pnimbalkar@purestorage.com>

* PWX-29409: Ignore zones with no nodes (#1008)

In disaggregated mode, there could be zones in which no storage nodes
  might be present. Such a zone would make the maxSNPZ value to be 0.
  CHanging the behavior to ignore 0 nodes in a zone for maxSNPZ
  calculation.

Signed-off-by: Naveen Revanna <nrevanna@purestorage.com>

---------

Signed-off-by: Piyush Nimbalkar <pnimbalkar@purestorage.com>
Signed-off-by: Naveen Revanna <nrevanna@purestorage.com>
Co-authored-by: CNBU Jenkins <cnbu-jenkins@purestorage.com>
Co-authored-by: Jiafeng Liao <jliao@purestorage.com>
Co-authored-by: Piyush Nimbalkar <pnimbalkar@purestorage.com>
Co-authored-by: Naveen Revanna <83608369+nrevanna@users.noreply.github.com>

* Add PassPreFlight event tag and logging

Signed-off-by: Jose Rivera <jose@portworx.com>

* PWX-28826: Check status of portworx container in pre-flight pod and remove 'wait' code.

Signed-off-by: Jose Rivera <jose@portworx.com>

* PWX-28826: Fix unit test.

Signed-off-by: Jose Rivera <jose@portworx.com>

* PWX-28826: Fix unit test.

Signed-off-by: Jose Rivera <jose@portworx.com>

* PWX-28826: PR review changes and fix portworx_test.go UTs

Signed-off-by: Jose Rivera <jose@portworx.com>

* PWX-28826: fix gomack Validate calls.  Also comment out the two tests that don't work since Validate was removed from the controller.validate() func. PWX-30373 to try and fix later.

Signed-off-by: Jose Rivera <jose@portworx.com>

* PWX-30373: Re-add back in the commented out tests and add K8s version check failure to trigger the needed workflow.

Signed-off-by: Jose Rivera <jose@portworx.com>

* PWX-28826: Exit pre-check wait if running CBT namespace.

Signed-off-by: Jose Rivera <jose@portworx.com>

* PWX-28826: Add 5 min timeout to pre-flight status check.

Signed-off-by: Jose Rivera <jose@portworx.com>

* PWX-28826: Exit GetPreFlightStatus() with success if running CBT namespace.

Signed-off-by: Jose Rivera <jose@portworx.com>

* PWX-28826: Don't automatically enable dmthin via pre-flight check if running CBT namespace.

Signed-off-by: Jose Rivera <jose@portworx.com>

* PWX-30373: Revert UT and integration test hacks.  Need to mock the functionality correctly.

Signed-off-by: Jose Rivera <jose@portworx.com>

* PWX-28826: Increase pre-flight daemonset ready wait to 10mins.

Signed-off-by: Jose Rivera <jose@portworx.com>

* PWX-28826: fix 'TestValidate' UT.  Don't error if pre-flight daemonset exists.

Signed-off-by: Jose Rivera <jose@portworx.com>

* Use reconcile loop to check pre-flight status.  Timeout after 15m.

Signed-off-by: Jose Rivera <jose@portworx.com>

* PWX-28826: Startup check was not handling if preflight was not enabled.

Signed-off-by: Jose Rivera <jose@portworx.com>

* PWX-28826: Re-enable preflight for integration tests.

Signed-off-by: Jose Rivera <jose@portworx.com>

* PWX-28826: disable preflight for integration tests.

Signed-off-by: Jose Rivera <jose@portworx.com>

* PWX-28826: Only do preflight if AWS.

Signed-off-by: Jose Rivera <jose@portworx.com>

* PWX-30496: if '-T dmthin' exists in stc before preflight is ran and preflight fails don't start.  Add default metadata dev for dmthin if it does not exist.

Signed-off-by: Jose Rivera <jose@portworx.com>

* PWX-28826: Use condition set to determine of CheckCloudDrivePermission check is to be run. Handle skip annotation and pod cleanup.

Signed-off-by: Jose Rivera <jose@portworx.com>

* PWX-28826: Remove debug logging

Signed-off-by: Jose Rivera <jose@portworx.com>

* PWX-30419:  Remove more debug logging.

Signed-off-by: Jose Rivera <jose@portworx.com>

* PWX-30419: Revert to correct version.

Signed-off-by: Jose Rivera <jose@portworx.com>

* PWX-30419: Fix unit test.  Need to enable pre-flight.

Signed-off-by: Jose Rivera <jose@portworx.com>

* PWX-30419: Fix TestEKSPreflightCheck unit test. Pre-flight now includes DMthin checks so need to handle that.

Signed-off-by: Jose Rivera <jose@portworx.com>

* update from master, conflict fix.

* PWX-28826: Use cluster condition to determine pre-flight status.

Signed-off-by: Jose Rivera <jose@portworx.com>

* PWX-28826: return cluster condition from driverValidate as that to determin status..

Signed-off-by: Jose Rivera <jose@portworx.com>

* Add err message to default condition message and make sure to use default condition on CloudDrivePermission error.

Signed-off-by: Jose Rivera <jose@portworx.com>

* PWX-28826: Add a constant for pre-flight timeout value.

Signed-off-by: Jose Rivera <jose@portworx.com>

* PWX-28826: pre-flight timeout should prevent startup.  Add event for timeout.

Signed-off-by: Jose Rivera <jose@portworx.com>

* PWX-28826: Fix unit tests.

Signed-off-by: Jose Rivera <jose@portworx.com>

* PWX-28826: Use progress string.

Signed-off-by: Jose Rivera <jose@portworx.com>

---------

Signed-off-by: Harsh Desai <hadesai@purestorage.com>
Signed-off-by: Jose Rivera <jose@portworx.com>
Signed-off-by: Piyush Nimbalkar <pnimbalkar@purestorage.com>
Signed-off-by: Naveen Revanna <nrevanna@purestorage.com>
Co-authored-by: Harsh Desai <hadesai@purestorage.com>
Co-authored-by: CNBU Jenkins <cnbu-jenkins@purestorage.com>
Co-authored-by: Jiafeng Liao <jliao@purestorage.com>
Co-authored-by: Piyush Nimbalkar <pnimbalkar@purestorage.com>
Co-authored-by: Naveen Revanna <83608369+nrevanna@users.noreply.github.com>
harsh-px added a commit that referenced this pull request Aug 7, 2023
…er 15m. (#1070) (#1188)

* PWX-28826 Boilerplace



* more boilerplate



* PWX-28826:  Pre-flight check for DMthin.



* PWX-28826: Add comments and move StorageNode cleanup.



* Passed checks should be Info events.



* Passed checks should be Info events. (#1010)



* Pwx 28826 (#1011)

* Pwx 28826 (#1012)

* PWX-28826: Update with the latest master changes. (#1013)

* Updating CSV to use 23.3.1 released image

* Update for 23.3.1 release

* Controller gen vendor



* PWX-29389 Add CRD for portworx diags collection



* PWX-29409: Ignore zones with no nodes (#1008)

In disaggregated mode, there could be zones in which no storage nodes
  might be present. Such a zone would make the maxSNPZ value to be 0.
  CHanging the behavior to ignore 0 nodes in a zone for maxSNPZ
  calculation.



---------








* Add PassPreFlight event tag and logging



* PWX-28826: Check status of portworx container in pre-flight pod and remove 'wait' code.



* PWX-28826: Fix unit test.



* PWX-28826: Fix unit test.



* PWX-28826: PR review changes and fix portworx_test.go UTs



* PWX-28826: fix gomack Validate calls.  Also comment out the two tests that don't work since Validate was removed from the controller.validate() func. PWX-30373 to try and fix later.



* PWX-30373: Re-add back in the commented out tests and add K8s version check failure to trigger the needed workflow.



* PWX-28826: Exit pre-check wait if running CBT namespace.



* PWX-28826: Add 5 min timeout to pre-flight status check.



* PWX-28826: Exit GetPreFlightStatus() with success if running CBT namespace.



* PWX-28826: Don't automatically enable dmthin via pre-flight check if running CBT namespace.



* PWX-30373: Revert UT and integration test hacks.  Need to mock the functionality correctly.



* PWX-28826: Increase pre-flight daemonset ready wait to 10mins.



* PWX-28826: fix 'TestValidate' UT.  Don't error if pre-flight daemonset exists.



* Use reconcile loop to check pre-flight status.  Timeout after 15m.



* PWX-28826: Startup check was not handling if preflight was not enabled.



* PWX-28826: Re-enable preflight for integration tests.



* PWX-28826: disable preflight for integration tests.



* PWX-28826: Only do preflight if AWS.



* PWX-30496: if '-T dmthin' exists in stc before preflight is ran and preflight fails don't start.  Add default metadata dev for dmthin if it does not exist.



* PWX-28826: Use condition set to determine of CheckCloudDrivePermission check is to be run. Handle skip annotation and pod cleanup.



* PWX-28826: Remove debug logging



* PWX-30419:  Remove more debug logging.



* PWX-30419: Revert to correct version.



* PWX-30419: Fix unit test.  Need to enable pre-flight.



* PWX-30419: Fix TestEKSPreflightCheck unit test. Pre-flight now includes DMthin checks so need to handle that.



* update from master, conflict fix.

* PWX-28826: Use cluster condition to determine pre-flight status.



* PWX-28826: return cluster condition from driverValidate as that to determin status..



* Add err message to default condition message and make sure to use default condition on CloudDrivePermission error.



* PWX-28826: Add a constant for pre-flight timeout value.



* PWX-28826: pre-flight timeout should prevent startup.  Add event for timeout.



* PWX-28826: Fix unit tests.



* PWX-28826: Use progress string.



---------

Signed-off-by: Harsh Desai <hadesai@purestorage.com>
Signed-off-by: Jose Rivera <jose@portworx.com>
Signed-off-by: Piyush Nimbalkar <pnimbalkar@purestorage.com>
Signed-off-by: Naveen Revanna <nrevanna@purestorage.com>
Co-authored-by: Harsh Desai <hadesai@purestorage.com>
Co-authored-by: CNBU Jenkins <cnbu-jenkins@purestorage.com>
Co-authored-by: Jiafeng Liao <jliao@purestorage.com>
Co-authored-by: Piyush Nimbalkar <pnimbalkar@purestorage.com>
Co-authored-by: Naveen Revanna <83608369+nrevanna@users.noreply.github.com>
@piyush-nimbalkar piyush-nimbalkar deleted the PWX-28826-use-reconcile-loop branch September 15, 2023 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants