-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PWX-30419: Use reconcile loop to check pre-flight status. Timeout after 15m. #1070
Conversation
Signed-off-by: Harsh Desai <hadesai@purestorage.com>
Signed-off-by: Harsh Desai <hadesai@purestorage.com>
Signed-off-by: Jose Rivera <jose@portworx.com>
Signed-off-by: Jose Rivera <jose@portworx.com>
Signed-off-by: Jose Rivera <jose@portworx.com>
Signed-off-by: Jose Rivera <jose@portworx.com>
* Updating CSV to use 23.3.1 released image * Update for 23.3.1 release * Controller gen vendor Signed-off-by: Piyush Nimbalkar <pnimbalkar@purestorage.com> * PWX-29389 Add CRD for portworx diags collection Signed-off-by: Piyush Nimbalkar <pnimbalkar@purestorage.com> * PWX-29409: Ignore zones with no nodes (#1008) In disaggregated mode, there could be zones in which no storage nodes might be present. Such a zone would make the maxSNPZ value to be 0. CHanging the behavior to ignore 0 nodes in a zone for maxSNPZ calculation. Signed-off-by: Naveen Revanna <nrevanna@purestorage.com> --------- Signed-off-by: Piyush Nimbalkar <pnimbalkar@purestorage.com> Signed-off-by: Naveen Revanna <nrevanna@purestorage.com> Co-authored-by: CNBU Jenkins <cnbu-jenkins@purestorage.com> Co-authored-by: Jiafeng Liao <jliao@purestorage.com> Co-authored-by: Piyush Nimbalkar <pnimbalkar@purestorage.com> Co-authored-by: Naveen Revanna <83608369+nrevanna@users.noreply.github.com>
Signed-off-by: Jose Rivera <jose@portworx.com>
…emove 'wait' code. Signed-off-by: Jose Rivera <jose@portworx.com>
Signed-off-by: Jose Rivera <jose@portworx.com>
Signed-off-by: Jose Rivera <jose@portworx.com>
Signed-off-by: Jose Rivera <jose@portworx.com>
…that don't work since Validate was removed from the controller.validate() func. PWX-30373 to try and fix later. Signed-off-by: Jose Rivera <jose@portworx.com>
… check failure to trigger the needed workflow. Signed-off-by: Jose Rivera <jose@portworx.com>
Signed-off-by: Jose Rivera <jose@portworx.com>
Signed-off-by: Jose Rivera <jose@portworx.com>
…space. Signed-off-by: Jose Rivera <jose@portworx.com>
…running CBT namespace. Signed-off-by: Jose Rivera <jose@portworx.com>
…ctionality correctly. Signed-off-by: Jose Rivera <jose@portworx.com>
Signed-off-by: Jose Rivera <jose@portworx.com>
… exists. Signed-off-by: Jose Rivera <jose@portworx.com>
Signed-off-by: Jose Rivera <jose@portworx.com>
Signed-off-by: Jose Rivera <jose@portworx.com>
Signed-off-by: Jose Rivera <jose@portworx.com>
Signed-off-by: Jose Rivera <jose@portworx.com>
|
||
// Skip over driver Validate() if an error has occurred | ||
if err == nil { | ||
err = c.driverValidate(toUpdate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we return an cluster condition constant showing the state of the preflight pods?
if toUpdate.Annotations[pxutil.AnnotationPreflightCheck] != "false" { // In progress | ||
condition.Status = corev1.ClusterConditionStatusInProgress | ||
logrus.Infof("storage cluster preflight in progress...") | ||
|
||
} else { | ||
condition.Status = corev1.ClusterConditionStatusCompleted | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update the condition based on the output from driverValidate.
@@ -412,55 +412,59 @@ func (c *Controller) validateCloudStorageLabelKey(cluster *corev1.StorageCluster | |||
return nil | |||
} | |||
|
|||
func (c *Controller) runPreflightCheck(cluster *corev1.StorageCluster) error { | |||
func bringUpStorageCluster(cluster *corev1.StorageCluster) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to return an error here.
@@ -657,6 +714,15 @@ func (c *Controller) syncStorageCluster( | |||
} | |||
} | |||
|
|||
if startOk, err := bringUpStorageCluster(cluster); err == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: isPreflightComplete?
drivers/storage/portworx/portworx.go
Outdated
if err != nil { | ||
if errors.IsNotFound(err) { | ||
condition.Status = corev1.ClusterConditionStatusInProgress | ||
condition.Message = "pre-flight: in progress..." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*Preflight pods haven't started yet.
Signed-off-by: Jose Rivera <jose@portworx.com>
…termin status.. Signed-off-by: Jose Rivera <jose@portworx.com>
…ault condition on CloudDrivePermission error. Signed-off-by: Jose Rivera <jose@portworx.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jrivera-px can we UTs for these changes either in this PR or a different one?
drivers/storage/portworx/portworx.go
Outdated
if total != 0 && completed == total { | ||
logrus.Infof("pre-flight: checks completed...") | ||
} else { | ||
if age >= 15*time.Minute { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Extract the duration as a constant?
@@ -842,7 +920,7 @@ func (c *Controller) gcNeeded(obj client.Object) bool { | |||
|
|||
if obj.GetObjectKind().GroupVersionKind().Kind == "ConfigMap" { | |||
if obj.GetName() == "px-attach-driveset-lock" || | |||
strings.HasPrefix(obj.GetName(), "px-bringup-queue-lock") { | |||
strings.HasPrefix(obj.GetName(), "pxb-ringup-queue-lock") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this is a typo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah transpose the chars on accident..
Signed-off-by: Jose Rivera <jose@portworx.com>
…imeout. Signed-off-by: Jose Rivera <jose@portworx.com>
Signed-off-by: Jose Rivera <jose@portworx.com>
Signed-off-by: Jose Rivera <jose@portworx.com>
…er 15m. (#1070) * PWX-28826 Boilerplace Signed-off-by: Harsh Desai <hadesai@purestorage.com> * more boilerplate Signed-off-by: Harsh Desai <hadesai@purestorage.com> * PWX-28826: Pre-flight check for DMthin. Signed-off-by: Jose Rivera <jose@portworx.com> * PWX-28826: Add comments and move StorageNode cleanup. Signed-off-by: Jose Rivera <jose@portworx.com> * Passed checks should be Info events. Signed-off-by: Jose Rivera <jose@portworx.com> * Passed checks should be Info events. (#1010) Signed-off-by: Jose Rivera <jose@portworx.com> * Pwx 28826 (#1011) * Pwx 28826 (#1012) * PWX-28826: Update with the latest master changes. (#1013) * Updating CSV to use 23.3.1 released image * Update for 23.3.1 release * Controller gen vendor Signed-off-by: Piyush Nimbalkar <pnimbalkar@purestorage.com> * PWX-29389 Add CRD for portworx diags collection Signed-off-by: Piyush Nimbalkar <pnimbalkar@purestorage.com> * PWX-29409: Ignore zones with no nodes (#1008) In disaggregated mode, there could be zones in which no storage nodes might be present. Such a zone would make the maxSNPZ value to be 0. CHanging the behavior to ignore 0 nodes in a zone for maxSNPZ calculation. Signed-off-by: Naveen Revanna <nrevanna@purestorage.com> --------- Signed-off-by: Piyush Nimbalkar <pnimbalkar@purestorage.com> Signed-off-by: Naveen Revanna <nrevanna@purestorage.com> Co-authored-by: CNBU Jenkins <cnbu-jenkins@purestorage.com> Co-authored-by: Jiafeng Liao <jliao@purestorage.com> Co-authored-by: Piyush Nimbalkar <pnimbalkar@purestorage.com> Co-authored-by: Naveen Revanna <83608369+nrevanna@users.noreply.github.com> * Add PassPreFlight event tag and logging Signed-off-by: Jose Rivera <jose@portworx.com> * PWX-28826: Check status of portworx container in pre-flight pod and remove 'wait' code. Signed-off-by: Jose Rivera <jose@portworx.com> * PWX-28826: Fix unit test. Signed-off-by: Jose Rivera <jose@portworx.com> * PWX-28826: Fix unit test. Signed-off-by: Jose Rivera <jose@portworx.com> * PWX-28826: PR review changes and fix portworx_test.go UTs Signed-off-by: Jose Rivera <jose@portworx.com> * PWX-28826: fix gomack Validate calls. Also comment out the two tests that don't work since Validate was removed from the controller.validate() func. PWX-30373 to try and fix later. Signed-off-by: Jose Rivera <jose@portworx.com> * PWX-30373: Re-add back in the commented out tests and add K8s version check failure to trigger the needed workflow. Signed-off-by: Jose Rivera <jose@portworx.com> * PWX-28826: Exit pre-check wait if running CBT namespace. Signed-off-by: Jose Rivera <jose@portworx.com> * PWX-28826: Add 5 min timeout to pre-flight status check. Signed-off-by: Jose Rivera <jose@portworx.com> * PWX-28826: Exit GetPreFlightStatus() with success if running CBT namespace. Signed-off-by: Jose Rivera <jose@portworx.com> * PWX-28826: Don't automatically enable dmthin via pre-flight check if running CBT namespace. Signed-off-by: Jose Rivera <jose@portworx.com> * PWX-30373: Revert UT and integration test hacks. Need to mock the functionality correctly. Signed-off-by: Jose Rivera <jose@portworx.com> * PWX-28826: Increase pre-flight daemonset ready wait to 10mins. Signed-off-by: Jose Rivera <jose@portworx.com> * PWX-28826: fix 'TestValidate' UT. Don't error if pre-flight daemonset exists. Signed-off-by: Jose Rivera <jose@portworx.com> * Use reconcile loop to check pre-flight status. Timeout after 15m. Signed-off-by: Jose Rivera <jose@portworx.com> * PWX-28826: Startup check was not handling if preflight was not enabled. Signed-off-by: Jose Rivera <jose@portworx.com> * PWX-28826: Re-enable preflight for integration tests. Signed-off-by: Jose Rivera <jose@portworx.com> * PWX-28826: disable preflight for integration tests. Signed-off-by: Jose Rivera <jose@portworx.com> * PWX-28826: Only do preflight if AWS. Signed-off-by: Jose Rivera <jose@portworx.com> * PWX-30496: if '-T dmthin' exists in stc before preflight is ran and preflight fails don't start. Add default metadata dev for dmthin if it does not exist. Signed-off-by: Jose Rivera <jose@portworx.com> * PWX-28826: Use condition set to determine of CheckCloudDrivePermission check is to be run. Handle skip annotation and pod cleanup. Signed-off-by: Jose Rivera <jose@portworx.com> * PWX-28826: Remove debug logging Signed-off-by: Jose Rivera <jose@portworx.com> * PWX-30419: Remove more debug logging. Signed-off-by: Jose Rivera <jose@portworx.com> * PWX-30419: Revert to correct version. Signed-off-by: Jose Rivera <jose@portworx.com> * PWX-30419: Fix unit test. Need to enable pre-flight. Signed-off-by: Jose Rivera <jose@portworx.com> * PWX-30419: Fix TestEKSPreflightCheck unit test. Pre-flight now includes DMthin checks so need to handle that. Signed-off-by: Jose Rivera <jose@portworx.com> * update from master, conflict fix. * PWX-28826: Use cluster condition to determine pre-flight status. Signed-off-by: Jose Rivera <jose@portworx.com> * PWX-28826: return cluster condition from driverValidate as that to determin status.. Signed-off-by: Jose Rivera <jose@portworx.com> * Add err message to default condition message and make sure to use default condition on CloudDrivePermission error. Signed-off-by: Jose Rivera <jose@portworx.com> * PWX-28826: Add a constant for pre-flight timeout value. Signed-off-by: Jose Rivera <jose@portworx.com> * PWX-28826: pre-flight timeout should prevent startup. Add event for timeout. Signed-off-by: Jose Rivera <jose@portworx.com> * PWX-28826: Fix unit tests. Signed-off-by: Jose Rivera <jose@portworx.com> * PWX-28826: Use progress string. Signed-off-by: Jose Rivera <jose@portworx.com> --------- Signed-off-by: Harsh Desai <hadesai@purestorage.com> Signed-off-by: Jose Rivera <jose@portworx.com> Signed-off-by: Piyush Nimbalkar <pnimbalkar@purestorage.com> Signed-off-by: Naveen Revanna <nrevanna@purestorage.com> Co-authored-by: Harsh Desai <hadesai@purestorage.com> Co-authored-by: CNBU Jenkins <cnbu-jenkins@purestorage.com> Co-authored-by: Jiafeng Liao <jliao@purestorage.com> Co-authored-by: Piyush Nimbalkar <pnimbalkar@purestorage.com> Co-authored-by: Naveen Revanna <83608369+nrevanna@users.noreply.github.com>
…er 15m. (#1070) (#1188) * PWX-28826 Boilerplace * more boilerplate * PWX-28826: Pre-flight check for DMthin. * PWX-28826: Add comments and move StorageNode cleanup. * Passed checks should be Info events. * Passed checks should be Info events. (#1010) * Pwx 28826 (#1011) * Pwx 28826 (#1012) * PWX-28826: Update with the latest master changes. (#1013) * Updating CSV to use 23.3.1 released image * Update for 23.3.1 release * Controller gen vendor * PWX-29389 Add CRD for portworx diags collection * PWX-29409: Ignore zones with no nodes (#1008) In disaggregated mode, there could be zones in which no storage nodes might be present. Such a zone would make the maxSNPZ value to be 0. CHanging the behavior to ignore 0 nodes in a zone for maxSNPZ calculation. --------- * Add PassPreFlight event tag and logging * PWX-28826: Check status of portworx container in pre-flight pod and remove 'wait' code. * PWX-28826: Fix unit test. * PWX-28826: Fix unit test. * PWX-28826: PR review changes and fix portworx_test.go UTs * PWX-28826: fix gomack Validate calls. Also comment out the two tests that don't work since Validate was removed from the controller.validate() func. PWX-30373 to try and fix later. * PWX-30373: Re-add back in the commented out tests and add K8s version check failure to trigger the needed workflow. * PWX-28826: Exit pre-check wait if running CBT namespace. * PWX-28826: Add 5 min timeout to pre-flight status check. * PWX-28826: Exit GetPreFlightStatus() with success if running CBT namespace. * PWX-28826: Don't automatically enable dmthin via pre-flight check if running CBT namespace. * PWX-30373: Revert UT and integration test hacks. Need to mock the functionality correctly. * PWX-28826: Increase pre-flight daemonset ready wait to 10mins. * PWX-28826: fix 'TestValidate' UT. Don't error if pre-flight daemonset exists. * Use reconcile loop to check pre-flight status. Timeout after 15m. * PWX-28826: Startup check was not handling if preflight was not enabled. * PWX-28826: Re-enable preflight for integration tests. * PWX-28826: disable preflight for integration tests. * PWX-28826: Only do preflight if AWS. * PWX-30496: if '-T dmthin' exists in stc before preflight is ran and preflight fails don't start. Add default metadata dev for dmthin if it does not exist. * PWX-28826: Use condition set to determine of CheckCloudDrivePermission check is to be run. Handle skip annotation and pod cleanup. * PWX-28826: Remove debug logging * PWX-30419: Remove more debug logging. * PWX-30419: Revert to correct version. * PWX-30419: Fix unit test. Need to enable pre-flight. * PWX-30419: Fix TestEKSPreflightCheck unit test. Pre-flight now includes DMthin checks so need to handle that. * update from master, conflict fix. * PWX-28826: Use cluster condition to determine pre-flight status. * PWX-28826: return cluster condition from driverValidate as that to determin status.. * Add err message to default condition message and make sure to use default condition on CloudDrivePermission error. * PWX-28826: Add a constant for pre-flight timeout value. * PWX-28826: pre-flight timeout should prevent startup. Add event for timeout. * PWX-28826: Fix unit tests. * PWX-28826: Use progress string. --------- Signed-off-by: Harsh Desai <hadesai@purestorage.com> Signed-off-by: Jose Rivera <jose@portworx.com> Signed-off-by: Piyush Nimbalkar <pnimbalkar@purestorage.com> Signed-off-by: Naveen Revanna <nrevanna@purestorage.com> Co-authored-by: Harsh Desai <hadesai@purestorage.com> Co-authored-by: CNBU Jenkins <cnbu-jenkins@purestorage.com> Co-authored-by: Jiafeng Liao <jliao@purestorage.com> Co-authored-by: Piyush Nimbalkar <pnimbalkar@purestorage.com> Co-authored-by: Naveen Revanna <83608369+nrevanna@users.noreply.github.com>
What this PR does / why we need it:
When executing pre-flight check we would wait for the checks to complete blocking the reconcile loop. This prevented any other state change from being processed. So this change corrected that. Use cluster conditions to determine if we are still executing a pre-flight. If so check the status each time though the loop. Add a 15 min timeout based on the time the DS has been running.
Which issue(s) this PR fixes (optional)
Closes #
PWX-30419
Special notes for your reviewer: