Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PWX-30419: Use reconcile loop to check pre-flight status. Timeout after 15m. #1070

Merged
merged 63 commits into from
Aug 7, 2023
Merged
Show file tree
Hide file tree
Changes from 58 commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
069761b
PWX-28826 Boilerplace
harsh-px Mar 28, 2023
f8b43c0
more boilerplate
harsh-px Mar 28, 2023
29d1d99
PWX-28826: Pre-flight check for DMthin.
jrivera-px Apr 3, 2023
f82f027
PWX-28826: Add comments and move StorageNode cleanup.
jrivera-px Apr 3, 2023
740fc60
Passed checks should be Info events.
jrivera-px Apr 4, 2023
b930040
Passed checks should be Info events. (#1010)
jrivera-px Apr 5, 2023
e06c9f4
Merge branch 'PWX-28826' of github.com:libopenstorage/operator into P…
jrivera-px Apr 5, 2023
6a0e863
Pwx 28826 (#1011)
jrivera-px Apr 5, 2023
94c6226
Pwx 28826 (#1012)
jrivera-px Apr 5, 2023
0b20b13
PWX-28826: Update with the latest master changes. (#1013)
jrivera-px Apr 5, 2023
4cb501e
Merge branch 'master' into PWX-28826
jrivera-px Apr 5, 2023
0ff38ae
Add PassPreFlight event tag and logging
jrivera-px Apr 6, 2023
5d0843e
PWX-28826: Check status of portworx container in pre-flight pod and r…
jrivera-px Apr 6, 2023
766939d
PWX-28826: Fix unit test.
jrivera-px Apr 6, 2023
2b4e04b
PWX-28826: Fix unit test.
jrivera-px Apr 6, 2023
c82bd43
PWX-28826: PR review changes and fix portworx_test.go UTs
jrivera-px Apr 8, 2023
02a5f6c
PWX-28826: fix gomack Validate calls. Also comment out the two tests…
jrivera-px Apr 9, 2023
ed4aa58
PWX-30373: Re-add back in the commented out tests and add K8s version…
jrivera-px Apr 9, 2023
e21de49
PWX-28826: Exit pre-check wait if running CBT namespace.
jrivera-px Apr 10, 2023
ff94003
PWX-28826: Add 5 min timeout to pre-flight status check.
jrivera-px Apr 10, 2023
c931dae
PWX-28826: Exit GetPreFlightStatus() with success if running CBT name…
jrivera-px Apr 10, 2023
3a4b027
PWX-28826: Don't automatically enable dmthin via pre-flight check if …
jrivera-px Apr 10, 2023
b55a756
PWX-30373: Revert UT and integration test hacks. Need to mock the fu…
jrivera-px Apr 11, 2023
6ec969e
PWX-28826: Increase pre-flight daemonset ready wait to 10mins.
jrivera-px Apr 12, 2023
528ad6d
PWX-28826: fix 'TestValidate' UT. Don't error if pre-flight daemonse…
jrivera-px Apr 13, 2023
7bd5f40
Merge branch 'master' into PWX-28826
jrivera-px Apr 16, 2023
0ab2abf
Use reconcile loop to check pre-flight status. Timeout after 15m.
jrivera-px Apr 16, 2023
5978e4f
PWX-28826: Startup check was not handling if preflight was not enabled.
jrivera-px Apr 16, 2023
9cd7f81
PWX-28826: Re-enable preflight for integration tests.
jrivera-px Apr 16, 2023
541f6be
PWX-28826: disable preflight for integration tests.
jrivera-px Apr 17, 2023
a2e89fa
PWX-28826: Only do preflight if AWS.
jrivera-px Apr 17, 2023
a63a768
PWX-30496: if '-T dmthin' exists in stc before preflight is ran and …
jrivera-px Apr 20, 2023
12e3584
Merge branch 'master' into PWX-28826-use-reconcile-loop
jrivera-px May 3, 2023
0757938
Merge branch 'master' into PWX-28826-use-reconcile-loop
jrivera-px Jun 9, 2023
f4bfc58
Merge branch 'master' into PWX-28826-use-reconcile-loop
jrivera-px Jun 9, 2023
953d9fc
PWX-28826: Use condition set to determine of CheckCloudDrivePermissio…
jrivera-px Jun 9, 2023
6f843c3
PWX-28826: Remove debug logging
jrivera-px Jun 9, 2023
90ce4ab
PWX-30419: Remove more debug logging.
jrivera-px Jun 9, 2023
37c5281
PWX-30419: Revert to correct version.
jrivera-px Jun 9, 2023
98c2f14
PWX-30419: Fix unit test. Need to enable pre-flight.
jrivera-px Jun 12, 2023
f1b6e9b
Merge branch 'master' into PWX-28826-use-reconcile-loop
jrivera-px Jun 13, 2023
434251d
Merge branch 'master' into PWX-28826-use-reconcile-loop
jrivera-px Jun 14, 2023
03fbb76
PWX-30419: Fix TestEKSPreflightCheck unit test. Pre-flight now includ…
jrivera-px Jun 16, 2023
50b829a
Merge branch 'master' into PWX-28826-use-reconcile-loop
jrivera-px Jul 7, 2023
f883477
Merge branch 'master' into PWX-28826-use-reconcile-loop
jrivera-px Jul 12, 2023
4d6cab8
Merge branch 'master' into PWX-28826-use-reconcile-loop
jrivera-px Jul 13, 2023
92d7b5e
Merge branch 'master' into PWX-28826-use-reconcile-loop
jrivera-px Jul 14, 2023
64b0819
Merge branch 'master' into PWX-28826-use-reconcile-loop
jrivera-px Jul 27, 2023
79b8905
Merge branch 'master' into PWX-28826-use-reconcile-loop
jrivera-px Jul 27, 2023
fa1097c
Merge branch 'master' into PWX-28826-use-reconcile-loop
jrivera-px Jul 27, 2023
a8c8f0e
Merge branch 'master' into PWX-28826-use-reconcile-loop
jrivera-px Jul 31, 2023
55a7d3d
update from master, conflict fix.
jrivera-px Jul 31, 2023
ba25d86
Merge branch 'master' into PWX-28826-use-reconcile-loop
jrivera-px Jul 31, 2023
5a9933f
Merge branch 'master' into PWX-28826-use-reconcile-loop
jrivera-px Aug 3, 2023
418da07
PWX-28826: Use cluster condition to determine pre-flight status.
jrivera-px Aug 3, 2023
73d4808
PWX-28826: return cluster condition from driverValidate as that to de…
jrivera-px Aug 4, 2023
1c7b4b3
Merge branch 'master' into PWX-28826-use-reconcile-loop
jrivera-px Aug 4, 2023
8d6cf9f
Add err message to default condition message and make sure to use def…
jrivera-px Aug 4, 2023
8ed780b
Merge branch 'master' into PWX-28826-use-reconcile-loop
jrivera-px Aug 4, 2023
602f01b
PWX-28826: Add a constant for pre-flight timeout value.
jrivera-px Aug 4, 2023
eb1862b
PWX-28826: pre-flight timeout should prevent startup. Add event for …
jrivera-px Aug 5, 2023
ee9bc1d
PWX-28826: Fix unit tests.
jrivera-px Aug 5, 2023
e6659b5
PWX-28826: Use progress string.
jrivera-px Aug 6, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
116 changes: 81 additions & 35 deletions drivers/storage/portworx/portworx.go
Original file line number Diff line number Diff line change
Expand Up @@ -98,70 +98,116 @@ func (p *portworx) Init(
}

func (p *portworx) Validate(cluster *corev1.StorageCluster) error {

condition := &corev1.ClusterCondition{
Source: pxutil.PortworxComponentName,
Type: corev1.ClusterConditionTypePreflight,
}

setClusterCondition := func(status corev1.ClusterConditionStatus, msg string) {
condition.Status = status
condition.Message = msg
util.UpdateStorageClusterCondition(cluster, condition)
}

podSpec, err := p.GetStoragePodSpec(cluster, "")
if err != nil {
err = fmt.Errorf("pre-flight: get storage pod spec: %v", err)
setClusterCondition(corev1.ClusterConditionStatusFailed, err.Error())
return err
}

preFlighter := NewPreFlighter(cluster, p.k8sClient, podSpec)

deletePreflight := func() {
_, _, err := GetPreFlightPodsFromNamespace(p.k8sClient, cluster.Namespace)
if err == nil {
// Clean up the pre-flight pods
logrus.Infof("pre-flight: cleaning pre-flight ds...")
if derr := preFlighter.DeletePreFlight(); derr != nil {
logrus.Errorf("pre-flight: error deleting pre-flight: %v", derr)
}
}
}

check, ok := cluster.Annotations[pxutil.AnnotationPreflightCheck]
check = strings.TrimSpace(strings.ToLower(check))
if !ok || check == "skip" {
setClusterCondition(corev1.ClusterConditionStatusDisabled, "pre-flight: skipped...")
logrus.Infof(condition.Message)
deletePreflight()
return nil
}

// Start the pre-flight container. The pre-flight checks at this time are specific to enabling DMthin
err = preFlighter.RunPreFlight()
if err != nil {
if !errors.IsAlreadyExists(err) {
err = fmt.Errorf("pre-flight: run error: %v", err)
setClusterCondition(corev1.ClusterConditionStatusFailed, err.Error())
deletePreflight()
return err
}
logrus.Debugf("pre-flight: container already running...")
}

defer func() {
// Clean up the pre-flight pods
logrus.Infof("pre-flight: cleaning pre-flight ds...")
if derr := preFlighter.DeletePreFlight(); derr != nil {
logrus.Errorf("pre-flight: error deleting pre-flight: %v", derr)
}
}()

cnt := 0
// Wait for all the pre-flight pods to finish
for {
time.Sleep(3 * time.Second) // Pause before status check
completed, inProgress, total, err := preFlighter.GetPreFlightStatus()
if err != nil {
logrus.Errorf("pre-flight: error getting pre-flight status: %v", err)
return err
}
logrus.Infof("pre-flight: Completed [%v] In Progress [%v] Total [%v]", completed, inProgress, total)

if total != 0 && completed == total {
logrus.Infof("pre-flight: completed...")
break
completed, inProgress, total, age, err := preFlighter.GetPreFlightStatus()
if err != nil {
if errors.IsNotFound(err) {
setClusterCondition(corev1.ClusterConditionStatusInProgress, "pre-flight: has not started yet...")
logrus.Infof(condition.Message)
return nil
}
err = fmt.Errorf("pre-flight: error getting pre-flight status: %v", err)
setClusterCondition(corev1.ClusterConditionStatusFailed, err.Error())
deletePreflight()
return err
}

// Add five minute timeout. If we do reconcile loop check we will need a different way.
cnt++
if cnt == 200 { // 3s * 100 = 300s (10 mins)
err = fmt.Errorf("pre-flight: pre-flight status check timed out")
logrus.Errorf("%v", err)
logrus.Infof("pre-flight: Completed [%v] In Progress [%v] Total [%v]", completed, inProgress, total)
if total != 0 && completed == total {
logrus.Infof("pre-flight: checks completed...")
} else {
if age >= 15*time.Minute {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Extract the duration as a constant?

err = fmt.Errorf("pre-flight: pre-flight check timed out")
setClusterCondition(corev1.ClusterConditionStatusTimeout, err.Error())
deletePreflight()
return err
}
setClusterCondition(corev1.ClusterConditionStatusInProgress,
fmt.Sprintf("pre-flight: Operation still in progress: Completed [%v] In Progress [%v] Total [%v]",
completed, inProgress, total))
logrus.Infof(condition.Message)
return nil
}

// Process all the StorageNode.Status.Checks
var storageNodes []*corev1.StorageNode

storageNodes, err = p.storageNodesList(cluster)
if err == nil {
err = preFlighter.ProcessPreFlightResults(p.recorder, storageNodes)
if err != nil {
logrus.Errorf("pre-flight: Error processing results: %v", err)
}
} else {
logrus.Errorf("pre-flight incomplete: Error getting storage node list: %v", err)
if err != nil {
err = fmt.Errorf("pre-flight incomplete: Error getting storage node list: %v", err)
setClusterCondition(corev1.ClusterConditionStatusFailed, err.Error())
deletePreflight()
return err
}

return err
logrus.Infof("pre-flight: process pre-flight results...")
err = preFlighter.ProcessPreFlightResults(p.recorder, storageNodes)
if err != nil {
err = fmt.Errorf("pre-flight: Error processing results: %v", err)
setClusterCondition(corev1.ClusterConditionStatusFailed, err.Error())
deletePreflight()
return err
}

setClusterCondition(corev1.ClusterConditionStatusCompleted, "pre-flight: done...")
logrus.Infof(condition.Message)
deletePreflight()

return nil
}

func (p *portworx) initializeComponents() {
for _, comp := range component.GetAll() {
comp.Initialize(p.k8sClient, *p.k8sVersion, p.scheme, p.recorder)
Expand Down
12 changes: 12 additions & 0 deletions drivers/storage/portworx/portworx_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,9 @@ func TestValidate(t *testing.T) {
ObjectMeta: metav1.ObjectMeta{
Name: "px-cluster",
Namespace: "kube-test",
Annotations: map[string]string{
pxutil.AnnotationPreflightCheck: "true",
},
},
}

Expand Down Expand Up @@ -151,6 +154,7 @@ func TestValidate(t *testing.T) {
},
}

coreops.SetInstance(coreops.New(fakek8sclient.NewSimpleClientset()))
k8sClient := testutil.FakeK8sClient(preflightDS)

err := k8sClient.Create(context.TODO(), preFlightPod1)
Expand Down Expand Up @@ -222,6 +226,9 @@ func TestValidateCheckFailure(t *testing.T) {
ObjectMeta: metav1.ObjectMeta{
Name: "px-cluster",
Namespace: "kube-test",
Annotations: map[string]string{
pxutil.AnnotationPreflightCheck: "true",
},
},
}

Expand Down Expand Up @@ -287,6 +294,7 @@ func TestValidateCheckFailure(t *testing.T) {
},
}

coreops.SetInstance(coreops.New(fakek8sclient.NewSimpleClientset()))
k8sClient := testutil.FakeK8sClient(preflightDS)

err := k8sClient.Create(context.TODO(), preFlightPod1)
Expand Down Expand Up @@ -323,6 +331,9 @@ func TestValidateMissingRequiredCheck(t *testing.T) {
ObjectMeta: metav1.ObjectMeta{
Name: "px-cluster",
Namespace: "kube-test",
Annotations: map[string]string{
pxutil.AnnotationPreflightCheck: "true",
},
},
}

Expand Down Expand Up @@ -379,6 +390,7 @@ func TestValidateMissingRequiredCheck(t *testing.T) {
},
}

coreops.SetInstance(coreops.New(fakek8sclient.NewSimpleClientset()))
k8sClient := testutil.FakeK8sClient(preflightDS)

err := k8sClient.Create(context.TODO(), preFlightPod1)
Expand Down
27 changes: 18 additions & 9 deletions drivers/storage/portworx/preflight.go
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ import (
"fmt"
"regexp"
"strings"
"time"

"github.com/hashicorp/go-version"
"github.com/sirupsen/logrus"
Expand Down Expand Up @@ -43,8 +44,9 @@ type PreFlightPortworx interface {
// RunPreFlight runs the pre-flight daemonset
RunPreFlight() error
// GetPreFlightStatus returns the status of the pre-flight daemonset
// returns the no. of completed, in progress and total pods
GetPreFlightStatus() (int32, int32, int32, error)
// returns the no. of completed, in progress and total pods, plus how
// long the daemonset has been running.
GetPreFlightStatus() (int32, int32, int32, time.Duration, error)
// GetPreFlightPods returns the pods of the pre-flight daemonset
GetPreFlightPods() ([]*v1.Pod, error)
// ProcessPreFlightResults process StorageNode status checks
Expand Down Expand Up @@ -77,7 +79,7 @@ func NewPreFlighter(
}
}

func getPreFlightPodsFromNamespace(k8sClient client.Client, namespace string) (*appsv1.DaemonSet, []*v1.Pod, error) {
func GetPreFlightPodsFromNamespace(k8sClient client.Client, namespace string) (*appsv1.DaemonSet, []*v1.Pod, error) {
ds := &appsv1.DaemonSet{}
err := k8sClient.Get(
context.TODO(),
Expand Down Expand Up @@ -227,14 +229,14 @@ func (u *preFlightPortworx) CreatePreFlightDaemonsetSpec(ownerRef *metav1.OwnerR

// GetPreFlightPods returns the pods of the pre-flight daemonset
func (u *preFlightPortworx) GetPreFlightPods() ([]*v1.Pod, error) {
_, pods, err := getPreFlightPodsFromNamespace(u.k8sClient, u.cluster.Namespace)
_, pods, err := GetPreFlightPodsFromNamespace(u.k8sClient, u.cluster.Namespace)
return pods, err
}

func (u *preFlightPortworx) GetPreFlightStatus() (int32, int32, int32, error) {
ds, pods, err := getPreFlightPodsFromNamespace(u.k8sClient, u.cluster.Namespace)
func (u *preFlightPortworx) GetPreFlightStatus() (int32, int32, int32, time.Duration, error) {
ds, pods, err := GetPreFlightPodsFromNamespace(u.k8sClient, u.cluster.Namespace)
if err != nil {
return -1, -1, -1, err
return -1, -1, -1, 0, err
}
totalPods := ds.Status.DesiredNumberScheduled
completedPods := 0
Expand All @@ -247,8 +249,11 @@ func (u *preFlightPortworx) GetPreFlightStatus() (int32, int32, int32, error) {
}
}
}

statusAge := time.Since(ds.CreationTimestamp.Time)
logrus.Infof("Pre-flight status running time: %v", statusAge)
logrus.Infof("Pre-flight Status: Completed [%v] InProgress [%v] Total Pods [%v]", completedPods, totalPods-int32(completedPods), totalPods)
return int32(completedPods), totalPods - int32(completedPods), totalPods, nil
return int32(completedPods), totalPods - int32(completedPods), totalPods, statusAge, nil
}

func (u *preFlightPortworx) RunPreFlight() error {
Expand Down Expand Up @@ -288,7 +293,11 @@ func (u *preFlightPortworx) RunPreFlight() error {

err = u.k8sClient.Create(context.TODO(), preflightDS)
if err != nil {
logrus.Errorf("runPreFlight: error creating: %v", err)
if errors.IsAlreadyExists(err) {
logrus.Infof("runPreFlight: daemonset already exists")
} else {
logrus.Errorf("RunPreFlight: error creating: %v", err)
}
}

return err
Expand Down
8 changes: 7 additions & 1 deletion pkg/controller/storagecluster/controller_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -9114,15 +9114,21 @@ func TestEKSPreflightCheck(t *testing.T) {

// TestCase: fix the permission then rerun preflight
cluster.Annotations[pxutil.AnnotationPreflightCheck] = "true"
// Cloud drive permission check will only be checked once when the status condition is not set.
cluster.Status.Conditions = []corev1.ClusterCondition{}
preflightOps.EXPECT().CheckCloudDrivePermission(cluster).Return(nil)
err = controller.runPreflightCheck(cluster)
require.NoError(t, err)

// Pre-flight checks now contain DMthin checks which will not be executed unless cloud permission checks passes.
// Since the DMthin checks will not be executed the condition returned will be 'ClusterConditionStatusInProgress'.
// This is good enough to determine if cloud permission is working. If it had failed DMthin would not be run and
// the condition would have been ClusterConditionStatusFailed.
err = testutil.Get(k8sClient, cluster, cluster.Name, cluster.Namespace)
require.NoError(t, err)
condition = util.GetStorageClusterCondition(cluster, pxutil.PortworxComponentName, corev1.ClusterConditionTypePreflight)
require.NotNil(t, condition)
require.Equal(t, corev1.ClusterConditionStatusCompleted, condition.Status)
require.Equal(t, corev1.ClusterConditionStatusInProgress, condition.Status)
}

func TestStorageClusterStateDuringValidation(t *testing.T) {
Expand Down
Loading
Loading