cmd: wait for the cluster to be initialized #1132

staebler · 2019-01-26T03:24:09Z

After creating the cluster, wait until the ClusterVersion object indicates that the cluster has been initialized prior to exiting the installer.

staebler · 2019-01-26T03:26:26Z

Also, this was made worse because neither the installer nor the e2e run setup waits for cluster version to be valid, so we were starting e2e before the cluster was fully configured.

We should fix that as well (probably in both places).

@smarterclayton Is this what you are looking for with regards to having the installer verify that the cluster version is valid?

staebler · 2019-01-26T03:46:21Z

This is the error from the installer when run with these changes.

FATAL Cluster failed to initialize: Could not update clusteroperator "cluster-monitoring-operator" (config.openshift.io/v1, 177 of 241)

But the cluster did eventually initialize, which means that the installer should not stop when the cluster version object has a Failing/True condition.

wking · 2019-01-26T03:53:35Z

But the cluster did eventually initialize...

Would be interesting to see how it recovered. Maybe it will turn up in CI where we'll have logs.

staebler · 2019-01-26T04:24:14Z

Output from installer with it changed to not stop after seeing a Failure/True condition.

time="2019-01-25T23:15:12-05:00" level=info msg="Waiting up to 10m0s for the cluster to be initialized..."
time="2019-01-25T23:15:13-05:00" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-01-25T23:15:44-05:00" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-01-25T23:15:53-05:00" level=debug msg="Still waiting for the cluster to initialize: Could not update clusteroperator \"openshift-apiserver\" (config.openshift.io/v1, 116 of 241)"
time="2019-01-25T23:16:24-05:00" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-01-25T23:16:54-05:00" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-01-25T23:17:25-05:00" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-01-25T23:17:56-05:00" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-01-25T23:18:34-05:00" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-01-25T23:19:12-05:00" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-01-25T23:19:43-05:00" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-01-25T23:20:03-05:00" level=debug msg="Still waiting for the cluster to initialize: Could not update clusteroperator \"cluster-monitoring-operator\" (config.openshift.io/v1, 177 of 241)"
time="2019-01-25T23:20:34-05:00" level=debug msg="Still waiting for the cluster to initialize: Could not update clusteroperator \"cluster-monitoring-operator\" (config.openshift.io/v1, 177 of 241)"
time="2019-01-25T23:20:53-05:00" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-01-25T23:21:24-05:00" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-01-25T23:21:55-05:00" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-01-25T23:22:07-05:00" level=debug msg="Cluster is initialized"
time="2019-01-25T23:22:07-05:00" level=info msg="Waiting up to 10m0s for the openshift-console route to be created..."

staebler · 2019-01-26T04:27:47Z

Pushed update to wait up to 30 minutes, to not stop when ClusterVersion reports Failing/True, and to better downsample the log messages.

staebler · 2019-01-26T12:42:28Z

/retest

smarterclayton

A couple of comments but yes.

cmd/openshift-install/create.go

The ClusterVersion client is not available in the version of openshift/client-go that was pinned. These changes remove the pinnings of openshift/client-go and openshift/api. Commands run: dep ensure dep ensure -update github.com/openshift/api

Vendor openshift/library-go for access to helper functions for evaluating the status conditions of the ClusterVersion object.

After creating the cluster, wait up to 30 minutes for the ClusterVersion object to indicate that the cluster has been initialized prior to exiting the installer.

abhinavdahiya · 2019-01-29T19:38:40Z

/approve

This looks good, will wait for sometime to allow other people to comment.
@smarterclayton @wking PTAL

cmd/openshift-install/create.go

smarterclayton · 2019-01-29T19:46:48Z

/test e2e-aws

Sorry, using you as a guinea pig

wking · 2019-01-30T19:24:46Z

/lgtm

openshift-ci-robot · 2019-01-30T19:25:01Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, staebler, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [abhinavdahiya,staebler,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2019-01-30T22:53:32Z

/retest

Please review the full test history for this PR and help us cut down flakes.

wking · 2019-01-31T04:46:02Z

e2e-aws:

Flaky tests:

[sig-storage] In-tree Volumes [Driver: nfs] [Testpattern: Dynamic PV (default fs)] subPath should support existing directory [Suite:openshift/conformance/parallel] [Suite:k8s]

Failing tests:

[sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should not deadlock when a pod's predecessor fails [Suite:openshift/conformance/parallel] [Suite:k8s]

Notes on the StatefulSet panics here.

/retest

The console may become optional [1], so teach the installer to handle its absence gracefully. We've waited on the console since way back in ff53523 (add logs at end of install for kubeadmin, consoleURL, 2018-12-06, openshift#806). Back then, install-complete timing was much less organized, and since e17ba3c (cmd: wait for the cluster to be initialized, 2019-01-25, openshift#1132) we've blocked on ClusterVersion going Available=True. So the current dependency chain is: 1. Console route admission blocks console operator from going Available=True in its ClusterOperator. 2. Console ClusterOperator blocks cluster-version operator from going Available=True in ClusterVersion. 3. ClusterVersion blocks installer's waitForInitializedCluster. So we no longer need to wait for the route to show up, and can fail fast if we get a clear IsNotFound. I'm keeping a bit of polling so we don't fail an install on a temporary network hiccup. We don't want to drop the console check entirely, because when it is found, we want: * To continue to log that access pathway on install-complete. * To continue to append the router CA to the kubeconfig. That latter point has been done since 4033577 (append router CA to cluster CA in kubeconfig, 2019-02-12, openshift#1242). The motication in that commit message is not explicit, but the idea is to support folks who naively run 'oc login' with the kubeadmin kubeconfig [2] (despite that kubeconfig already having cluster-root access) when the console route's cert's CA happens to be something that the user's local trust store doesn't include by default. [1]: openshift/enhancements#922 [2]: openshift#1541 (comment)

The console may become optional [1], so teach the installer to handle its absence gracefully. We've waited on the console since way back in ff53523 (add logs at end of install for kubeadmin, consoleURL, 2018-12-06, openshift#806). Back then, install-complete timing was much less organized, and since e17ba3c (cmd: wait for the cluster to be initialized, 2019-01-25, openshift#1132) we've blocked on ClusterVersion going Available=True. So the current dependency chain is: 1. Console route admission blocks console operator from going Available=True in its ClusterOperator. 2. Console ClusterOperator blocks cluster-version operator from going Available=True in ClusterVersion. 3. ClusterVersion blocks installer's waitForInitializedCluster. So we no longer need to wait for the route to show up, and can fail fast if we get a clear IsNotFound. I'm keeping a bit of polling so we don't fail an install on a temporary network hiccup. We don't want to drop the console check entirely, because when it is found, we want: * To continue to log that access pathway on install-complete. * To continue to append the router CA to the kubeconfig. That latter point has been done since 4033577 (append router CA to cluster CA in kubeconfig, 2019-02-12, openshift#1242). The motivation in that commit message is not explicit, but the idea is to support folks who naively run 'oc login' with the kubeadmin kubeconfig [2] (despite that kubeconfig already having cluster-root access) when the console route's cert's CA happens to be something that the user's local trust store doesn't include by default. [1]: openshift/enhancements#922 [2]: openshift#1541 (comment)

openshift-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jan 26, 2019

openshift-ci-robot requested review from aaronlevy and flaper87 January 26, 2019 03:24

smarterclayton suggested changes Jan 27, 2019

View reviewed changes

cmd/openshift-install/create.go Outdated Show resolved Hide resolved

cmd/openshift-install/create.go Outdated Show resolved Hide resolved

staebler added 3 commits January 29, 2019 10:23

vendor: add openshift/library-go

f808119

Vendor openshift/library-go for access to helper functions for evaluating the status conditions of the ClusterVersion object.

cmd: wait for the cluster to be initialized

e17ba3c

After creating the cluster, wait up to 30 minutes for the ClusterVersion object to indicate that the cluster has been initialized prior to exiting the installer.

abhinavdahiya mentioned this pull request Jan 29, 2019

Increase timeout for console route #1143

Closed

wking reviewed Jan 29, 2019

View reviewed changes

cmd/openshift-install/create.go Show resolved Hide resolved

wking reviewed Jan 29, 2019

View reviewed changes

cmd/openshift-install/create.go Show resolved Hide resolved

openshift-ci-robot assigned wking Jan 30, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 30, 2019

openshift-merge-robot merged commit 9f73958 into openshift:master Jan 31, 2019

abhinavdahiya mentioned this pull request Jan 31, 2019

asset/manifests: update status of Infrastructure #1155

Merged

wking mentioned this pull request Jan 31, 2019

credentials validation #1156

Merged

wking mentioned this pull request Oct 27, 2021

cmd/openshift-install/create: One shot console access #5336

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd: wait for the cluster to be initialized #1132

cmd: wait for the cluster to be initialized #1132

staebler commented Jan 26, 2019

staebler commented Jan 26, 2019

staebler commented Jan 26, 2019 •

edited

Loading

wking commented Jan 26, 2019

staebler commented Jan 26, 2019 •

edited

Loading

staebler commented Jan 26, 2019

staebler commented Jan 26, 2019

smarterclayton left a comment

abhinavdahiya commented Jan 29, 2019

smarterclayton commented Jan 29, 2019

wking commented Jan 30, 2019

openshift-ci-robot commented Jan 30, 2019

openshift-bot commented Jan 30, 2019

wking commented Jan 31, 2019

cmd: wait for the cluster to be initialized #1132

cmd: wait for the cluster to be initialized #1132

Conversation

staebler commented Jan 26, 2019

staebler commented Jan 26, 2019

staebler commented Jan 26, 2019 • edited Loading

wking commented Jan 26, 2019

staebler commented Jan 26, 2019 • edited Loading

staebler commented Jan 26, 2019

staebler commented Jan 26, 2019

smarterclayton left a comment

Choose a reason for hiding this comment

abhinavdahiya commented Jan 29, 2019

smarterclayton commented Jan 29, 2019

wking commented Jan 30, 2019

openshift-ci-robot commented Jan 30, 2019

openshift-bot commented Jan 30, 2019

wking commented Jan 31, 2019

staebler commented Jan 26, 2019 •

edited

Loading

staebler commented Jan 26, 2019 •

edited

Loading