Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 2090816: Make bootstrap timeout configurable #5979

Closed
wants to merge 1 commit into from

Conversation

honza
Copy link
Member

@honza honza commented Jun 8, 2022

We add a new bootstrap-timeout flag to the create cluster command.

We add a new `bootstrap-timeout` flag to the `create cluster` command.
@openshift-ci openshift-ci bot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Jun 8, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 8, 2022

@honza: This pull request references Bugzilla bug 2090816, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.11.0) matches configured target release for branch (4.11.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

No GitHub users were found matching the public email listed for the QA contact in Bugzilla (augol@redhat.com), skipping review request.

In response to this:

Bug 2090816: Make bootstrap timeout configurable

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@hardys
Copy link
Contributor

hardys commented Jun 8, 2022

As someone who's mostly used the installer for baremetal (where the delays related to the environment can vary a lot) this makes sense to me (although I suspect we'd ideally want to expose some way to control all of the timeouts, not only the bootstrap-complete one).

However previous feedback from @patrickdillon indicated there may be an installer team preference to just increasing the hard-coded timeout (just for baremetal) similar to #2979

Personally I prefer the configurable timeout approach, since it avoids making installs fail-slow when there are not specific issues (like slow POST times, network bandwidth constraints etc), and also it's hard to choose any one hard-coded value that will always work on-prem, because the delays can vary so much unlike in typical cloud scenarios where things are more deterministic.

@sadasu
Copy link
Contributor

sadasu commented Jun 8, 2022

I could be wrong but do we have a default value for this configurable bootstrap-timeout flag?

@honza
Copy link
Member Author

honza commented Jun 8, 2022

I could be wrong but do we have a default value for this configurable bootstrap-timeout flag?

Yes

@sadasu
Copy link
Contributor

sadasu commented Jun 8, 2022

I could be wrong but do we have a default value for this configurable bootstrap-timeout flag?

Yes

Cool! I missed that.

@patrickdillon
Copy link
Contributor

Thanks for putting this together with the alternative #5981. In general, this timeout should only be for baremetal (or perhaps other on-prem environs too)--not cloud. Hopefully the reason is clear, but I can explain further if needed.

That cross-platform approach is one issue with this implementation. I'm also trying to check whether this flag applies to all create commands or just create cluster. I'm struggling to get this to work and the flag is undocumented:

$ ./openshift-install create cluster --help
Create an OpenShift cluster

Usage:
  openshift-install create cluster [flags]

Flags:
  -h, --help   help for cluster

Global Flags:
      --dir string         assets directory (default ".")
      --log-level string   log level (e.g. "debug | info | warn | error") (default "info")
$ ./openshift-install create cluster --dir gcp-test --bootstrap-timeout 30
FATA[0000] Error executing openshift-install: unknown flag: --bootstrap-timeout

These reasons suggest to me that a flag is not the right approach for this.

Personally I prefer the configurable timeout approach, since it avoids making installs fail-slow when there are not specific issues (like slow POST times, network bandwidth constraints etc), and also it's hard to choose any one hard-coded value that will always work on-prem, because the delays can vary so much unlike in typical cloud scenarios where things are more deterministic.

I take @hardys point here, and I agree that we don't want to degrade the general experience to enable potentially a handful of environments. I will point out that the downside for a configurable timeout is that it would still require configuration for those problematic environments to work, rather than working out of the box. But with the understanding those troublesome environments are the exception and not the norm, we agree it is better to require the extra configuration for those failing environments rather than potentially fail slow in all cases.

If the flag solution is not going to work, I see two potential solutions for allowing this to be configurable:

  • environment variable - generally we avoid env vars and we don't document them, but I feel like this might fit the pattern where they have been applied in the past. I would need to double check (@sdodson do you have a sense?) whether we have restrictions about customers using them at all, or whether it would be suitable in cases like this
  • add a field in the baremetal platform in the install config - I don't like this idea, but it is technically possible

@honza
Copy link
Member Author

honza commented Jun 8, 2022

That cross-platform approach is one issue with this implementation. I'm also trying to check whether this flag applies to all create commands or just create cluster. I'm struggling to get this to work and the flag is undocumented

I'm not sure what happened there.

$ ./bin/openshift-install create cluster --help
Create an OpenShift cluster

Usage:
  openshift-install create cluster [flags]

Flags:
      --bootstrap-timeout int   Bootstrap timeout in minutes (default 20)
  -h, --help                    help for cluster

Global Flags:
      --dir string         assets directory (default ".")
      --log-level string   log level (e.g. "debug | info | warn | error") (default "info")

The new flag is specific to the create cluster command.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 8, 2022

@honza: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-azure 97e1b2f link false /test e2e-azure
ci/prow/okd-e2e-aws 97e1b2f link false /test okd-e2e-aws
ci/prow/e2e-crc 97e1b2f link false /test e2e-crc
ci/prow/e2e-libvirt 97e1b2f link false /test e2e-libvirt
ci/prow/e2e-ibmcloud 97e1b2f link false /test e2e-ibmcloud
ci/prow/e2e-openstack 97e1b2f link false /test e2e-openstack

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@sdodson
Copy link
Member

sdodson commented Jun 8, 2022

environment variable - generally we avoid env vars and we don't document them, but I feel like this might fit the pattern where they have been applied in the past. I would need to double check (@sdodson do you have a sense?) whether we have restrictions about customers using them at all, or whether it would be suitable in cases like this

Yes, we maintain a firm stance that environment variables are only suitable for installer development uses. Having read the discussion in this PR I guess a flag scoped to cluster create is OK.

@patrickdillon
Copy link
Contributor

I'm not sure what happened there.

Definitely a mistake on my end!

Yes, we maintain a firm stance that environment variables are only suitable for installer development uses. Having read the discussion in this PR I guess a flag scoped to cluster create is OK.

Thanks Scott. As environment variables are out of the question, I think the install config seems like the only possible option for providing configurable timeouts.

Even if we do allow the timeout to be configurable across platforms (which I don't think is a good idea), we don't generally have install configuration through flags. At the minimum this would create a documentation problem.

@hardys
Copy link
Contributor

hardys commented Jun 14, 2022

Even if we do allow the timeout to be configurable across platforms (which I don't think is a good idea), we don't generally have install configuration through flags. At the minimum this would create a documentation problem.

Most configuration is done via the install-config, which describes the desired state of the cluster - but a flag seems like a reasonable approach for something which modifies the default behavior of the installer?

I guess we could make this platform-specific (e.g only enabled via baremetal tag for the openshift-baremetal-install binary?), but are we sure users will never need this escape-hatch option on other platforms? (for example I can imagine vsphere/openstack and any on-prem platforms having less deterministic performance than most clouds)

@sadasu
Copy link
Contributor

sadasu commented Jun 14, 2022

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 14, 2022
@sadasu
Copy link
Contributor

sadasu commented Jun 14, 2022

@honza Can we update documentation within this PR?

@sdodson
Copy link
Member

sdodson commented Jun 14, 2022

The rationale from @hardys is sound and with @sadasu LGTM we'll move forward with this.
/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 14, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sdodson

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 14, 2022
@sadasu
Copy link
Contributor

sadasu commented Jun 14, 2022

/label backport-risk-assessed

@openshift-ci openshift-ci bot added the backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. label Jun 14, 2022
@sdodson
Copy link
Member

sdodson commented Jun 14, 2022

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 14, 2022

// commandF can be used to alter the `command` above; this is necessary for
// things like persistent flags
commandF func(cmd *cobra.Command)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why we need this here? It doesn't look like we need this for all commands so why don't we add it only in the newCreateCmd() function?


func newCreateCmd() *cobra.Command {
	cmd := &cobra.Command{
		Use:   "create",
		Short: "Create part of an OpenShift cluster",
		RunE: func(cmd *cobra.Command, args []string) error {
			return cmd.Help()
		},
	}
cmd.PersistentFlags().IntVar(&createOpts.bootstrapTimeout, "bootstrap-timeout", 20, "Bootstrap timeout in minutes")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a general solution: each subcommand can now modify the cobra.Command instance independently. Setting the bootstrap timeout only makes sense when creating the cluster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it my bad. Was hoping to find a way to not add commandF for just one use case but I guess we can't.

@honza
Copy link
Member Author

honza commented Jun 15, 2022

Closing this as #5981 has merged

@honza honza closed this Jun 15, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 15, 2022

@honza: This pull request references Bugzilla bug 2090816. The bug has been updated to no longer refer to the pull request using the external bug tracker.

In response to this:

Bug 2090816: Make bootstrap timeout configurable

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants