Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1919407: openstack/validation: enforce control plane size #4585

Merged
merged 1 commit into from Feb 2, 2021
Merged

Bug 1919407: openstack/validation: enforce control plane size #4585

merged 1 commit into from Feb 2, 2021

Conversation

crawford
Copy link
Contributor

This is a follow up to b6e3088, which hard codes the size of the
control plane to three nodes. We have a broad policy within OpenShift to
only support three-node control planes, but that's there's nothing
technically preventing a user from using a different size. In the case
of IPI on OpenStack, however, there is a technical restriction, so this
explicitly validates that.

@openshift-ci-robot openshift-ci-robot added bugzilla/severity-unspecified Referenced Bugzilla bug's severity is unspecified for the PR. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Jan 22, 2021
@openshift-ci-robot
Copy link
Contributor

@crawford: This pull request references Bugzilla bug 1919407, which is invalid:

  • expected the bug to target the "4.7.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1919407: openstack/validation: enforce control plane size

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@crawford
Copy link
Contributor Author

/bugzilla refresh

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Jan 22, 2021
@openshift-ci-robot
Copy link
Contributor

@crawford: This pull request references Bugzilla bug 1919407, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.7.0) matches configured target release for branch (4.7.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot removed the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Jan 22, 2021
@crawford
Copy link
Contributor Author

/cc @pierreprinetti

This is a follow up to b6e3088, which hard codes the size of the
control plane to three nodes. We have a broad policy within OpenShift to
only support three-node control planes, but that's there's nothing
technically preventing a user from using a different size. In the case
of IPI on OpenStack, however, there is a technical restriction, so this
explicitly validates that.

// ValidateForProvisioning validates that the install config is valid for provisioning the cluster.
func ValidateForProvisioning(ic *types.InstallConfig) error {
if ic.ControlPlane.Replicas != nil && *ic.ControlPlane.Replicas != 3 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more appropriate to put the check in the type validation. The validation where you have it is meant to be for validation that requires access to the platform provider.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about doing that, but I ended up putting it here because I didn't want UPI installs to also be caught by this check. My intention is to guard against the very narrow case of an IPI install on OpenStack.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'll buy that.

@crawford
Copy link
Contributor Author

/test e2e-openstack

Copy link
Contributor

@staebler staebler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 23, 2021
@pierreprinetti
Copy link
Member

/hold

In the case of IPI on OpenStack, however, there is a technical restriction

Can you please help us reproduce the issue you are facing? In our tests, the cluster deploys as many master nodes as set in the install-config ‘replicas’.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 23, 2021
Copy link
Contributor

@mdbooth mdbooth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a recipe for confusion to me for a number of reasons.

The control plane must be exactly 3 nodes on all platforms, so validating it only for OpenStack is potentially confusing for any future maintainer. This applies especially if we should ever support more than 3 masters, as this OpenStack-specific validation would become a support landmine if we forgot to remove it. I could support this change, but it should be common for all platforms. Case in point: we were surprised to discover this support limitation existed, as we didn't previously know about it.

OpenStack doesn't have this limitation in any case, as MCO will provision additional masters after bootstrap.

It's my understanding (and sincere hope) that this issue will be addressed by moving masters to a machineset in a future release. The consequence of that on this code would be that it would be deleted(🎉), so I don't think we should place too much weight on the argument of it being a burden on that effort. As noted above, this outlier-validation itself creates a burden on future work that doesn't currently exist.

@crawford
Copy link
Contributor Author

This applies especially if we should ever support more than 3 masters, as this OpenStack-specific validation would become a support landmine if we forgot to remove it.

This is backwards. The intention of this narrow validation is to ensure that this workaround doesn't become a landmine; the validation itself isn't the landmine. Before supporting a particular cluster configuration, it must be verified in CI. If/When we go to enable five-node clusters in CI, this validation would provide immediate feedback that the workaround needs to be revisited and fixed in a more robust manner. At the very least, that would mean making the workaround generic for any number of control nodes, but ideally it would be a fix to the upstream Terraform provider. Without this validation, it seems very likely that this workaround would go unnoticed and potentially be exercised by customers.

The point about Red Hat not providing support for clusters that differ from a three-node control plane is orthogonal to this discussion. Some sort of validation there is obviously needed, but this issue goes beyond that (i.e. if we introduce and then later relaxed the global three-node validation, I would still expect this OpenStack-specific validation to stand until the workaround is addressed).

To state my position very clearly, the workaround in question causes a change to the bootstrap flow. Changes of that magnitude must be discussed in https://github.com/openshift/enhancements and must apply to all platforms, except in extreme cases (this is not one of those). You can avoid that discussion by ensuring that the problematic bootstrap flow is never exercised (e.g. this PR). Otherwise, we need to discuss moving all platforms to this new bootstrap flow.

@mdbooth
Copy link
Contributor

mdbooth commented Jan 28, 2021

To state my position very clearly, the workaround in question causes a change to the bootstrap flow. Changes of that magnitude must be discussed in https://github.com/openshift/enhancements and must apply to all platforms, except in extreme cases (this is not one of those). You can avoid that discussion by ensuring that the problematic bootstrap flow is never exercised (e.g. this PR). Otherwise, we need to discuss moving all platforms to this new bootstrap flow.

This is not a new bootstrap flow for 2 reasons:

  1. It does not change the flow for any supported configuration.
  2. It does not add the apparently unsupported control plane scale-out behaviour: it simply moves it from day 2 to day 1.

If we want to enforce 3 controllers globally, which seems sensible as it appears to be a source of confusion, we should do it globally. We might also consider disabling control-plane scale out globally if it is a support concern.

My long-term preference here would be to entirely remove terraform from the equation, at least for machine creation. Ideally we would not maintain 2 different ways of creating control plane nodes.

To be clear, I don't think we should merge this. Do it globally or not at all.

@eparis
Copy link
Member

eparis commented Feb 1, 2021

Let me see if I understand the situation correctly.

  • OpenStack IPI hard codes the creation of THREE (no more, no fewer) control plane nodes via terraform
  • OpenStack IPI creates Machines for the number of control plane nodes in the install-config
  • OpenStack Machine API would scale (up and down? just up?) the control plane nodes after install.
  • All other cloud IPI create the number of control plane nodes specified in the install config via terraform
  • (really unsure?) ?All? other cloud IPI create machines for control plane nodes?

It does seem the OpenStack IPI has a unique flow for count != 3.

I also recognize that we only support 3-control-plane clusters today. I should point out that we are actively working on supporting 1-control-plane clusters and starting to support 5-node clusters (and make sure both 4 and 2 function). However those overall OCP supported configuration questions are 100% unrelated to THIS discussion. The overall statement of 3-nodes only or not is not being discussed here. What is being discussed here is how OCP on OSP behaves quite differently than every other IPI and that it has hard coding of the number 3.

As such I completely support Alex's belief that we should carefully call out this number 3 and should do so only for OpenStack IPI. When the product as a whole supports 1,(2),3,(4), and 5 node clusters all of the other IPIs are VERY likely to work exactly the same. But OpenStack IPI is NOT likely to work exactly the same. We should merge this commit so the first person trying to use the untested (can't count on testing by other platforms here) configuration won't get bitten. When we get to the point I would expect the shiftstack team to test, confirm functionality, and then remove this check. But as it stands today we shouldn't ask our customers to even try such a thing.

@pierreprinetti
Copy link
Member

/hold cancel

@eparis
Your summary looks perfect to me; I will maybe add that QE has verified clusters with master replica !=3. While I disagree with your conclusions, I understand the rationale.

We've come to a standoff where:

  • for some, introducing a validation specifically for OpenStack represents a flake because both the supportability and the actual behaviour of the cluster mimics what happens in the other platforms;
  • for others, the behaviour of OpenStack (relying on MAO rather than Terraform to stand up masters) is a flake worth a platform-specific validation.

I am removing the /hold because I believe that at this point, nobody wants to further impede a user because of our arguing over how to treat an unsupported configuration.

We will work on a long-term fix to the original user issue that does not rely on a peculiar workflow for OpenStack. In the meantime:

/approve
/lgtm
/cherry-pick release-4.6

@openshift-cherrypick-robot

@pierreprinetti: once the present PR merges, I will cherry-pick it on top of release-4.6 in a new PR and assign it to you.

In response to this:

/hold cancel

@eparis
Your summary looks perfect to me; I will maybe add that QE has verified clusters with master replica !=3. While I disagree with your conclusions, I understand the rationale.

We've come to a standoff where:

  • for some, introducing a validation specifically for OpenStack represents a flake because both the supportability and the actual behaviour of the cluster mimics what happens in the other platforms;
  • for others, the behaviour of OpenStack (relying on MAO rather than Terraform to stand up masters) is a flake worth a platform-specific validation.

I am removing the /hold because I believe that at this point, nobody wants to further impede a user because of our arguing over how to treat an unsupported configuration.

We will work on a long-term fix to the original user issue that does not rely on a peculiar workflow for OpenStack. In the meantime:

/approve
/lgtm
/cherry-pick release-4.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 2, 2021
@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 2, 2021
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pierreprinetti, staebler

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@pierreprinetti
Copy link
Member

/bugzilla refresh

@openshift-ci-robot
Copy link
Contributor

@pierreprinetti: This pull request references Bugzilla bug 1919407, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.7.0) matches configured target release for branch (4.7.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. and removed bugzilla/severity-unspecified Referenced Bugzilla bug's severity is unspecified for the PR. labels Feb 2, 2021
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit c60d07e into openshift:master Feb 2, 2021
@openshift-ci-robot
Copy link
Contributor

@crawford: All pull requests linked via external trackers have merged:

Bugzilla bug 1919407 has been moved to the MODIFIED state.

In response to this:

Bug 1919407: openstack/validation: enforce control plane size

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot

@pierreprinetti: #4585 failed to apply on top of branch "release-4.6":

Applying: openstack/validation: enforce control plane size
Using index info to reconstruct a base tree...
M	pkg/asset/installconfig/openstack/validate.go
M	pkg/asset/installconfig/platformprovisioncheck.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/asset/installconfig/platformprovisioncheck.go
CONFLICT (content): Merge conflict in pkg/asset/installconfig/platformprovisioncheck.go
Auto-merging pkg/asset/installconfig/openstack/validate.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 openstack/validation: enforce control plane size
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/hold cancel

@eparis
Your summary looks perfect to me; I will maybe add that QE has verified clusters with master replica !=3. While I disagree with your conclusions, I understand the rationale.

We've come to a standoff where:

  • for some, introducing a validation specifically for OpenStack represents a flake because both the supportability and the actual behaviour of the cluster mimics what happens in the other platforms;
  • for others, the behaviour of OpenStack (relying on MAO rather than Terraform to stand up masters) is a flake worth a platform-specific validation.

I am removing the /hold because I believe that at this point, nobody wants to further impede a user because of our arguing over how to treat an unsupported configuration.

We will work on a long-term fix to the original user issue that does not rely on a peculiar workflow for OpenStack. In the meantime:

/approve
/lgtm
/cherry-pick release-4.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@pierreprinetti
Copy link
Member

Backported in #4612

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants