New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1919407: openstack/validation: enforce control plane size #4585
Conversation
@crawford: This pull request references Bugzilla bug 1919407, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/bugzilla refresh |
@crawford: This pull request references Bugzilla bug 1919407, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cc @pierreprinetti |
This is a follow up to b6e3088, which hard codes the size of the control plane to three nodes. We have a broad policy within OpenShift to only support three-node control planes, but that's there's nothing technically preventing a user from using a different size. In the case of IPI on OpenStack, however, there is a technical restriction, so this explicitly validates that.
|
||
// ValidateForProvisioning validates that the install config is valid for provisioning the cluster. | ||
func ValidateForProvisioning(ic *types.InstallConfig) error { | ||
if ic.ControlPlane.Replicas != nil && *ic.ControlPlane.Replicas != 3 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be more appropriate to put the check in the type validation. The validation where you have it is meant to be for validation that requires access to the platform provider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about doing that, but I ended up putting it here because I didn't want UPI installs to also be caught by this check. My intention is to guard against the very narrow case of an IPI install on OpenStack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I'll buy that.
/test e2e-openstack |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
/hold
Can you please help us reproduce the issue you are facing? In our tests, the cluster deploys as many master nodes as set in the install-config ‘replicas’. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like a recipe for confusion to me for a number of reasons.
The control plane must be exactly 3 nodes on all platforms, so validating it only for OpenStack is potentially confusing for any future maintainer. This applies especially if we should ever support more than 3 masters, as this OpenStack-specific validation would become a support landmine if we forgot to remove it. I could support this change, but it should be common for all platforms. Case in point: we were surprised to discover this support limitation existed, as we didn't previously know about it.
OpenStack doesn't have this limitation in any case, as MCO will provision additional masters after bootstrap.
It's my understanding (and sincere hope) that this issue will be addressed by moving masters to a machineset in a future release. The consequence of that on this code would be that it would be deleted(🎉), so I don't think we should place too much weight on the argument of it being a burden on that effort. As noted above, this outlier-validation itself creates a burden on future work that doesn't currently exist.
This is backwards. The intention of this narrow validation is to ensure that this workaround doesn't become a landmine; the validation itself isn't the landmine. Before supporting a particular cluster configuration, it must be verified in CI. If/When we go to enable five-node clusters in CI, this validation would provide immediate feedback that the workaround needs to be revisited and fixed in a more robust manner. At the very least, that would mean making the workaround generic for any number of control nodes, but ideally it would be a fix to the upstream Terraform provider. Without this validation, it seems very likely that this workaround would go unnoticed and potentially be exercised by customers. The point about Red Hat not providing support for clusters that differ from a three-node control plane is orthogonal to this discussion. Some sort of validation there is obviously needed, but this issue goes beyond that (i.e. if we introduce and then later relaxed the global three-node validation, I would still expect this OpenStack-specific validation to stand until the workaround is addressed). To state my position very clearly, the workaround in question causes a change to the bootstrap flow. Changes of that magnitude must be discussed in https://github.com/openshift/enhancements and must apply to all platforms, except in extreme cases (this is not one of those). You can avoid that discussion by ensuring that the problematic bootstrap flow is never exercised (e.g. this PR). Otherwise, we need to discuss moving all platforms to this new bootstrap flow. |
This is not a new bootstrap flow for 2 reasons:
If we want to enforce 3 controllers globally, which seems sensible as it appears to be a source of confusion, we should do it globally. We might also consider disabling control-plane scale out globally if it is a support concern. My long-term preference here would be to entirely remove terraform from the equation, at least for machine creation. Ideally we would not maintain 2 different ways of creating control plane nodes. To be clear, I don't think we should merge this. Do it globally or not at all. |
Let me see if I understand the situation correctly.
It does seem the OpenStack IPI has a unique flow for count != 3. I also recognize that we only support 3-control-plane clusters today. I should point out that we are actively working on supporting 1-control-plane clusters and starting to support 5-node clusters (and make sure both 4 and 2 function). However those overall OCP supported configuration questions are 100% unrelated to THIS discussion. The overall statement of 3-nodes only or not is not being discussed here. What is being discussed here is how OCP on OSP behaves quite differently than every other IPI and that it has hard coding of the number 3. As such I completely support Alex's belief that we should carefully call out this number 3 and should do so only for OpenStack IPI. When the product as a whole supports 1,(2),3,(4), and 5 node clusters all of the other IPIs are VERY likely to work exactly the same. But OpenStack IPI is NOT likely to work exactly the same. We should merge this commit so the first person trying to use the untested (can't count on testing by other platforms here) configuration won't get bitten. When we get to the point I would expect the shiftstack team to test, confirm functionality, and then remove this check. But as it stands today we shouldn't ask our customers to even try such a thing. |
/hold cancel @eparis We've come to a standoff where:
I am removing the We will work on a long-term fix to the original user issue that does not rely on a peculiar workflow for OpenStack. In the meantime: /approve |
@pierreprinetti: once the present PR merges, I will cherry-pick it on top of release-4.6 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: pierreprinetti, staebler The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Please review the full test history for this PR and help us cut down flakes. |
/bugzilla refresh |
@pierreprinetti: This pull request references Bugzilla bug 1919407, which is valid. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
/retest Please review the full test history for this PR and help us cut down flakes. |
@crawford: All pull requests linked via external trackers have merged: Bugzilla bug 1919407 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@pierreprinetti: #4585 failed to apply on top of branch "release-4.6":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Backported in #4612 |
This is a follow up to b6e3088, which hard codes the size of the
control plane to three nodes. We have a broad policy within OpenShift to
only support three-node control planes, but that's there's nothing
technically preventing a user from using a different size. In the case
of IPI on OpenStack, however, there is a technical restriction, so this
explicitly validates that.