New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1805639: Validate credentialsSecret #673
Bug 1805639: Validate credentialsSecret #673
Conversation
@JoelSpeed: This pull request references Bugzilla bug 1805639, which is valid. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks reasonable to me, i just want to make sure i understand it. is the idea here that during validation for a machine create/update we check to ensure that the credentials secret exists and if not we add an error which causes a graceful failure?
Yep that's the plan, prevents someone from updating the secret to point to a secret that doesn't exist. Currently spinning up a cluster to test that this works as expected |
thanks Joel, makes sense to me. |
I've manually tested this and can confirm, any secret that existed within the namespace was allowed, but a secret that did not exist was rejected with the appropriate message. I'm happy for this to merge once the tests pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the webhook is the right way to solve this. I'd prefer to go immediately to failed if the credentials are invalid.
Reason being, it's not desirable to have the machineset controller fail to successfully create machines. If someone deletes the secret later, it's going to be hard to troubleshoot what's happening. With just failing the machines, we can put a reason right onto the machine-object.
While this does marginally improve the UX when creating a machineset, it has a greater negative impact on troubleshooting down the road IMO.
We shouldn't force the machine to
Why does this change make it any harder to debug than the existing code?
This helps in a small way yes, it's trying to improve the UX at machineset creation time, but we can't catch all cases, and I don't think there's an easy way to catch all cases. I don't think this makes the experience worse than it already is, I don't think it should interfere with debugging etc. and since it doesn't affect any of the existing validations or behaviour, we aren't taking any reporting of errors away. |
/retest |
@JoelSpeed: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Unless we're extending the machineSet to add some information to the machineSet.status object about why it can't create machines, I'm against this change. This change will inevitably lead to "I created a machineset, but I don't have any machines" type of questions. It won't be obvious what's broken or where to look. Couple that with the fact that people use node/instance/machine all interchangeably, it's going to be hard to communicate to users where to look for what conditions. As far as failing the machine due to invalid credentials, we should check to ensure that the machine wasn't provisioned already. We could check this before we set finalizers, ensure the secret is present, if it is, we set finalizers, if it's not, we set failed (possibly skip setting finalizers?) If finalizers are set, we have no way of knowing if an instance was created if credentials go missing (on the outside chance that the machine-controller crashes around the same time the credentials are removed and immediately after the request to create the instance was submitted). |
it seems like there is debate to had about this still, i'm going to remove my approval until the issues raised here have been resolved. |
We could add a condition, and/or use an event to communicate this.
If the MachineSet is accepted, then it will have a valid credentialsSecret, the only way you could end up with this is if the MachineSet is created and then the credentialsSecret is deleted at a later point. It's possible, so I agree, we should be adding some more obvious errors to the MachineSet events or status
Your suggestion here is that we swap the ordering of checking the instance exists with the addition of the finalizer, so that we only add the finalizer after we successfully create the machine? I think this is likely to cause issues with Machine leaking. We had a bug recently with Azure where |
Not quite. My suggestion is to check that the credentials secret exists before adding the finalizer. If it doesn't, we fail. If the credentials secret exists, we add the finalizer, don't check it any more. This way, if the secret disappears later we know there is a possibility that the machine was created. If this happens, then we should not remove the machine and require the user to remove the finalizer and instruct them to check the cloud to remove the leaked instance. This is the only way to ensure we don't leak an instance. |
This makes more sense to me now, though I do think this PR provides a pretty similar experience, we are just checking it exists before the Machine ever gets created. We don't even get to the point where the Machine could go failed, which I think is preferable. Though I appreciate the concerns about the UX of a MachineSet failing to create machines. Personally I prefer the fail early and provide details of failed machine creates on the MachineSet YAML still, I think this could provide the plumbing/opportunity to provide more helpful information to users for the various paths as to why the MachineSet couldn't do something |
We have the possibility to race with the CCO on first install. Let's not fail machines or otherwise do things that might be helpful but cause other buggy behavior. I suggest we close this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good to me, but we surely should not merge this like that, talking about #673 (comment). Could it be implemented using warnings instead of rejecting resource, so we just notify the user that it is not expected to have no userSecret? This should not affect cluster installation.
Also, this PR implements passing resource client to webhooks, which is needed to fix BZ 1889620
, so the fix is currently blocked.
This needs picking up again as it hasn't been touched in a while, but switching to a warning could be a good way to signal to the user that something isn't quite as expected, how do others feel about that approach? |
Let's drop this. We should just indicate on machine status that the secret is not present. In most cases it will be, in limited cases where a user makes a typo, it will be obvious how to fix it. |
/hold These webhooks are not a priority. |
36ea4b3
to
a0660ed
Compare
/hold cancel I've updated this to warn users when a credentials secret does not exist. This is non blocking and means that users may be alerted to a problem earlier should there be one. PTAL |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Danil-Grigorev The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This commit introduces the ability to validate the existence of the credentialsSecret reference in the machine providerSpec for AWS. To make this possible a kube client is stored in the admissionHandler which is then passed at convenience trough the machineAdmissionFn signature.
There are scenarios where this could cause a race condition (eg on installation) if Machine creation is rejected immediately when no credentials secret exists. This change ensures users are warned, but it is not a fatal error for the request.
a0660ed
to
d36a7e3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change doesn't block machine from being created, looks good to me in that case
/lgtm
/retest |
/retest Please review the full test history for this PR and help us cut down flakes. |
3 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
@JoelSpeed: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@JoelSpeed: All pull requests linked via external trackers have merged: Bugzilla bug 1805639 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This let the webhook to validate the credentialsSecret existence before creating the machine resource.
Additionally we should probably reflect any unexpected error to use the credentialsSecret as providerStatus conditions https://issues.redhat.com/browse/OCPCLOUD-931
I'm taking over this bug from @enxebre as he is away for three weeks. I've applied all of my suggestions from the original PR #660 and am happy for this to merge once it has undergone some manual testing.