Bug 1805639: Validate credentialsSecret #673

JoelSpeed · 2020-08-10T14:01:32Z

This let the webhook to validate the credentialsSecret existence before creating the machine resource.
Additionally we should probably reflect any unexpected error to use the credentialsSecret as providerStatus conditions https://issues.redhat.com/browse/OCPCLOUD-931

I'm taking over this bug from @enxebre as he is away for three weeks. I've applied all of my suggestions from the original PR #660 and am happy for this to merge once it has undergone some manual testing.

openshift-ci-robot · 2020-08-10T14:01:37Z

@JoelSpeed: This pull request references Bugzilla bug 1805639, which is valid. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.6.0) matches configured target release for branch (4.6.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1805639: Validate credentialsSecret

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

elmiko

this looks reasonable to me, i just want to make sure i understand it. is the idea here that during validation for a machine create/update we check to ensure that the credentials secret exists and if not we add an error which causes a graceful failure?

JoelSpeed · 2020-08-10T14:23:09Z

this looks reasonable to me, i just want to make sure i understand it. is the idea here that during validation for a machine create/update we check to ensure that the credentials secret exists and if not we add an error which causes a graceful failure?

Yep that's the plan, prevents someone from updating the secret to point to a secret that doesn't exist. Currently spinning up a cluster to test that this works as expected

elmiko · 2020-08-10T14:24:42Z

thanks Joel, makes sense to me.
/approve

JoelSpeed · 2020-08-10T16:07:13Z

I've manually tested this and can confirm, any secret that existed within the namespace was allowed, but a secret that did not exist was rejected with the appropriate message. I'm happy for this to merge once the tests pass

michaelgugino

I don't think the webhook is the right way to solve this. I'd prefer to go immediately to failed if the credentials are invalid.

Reason being, it's not desirable to have the machineset controller fail to successfully create machines. If someone deletes the secret later, it's going to be hard to troubleshoot what's happening. With just failing the machines, we can put a reason right onto the machine-object.

While this does marginally improve the UX when creating a machineset, it has a greater negative impact on troubleshooting down the road IMO.

JoelSpeed · 2020-08-17T11:27:10Z

I don't think the webhook is the right way to solve this. I'd prefer to go immediately to failed if the credentials are invalid.

We shouldn't force the machine to Failed if the credentials are invalid since we can't check whether the VM exists or not (because no creds). If we forced it to failed because someone deleted the credentials after the Machine was created, this could cause a machine to go Failed when actually, it's still running.

If someone deletes the secret later, it's going to be hard to troubleshoot what's happening.

Why does this change make it any harder to debug than the existing code?

While this does marginally improve the UX when creating a machineset, it has a greater negative impact on troubleshooting down the road IMO.

This helps in a small way yes, it's trying to improve the UX at machineset creation time, but we can't catch all cases, and I don't think there's an easy way to catch all cases.

I don't think this makes the experience worse than it already is, I don't think it should interfere with debugging etc. and since it doesn't affect any of the existing validations or behaviour, we aren't taking any reporting of errors away.

JoelSpeed · 2020-08-18T14:41:31Z

/retest

openshift-ci-robot · 2020-08-18T16:34:50Z

@JoelSpeed: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-aws-operator	`36ea4b3`	link	`/test e2e-aws-operator`
ci/prow/e2e-aws	`36ea4b3`	link	`/test e2e-aws`
ci/prow/e2e-azure	`36ea4b3`	link	`/test e2e-azure`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

michaelgugino · 2020-08-19T20:02:57Z

Unless we're extending the machineSet to add some information to the machineSet.status object about why it can't create machines, I'm against this change.

This change will inevitably lead to "I created a machineset, but I don't have any machines" type of questions. It won't be obvious what's broken or where to look. Couple that with the fact that people use node/instance/machine all interchangeably, it's going to be hard to communicate to users where to look for what conditions.

As far as failing the machine due to invalid credentials, we should check to ensure that the machine wasn't provisioned already. We could check this before we set finalizers, ensure the secret is present, if it is, we set finalizers, if it's not, we set failed (possibly skip setting finalizers?) If finalizers are set, we have no way of knowing if an instance was created if credentials go missing (on the outside chance that the machine-controller crashes around the same time the credentials are removed and immediately after the request to create the instance was submitted).

elmiko · 2020-08-19T20:10:12Z

it seems like there is debate to had about this still, i'm going to remove my approval until the issues raised here have been resolved.
/approve cancel

JoelSpeed · 2020-08-20T10:49:58Z

Unless we're extending the machineSet to add some information to the machineSet.status object about why it can't create machines, I'm against this change.

We could add a condition, and/or use an event to communicate this.

This change will inevitably lead to "I created a machineset, but I don't have any machines" type of questions. It won't be obvious what's broken or where to look. Couple that with the fact that people use node/instance/machine all interchangeably, it's going to be hard to communicate to users where to look for what conditions.

If the MachineSet is accepted, then it will have a valid credentialsSecret, the only way you could end up with this is if the MachineSet is created and then the credentialsSecret is deleted at a later point. It's possible, so I agree, we should be adding some more obvious errors to the MachineSet events or status

As far as failing the machine due to invalid credentials, we should check to ensure that the machine wasn't provisioned already. We could check this before we set finalizers, ensure the secret is present, if it is, we set finalizers, if it's not, we set failed (possibly skip setting finalizers?) If finalizers are set, we have no way of knowing if an instance was created if credentials go missing (on the outside chance that the machine-controller crashes around the same time the credentials are removed and immediately after the request to create the instance was submitted).

Your suggestion here is that we swap the ordering of checking the instance exists with the addition of the finalizer, so that we only add the finalizer after we successfully create the machine? I think this is likely to cause issues with Machine leaking. We had a bug recently with Azure where Create always returns an error initially due to the async nature of their API, so unless we are 100% sure we can determine the difference between a create failing legitimately and this async failure, then add the finalizer, we could possible leak an instance.
Note how at the moment we add the finalizer and then requeue, this is to prevent conflict errors that might cause us to have created the machine but not committed the finalizer. Again this is leak prevention and I don't think we should be changing this. The finalizer needs to be on the Machine before we attempt to create the instance

michaelgugino · 2020-08-27T03:13:26Z

Your suggestion here is that we swap the ordering of checking the instance exists with the addition of the finalizer, so that we only add the finalizer after we successfully create the machine? I think this is likely to cause issues with Machine leaking.

Not quite. My suggestion is to check that the credentials secret exists before adding the finalizer. If it doesn't, we fail. If the credentials secret exists, we add the finalizer, don't check it any more. This way, if the secret disappears later we know there is a possibility that the machine was created. If this happens, then we should not remove the machine and require the user to remove the finalizer and instruct them to check the cloud to remove the leaked instance. This is the only way to ensure we don't leak an instance.

JoelSpeed · 2020-09-07T09:33:49Z

My suggestion is to check that the credentials secret exists before adding the finalizer.

This makes more sense to me now, though I do think this PR provides a pretty similar experience, we are just checking it exists before the Machine ever gets created. We don't even get to the point where the Machine could go failed, which I think is preferable. Though I appreciate the concerns about the UX of a MachineSet failing to create machines.

Personally I prefer the fail early and provide details of failed machine creates on the MachineSet YAML still, I think this could provide the plumbing/opportunity to provide more helpful information to users for the various paths as to why the MachineSet couldn't do something

michaelgugino · 2020-11-02T13:37:28Z

We have the possibility to race with the CCO on first install. Let's not fail machines or otherwise do things that might be helpful but cause other buggy behavior. I suggest we close this.

Danil-Grigorev

Changes look good to me, but we surely should not merge this like that, talking about #673 (comment). Could it be implemented using warnings instead of rejecting resource, so we just notify the user that it is not expected to have no userSecret? This should not affect cluster installation.

Also, this PR implements passing resource client to webhooks, which is needed to fix BZ 1889620, so the fix is currently blocked.

JoelSpeed · 2020-11-09T12:15:25Z

This needs picking up again as it hasn't been touched in a while, but switching to a warning could be a good way to signal to the user that something isn't quite as expected, how do others feel about that approach?

michaelgugino · 2020-11-12T14:04:10Z

Let's drop this. We should just indicate on machine status that the secret is not present. In most cases it will be, in limited cases where a user makes a typo, it will be obvious how to fix it.

michaelgugino · 2020-11-12T14:09:27Z

/hold

These webhooks are not a priority.

JoelSpeed · 2020-11-30T18:27:04Z

/hold cancel

I've updated this to warn users when a credentials secret does not exist. This is non blocking and means that users may be alerted to a problem earlier should there be one. PTAL

JoelSpeed · 2020-12-01T08:39:22Z

/retest

Danil-Grigorev

/approve

openshift-ci-robot · 2020-12-01T10:16:06Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Danil-Grigorev

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Danil-Grigorev]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

This commit introduces the ability to validate the existence of the credentialsSecret reference in the machine providerSpec for AWS. To make this possible a kube client is stored in the admissionHandler which is then passed at convenience trough the machineAdmissionFn signature.

There are scenarios where this could cause a race condition (eg on installation) if Machine creation is rejected immediately when no credentials secret exists. This change ensures users are warned, but it is not a fatal error for the request.

alexander-demicev

The change doesn't block machine from being created, looks good to me in that case

/lgtm

JoelSpeed · 2020-12-10T13:57:22Z

/retest

openshift-bot · 2020-12-10T14:21:04Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-12-10T14:33:55Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-12-10T15:38:49Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-12-10T15:51:49Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-merge-robot · 2020-12-10T16:36:30Z

@JoelSpeed: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-azure-operator	`d36a7e3`	link	`/test e2e-azure-operator`
ci/prow/e2e-vsphere-operator	`d36a7e3`	link	`/test e2e-vsphere-operator`
ci/prow/e2e-vsphere	`d36a7e3`	link	`/test e2e-vsphere`
ci/prow/e2e-libvirt	`d36a7e3`	link	`/test e2e-libvirt`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci-robot · 2020-12-10T16:56:25Z

@JoelSpeed: All pull requests linked via external trackers have merged:

openshift/machine-api-operator#673

Bugzilla bug 1805639 has been moved to the MODIFIED state.

In response to this:

Bug 1805639: Validate credentialsSecret

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot added the bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. label Aug 10, 2020

openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Aug 10, 2020

openshift-ci-robot requested review from elmiko and enxebre August 10, 2020 14:01

JoelSpeed mentioned this pull request Aug 10, 2020

Bug 1805639: Validate credentialsSecret #660

Closed

elmiko reviewed Aug 10, 2020

View reviewed changes

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 10, 2020

michaelgugino suggested changes Aug 10, 2020

View reviewed changes

openshift-ci-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 19, 2020

Danil-Grigorev mentioned this pull request Nov 2, 2020

Bug 1881865: AWS machine does not fail on missing userData secret openshift/cluster-api-provider-aws#354

Closed

Danil-Grigorev suggested changes Nov 9, 2020

View reviewed changes

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 12, 2020

JoelSpeed force-pushed the validate-creds branch from 36ea4b3 to a0660ed Compare November 30, 2020 18:23

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 30, 2020

Danil-Grigorev approved these changes Dec 1, 2020

View reviewed changes

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 1, 2020

enxebre and others added 5 commits December 10, 2020 11:47

Add validation for credentials existence in vSphere

5b4c707

Add validation for credentials existence in GCP

1706724

Add validation for credentials existence in Azure

ee784d8

JoelSpeed force-pushed the validate-creds branch from a0660ed to d36a7e3 Compare December 10, 2020 11:50

alexander-demicev approved these changes Dec 10, 2020

View reviewed changes

openshift-ci-robot assigned alexander-demicev Dec 10, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 10, 2020

openshift-merge-robot merged commit ed12b42 into openshift:master Dec 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1805639: Validate credentialsSecret #673

Bug 1805639: Validate credentialsSecret #673

JoelSpeed commented Aug 10, 2020

openshift-ci-robot commented Aug 10, 2020

elmiko left a comment

JoelSpeed commented Aug 10, 2020

elmiko commented Aug 10, 2020

JoelSpeed commented Aug 10, 2020

michaelgugino left a comment

JoelSpeed commented Aug 17, 2020

JoelSpeed commented Aug 18, 2020

openshift-ci-robot commented Aug 18, 2020

michaelgugino commented Aug 19, 2020

elmiko commented Aug 19, 2020

JoelSpeed commented Aug 20, 2020

michaelgugino commented Aug 27, 2020

JoelSpeed commented Sep 7, 2020

michaelgugino commented Nov 2, 2020

Danil-Grigorev left a comment

JoelSpeed commented Nov 9, 2020

michaelgugino commented Nov 12, 2020

michaelgugino commented Nov 12, 2020

JoelSpeed commented Nov 30, 2020

JoelSpeed commented Dec 1, 2020

Danil-Grigorev left a comment

openshift-ci-robot commented Dec 1, 2020

alexander-demicev left a comment

JoelSpeed commented Dec 10, 2020

openshift-bot commented Dec 10, 2020

openshift-bot commented Dec 10, 2020

openshift-bot commented Dec 10, 2020

openshift-bot commented Dec 10, 2020

openshift-merge-robot commented Dec 10, 2020 •

edited

openshift-ci-robot commented Dec 10, 2020

Bug 1805639: Validate credentialsSecret #673

Bug 1805639: Validate credentialsSecret #673

Conversation

JoelSpeed commented Aug 10, 2020

openshift-ci-robot commented Aug 10, 2020

elmiko left a comment

Choose a reason for hiding this comment

JoelSpeed commented Aug 10, 2020

elmiko commented Aug 10, 2020

JoelSpeed commented Aug 10, 2020

michaelgugino left a comment

Choose a reason for hiding this comment

JoelSpeed commented Aug 17, 2020

JoelSpeed commented Aug 18, 2020

openshift-ci-robot commented Aug 18, 2020

michaelgugino commented Aug 19, 2020

elmiko commented Aug 19, 2020

JoelSpeed commented Aug 20, 2020

michaelgugino commented Aug 27, 2020

JoelSpeed commented Sep 7, 2020

michaelgugino commented Nov 2, 2020

Danil-Grigorev left a comment

Choose a reason for hiding this comment

JoelSpeed commented Nov 9, 2020

michaelgugino commented Nov 12, 2020

michaelgugino commented Nov 12, 2020

JoelSpeed commented Nov 30, 2020

JoelSpeed commented Dec 1, 2020

Danil-Grigorev left a comment

Choose a reason for hiding this comment

openshift-ci-robot commented Dec 1, 2020

alexander-demicev left a comment

Choose a reason for hiding this comment

JoelSpeed commented Dec 10, 2020

openshift-bot commented Dec 10, 2020

openshift-bot commented Dec 10, 2020

openshift-bot commented Dec 10, 2020

openshift-bot commented Dec 10, 2020

openshift-merge-robot commented Dec 10, 2020 • edited

openshift-ci-robot commented Dec 10, 2020

openshift-merge-robot commented Dec 10, 2020 •

edited