Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1805639: Validate credentialsSecret #673

Merged
merged 5 commits into from Dec 10, 2020

Conversation

JoelSpeed
Copy link
Contributor

This let the webhook to validate the credentialsSecret existence before creating the machine resource.
Additionally we should probably reflect any unexpected error to use the credentialsSecret as providerStatus conditions https://issues.redhat.com/browse/OCPCLOUD-931

I'm taking over this bug from @enxebre as he is away for three weeks. I've applied all of my suggestions from the original PR #660 and am happy for this to merge once it has undergone some manual testing.

@openshift-ci-robot openshift-ci-robot added the bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. label Aug 10, 2020
@openshift-ci-robot
Copy link
Contributor

@JoelSpeed: This pull request references Bugzilla bug 1805639, which is valid. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1805639: Validate credentialsSecret

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Aug 10, 2020
Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks reasonable to me, i just want to make sure i understand it. is the idea here that during validation for a machine create/update we check to ensure that the credentials secret exists and if not we add an error which causes a graceful failure?

@JoelSpeed
Copy link
Contributor Author

this looks reasonable to me, i just want to make sure i understand it. is the idea here that during validation for a machine create/update we check to ensure that the credentials secret exists and if not we add an error which causes a graceful failure?

Yep that's the plan, prevents someone from updating the secret to point to a secret that doesn't exist. Currently spinning up a cluster to test that this works as expected

@elmiko
Copy link
Contributor

elmiko commented Aug 10, 2020

thanks Joel, makes sense to me.
/approve

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 10, 2020
@JoelSpeed
Copy link
Contributor Author

I've manually tested this and can confirm, any secret that existed within the namespace was allowed, but a secret that did not exist was rejected with the appropriate message. I'm happy for this to merge once the tests pass

Copy link
Contributor

@michaelgugino michaelgugino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the webhook is the right way to solve this. I'd prefer to go immediately to failed if the credentials are invalid.

Reason being, it's not desirable to have the machineset controller fail to successfully create machines. If someone deletes the secret later, it's going to be hard to troubleshoot what's happening. With just failing the machines, we can put a reason right onto the machine-object.

While this does marginally improve the UX when creating a machineset, it has a greater negative impact on troubleshooting down the road IMO.

@JoelSpeed
Copy link
Contributor Author

I don't think the webhook is the right way to solve this. I'd prefer to go immediately to failed if the credentials are invalid.

We shouldn't force the machine to Failed if the credentials are invalid since we can't check whether the VM exists or not (because no creds). If we forced it to failed because someone deleted the credentials after the Machine was created, this could cause a machine to go Failed when actually, it's still running.

If someone deletes the secret later, it's going to be hard to troubleshoot what's happening.

Why does this change make it any harder to debug than the existing code?

While this does marginally improve the UX when creating a machineset, it has a greater negative impact on troubleshooting down the road IMO.

This helps in a small way yes, it's trying to improve the UX at machineset creation time, but we can't catch all cases, and I don't think there's an easy way to catch all cases.

I don't think this makes the experience worse than it already is, I don't think it should interfere with debugging etc. and since it doesn't affect any of the existing validations or behaviour, we aren't taking any reporting of errors away.

@JoelSpeed
Copy link
Contributor Author

/retest

@openshift-ci-robot
Copy link
Contributor

@JoelSpeed: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws-operator 36ea4b3 link /test e2e-aws-operator
ci/prow/e2e-aws 36ea4b3 link /test e2e-aws
ci/prow/e2e-azure 36ea4b3 link /test e2e-azure

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@michaelgugino
Copy link
Contributor

Unless we're extending the machineSet to add some information to the machineSet.status object about why it can't create machines, I'm against this change.

This change will inevitably lead to "I created a machineset, but I don't have any machines" type of questions. It won't be obvious what's broken or where to look. Couple that with the fact that people use node/instance/machine all interchangeably, it's going to be hard to communicate to users where to look for what conditions.

As far as failing the machine due to invalid credentials, we should check to ensure that the machine wasn't provisioned already. We could check this before we set finalizers, ensure the secret is present, if it is, we set finalizers, if it's not, we set failed (possibly skip setting finalizers?) If finalizers are set, we have no way of knowing if an instance was created if credentials go missing (on the outside chance that the machine-controller crashes around the same time the credentials are removed and immediately after the request to create the instance was submitted).

@elmiko
Copy link
Contributor

elmiko commented Aug 19, 2020

it seems like there is debate to had about this still, i'm going to remove my approval until the issues raised here have been resolved.
/approve cancel

@openshift-ci-robot openshift-ci-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 19, 2020
@JoelSpeed
Copy link
Contributor Author

Unless we're extending the machineSet to add some information to the machineSet.status object about why it can't create machines, I'm against this change.

We could add a condition, and/or use an event to communicate this.

This change will inevitably lead to "I created a machineset, but I don't have any machines" type of questions. It won't be obvious what's broken or where to look. Couple that with the fact that people use node/instance/machine all interchangeably, it's going to be hard to communicate to users where to look for what conditions.

If the MachineSet is accepted, then it will have a valid credentialsSecret, the only way you could end up with this is if the MachineSet is created and then the credentialsSecret is deleted at a later point. It's possible, so I agree, we should be adding some more obvious errors to the MachineSet events or status

As far as failing the machine due to invalid credentials, we should check to ensure that the machine wasn't provisioned already. We could check this before we set finalizers, ensure the secret is present, if it is, we set finalizers, if it's not, we set failed (possibly skip setting finalizers?) If finalizers are set, we have no way of knowing if an instance was created if credentials go missing (on the outside chance that the machine-controller crashes around the same time the credentials are removed and immediately after the request to create the instance was submitted).

Your suggestion here is that we swap the ordering of checking the instance exists with the addition of the finalizer, so that we only add the finalizer after we successfully create the machine? I think this is likely to cause issues with Machine leaking. We had a bug recently with Azure where Create always returns an error initially due to the async nature of their API, so unless we are 100% sure we can determine the difference between a create failing legitimately and this async failure, then add the finalizer, we could possible leak an instance.
Note how at the moment we add the finalizer and then requeue, this is to prevent conflict errors that might cause us to have created the machine but not committed the finalizer. Again this is leak prevention and I don't think we should be changing this. The finalizer needs to be on the Machine before we attempt to create the instance

@michaelgugino
Copy link
Contributor

Your suggestion here is that we swap the ordering of checking the instance exists with the addition of the finalizer, so that we only add the finalizer after we successfully create the machine? I think this is likely to cause issues with Machine leaking.

Not quite. My suggestion is to check that the credentials secret exists before adding the finalizer. If it doesn't, we fail. If the credentials secret exists, we add the finalizer, don't check it any more. This way, if the secret disappears later we know there is a possibility that the machine was created. If this happens, then we should not remove the machine and require the user to remove the finalizer and instruct them to check the cloud to remove the leaked instance. This is the only way to ensure we don't leak an instance.

@JoelSpeed
Copy link
Contributor Author

My suggestion is to check that the credentials secret exists before adding the finalizer.

This makes more sense to me now, though I do think this PR provides a pretty similar experience, we are just checking it exists before the Machine ever gets created. We don't even get to the point where the Machine could go failed, which I think is preferable. Though I appreciate the concerns about the UX of a MachineSet failing to create machines.

Personally I prefer the fail early and provide details of failed machine creates on the MachineSet YAML still, I think this could provide the plumbing/opportunity to provide more helpful information to users for the various paths as to why the MachineSet couldn't do something

@michaelgugino
Copy link
Contributor

We have the possibility to race with the CCO on first install. Let's not fail machines or otherwise do things that might be helpful but cause other buggy behavior. I suggest we close this.

Copy link
Contributor

@Danil-Grigorev Danil-Grigorev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me, but we surely should not merge this like that, talking about #673 (comment). Could it be implemented using warnings instead of rejecting resource, so we just notify the user that it is not expected to have no userSecret? This should not affect cluster installation.

Also, this PR implements passing resource client to webhooks, which is needed to fix BZ 1889620, so the fix is currently blocked.

@JoelSpeed
Copy link
Contributor Author

This needs picking up again as it hasn't been touched in a while, but switching to a warning could be a good way to signal to the user that something isn't quite as expected, how do others feel about that approach?

@michaelgugino
Copy link
Contributor

Let's drop this. We should just indicate on machine status that the secret is not present. In most cases it will be, in limited cases where a user makes a typo, it will be obvious how to fix it.

@michaelgugino
Copy link
Contributor

/hold

These webhooks are not a priority.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 12, 2020
@JoelSpeed
Copy link
Contributor Author

/hold cancel

I've updated this to warn users when a credentials secret does not exist. This is non blocking and means that users may be alerted to a problem earlier should there be one. PTAL

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 30, 2020
@JoelSpeed
Copy link
Contributor Author

/retest

Copy link
Contributor

@Danil-Grigorev Danil-Grigorev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Danil-Grigorev

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 1, 2020
enxebre and others added 5 commits December 10, 2020 11:47
This commit introduces the ability to validate the existence of the credentialsSecret reference in the machine providerSpec for AWS.
To make this possible a kube client is stored in the admissionHandler which is then passed at convenience trough the machineAdmissionFn signature.
There are scenarios where this could cause a race condition (eg on 
installation) if Machine creation is rejected immediately when no 
credentials secret exists. This change ensures users are warned, but it 
is not a fatal error for the request.
Copy link
Contributor

@alexander-demicev alexander-demicev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change doesn't block machine from being created, looks good to me in that case

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 10, 2020
@JoelSpeed
Copy link
Contributor Author

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

3 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot
Copy link
Contributor

openshift-merge-robot commented Dec 10, 2020

@JoelSpeed: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-azure-operator d36a7e3 link /test e2e-azure-operator
ci/prow/e2e-vsphere-operator d36a7e3 link /test e2e-vsphere-operator
ci/prow/e2e-vsphere d36a7e3 link /test e2e-vsphere
ci/prow/e2e-libvirt d36a7e3 link /test e2e-libvirt

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit ed12b42 into openshift:master Dec 10, 2020
@openshift-ci-robot
Copy link
Contributor

@JoelSpeed: All pull requests linked via external trackers have merged:

Bugzilla bug 1805639 has been moved to the MODIFIED state.

In response to this:

Bug 1805639: Validate credentialsSecret

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants