New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CCO-366 Add ability to detect AWS STS and behave accordingly #542
Conversation
Skipping CI for Draft Pull Request. |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #542 +/- ##
==========================================
- Coverage 48.06% 47.84% -0.23%
==========================================
Files 93 93
Lines 11305 11488 +183
==========================================
+ Hits 5434 5496 +62
- Misses 5251 5359 +108
- Partials 620 633 +13
|
pkg/aws/actuator/actuator.go
Outdated
ErrReason: minterv1.InsufficientCloudCredentials, | ||
Message: msg, | ||
stsDetected, err := utils.IsTimedTokenCluster(a.Client, logger) | ||
if stsDetected { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After this change e2e started working. I am kind of unsure about it though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, here is logging from a Clusterbot cluster launched with workflow-launch openshift-e2e-aws-manual-oidc-sts https://github.com/openshift/cloud-credential-operator/pull/542
and then having run make test-e2e-sts
manually:
time="2023-06-12T13:40:45Z" level=info msg="reconciling clusteroperator status"
time="2023-06-12T13:40:45Z" level=info msg="operator set to disabled / manual mode" controller=credreq cr=openshift-cloud-credential-operator/test-sts-creds-req
time="2023-06-12T13:40:45Z" level=info msg="syncing credentials request" controller=credreq cr=openshift-cloud-credential-operator/test-sts-creds-req
time="2023-06-12T13:40:45Z" level=info msg="adding finalizer: cloudcredential.openshift.io/deprovision" controller=credreq cr=openshift-cloud-credential-operator/test-sts-creds-req secret=default/test-sts-secret
time="2023-06-12T13:40:45Z" level=info msg="operator set to disabled / manual mode" controller=credreq cr=openshift-cloud-credential-operator/test-sts-creds-req
time="2023-06-12T13:40:45Z" level=info msg="syncing credentials request" controller=credreq cr=openshift-cloud-credential-operator/test-sts-creds-req
time="2023-06-12T13:40:45Z" level=info msg="timed token access cluster detected: true, so not trying to provision with root secret" controller=credreq cr=openshift-cloud-credential-operator/test-sts-creds-req secret=default/test-sts-secret
time="2023-06-12T13:40:45Z" level=info msg="actuator detected STS enabled cluster making secret" actuator=aws cr=openshift-cloud-credential-operator/test-sts-creds-req
time="2023-06-12T13:40:45Z" level=info msg="creating secret" actuator=aws cr=openshift-cloud-credential-operator/test-sts-creds-req
time="2023-06-12T13:40:45Z" level=info msg="clusteroperator status updated" controller=status
time="2023-06-12T13:40:45Z" level=info msg="reconciling clusteroperator status"
time="2023-06-12T13:40:45Z" level=info msg="secret created successfully" actuator=aws cr=openshift-cloud-credential-operator/test-sts-creds-req
time="2023-06-12T13:40:56Z" level=info msg="calculating metrics for all CredentialsRequests" controller=metrics
time="2023-06-12T13:40:56Z" level=info msg="reconcile complete" controller=metrics elapsed=4.220597ms
time="2023-06-12T13:41:56Z" level=info msg="reconciling clusteroperator status"
time="2023-06-12T13:42:56Z" level=info msg="calculating metrics for all CredentialsRequests" controller=metrics
time="2023-06-12T13:42:56Z" level=info msg="reconcile complete" controller=metrics elapsed=4.133929ms
time="2023-06-12T13:44:45Z" level=info msg="reconciling clusteroperator status"
time="2023-06-12T13:44:45Z" level=info msg="operator set to disabled / manual mode" controller=credreq cr=openshift-cloud-credential-operator/test-sts-creds-req
time="2023-06-12T13:44:45Z" level=info msg="syncing credentials request" controller=credreq cr=openshift-cloud-credential-operator/test-sts-creds-req
time="2023-06-12T13:44:45Z" level=warning msg="no user name set on credentials being deleted, most likely were never provisioned or using passthrough creds" actuator=aws cr=openshift-cloud-credential-operator/test-sts-creds-req
time="2023-06-12T13:44:45Z" level=info msg="target secret deleted successfully" controller=credreq cr=openshift-cloud-credential-operator/test-sts-creds-req secret=default/test-sts-secret targetSecret=default/test-sts-secret
time="2023-06-12T13:44:45Z" level=info msg="actuator deletion complete, removing finalizer" controller=credreq cr=openshift-cloud-credential-operator/test-sts-creds-req secret=default/test-sts-secret
time="2023-06-12T13:44:45Z" level=info msg="reconciling clusteroperator status"
time="2023-06-12T13:44:45Z" level=info msg="operator set to disabled / manual mode" controller=credreq cr=openshift-cloud-credential-operator/test-sts-creds-req
time="2023-06-12T13:44:45Z" level=info msg="syncing credentials request" controller=credreq cr=openshift-cloud-credential-operator/test-sts-creds-req
time="2023-06-12T13:44:45Z" level=info msg="clusteroperator status updated" controller=status
time="2023-06-12T13:44:56Z" level=info msg="calculating metrics for all CredentialsRequests" controller=metrics
time="2023-06-12T13:44:56Z" level=info msg="reconcile complete" controller=metrics elapsed=3.554297ms
time="2023-06-12T13:46:56Z" level=info msg="calculating metrics for all CredentialsRequests" controller=metrics
time="2023-06-12T13:46:56Z" level=info msg="reconcile complete" controller=metrics elapsed=4.054413ms
time="2023-06-12T13:46:56Z" level=info msg="reconciling clusteroperator status"
So I feel more confident the test is really passing.
|
||
func onAdd(t *testing.T, cfg envconf.Config, ctx context.Context) func(obj interface{}) { | ||
time.Sleep(2 * time.Minute) | ||
if err := cfg.Client().Resources().Get(ctx, secretName, namespace, secret); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because of unsureness about Secret really existing here, despite this being a callback from the watch on the Secret.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The obj interface{}
passed to this func is the object that was just added - perhaps use that? 2min wait should not be needed here
/assign @abutcher |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple small things. Wondering if some of the new Info
logs should be moved to Debug
based on their potential noise level but those can be moved or followed up on.
} | ||
|
||
func computeClusterOperatorVersions() string { | ||
currentVersion := os.Getenv("RELEASE_VERSION") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if this is not set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't get to use the feature gate? As featuregates.NewFeatureGateAccess
likely doesn't do much for nil as the desired version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A string cannot be nil in golang, it will instead be an empty string. So, in such a case it will be looking for the cluster version to be an empty string. In such a case, I expect a lot of other things will also be broken. The good news is RELEASE_VERSION is defined in the 03-deployment.yaml and should always be non-empty.
pkg/operator/utils/utils.go
Outdated
logger.WithError(err).Error("error loading CCO configuration to determine mode") | ||
return false, err | ||
} | ||
if credentialsMode != "Manual" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super-nit: is manual not a constant somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yikes! investigating I find that mode referred to as both "manual" and "Manual" in yamls, docs and code. Yes, there is a constant, that has it as "manual" (1b384cc#diff-f6d93feef6d59ca8ced14b54fe0058d41f14de44450baf6b723cc1c2ec280c54R25) but it doesn't seem to be used everywhere now....
So I'm a little leery to start using the constant version here until CCO team can review a bit more. I'd say it can wait until another day to get better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GitOperatorConfiguration() returns an operatorv1.CloudCredentialsMode which has a constant CloudCredentialsModeManual = "Manual".
|
||
func onAdd(t *testing.T, cfg envconf.Config, ctx context.Context) func(obj interface{}) { | ||
time.Sleep(2 * time.Minute) | ||
if err := cfg.Client().Resources().Get(ctx, secretName, namespace, secret); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The obj interface{}
passed to this func is the object that was just added - perhaps use that? 2min wait should not be needed here
pkg/operator/credentialsrequest/credentialsrequest_controller.go
Outdated
Show resolved
Hide resolved
Changed all 3 instances to Debug in d6bb916 |
return reconcile.Result{}, err | ||
if !stsDetected { | ||
logger.Infof("operator detects STS enabled cluster") | ||
return reconcile.Result{}, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This return short circuits continuing through the controller to create the secret for the CredentialsRequest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth adding a test to the credentials request controller tests to ensure we go through the motions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the end of the day and it's been a long one for me, again, I'm not seeing a good way to create a test on the controller with STS enabled. There might be one, I'm just kind of lost at the moment. Maybe we should add a card to Jira for "Streamline Credentials Request Controller Unit Tests and add STS enabled and feature gate flags"?
// The presence of an STSRoleARN within the AWSProviderSpec initiates creation of a secret containing IAM | ||
// Role details necessary for assuming the IAM Role via Amazon's Secure Token Service. | ||
// +optional | ||
// +kubebuilder:validation:Pattern:="^arn:(?P<Partition>[^:\n]*):(?P<Service>[^:\n]*):(?P<Region>[^:\n]*):(?P<AccountID>[^:\n]*):(?P<Ignore>(?P<ResourceType>[^:\/\n]*)[:\/])?(?P<Resource>.*)$" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The role ARN validation is causing CRD generation to fail which for some reason fails silently when ran via make update
. The silently failing part is possibly related to the openshift/api workaround.
'_output/tools/bin/controller-gen-v0.2.5' schemapatch:manifests="./manifests" paths="./pkg/apis/cloudcredential/v1" 'output:dir="./manifests"'
/home/abutcher/go/src/github.com/openshift/cloud-credential-operator/pkg/apis/cloudcredential/v1/types_aws.go:36:2: invalid char escape (at <input>:1:1)
/home/abutcher/go/src/github.com/openshift/cloud-credential-operator/pkg/apis/cloudcredential/v1/types_aws.go:36:2: invalid char escape (at <input>:1:1)
/home/abutcher/go/src/github.com/openshift/cloud-credential-operator/pkg/apis/cloudcredential/v1/types_aws.go:36:2: unable to parse string: invalid syntax (at <input>:1:1)
/home/abutcher/go/src/github.com/openshift/cloud-credential-operator/pkg/apis/cloudcredential/v1/types_aws.go:36:2: invalid char escape (at <input>:1:1)
/home/abutcher/go/src/github.com/openshift/cloud-credential-operator/pkg/apis/cloudcredential/v1/types_aws.go:36:2: invalid char escape (at <input>:1:1)
/home/abutcher/go/src/github.com/openshift/cloud-credential-operator/pkg/apis/cloudcredential/v1/types_aws.go:36:2: unable to parse string: invalid syntax (at <input>:1:1)
Error: not all generators ran successfully
This comment was marked as duplicate.
This comment was marked as duplicate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have general concern about the logic defining what is STS enabled vs feature gate enabled (per platform). Otherwise, a few minor nitpicks.
@@ -46,6 +46,8 @@ type Actuator interface { | |||
Upgradeable(operatorv1.CloudCredentialsMode) *configv1.ClusterOperatorStatusCondition | |||
// GetCredentialsRootSecret returns the credentials root secret. | |||
GetCredentialsRootSecret(ctx context.Context, cr *minterv1.CredentialsRequest) (*corev1.Secret, error) | |||
// Query STSFeatureGateEnabled. | |||
STSFeatureGateEnabled() bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I'm a little concerned that feature gate enabled is determined per platform. I would instead expect the STSEnabled to be per platform and featureGateEnabled to simply reflect the status of the feature gate (platform agnostic). I don't think it's worth blocking on, but please make sure the functionality continues to work as expected when the STSFeatureGate bits are eventually removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, noted, I understand what you're saying, if you elaborate a little bit on your top level comment and make comment on the Jira about removing the FeatureGate that should do it for remembering to deal with this.
pkg/operator/utils/utils.go
Outdated
logger.WithError(err).Error("error loading CCO configuration to determine mode") | ||
return false, err | ||
} | ||
if credentialsMode != "Manual" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GitOperatorConfiguration() returns an operatorv1.CloudCredentialsMode which has a constant CloudCredentialsModeManual = "Manual".
} | ||
configInformers := configinformers.NewSharedInformerFactory(clientSet, 10*time.Minute) | ||
desiredVersion := computeClusterOperatorVersions() | ||
missingVersion := desiredVersion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this should be missingVersion := "0.0.1-snapshot"
to match what is in the 03-deployment.yaml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems wrong to hard code to a string which might end up different from YAML contents?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly... I just noted the difference from many of the other operators.
- cluster-ingress-operator
- cloud-network-config-controller
- azure-disk-csi-driver-operator
- azure-file-csi-driver-operator
- cluster-storage-operator
- machine-api-provider-azure
I believe what exists today (missingVersion := desiredVersion
) will cause it to treat all versions as if they are the missing version, thereby skipping the version check altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something to keep in mind, the purpose of missingVersion
is to handle the situation where the 03-deployment.yaml is applied from locally compiled code, and where it is not going through the process of having a RELEASE_VERSION defined via our release process. Or, in other words, it is explicitly to enable locally compiled operator(s) to function as desired by ignoring the version check itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@jstuever Could you elaborate a bit? Even given that we'll probably need to come back to fix the logic later, it would be helpful to put the specifcs here for a reference for that? One thing I noticed as I try to add a unit test on the STS enabled path in credentialsrequest_controller.go, I am finding it easy to test the FeatureGate enabled part, but hard to fake the STS detected part, because that boolean is hard to reach from the tests. |
…s for" This reverts commit 6561e65. Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>
``` $ make update ``` Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>
For whatever reason, when the upstream target calls controller-gen, the tool reads and writes *every* YAML file in the `./manifests` directory, even files that the tool is not writing to during the generator step. We copy our `00-config-custresdef.yaml` from the OpenShift API repo, which uses a bsepoke generator with different formatting on the output. Therefore, when `controller-gen` is run on the YAML we copy from the API repo, it re-formats it and causes the `git diff` check to fail. In this override, we simply copy back in the YAMLs from the API repo after generating, overwriting anything `controller-gen` may have done, so we don't have this spurious failure. Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>
In reply to #542 (comment)
|
/test e2e-aws-manual-oidc |
Responding to #542 (comment) I agree, the constants file is confusing. I had to dig in to understand how it is used. It looks like they are used solely for keys (string) in a mapping in the metrics operator. The mapping appears to be more of a status use case than a desired configuration. There also appear to be additional status options compared to |
@bentito: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
We spent some time in openshift#511 and again in openshift#542 trying to reconcile CRDs. The problem is that we want to *generate* the CredentialsRequest CRD from code in this repo, but *use* (copy) the CloudCredential CRD from openshift/api, which we vendor. But we invoke controller-gen through build-machinery-go, and it does unexpected things to the latter, which breaks validation. With this commit, we move the CredentialsRequest CRD to a `generated` subdirectory and the CloudCredential CRD to an `imported` subdirectory. This lets us go back to the simpler invocation of bmg's tooling while keeping everything in the shape we expect. One more quirk: Because build-machinery-go starts defining dependency chains for targets like `update`, we need to start defining that dependency chain *before* we import the bmg libs to ensure that we copy/generate CRDs *before* we include them in bindata.
/approve proxy for @abutcher |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: 2uasimojo, bentito, stevekuznetsov The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cherry-pick release-4.13 |
@stevekuznetsov: #542 failed to apply on top of branch "release-4.13":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Adds capability to detect AWS STS enabled clusters and behave a bit like Mint mode when detected.
There is now a small unit test suite. And a single e2e test that uses an STS workflow.
Notes:
must-gather
make me unsure:Ah, it seems to be working, see inline comment with log lines showing Secret creation/deletion on CredentialsRequest
addition in the e2e test.