Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 2065597: Add support for dynamic, user-managed config #78

Conversation

stephenfin
Copy link
Contributor

@stephenfin stephenfin commented Mar 31, 2022

When working on openshift/enhancements#1009, we noticed that we had the same problem with Cinder CSI that we did with OpenStack Cloud provider: namely, that we are using a static cloud.conf for configuration. This means all storage settings from openshift-config/cloud-provider-config are lost when we upgrade.

This PR resolves this issue by replacing the static config map containing the Cinder CSI driver config with a dynamic config map managed by the operator. To populate this, the copy-modify-save pattern we used to fix the cloud provider issue in the Cluster Cloud Controller Manager Operator (CCCMO) is reused.

TODO:

  • Add a check for a new, CSI-only config map in the openshift-config namespace. This should be preferred if available, to allow us to get away from combining cloud provider and CSI configuration (these are different services nowadays).
  • Decide if I need to squash the last two patches (see comment in the commits)
  • Manual verify that this is doing what we think
  • Tests!

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 31, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 31, 2022

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link
Member

@mandre mandre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a very good start.
I'm less familial with the controller changes, so I would have give it a spin to check it's working as expected. You could perhaps mark your PR as ready for review so that we can start by testing it in CI.

pkg/controllers/config/configsync.go Outdated Show resolved Hide resolved
}

isMultiAZDeployment, err := isMultiAZDeployment()
if err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be an error case, or should we fallback and provide a default value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this was an error previously (see starter.go). I've just preserved the logic here. There's no reason we couldn't default to false though. WDYT?

blockStorage, _ := cfg.GetSection("BlockStorage")
if blockStorage != nil {
klog.Infof("[BlockStorage] section found; dropping any legacy settings...")
// Remove the legacy keys, once we ensure they're not overridden
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we're removing trust-device-path unconditionally. Should we make the operator degraded like we did with CCCMO when this option is set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should really have a TODO here. The comment indicates what I want to do but I didn't know what the default value should be. I'll have a look at an existing cluster.

@stephenfin stephenfin marked this pull request as ready for review April 7, 2022 09:43
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 7, 2022
@stephenfin
Copy link
Contributor Author

There are still TODOs on this but this is good enough to run CI on (and see what's crashing and burning)

@openshift-ci openshift-ci bot requested a review from gnufied April 7, 2022 09:44
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 21, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: stephenfin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 21, 2022
No one should ever check a compiled binary in.

Signed-off-by: Stephen Finucane <stephenfin@redhat.com>
Allow us to store the common namespace information as we add controllers
in the future.

Signed-off-by: Stephen Finucane <stephenfin@redhat.com>
@stephenfin stephenfin force-pushed the cloud-provider-config-upgrade branch from 4c2ba42 to cce8a89 Compare April 29, 2022 14:23
@gnufied
Copy link
Member

gnufied commented Apr 29, 2022

When working on openshift/enhancements#1009, we noticed that we had the same problem with Cinder CSI that we did with OpenStack Cloud provider: namely, that we are using a static cloud.conf for configuration. This means all storage settings from openshift-config/cloud-provider-config are lost when we upgrade.

Can you explain - how settings from openshift-config/cloud-provider-config are lost when we upgrade? I understand the argument that legacy-cloud-configuration may not be compatible with cloud-config that is required for external CCM but how are settings lost?

We had a similar problem of incompatible cloud-configuration with vsphere-csi driver and we ended up rewriting the configmap from openshift-config namespace. But no setting was lost afaik.

This PR resolves this issue by replacing the static config map containing the Cinder CSI driver config with a dynamic config map managed by the operator

The existing configmap configuration is not really static afaict. If user changes configmap inside openshift-config namespace, then a new controller will be deployed.

@mandre
Copy link
Member

mandre commented May 2, 2022

Can you explain - how settings from openshift-config/cloud-provider-config are lost when we upgrade? I understand the argument that legacy-cloud-configuration may not be compatible with cloud-config that is required for external CCM but how are settings lost?

It was not really that these settings were lost, but they were not used by Cinder CSI.

If the customer had set any blockstorage options in the user-managed cloud.conf (in openshift-config/cloud-provider-config), they were not ported over to Cinder CSI cloud.conf, and from a user point of vue were effectively lost on upgrade. Prior to this change the Cinder CSI operator would read its cloud.conf from openstack-cinder-config/openshift-cluster-csi-drivers static asset and any change to the CM would be overwritten.

We had a similar problem of incompatible cloud-configuration with vsphere-csi driver and we ended up rewriting the configmap from openshift-config namespace. But no setting was lost afaik.

Right, this is the purpose of this change too.
We're now reading the cloud.conf from the user-managed cloud.conf at openshift-config/cloud-provider-config and transform it into a cloud.conf Cinder CSI can use at openshift-cluster-csi-drivers/csi-cinder-config.

This PR resolves this issue by replacing the static config map containing the Cinder CSI driver config with a dynamic config map managed by the operator

The existing configmap configuration is not really static afaict. If user changes configmap inside openshift-config namespace, then a new controller will be deployed.

That was only valid for in-tree unfortunately. With this change, the user-manage cloud.conf is now the source for both in-tree config and Cinder CSI's cloud.conf.

@@ -284,12 +284,10 @@ spec:
path: clouds.yaml
- name: config-cinderplugin
configMap:
name: openstack-cinder-config
name: csi-cinder-config
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we call this CM cloud-confand store the file under the cloud.conf key, to align with the CCM's cloud config?

@mandre
Copy link
Member

mandre commented May 2, 2022

Finally some green \o/
The e2e-openstack-csi job is failing because of an issue in CIRO that should be fixed with openshift/cluster-image-registry-operator#773.

There are still a couple of oddities with this PR:

  • it executes getCloudInfo() way more often than the 20 min sync interval we set
  • the debug logs say the zones are empty, while the generated cloud.conf has ignore-volume-az = yes

@stephenfin stephenfin force-pushed the cloud-provider-config-upgrade branch from c0b4614 to 277232a Compare May 3, 2022 17:37
@mandre mandre changed the title Add support for dynamic, user-managed config Bug 2065597: Add support for dynamic, user-managed config May 5, 2022
@openshift-ci openshift-ci bot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels May 5, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 5, 2022

@stephenfin: This pull request references Bugzilla bug 2065597, which is invalid:

  • expected the bug to target the "4.11.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 2065597: Add support for dynamic, user-managed config

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mandre
Copy link
Member

mandre commented May 5, 2022

/bugzilla refresh

@openshift-ci openshift-ci bot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label May 5, 2022
@openshift-ci openshift-ci bot removed the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label May 5, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 5, 2022

@mandre: This pull request references Bugzilla bug 2065597, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.11.0) matches configured target release for branch (4.11.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @eurijon

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested a review from eurijon May 5, 2022 12:10
Copy link
Member

@mandre mandre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we also find an explanation for why getCloudInfo was called so frequently?

pkg/controllers/config/configsync_test.go Outdated Show resolved Hide resolved
if err != nil {
return false, fmt.Errorf("couldn't collect info about cloud availability zones: %w", err)
}
klog.V(2).Infof("found cloud info: %+v", ci)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll have to remove this debug logs before merging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've kept them for now while I try to debug the frequent calls but I have added a TODO.

klog.V(2).Infof("found volume zones: %+v", ci.ComputeZones)
// We consider a cloud multiaz when it either have several different zones
// or compute and volumes are different.
return len(ci.ComputeZones) != 1 || len(ci.VolumeZones) != 1 || ci.ComputeZones[0] != ci.VolumeZones[0], nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still have an issue where getCloudInfo() always return empty compute and volume zones. I believe this comes from the implementation of getCloudInfo() where we returned a cached value if it's not nil.

The first invocation triggers an error (perhaps because it's too early?) and we never fetch the cloud info again. The error is different based on the cloud where the CI runs.

On vexxhost-managed, we get:

E0503 18:17:23.602278       1 base_controller.go:272] ConfigSync reconciliation failed: couldn't collect info about cloud availability zones: failed to create a compute client: Post "https://rdo.vexxhost.ca:5000/v3/auth/tokens": dial tcp: i/o timeout

While on vexxhost-mecha (using self-signed cert) we get:

E0503 17:57:33.893921       1 base_controller.go:272] ConfigSync reconciliation failed: couldn't collect info about cloud availability zones: failed to create a compute client: Error parsing CA Cert from /etc/kubernetes/static-pod-resources/configmaps/cloud-config/ca-bundle.pem

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've modified getCloudInfo so that it no longer caches the CloudInfo result internally. Instead, isMultiAZDeployment will cache the result returned by getCloudInfo but only if there are one or more volume and compute AZs. This makes sense since both Nova and Cinder have a default AZ of nova (which you shouldn't actually use! See warning here). Any request that doesn't return at least this is errant.

We probably want to discuss whether we even want to do this, since a user could configure it themselves and it feels a bit "magical"...

@mandre
Copy link
Member

mandre commented May 6, 2022

/retest

@@ -24,14 +26,38 @@ type clients struct {

var ci *CloudInfo

// getCloudInfo fetches and caches metadata from openstack
func getCloudInfo() (*CloudInfo, error) {
func isMultiAZDeployment() (bool, error) {
var err error

if ci != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if ci != nil {
if ci == nil {

Copy link
Contributor Author

@stephenfin stephenfin May 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh

pkg/controllers/config/cloudinfo.go Outdated Show resolved Hide resolved
This is responsible for generating a new config map containing
configuration for the cinder CSI driver. This config map is based on a
user-provided config map (found at '').
It is unlikely that this file can be used as-is and we therefore need to
conduct a bit of surgery on it. Namely, we must do the following:

- Add the following to the [Global] section:

  - [Global] use-clouds
  - [Global] clouds-file
  - [Global] cloud

- Add the following to the [BlockStorage] section:

  - [BlockStorage] ignore-volume-az

- Drop the following from the [BlockStorage] section:

  - [BlockStorage] trust-device-path

With this done, we can remove our use of static config maps. That will
be done in a separate change.

Signed-off-by: Stephen Finucane <stephenfin@redhat.com>
We're now creating this dynamically, meaning we no longer need these
static assets. Remove them.

Signed-off-by: Stephen Finucane <stephenfin@redhat.com>
@stephenfin stephenfin force-pushed the cloud-provider-config-upgrade branch from e12fa96 to fdaaac6 Compare May 10, 2022 11:49
Copy link
Member

@mandre mandre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/hold for CI

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 10, 2022
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 10, 2022
@mandre
Copy link
Member

mandre commented May 10, 2022

The failure seems seems like a flake and unrelated to this change, however let's make sure of it:
/test e2e-openstack

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 10, 2022

@stephenfin: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@mandre
Copy link
Member

mandre commented May 10, 2022

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 10, 2022
@openshift-merge-robot openshift-merge-robot merged commit fbf6c4e into openshift:master May 10, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 10, 2022

@stephenfin: All pull requests linked via external trackers have merged:

Bugzilla bug 2065597 has been moved to the MODIFIED state.

In response to this:

Bug 2065597: Add support for dynamic, user-managed config

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants