Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional trusted certificate authorities for image registries #1295

Merged
merged 6 commits into from Apr 21, 2023

Conversation

dmage
Copy link
Member

@dmage dmage commented Dec 1, 2022

No description provided.

@dmage dmage force-pushed the node-ca-replacement branch 2 times, most recently from 682d400 to 498fbcb Compare December 1, 2022 17:22
@dmage
Copy link
Member Author

dmage commented Dec 1, 2022

/assign @mrunalp @sinnykumari
/cc @flavianmissi

There is a intention to make the image registry operator optional, and we are looking for a new way to distribute certificate authorities for registries. It seems MCO is the best fit for this feature. I briefly described my idea how it could work without the registry operator, please let me know if something is unclear or you have other ideas.


### User Stories

* As a <role>, I want to <take some action> so that I can <accomplish a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly:

  • As a customer, I want to be able to remove the image-registry operator, so I can save some CPU/memory overhead in a cluster that does not need an internal image registry. But I still need external image registry CAs listed in the cluster Images distributed so that...

or some such?

* Make additionalTrustedCA be managed by machine-config-operator.
* Create a mechanism for the cluster-image-registry-operator to distribute
certificate authorities for the integrated image registry.
* Remove the node-ca daemon set.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Removing the DaemonSet is a nice simplification, but I think it's an implementation/design choice, and not a goal in its own right. Possibly "minimize redundant pathways for getting data onto node disks" or something is a goal that is motivating the DaemonSet-removal?

the `openshift-config-managed` namespace and merge its content with the
user-provided config map. This will allow cluster-image-registry-operator to
provide certificate authorities for the integated image registry without
changing the user-provided config map.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsaid in the current proposal, but implied by the enhancements/machine-config/ position, is that the output of this controller will be MachineConfigs feeding into whichever MachineConfigPools are necessary to get this new content rolled out over the whole cluster. Can we spell that out in detail here, possibly with feedback from the MachineConfig folks about how to target whichever MachineConfigPools are required?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this would generate MachineConfig.

-----END CERTIFICATE-----
```

The cluster-config-operator should update every node to propagete these
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "propagete" -> propagate"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the cluster-config-operator is a dead repo. It contains no code and should contain no code. It has no owners. It holds CRD and CR definitions required to bootstrap the first master, nothing more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#812 had floated a node operator. Allowing folks to save some CPU/memory by removing the registry operator, while replacing it with a new node operator, doesn't actually win many resources back for the customer. But #812 shows that we have other reasons for wanting the node folks to have their own operator, including avoiding the need to do things like openshift/machine-config-operator#1453. No worries from me if the image-registry folks can find a team with an existing core operator to hand this functionality off to. Or if the node folks can continue to talk other existing core operators into letting them co-maintain node-associated portions. But I'm cross-linking #812 in case folks decide to go that route instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant machine-config-operator, it already manages files like registries.conf. Updated.


If the config maps are updated, old certificate authorities should be removed
and new ones should be added. Nodes nor applications shouldn't be restarted to
apply changes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will these changes require a CRI-O reload to take effect? If so, and if MachineConfig end up being used as the delivery mechanism, this will need a bump.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing mechanism doesn't reload anything, and it seems to work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trevor is right here, if we need to avoid disruption here then we'll need to add these paths to the live-apply set.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may actually need a restart for these fields. I can't find the special handling for a CRI-O reload for this path

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmage and @haircommander Did we ever reach conclusion on this? Oleg says it works currently without reloading cri-o but Peter says no.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mtrmac do we read the /etc/docker/certs.d each time we pull or is it cached at all? I would have expected it to be cached similarly to registries.conf but the current behavior imlies it's not

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s https://github.com/containers/image/blob/bb66acc37f166c03470cb58d3b8c808c288e1c2a/docker/docker_client.go#L268 , and it’s not cached at all - it’s read on every pull afresh (and in case of CRI-O, actually multiple times per pull). That’s not a documented/committed behavior, but I also can’t see any reason that would motivate us to change.

(And yes, registries.conf is cached. Sadly, the (too many) config files in c/image were added over time and are not very consistent in their file formats, lookup locations, or behaviors.)

* Make additionalTrustedCA be managed by machine-config-operator.
* Create a mechanism for the cluster-image-registry-operator to distribute
certificate authorities for the integrated image registry.
* Remove the node-ca daemon set.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several of us on the MCO/CoreOS side were really surprised to discover this node-ca daemonset a while ago. The fewer things that mutate the node state the better.

cc @jkyros

the `openshift-config-managed` namespace and merge its content with the
user-provided config map. This will allow cluster-image-registry-operator to
provide certificate authorities for the integated image registry without
changing the user-provided config map.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this would generate MachineConfig.


If the config maps are updated, old certificate authorities should be removed
and new ones should be added. Nodes nor applications shouldn't be restarted to
apply changes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trevor is right here, if we need to avoid disruption here then we'll need to add these paths to the live-apply set.

@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 19, 2023
@flavianmissi
Copy link
Member

/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 24, 2023
name: registry-ca
```

If the image registry operator is installed, there should be a config map:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we are talking here about making image-registry operator optional, what happens when image-registry operator is not installed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The image-registry-ca config map is required for the image registry to correctly work in a cluster.
When the image registry operator is not installed in the cluster, the image-registry-ca config map will not be present.

The important thing here is that when it is present, the MCO should distribute the CA on the cluster nodes (which the registry operator currently does, but will stop doing).
It's worth noting that this would be the path of least friction. Another option would be to rely on machine configs (created by the registry operator) to distribute the CAs required by the registry, so the MCO would only need a new path for distributing user provided CAs via additionalTrustedCA in the images.config.openshift.io/cluster (to keep backwards compatibility). The downside of this approach is that could potentially disrupt other openshift components (such as builds) that also rely on the integrated registry CAs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense. thanks

Copy link
Contributor

@sinnykumari sinnykumari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few questions:

  1. Does node-ca write/update these ca today to all nodes in the cluster?
  2. I am wondering if we really need a new MCO controller for this. Even today, MCO reads and updates nodes with cluster level object(like proxy, infra) as well as configmap (like kube-cloud-config) in openshift-config-managed namespace. We could look for image cluster object and image-registry-ca cm ?
  3. Who will be doing and maintaining MCO related changes?

certificate authorities for image registries, so that openshift-apiserver can
access registries.

For them, the controller should create a merged config map in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "the controller" here referring to MCO controller or somethingelse?

@yuqi-zhang
Copy link
Contributor

I think the MCO can also manage this like it currently manages kubelet certs, but long term, we should have a separate path for cert management, which we are looking into (https://issues.redhat.com/browse/MCO-467).

+1 to the questions Sinny posted above, but I am also wondering:

  1. how often are these CAs and certificates rotated, and by what mechanism?
  2. it doesn't seem we have a consensus on yet if a reboot or a crio reload is required for this to take effect. Has this been otherwise tested?
  3. similar to Sinny's question, are these additional CAs always present, and is there always a canonical location to read these from, or is it up to the MCO to piece it together based on in-cluster objects?

@flavianmissi
Copy link
Member

Thanks for reviewing @sinnykumari, tried to answer your questions in line:

Does node-ca write/update these ca today to all nodes in the cluster?

Yes, to /etc/docker/certs.d.

I am wondering if we really need a new MCO controller for this. Even today, MCO reads and updates nodes with cluster level object(like proxy, infra) as well as configmap (like kube-cloud-config) in openshift-config-managed namespace. We could look for image cluster object and image-registry-ca cm ?

I think you and your team are best equipped to make that decision. We should probably rephrase this EP to request a "mechanism" rather than a "controller" specifically. cc @dmage could you reword? This is the first mention of "controller" that I found. There are others.

Who will be doing and maintaining MCO related changes?

We would like the MCO team to do (and maintain) these changes for us, as we are very understaffed in the image registry team 😄

@flavianmissi
Copy link
Member

@yuqi-zhang thanks for commenting, answers below.

how often are these CAs and certificates rotated, and by what mechanism?

The user provided CAs are well, rotated by the user, so we can't answer that part.
The internal registry CAs AFAIK come from the service CA operator. It seems they're updated once every other year.

it doesn't seem we have a consensus on yet if a reboot or a crio reload is required for this to take effect. Has this been otherwise tested?

I have not tested this. The node-ca daemon set does not reboot anything, and it works.

similar to Sinny's question, are these additional CAs always present, and is there always a canonical location to read these from, or is it up to the MCO to piece it together based on in-cluster objects?

Yes, it is up to the MCO to piece together the internal registry CAs with the user provided CAs.
The internal registry CAs will only be present when the registry operator is installed (right now it's always installed). The location can be changed, the suggested locations here are just suggestions, if you think we can do better then let's discuss it (:
The user provided CAs are in the openshift-config namespace, the name itself comes from the images.config.openshift.io/cluster object, under the .spec.additionalTrustedCA.

@yuqi-zhang
Copy link
Contributor

Thanks for the answers! I think then off the top of my head, the MCO will attempt to do something like this:

  1. the MCO reads AdditionalTrustedCA from images.config.openshift.io/cluster, if any exists (we actually do read images.config.openshift.io/cluster for image pull configuration)
  2. the MCO reads the internal registry CAs (place tbd)
  3. the MCO renders them (if they exist) into some field (controllerconfig?)
  4. the MCC takes that and populates a templated file at /etc/docker/certs.d into the rendered config
  5. the MCD lays that down as a disruptionless operation

With the note that long term (maybe not that long) we want certs to be managed differently. Does that mostly make sense?

When release does this need to happen in?

@flavianmissi
Copy link
Member

Thanks for coming back to us so quickly @yuqi-zhang!

Points 1 and 2 from your list make sense to me, but I can't claim to understand 3-5, I'll leave the internals to you and your team 😄 The result (CAs configured on /etc/docker/certs.d) is what matters to us, and that looks good from where I stand.

With the note that long term (maybe not that long) we want certs to be managed differently. Does that mostly make sense?

The makes complete sense! Is the long[er] term approach something that we want defined in this EP as well? Or is it worth to just note that we need a better solution, and leave the step of defining such solution to its own EP?

When release does this need to happen in?

Our PM would like this to land on 4.14. WDYT?

@yuqi-zhang
Copy link
Contributor

The makes complete sense! Is the long[er] term approach something that we want defined in this EP as well? Or is it worth to just note that we need a better solution, and leave the step of defining such solution to its own EP?

I think we don't have to include it. It can be a separate process, no need to increase scope of this 😄

Our PM would like this to land on 4.14. WDYT?

I will bring this to the team as part of planning to see how the timelines would align.

@sinnykumari
Copy link
Contributor

Thanks Flavian for answering all the questions. As Jerry said, we will need to discuss within team and see feasible this is for the MCO team.

@flavianmissi
Copy link
Member

Hi @sinnykumari and @yuqi-zhang, that is great news - thank you! The image registry team will gladly help with reviews.

I'll sync with @dmage this week so that we update this EP with a link to that ticket, as well as the clarifications we discussed here earlier.

@flavianmissi
Copy link
Member

@sdodson could you help us reviewing and eventually getting this approved?
Thanks!

@sdodson
Copy link
Member

sdodson commented Mar 14, 2023

@sdodson could you help us reviewing and eventually getting this approved?

I should have time to look at this tomorrow, sorry for the delay.

@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 12, 2023
@flavianmissi
Copy link
Member

/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 12, 2023
@sdodson
Copy link
Member

sdodson commented Apr 13, 2023

/approve
Light on implementation details but it's also clear that the MCO team are asking all the right questions and they've agreed to inherit this cluster responsibility.

Key for me will be ensuring as, Trevor and Colin point out, that these files are able to be live applied.

@sinnykumari You're up next for final review. If the MCO team has any implementation details they'd like to get recorded lets get those added as suggested updates.

/cc @mrunalp
Mrunal wanted to take a look too

@openshift-ci openshift-ci bot requested a review from mrunalp April 13, 2023 18:24
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 13, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, sdodson

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 13, 2023
creation-date: 2022-12-01
last-updated: 2022-12-01
tracking-link:
- https://issues.redhat.com/browse/MCO-499
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MCO-499 track mainly work from MCO side. I think we should also have here jira card linked here that covers image registries side of work.

@sinnykumari
Copy link
Contributor

sinnykumari commented Apr 14, 2023

LGTM
@sdodson MCO team haven't fleshed out yet the overall details of how it will look like. We have some planned spikes for next sprint which will provide more details and follow-up stories into linked jira card MCO-499

@sdodson
Copy link
Member

sdodson commented Apr 19, 2023

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 19, 2023
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 6b3209f and 2 for PR HEAD 77e38ad in total

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Apr 20, 2023
dmage and others added 2 commits April 20, 2023 11:47
…registries.md

Co-authored-by: Flavian Missi <flaviamissi@gmail.com>
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 20, 2023

@dmage: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@flavianmissi
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 21, 2023
@openshift-ci openshift-ci bot merged commit 40a1bc1 into openshift:master Apr 21, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet