New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Additional trusted certificate authorities for image registries #1295
Conversation
682d400
to
498fbcb
Compare
/assign @mrunalp @sinnykumari There is a intention to make the image registry operator optional, and we are looking for a new way to distribute certificate authorities for registries. It seems MCO is the best fit for this feature. I briefly described my idea how it could work without the registry operator, please let me know if something is unclear or you have other ideas. |
|
||
### User Stories | ||
|
||
* As a <role>, I want to <take some action> so that I can <accomplish a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly:
- As a customer, I want to be able to remove the image-registry operator, so I can save some CPU/memory overhead in a cluster that does not need an internal image registry. But I still need external image registry CAs listed in the
cluster
Images distributed so that...
or some such?
* Make additionalTrustedCA be managed by machine-config-operator. | ||
* Create a mechanism for the cluster-image-registry-operator to distribute | ||
certificate authorities for the integrated image registry. | ||
* Remove the node-ca daemon set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Removing the DaemonSet is a nice simplification, but I think it's an implementation/design choice, and not a goal in its own right. Possibly "minimize redundant pathways for getting data onto node disks" or something is a goal that is motivating the DaemonSet-removal?
the `openshift-config-managed` namespace and merge its content with the | ||
user-provided config map. This will allow cluster-image-registry-operator to | ||
provide certificate authorities for the integated image registry without | ||
changing the user-provided config map. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsaid in the current proposal, but implied by the enhancements/machine-config/
position, is that the output of this controller will be MachineConfigs feeding into whichever MachineConfigPools are necessary to get this new content rolled out over the whole cluster. Can we spell that out in detail here, possibly with feedback from the MachineConfig folks about how to target whichever MachineConfigPools are required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think this would generate MachineConfig.
-----END CERTIFICATE----- | ||
``` | ||
|
||
The cluster-config-operator should update every node to propagete these |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "propagete" -> propagate"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the cluster-config-operator is a dead repo. It contains no code and should contain no code. It has no owners. It holds CRD and CR definitions required to bootstrap the first master, nothing more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#812 had floated a node operator. Allowing folks to save some CPU/memory by removing the registry operator, while replacing it with a new node operator, doesn't actually win many resources back for the customer. But #812 shows that we have other reasons for wanting the node folks to have their own operator, including avoiding the need to do things like openshift/machine-config-operator#1453. No worries from me if the image-registry folks can find a team with an existing core operator to hand this functionality off to. Or if the node folks can continue to talk other existing core operators into letting them co-maintain node-associated portions. But I'm cross-linking #812 in case folks decide to go that route instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant machine-config-operator, it already manages files like registries.conf
. Updated.
|
||
If the config maps are updated, old certificate authorities should be removed | ||
and new ones should be added. Nodes nor applications shouldn't be restarted to | ||
apply changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will these changes require a CRI-O reload to take effect? If so, and if MachineConfig end up being used as the delivery mechanism, this will need a bump.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing mechanism doesn't reload anything, and it seems to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trevor is right here, if we need to avoid disruption here then we'll need to add these paths to the live-apply set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may actually need a restart for these fields. I can't find the special handling for a CRI-O reload for this path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dmage and @haircommander Did we ever reach conclusion on this? Oleg says it works currently without reloading cri-o but Peter says no.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mtrmac do we read the /etc/docker/certs.d
each time we pull or is it cached at all? I would have expected it to be cached similarly to registries.conf but the current behavior imlies it's not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That’s https://github.com/containers/image/blob/bb66acc37f166c03470cb58d3b8c808c288e1c2a/docker/docker_client.go#L268 , and it’s not cached at all - it’s read on every pull afresh (and in case of CRI-O, actually multiple times per pull). That’s not a documented/committed behavior, but I also can’t see any reason that would motivate us to change.
(And yes, registries.conf
is cached. Sadly, the (too many) config files in c/image were added over time and are not very consistent in their file formats, lookup locations, or behaviors.)
498fbcb
to
da1c55a
Compare
* Make additionalTrustedCA be managed by machine-config-operator. | ||
* Create a mechanism for the cluster-image-registry-operator to distribute | ||
certificate authorities for the integrated image registry. | ||
* Remove the node-ca daemon set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several of us on the MCO/CoreOS side were really surprised to discover this node-ca
daemonset a while ago. The fewer things that mutate the node state the better.
cc @jkyros
the `openshift-config-managed` namespace and merge its content with the | ||
user-provided config map. This will allow cluster-image-registry-operator to | ||
provide certificate authorities for the integated image registry without | ||
changing the user-provided config map. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think this would generate MachineConfig.
|
||
If the config maps are updated, old certificate authorities should be removed | ||
and new ones should be added. Nodes nor applications shouldn't be restarted to | ||
apply changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trevor is right here, if we need to avoid disruption here then we'll need to add these paths to the live-apply set.
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
/remove-lifecycle stale |
name: registry-ca | ||
``` | ||
|
||
If the image registry operator is installed, there should be a config map: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we are talking here about making image-registry operator optional, what happens when image-registry operator is not installed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The image-registry-ca
config map is required for the image registry to correctly work in a cluster.
When the image registry operator is not installed in the cluster, the image-registry-ca
config map will not be present.
The important thing here is that when it is present, the MCO should distribute the CA on the cluster nodes (which the registry operator currently does, but will stop doing).
It's worth noting that this would be the path of least friction. Another option would be to rely on machine configs (created by the registry operator) to distribute the CAs required by the registry, so the MCO would only need a new path for distributing user provided CAs via additionalTrustedCA
in the images.config.openshift.io/cluster
(to keep backwards compatibility). The downside of this approach is that could potentially disrupt other openshift components (such as builds) that also rely on the integrated registry CAs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense. thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few questions:
- Does node-ca write/update these ca today to all nodes in the cluster?
- I am wondering if we really need a new MCO controller for this. Even today, MCO reads and updates nodes with cluster level object(like proxy, infra) as well as configmap (like kube-cloud-config) in openshift-config-managed namespace. We could look for image cluster object and
image-registry-ca
cm ? - Who will be doing and maintaining MCO related changes?
certificate authorities for image registries, so that openshift-apiserver can | ||
access registries. | ||
|
||
For them, the controller should create a merged config map in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "the controller" here referring to MCO controller or somethingelse?
I think the MCO can also manage this like it currently manages kubelet certs, but long term, we should have a separate path for cert management, which we are looking into (https://issues.redhat.com/browse/MCO-467). +1 to the questions Sinny posted above, but I am also wondering:
|
Thanks for reviewing @sinnykumari, tried to answer your questions in line:
Yes, to
I think you and your team are best equipped to make that decision. We should probably rephrase this EP to request a "mechanism" rather than a "controller" specifically. cc @dmage could you reword? This is the first mention of "controller" that I found. There are others.
We would like the MCO team to do (and maintain) these changes for us, as we are very understaffed in the image registry team 😄 |
@yuqi-zhang thanks for commenting, answers below.
The user provided CAs are well, rotated by the user, so we can't answer that part.
I have not tested this. The node-ca daemon set does not reboot anything, and it works.
Yes, it is up to the MCO to piece together the internal registry CAs with the user provided CAs. |
Thanks for the answers! I think then off the top of my head, the MCO will attempt to do something like this:
With the note that long term (maybe not that long) we want certs to be managed differently. Does that mostly make sense? When release does this need to happen in? |
Thanks for coming back to us so quickly @yuqi-zhang! Points 1 and 2 from your list make sense to me, but I can't claim to understand 3-5, I'll leave the internals to you and your team 😄 The result (CAs configured on
The makes complete sense! Is the long[er] term approach something that we want defined in this EP as well? Or is it worth to just note that we need a better solution, and leave the step of defining such solution to its own EP?
Our PM would like this to land on 4.14. WDYT? |
I think we don't have to include it. It can be a separate process, no need to increase scope of this 😄
I will bring this to the team as part of planning to see how the timelines would align. |
Thanks Flavian for answering all the questions. As Jerry said, we will need to discuss within team and see feasible this is for the MCO team. |
Hi @sinnykumari and @yuqi-zhang, that is great news - thank you! The image registry team will gladly help with reviews. I'll sync with @dmage this week so that we update this EP with a link to that ticket, as well as the clarifications we discussed here earlier. |
@sdodson could you help us reviewing and eventually getting this approved? |
I should have time to look at this tomorrow, sorry for the delay. |
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
/remove-lifecycle stale |
enhancements/machine-config/certificate-authorities-for-image-registries.md
Outdated
Show resolved
Hide resolved
/approve Key for me will be ensuring as, Trevor and Colin point out, that these files are able to be live applied. @sinnykumari You're up next for final review. If the MCO team has any implementation details they'd like to get recorded lets get those added as suggested updates. /cc @mrunalp |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cgwalters, sdodson The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
creation-date: 2022-12-01 | ||
last-updated: 2022-12-01 | ||
tracking-link: | ||
- https://issues.redhat.com/browse/MCO-499 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MCO-499 track mainly work from MCO side. I think we should also have here jira card linked here that covers image registries side of work.
enhancements/machine-config/certificate-authorities-for-image-registries.md
Show resolved
Hide resolved
/lgtm |
…registries.md Co-authored-by: Flavian Missi <flaviamissi@gmail.com>
@dmage: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/lgtm |
No description provided.