Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pin and pre-load images #1481

Merged
merged 20 commits into from
Nov 1, 2023
Merged
Changes from 14 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
349 changes: 349 additions & 0 deletions enhancements/machine-config/pin-and-pre-load-images.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,349 @@
---
title: pin-and-pre-load-images
authors:
- "@jhernand"
reviewers:
- "@avishayt" # To ensure that this will be usable with the appliance.
- "@danielerez" # To ensure that this will be usable with the appliance.
- "@mrunalp" # To ensure that this can be implemented with CRI-O and MCO.
- "@nmagnezi" # To ensure that this will be usable with the appliance.
- "@oourfali" # To ensure that this will be usable with the appliance.
approvers:
- "@sinnykumari"
- "@mrunalp"
api-approvers:
- "@sdodson"
- "@zaneb"
- "@deads2k"
- "@JoelSpeed"
creation-date: 2023-09-21
last-updated: 2023-09-21
tracking-link:
- https://issues.redhat.com/browse/RFE-4482
see-also:
- https://github.com/openshift/enhancements/pull/1432
- https://github.com/openshift/api/pull/1609
replaces: []
superseded-by: []
---

# Pin and pre-load images
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW containers/bootc#128 is related to this, and would have the advantage it'd work in an ergonomic way outside of OCP too.


## Summary

Provide an mechanism to pin and pre-load container images.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the motivation for pinning? I understand pre-loading, but disqualifying the images from garbage collection in the kubelet means unused images will clutter the disk and potentially cause the disk to be overused. Is it to prevent cri-o from removing them on upgrade as it does today?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivation is to be able for the cluster to operate properly in the absence of an external registry. The storage should suffice in these use-cases. The idea is to pin only platform and partner images only, not all customer images running workloads on this cluster.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the only object is to get the clusters to operate without a registry, then simply pre-loading should suffice IMO. pinning doesn't seem necessary to me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how would you ensure the pre-loaded images don't get removed by GC before they are needed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if "The storage should suffice in these use-cases" then GC should never kick in as it's currently only triggered by disk usage hitting a certain threshold

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note also that users will be adding/modifying/removing their own workloads, which can bring new images, consume disk space and trigger the GC. We don't want the pinned images to be the ones evicted by that. Do we still agree that it is better to explicitly say that some images should never be removed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still agree that it is better to explicitly say that some images should never be removed?

I do, though we do need to think through what the user experience/recovery/alerting is if they do run out of storage because of pinned images and the system can't evict anything. I.e. how does the user get notified of the problem? And once they hit the problem is the cluster already broken or do they have time to take action before then? If they find out after the disk is full what's the recovery steps and what are the implications for the cluster+workloads until the recovery steps are performed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say that problems due to exhausted disk space aren't new: they are already possible, and I assume that your concerns are already somehow addressed. However there are some differences:

  1. If disk exhaustion happens while trying to pull the images then we will need a mechanism to report it via the API, because this won't be initiated by the kubelet and there is no pod status where to report it. The status of the PinnedImageSet should be enough for that. CRI-O will still report the issue in its log (if there is space for that), like for any other image.

  2. If disk exhaustion happens after the images have been pulled, then we need to make sure that the kubelet GC doesn't select these images. My understanding is that that is already solved by the pinning support in CRI-O: even if kubelet asks CRI-O to delete a pinned image CRI-O will not delete it. @mrunalp can you confirm/reject this?

  3. If the recovery steps (in documentation) include deleting images then we should amend them to ensure these images aren't removed.

I'l try to include these points in the document.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this to the risk ans mitgations section.


## Motivation

Slow and/or unreliable connections to the image registry servers interfere with
operations that require pulling images. For example, an upgrade may require
pulling more than one hundred images. Failures to pull those images cause
retries that interfere with the upgrade process and may eventually make it
fail. One way to improve that is to pull the images in advance, before they are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have data indicating inability to pull an image (whether due to an issue in the registry or networking connectivity or whatever) is a common cause of upgrade failures?

i think the more compelling motivation for this EP is to speed up upgrades by pre-pulling images so that when the user actually clicks the "upgrade now" button, the cluster spends less time in an upgrading state (which users tend to perceive as an unstable/risky state to be in)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have numbers. I know that the upgrade failure rate due to failures to pull images is 100% when there is no registry. That is our main use case.

I am refraining from saying that this speeds up upgrades because in my experience it doesn't. A regular upgrade with a reasonable internet connection takes approx 40 minutes. The same upgrade after pre-pulling the images also takes approx those 40 minutes, at least that is my experience. I am assuming that is different when there is a severe bandwidth limitation, but I didn't test that scenario.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's safe to say that downloading the images in advance provides a more consistent upgrade time in environments with limited connectivity. This is important when scheduling upgrades into maintenance windows, even if the upgrade might not otherwise fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @avishayt, added your comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bparees I know that it's not uncommon for upgrade CI jobs to have some image pulls get throttled entering ImagePullBackOff, that's in small five node clusters, if the clusters were significantly larger the likelihood of throttling increases because we'd have more individual nodes pulling images.

actually needed, and ensure that they aren't removed. Doing that provides a
more consistent upgrade time in those environments. That is important when
scheduling upgrades into maintenance windows, even if the upgrade might not
otherwise fail.

### User Stories

#### Pre-load and pin upgrade images

As the administrator of a cluster that has a low bandwidth and/or unreliable
connection to an image registry server I want to pin and pre-load all the
images required for the upgrade in advance, so that when I decide to actually
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we picking the image to pre-load? Pulling the full payload may be more than what's actually needed. Are we going to look at the currently running images and only pull newer references of those from the payload?

Copy link
Contributor Author

@jhernand jhernand Oct 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Determining the set of image references for an upgrade is out of the scope for this enhancement. It will be part of a separate enhancement (based on the now obsolete #1432 ) describing the orchestration of the upgrade without a registry.

Anyhow, the initial idea was to have a single PinnedImageSet with all the payload images, for all the nodes. We added the spec.nodeSelector to be able to have different sets of images for control plane nodes and worker nodes. For example:

# For control plane nodes:
apiVersion: machineconfiguration.openshift.io/v1alpha1
kind: PinnedImageSet
metadata:
  name: my-control-plane-pinned-images
spec:
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/control-plane: ""
  pinnedImages:
  ... 

---   
      
# For worker nodes:
apiVersion: machineconfiguration.openshift.io/v1alpha1
kind: PinnedImageSet
metadata: 
  name: my-worker-pinned-images
spec:
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/worker: ""
  pinnedImages:
  ...

I expect that the calculation of those sets of images for the control plane nodes and the worker nodes will be done in advance by CVO or by the tool that creates the upgrade bundle. We don't intend to look at the currently running images for that.

Note that this same PinnedImageSet mechanism could be used by other tools or applications that calculate a set of images looking at the currently running ones; the mechanism doesn't forbid that.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd indeed keep that aside for now. the way those are calculated is in parallel to providing the relevant APIs to allow that to work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with not addressing this now and just viewing it as roughly equivalent to mirroring the release image plus any additional images, however long term I'd like to see us move toward all upgrades pre-fetching and pinning images and before we get there I'd like to ensure that we try to limit the images as much as reasonable.

We should however try to make it clear what disk space will be consumed by this effort.

perform the upgrade there will be no need to contact that slow and/or
unreliable registry server and the upgrade will successfully complete in a
predictable time.

#### Pre-load and pin application images

As the administrator of a cluster that has a low bandwidth and/or unreliable
connection to an image registry server I want to pin and pre-load the images
required by my application in advance, so that when I decide to actually deploy
it there will be no need to contact that slow and/or unreliable registry server
and my application will successfully deploy in a predictable time.

### Goals

Provide a mechanism that cluster administrators can use to pin and pre-load
container images.

### Non-Goals

We wish to use the mechanism described in this enhancement to orchestrate
upgrades without a registry server. But that orchestration is not a goal
of this enhancement; it will be part of a separate enhancement based on
parts of https://github.com/openshift/enhancements/pull/1432.

## Proposal

### Workflow Description

1. The administrator of a cluster uses the new `PinnedImageSet` custom resource
to request that a set of container images are pinned and pre-loaded:

```yaml
apiVersion: machineconfiguration.openshift.io/v1alpha1
kind: PinnedImageSet
metadata:
name: my-pinned-images
spec:
nodeSelector:
matchLabels:
node-role.kubernetes.io/control-plane: ""
pinnedImages:
- quay.io/openshift-release-dev/ocp-release@sha256:...
- quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:...
- quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:...
...
```

1. The machine config operators ensures that all the images are pinned and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is one final thing I am not clear on: who is pulling the images?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we should determine that before we move forward, though I could be convinced otherwise

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suggestion is to have a new PinnedImageSetController sub-controller in the machine-config-controller. I will try to clarify that in this sentence. It is also explained in the implementation details section:

The machine-config-controller will grow a new PinnedImageSetController sub
controller that will watch the PinnedImageSet custom resources. When one of
those is created, updated or deleted the controller will start a daemon set that
will do the following in each node of the cluster:

  1. Verify that there is enough disk space available in the machine to pull the
    images.

  2. Create, modify or delete the CRI-O pinning configuration file.

  3. Reload the CRI-O configuration, with the equivalent of systemctl reload crio.

    Note that currently CRI-O doesn't recalculate the set of pinned images on
    reload, support for that will need to be added.

  4. Use the CRI-O gRPC API to run the equivalent of crictl pull for each of the
    images.

This sub-controller can't pull the images directly because that requires talking to the gRPC API of CRI-O, and that is only available via localhost in each node. It is for that reason that I am suggesting to add a daemon set to do it. It could also be part of the machine-config-daemon instead of a new daemon set, but I am not familiar enough with that code to say what is better. I also tried to explain that in the implementation details section.

Ultimately the images will be pulled by the CRI-O running in each node, think of it as going to each node and running crictl pull ... for each image.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this similar to how MCO applies kubeletconfig and containerruntime config on node. kubeletconfig and containerruntime controller processes respective CRD applied to the cluster which in results renders a new MachineConfig. Node sub-controller applies desired config annotation on node to signal machine-config-deamon (MCD running on each node ) to apply the change. In similar way, PinnedImageSetController would process the applied PinnedImageSetController CRD and a new rendered MachineConifg gets generated. MCD learns to apply the change by pulling those images using crictl pull and whatever needs to be done locally on the node. Alternatively, we can introduce a new daemon if needed due to any unknown limitation in MCD. Note that introducing a new daemon would still require some sort of co-ordination so that they don't go sideways like MCD started drain/reboot due a change and at the same time new daemon to pulling the images)

pulled in all the nodes that match the node selector.

### API Extensions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per my understanding, we will also introduce a pinnedImageSet CRD in MCO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is what I mention below:

A new PinnedImageSet object will be added to the
machineconfiguration.openshift.io API group.

Am I missing something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By CRD for pinnedImageSet I meant similar to CRD we have today like machineconfig , kubeletconfig so that admin have friendly output to see by oc or equivalent utility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is what I mean when I say "object". I replaced "object" with "custom resource" to try to make it clear. So, we will have a new PinnedImageSet custom resource. It is defined in detail in openshift/api#1609.

Copy link
Member

@zaneb zaneb Oct 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity, a new type in the k8s API is a CRD (Custom Resource Definition) and an instance of that type is a CR (or object)</pedantry>


A new `PinnedImageSet` custom resource definition will be added to the
`machineconfiguration.openshift.io` API group.

The new custom resource definition is described in detail in
https://github.com/openshift/api/pull/1609.

### Implementation Details/Notes/Constraints

Starting with version 4.14 of OpenShift CRI-O will have the capability to pin
certain images (see [this](https://github.com/cri-o/cri-o/pull/6862) pull
request for details). That capability will be used to pin all the images
required for the upgrade, so that they aren't garbage collected by kubelet and
CRI-O.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not quite true today. These images aren't currently disqualified for image removal on upgrade that CRI-O currently does

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW: we're working on dropping this behavior in CRI-O but it will take some time (4.17 time frame maybe??)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So images will be removed when CRI-O itself is upgraded regardless of being pinned? If so, is there any way around it? Would it help to disable the crio-wipe service.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah there's a setting to disable it. I also think it's reasonable to disqualify pinned images in the restart flow

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

crio-wipe.service no longer does the image wiping, it happens internally-- setting the version_file_persist config field to "" will disable it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we include that change to CRI-O (disqualify pinned images in the restart flow) in this enhancement?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I think so

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, added a paragraph explaining that.


In addition when the CRI-O service is upgraded and restarted it removes all the
images. This used to be done by the `crio-wipe` service, but is now done
internally by CRI-O. It can be avoided setting the `version_file_persist`
configuration parameter to "", but that would affect all images, not just the
pinned ones. This behavior needs to be changed in CRI-O so that pinned images
aren't removed, regardless of the value of `version_file_persist`.
Comment on lines +126 to +127
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


The changes to pin the images will be done in a file inside the
`/etc/crio/crio.conf.d` directory. To avoid potential conflicts with files
manually created by the administrator the name of this file will be the name of
the `PinnedImageSet` custom resource concatenated with the UUID assiged by the
API server. For example, if the custom resource is this:

```yaml
apiVersion: machineconfiguration.openshift.io/v1alpha1
kind: PinnedImageSet
metadata:
name: my-pinned-images
uuid: 550a1d88-2976-4447-9fc7-b65e457a7f42
spec:
pinnedImages:
- quay.io/openshift-release-dev/ocp-release@sha256:...
- quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:...
- quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:...
...
```

Then the complete path will be this:

```txt
/etc/crio/crio.conf.d/my-pinned-images-550a1d88-2976-4447-9fc7-b65e457a7f42.conf
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I maybe wrong, but all these extra constraint makes me think that we may need a new sub-controller in MCC to render the pinnedImageSet and apply the extra logic to generate a file with filename having amended uuid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar enough with MCC to be certain, but what you say sounds reasonable to me. Added a sentence explaining that:

This logic to handle the pinned image sets and generate the configuration files
will be part of a new sub-controller in MCC.

Does that look good?

Copy link
Contributor

@hexfusion hexfusion Jan 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The authoritative config is based on a sorting of the files done by crio. So for example if [crio.image] contained pinned_images in the following files.

00-default.conf
my-pinned-images-550a1d88-2976-4447-9fc7-b65e457a7f42.conf
01-my files.conf

the later 01-my files.conf would take preference.


The content of the file will the `pinned_images` parameter containing the list
of images:

```toml
pinned_images=[
"quay.io/openshift-release-dev/ocp-release@sha256:...",
"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:...",
"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:...",
...
]
```

In addition to the list of images to be pinned, the `PinnedImageSet` custom
resource will also contain a node selector. This is intended to support
different sets of images for different kinds of nodes. For example, to pin
different images for control plane and worker nodes the user could create two
`PinnedImageSet` custom resources:

```yaml
# For control plane nodes:
apiVersion: machineconfiguration.openshift.io/v1alpha1
kind: PinnedImageSet
metadata:
name: my-control-plane-pinned-images
spec:
nodeSelector:
matchLabels:
node-role.kubernetes.io/control-plane: ""
pinnedImages:
...

---

# For worker nodes:
apiVersion: machineconfiguration.openshift.io/v1alpha1
kind: PinnedImageSet
metadata:
name: my-control-plane-pinned-images
jhernand marked this conversation as resolved.
Show resolved Hide resolved
spec:
nodeSelector:
matchLabels:
node-role.kubernetes.io/worker: ""
pinnedImages:
...
```

This is specially convenient for pinning images for upgrades: there are many
images that are needed only by the control plane nodes and there is no need to
have them consuming disk space in worker nodes.

When no node selector is specified the images will be pinned in all the nodes
of the cluster.

When a `PinnedImageSet` custom resource is added, modified or deleted the
machine config operator will create, modify or delete the configuration file,
reload the CRI-O configuration (with the equivalent of `systemctl reload crio`)
and then it will use the CRI-O gRPC API to run the equivalent of `crictl pull`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pinned images don't currently get reset on reload, support for that will need to be added.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @haircommander. Added a paragraph explaining that:

Note that currently CRI-O doesn't reset pinned images on reload, support for
that will need to be added.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When will that be added?

What specifically do we mean by "reset" here? If I land a new PinnedImageSet config and reload will cri-o pick that up on reload but if I were to modify or delete it then those changes wouldn't be picked up on reload?

As long as new configs are applied on reload I think this is likely manageable, infact some customers may wish to preserve their ability to preform a z-stream rollback not wishing to delete the previous PIS until subsequent update. That would however mean that they need storage capacity for up to 3 PIS, old, current, and to-be applied versions. When applying the update they could remove the old version and allow those images to be garbage collected after reboot.

Copy link
Member

@sdodson sdodson Oct 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jhernand @sinnykumari Any thoughts about what specifically is doing the pulling here? Is that MCD on the node growing a controller to iterate over all PIS .conf files? the MCO itself or MCC calling CRI-O remotely over gRPC? I assume this interface isn't already network accessible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not very familiar with the machine config code, but I'd say that this should be a new machine-config-controller sub-controller that watches the PinnedImageSet custom resources. When it comes to actually pulling the images it should either create a new daemon set or else let the machine-config-daemon do it, because that task requires access to the gRPC API of CRI-O which is only available via localhost. In the prototype we used a daemon set.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind updating it to be clear that either the MCD or another new DaemonSet will be responsible for monitoring the PIS configs and retrieving images?

Copy link
Contributor Author

@jhernand jhernand Oct 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the implementation details section to try to clarify that there will be a new machine-config-controller sub controller and a new daemon set. Please check again: 25fc032

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we want to create a new deamonset to fetch images. MCO team didn't have any implementation detail discussion for it. But per my understanding, how I see this is MCC gains a new sub-controller to process PIS Custom Resource Object supplied to the cluster and renders it into rendered MacineConifg. MCD is where we do usually node level processing. So, if MCD can handle writing PIS related config on node as well as pulling listed images then we should use MCD. Otherwise, we can look for adding a new daemonset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. I think this is an implementation detail that we don't need to close in this document.

for each of the images.

Note that currently CRI-O doesn't reset pinned images on reload, support for
that will need to be added.

Note that this will happen in all the nodes of the cluster that match the node
selector.

When all the images have been succesfully pinned and pulled in all the matching
nodes the `Ready` condition will be set to `True`:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is only 1 status value here, does the MCO have a way to track progress across nodes? Can the user see that progress somehow?

What happens if MCO/MCC is restarted in the middle of pinning a set of images, will it have the information it needs to recover?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the currently proposed API there is no mechanism to track progress across nodes. We did have that in the prototype: a set of per-node list of images successfully pulled. But I decided to left that out here because it is not really necessary for the user of the API: all what the user of the API needs to know is if the images have been successfully pulled (in all the nodes) or not. I acknowledge that per-node progress information may be needed to actually implement this, but I'd consider it an implementation detail that doesn't need to be here. Let me know if you disagree and I will add those details.

When MCO/MCC come back after a restart or crash they will know if the images have been pulled and pinned already (that is part of the PinnedImageSet status). If they haven't then the process can start again: a typical reconciliation. Some image may already have been pulled. CRI-O will be asked to pull it again, it will contact the registry, verify that it is already pulled, and do nothing (layers are already downloaded, and won't be downloaded again).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurred to me that node objects contain the list of images on the node because I believe the scheduler prefers to schedule pods toward nodes where the image is already present. So this seems like it could be computed centrally but the cost of doing so may be high to watch or list all nodes and confirm that all currently pinned images are present on all of them.

I'm fine with nodes working autonomously toward pulling all of the images but I think there needs to be a central point to answer the question "Right now, is a given PinnedImageSet pulled on all nodes." And ideally in the future PinnedImageSet could be extended to indicate which release image it contains and then the CVO can inspect all PinnedImageSets to know if there's a pending operation for the release payload that's being proposed for an upgrade.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the currently proposed API (see openshift/api#1609) there is a Ready condition to indicate that all the images of a PinnedImageSet have been pinned and pulled in all the nodes of the cluster. For example:

apiVersion: machineconfiguration.openshift.io/v1alpha1
kind: PinnedImageSet
metadata:
  name: my-worker-pinned-images
spec:
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/worker: ""
  pinnedImages: [ ... ]
status:
  conditions:
  - type: Ready
    status: "True"   # <-- This indicates that all the images have been pulled in all the nodes.
  - type: Failed
    status: "False"

That is what users of the API need. For example, CVO would need just to check this, it won't be interested in knowing what is the progress for specific nodes.

The magic that is missing and I didn't describe in the document is how that is calculated. There are multiple mechanisms that can be used for that. For example, we could add an annotation (or label) to nodes to indicate that all the images are pulled in that particular node, and then the machine-config-controller sub-controller could watch nodes and calculate this. Or we could put that in the status of the PinnedImageSet:

...
status:
  conditions:
  - type: Ready
    status: "False"
  - type: Failed
    status: "False"
  nodeConditions:  # <-- This is a dictionary, where keys are node names.
    node0:
    - type: Ready
      status: "True"   # <-- This indicates 'node0' pulled all the images.
    - type: Failed
      status: "False"
    node1:
    - type: Ready
      status: "False"   # <-- This indicates 'node1' pulled all the images.
    - type: Failed
      status: "False"
    ...

I'd say that this latter is better, because all the information is in one single place: less objects to watch.

There are other alternatives, like having a list of node names that have all the images pulled (smaller, so better for large clusters), having it only in memory of the machine-config-controller sub-controller (assuming it uses the leadership mechanism), etc. I am refraining from writing any of that in this document because I believe it is an implementation detail that is of no interest to the user of the API.

Regarding how CVO will potentially use this, I think it should create a PinnedImageSet with a name (or label, or annotatoin) that contains the version number. For example, to upgrade to 4.13.150 it would create something like this:

apiVersion: machineconfiguration.openshift.io/v1alpha1
kind: PinnedImageSet
metadata:
  name: upgrade-worker-4.13.150
spec:
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/worker: ""
  pinnedImages: [ ... ]

CVO will then be watching these objects, and process only the ones that it created for that particular upgrade. When all of them have the Ready condition set to True (the one that says "for all nodes") it will then proceed to the next steps of the upgrade.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We maybe able to utilize and extend MCO state reporting API to indicate status of PinnedImageSet on a node openshift/api#1596.
Also, I beleive in the end MCO will manage, apply and process PinnedImageSet nodes at pool level? So, MCP status should throw error if it failed to apply PinnedImageSet on a node in the pool.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say that the machine-config-daemon (or a new daemon set, whatever we decide to implement this) will need to enumerate the PinnedImageSet custom resources, and for each one check if the spec.nodeSelector matches the node where it is running. If it matches then it will need to pull the images. I guess I am missing something because I don't see the difficulty. @sinnykumari @haircommander @yuqi-zhang can you elaborate on what is your concern so that I can try to address it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, just got to this thread today. So as I understand the PinnedImageSet object, the final goal is just to make sure each node is able to call cri-o with a list of images it needs to pull, correct?

In a sense if we wish to process this logic separately, we don't need to tie them to machineconfig pools. In fact we probably don't even need a new controller. The "smaller" implementation is just a daemon with RBAC access to PinnedImageSet object + its own node object, and is able to watch for changes. Once any PIS changes, the daemon can cross check with its node object to see if images need to be pulled, and just run a crio call. So I guess there isn't really a need to leverage MCO directives.

This is just one way to interpret this. I could be missing some critical big picture here though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal is to have the images available in the nodes, calling CRI-O to do it is just how we suggest to implement it.

Note that CRI-O doesn't have a call to pull multiple images at once, instead each image requires a separate call. See the PullImage RPC.

Your implementation suggestion looks reasonable to me @yuqi-zhang. I tried to not be too specific about the fine details because I assume we will decide about that during the implementation. The document already mentions that this can be part of the machine-config-daemon or of an independent daemon set.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, thanks, then the general idea seems fine to me. I think we can follow up on implementation details on the daemonset later.

@sinnykumari was I able to cover your concern?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks Jerry, sounds good. I have been thinking it as a sub-controller and use MCD to make sure that these node operations are happening in co-ordinated way in MCO which I captured in #1481 (comment) . But yes, implementing them would make things more clearer on the best path.


```yaml
status:
conditions:
- type: Ready
status: "True"
- type: Failed
status: "False"
```

If something fails the `Failed` condition will be set to `True`, and the
details of the error will be in the message. For example if `node12` fails to
pull an image:

```yaml
status:
pinnedImages:
- quay.io/openshift-release-dev/ocp-release@sha256:...
- quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:...
conditions:
- type: Ready
status: "False"
- type: Failed
status: "True"
message: Node 'node12' failed to pull image `quay.io/...` because ...
```

This logic to handle the pinned image sets and generate the configuration files
will be part of a new sub-controller in MCC.

### Risks and Mitigations

Pre-loading disk images can consume a large amount of disk space. For example,
pre-loading all the images required for an upgrade can consume more 32 GiB.
This is a risk because disk exhaustion can affect other workloads and the
control plane. There is already a mechanism to mitigate that: the kubelet
garbage collection. To ensure disk space issues are reported and that garbage
collection doesn't interfere we will introduced new mechanisms:

1. If disk exhaustion happens while trying to pre-load images the issues will
be reported explicitly via the status of the `PinnedImageSet` status. Note
typically these issues are reported as failures to pull images in the status of
pods, but in this case there are no pods pulling the images. CRI-O will still
report these issues in the log (if there is space for that), like for any other
image pull.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we then do anything with the PIS crio config? Do we just keep trying to pull pinned images hoping that garbage collection frees sufficient space to complete the operation?

If we delete it then that doesn't really free the space until cri-o gets restarted per earlier notes about resetting. This worries me that we could wedge nodes, should we ensure that before we pull images we compute the space which the garbage collection threshold would trigger at and abort if the content we'd pull exceeds the minimum free space available and not pull if GC is in progress?

32GiB to pull, 320GiB disk with 10% GC threshold wouldn't allow us to pull, but any threshold higher than that would because we can be sure if GC is not running we've got enough space to complete our operation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We intend to add the upgrade orchestration in a separate EP, based on #1432. I would like to do after this and #1483 are approved.

My understanding is that "reset" in this context means that when CRI-O receives the reload signal it needs to forget the previous set of pinned images and rebuild it from the configuration files. According to @haircommander it doesn't do that today.

Note that our suggestion is that when some PinnedImageSet changes first CRI-O is reconfigured and reloaded, and only then images start to be pulled. Imagine, for example, that you have images A and B pinned and then you remove B and add C. What would happen first is that CRI-O is reconfigured so that only A and C are pinned. That is unlikely to fail, as it only involves generating configuration files and sending signals. Only after that would we start to pull the images. That is more likely to fail, for example because C cannot be pulled. But even if that fails Kubelet can still garbage collect B.

I'd say that if disk is exhausted while pulling the images the behavior should be similar to what happens when a regular pod fails to pull an image. We retry that, so we should retry this as well.

Computing the disk space needed in advance would be helpful indeed. But implementing it will be challenging. It means contacting the registry to figure out the size of the layers, but also figuring out how much space those layers will take when extracted, and that requires intimate knowledge of the underlying workings of the container library. Would it be acceptable to use an heuristic? For example, can we contact the registry to find the sizes of the layers and then assume that the space consumed will be twice that? I say twice because that is roughly the case of the current OCP payloads. @sdodson let me know of you think that is acceptable.

Copy link
Member

@sdodson sdodson Oct 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, 2x the size of the layers seems like a good start to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, added the disk space check details to the implementation details and risk mitigation sections.


1. If disk exhaustion happens after the images have been pulled, then we will
ensure that the kubelet garbage collector doesn't select these images. That
will be handled by the image pinning support in CRI-O: even if kubelet asks
CRI-O to delete a pinned image CRI-O will not delete it.

1. The recovery steps in the documentation will be amended to ensure that these
images aren't deleted to recover disk space.

### Drawbacks

This approach requires non trivial changes to the machine config operator.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at least as currently defined another drawback is the implication that every pinned image is pulled/pinned on every node, even though most of the nodes will never run pods using a given image. (e.g. master nodes will never run a user workload image. and worker nodes will never run control plane images. but there are more subtle cases too).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there are subtleties, like compact three node clusters, where control plane nodes do run workloads. I think we can address this concern with the "pinned image set" concept that includes a node selector.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently ctrcfgs are run per MCP, so a different set could be configured per control plane and worker pools, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can support different sets of images for control plane and workers adding a nodeSelector to then pinned image set, something like this:

pinnedImageSets:
- name: upgrade-control-plane
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/control-plane: ""
  images:
  - quay.io/openshift-release-dev/ocp-release@sha256:...
  - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:...
  - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:...
  ...
- name: upgrade-workers
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/worker: ""
  images:
  - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:...
  - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:...


## Design Details

### Open Questions

None.

### Test Plan

We add a CI test that verifies that images are correctly pinned and pre-loaded.

### Graduation Criteria

The feature will ideally be introduced as `Dev Preview` in OpenShift 4.X,
moved to `Tech Preview` in 4.X+1 and declared `GA` in 4.X+2.

#### Dev Preview -> Tech Preview

- Availability of the CI test.

- Obtain positive feedback from at least one customer.

#### Tech Preview -> GA

- User facing documentation created in
[https://github.com/openshift/openshift-docs](openshift-docs).

#### Removing a deprecated feature

Not applicable, no feature will be removed.

### Upgrade / Downgrade Strategy

Upgrades from versions that don't support the `PinnedImageSet` custom resource
don't require any special handling because the custom resource is optional:
there will be no such custom resource in the upgraded cluster.

Downgrades to versions that don't support the `PinnedImageSet` custom resource
don't require any changes. The existing pinned images will be ignored in the
downgraded cluster, and will eventually be garbage collected.

### Version Skew Strategy

Not applicable.

### Operational Aspects of API Extensions

#### Failure Modes

Image pulling may fail due to lack of disk space or other reasons. This will be
reported via the conditions in the `PinnedImageSet` custom resource. See the
risks and mitigations section for details.
jhernand marked this conversation as resolved.
Show resolved Hide resolved

#### Support Procedures

Nothing.

## Implementation History

There is an initial prototype exploring some of the implementation details
described here in this [https://github.com/jhernand/upgrade-tool](repository).

## Alternatives

The alternative to this is to manually pull the images in all the nodes of the
cluster, manually create the `/etc/crio/crio.conf.d/pin.conf` file and manually
reload the CRI-O service.

## Infrastructure Needed

Infrastructure will be needed to run the CI test described in the test plan
above.