The MCO manages the Red Hat Enterprise Linux CoreOS (RHCOS) operating system. Further,
the operating system itself is just another part of the release image, called
In other words, the cluster controls the operating system.
"Bootimage" vs machine-os-content
We will use the term "bootimage" to mean an initial RHCOS disk image, such
as an AMI, bare metal raw disk image, VMWare VMDK, OpenStack qcow2, etc.
These bootimages are built using coreos-assembler.
Today, the installer pins the "bootimages"
it uses, and released installers also pin the release image. As noted above,
release images contain
machine-os-content, which can be a different
RHCOS version. You can find the installer-pinned bootimage in e.g. this file.
A pending enhancement describes generating and inspecting bootimage data from the release image (not yet implemented).
It's essential to understand that both the bootimage and the
are both essentially wrappers for an OSTree commit.
The OSTree format is an image format designed for in-place operating system updates; it operates
at the filesystem level (like container images) but (unlike container runtimes) has
tooling to manage things such as the bootloader and handling persistence of
The reason we wrap an OSTree commit inside a container image is so that the release image encapsulates basically everything about a cluster (except the bootimage). This makes it easy to mirror updates offline.
Applying OS updates before kubelet
We do not want to require that a new bootimage is released for every update, and in general it can be hard to require that in every environment (for example, bare metal PXE setups).
As of today, when a node boots the MCO serves it Ignition for configuration,
including a systemd unit called
which pulls code onto the host, and then it runs
to perform an OS update and reboot.
One important property of this is that it means OS updates are applied
before any potentially untrusted workloads land on the node. Because we just
podman, we also don't have to worry about having old
join a cluster.
Understanding OS updates at installation time
In this example we'll discuss AWS, but this process is similar to a situation of booting bare metal machines via PXE, or Azure, or a private OpenStack instance.
The openshift-installer starts, and uses the AMI it has pinned as the bootstrap node as well as for the control plane.
The bootstrap node's
bootkube.sh service pulls the release image, which
contains a reference to the MCO (
machine-config-operator) and also a
reference to a newer
bootkube.sh service runs the MCO in
"bootstrap" mode to generate and serve Ignition to the master machines.
The control plane nodes wait in the initramfs, retrying until they are able to fetch the Ignition config from the bootstrap node.
When that succeeds, the above process of
runs which extracts OS updates from the
machine-os-content container image,
and the control plane nodes each reboot (before
kubelet.service has started).
When the control plane nodes reboot and form a cluster, the bootstrap node is torn down.
At this point, Ignition has been executed, and that only runs once.
machine-config-daemon-firstboot.service is no longer used for OS updates.
The master machines use the machine-api-operator to boot the workers. Each worker pulls Ignition configs from the MCS running on the control plane. The exact same process of performing an upgrade and reboot happens for each worker.
Management via the Machine Config Daemon
After a node (whether control plane or worker) has joined the cluster, the MCO takes over. Previously, each individual node was running systemd units; now changes are coordinated via the MCO.
When the administrator starts an
oc adm upgrade, if a new
is provided in the release image, it will be rolled out to the control plane
Every change now will be managed by a
that only 1 machine at a time is changed (via
maxUnavailable: 1 default).
MCD host upgrade execution
Today mostly because of SELinux reasons the
MCD copies itself to the host, and runs as part of the host context.
Then it provides updated content to
rpm-ostreed.service, which is already
a daemon running on the host.
The MCD tries to watch the systemd journal for relevant service and proxy logs,
so you should be able to
oc -n openshift-machine-config-operator logs -c machine-config-daemon pod/machine-config-daemon-...
See this pull request.
When a node boots for the first time, it's the main CoreOS Ignition process which handles the config provided. This Ignition runs in the initramfs and performs any repartitioning and filesystem layout, etc.
MachineConfig objects also have higher level extensions
kernelArguments that aren't handled by Ignition.
The invention of
a way to bridge these two worlds; it's the target
in JSON form. The MCS injects this file into the Ignition it serves to the node.
Then when the node boots into e.g.
(also written by Ignition) reads that file and handles the bits (kernel arguments, extensions, )
that weren't handled by the "main Ignition" process.
These OS level changes are done along with the OS update process described above.
This ensures that for example if one specifies
nosmt as a kernel argument,
to turn off hyperthreading to more strongly isolate workloads, that kernel
argument will be applied before
Questions and answers
Q: I upgraded OpenShift and noticed that my AMI hasn't changed, is this normal?
Yes, see: openshift/enhancements#201 (As well as the rest of this document - we do in-place updates without changing the bootimage)
Q: Is the integrity of operating system upgrades verified?
The overall approach here is that the operating system is just one part of the cluster.
Integrity of the OpenShift platform is handled to start by the
cluster version operator.
Today the CVO will by default GPG verify the integrity of the release image
before applying it. The release image contains a
sha256 digest of
which is used by the MCO for updates. On the host, the container runtime
podman verifies the integrity of that
sha256 when pulling the image,
before the MCO reads its content. Hence, there is end-to-end GPG-verified integrity
for the operating system updates (as well as the rest of the cluster components
which run as regular containers).
Q: Why do you do this weird "ostree repository in container" thing? Why ostree?
We're using a system that works; ostree is a well tested "image like" update system that has been in use for many years by multiple distributions. It handles SELinux and bootloaders, etc. We're just "encapsulating" that system inside a container image for all of the above reasons (management, etc.).
At some point in the future though it's likely that we will try to change
machine-os-content container to look more like an unpacked container image.
Q: How do I look at the content in the ostree repository inside the container?
You can get the
ostree tool from many distributions; try e.g.
yum -y install ostree inside a RHEL UBI
container for example.
From there, probably the simplest thing is to use
oc image extract
to unpack the container image. Something like this:
$ mkdir machine-os-content $ oc image extract quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:02d810d3eb284e684bd20d342af3a800e955cccf0bb55e23ee0b434956221bdd --path /:machine-os-content $ find machine-os-content/srv/repo/ -name '*.commit' machine-os-content/srv/repo/objects/33/dd81479490fbb61a58af8525a71934e7545b9ed72d846d3e32a3f33f6fac9d.commit $ ostree --repo=machine-os-content/srv/repo ls 33dd81479490fbb61a58af8525a71934e7545b9ed72d846d3e32a3f33f6fac9d d00755 0 0 0 / l00777 0 0 0 /bin -> usr/bin l00777 0 0 0 /home -> var/home l00777 0 0 0 /lib -> usr/lib l00777 0 0 0 /lib64 -> usr/lib64 l00777 0 0 0 /media -> run/media l00777 0 0 0 /mnt -> var/mnt l00777 0 0 0 /opt -> var/opt l00777 0 0 0 /ostree -> sysroot/ostree l00777 0 0 0 /root -> var/roothome l00777 0 0 0 /sbin -> usr/sbin l00777 0 0 0 /srv -> var/srv d00755 0 0 0 /boot d00755 0 0 0 /dev d00755 0 0 0 /proc d00755 0 0 0 /run d00755 0 0 0 /sys d00755 0 0 0 /sysroot d01777 0 0 0 /tmp d00755 0 0 0 /usr d00755 0 0 0 /var $