Supporting CRI-ContainerD on Windows
Supporting CRI-ContainerD on Windows
Table of Contents
- Release Signoff Checklist
- User Stories
- Implementation Details/Notes/Constraints
- Risks and Mitigations
- Design Details
- Implementation History
- Infrastructure Needed
Release Signoff Checklist
ACTION REQUIRED: In order to merge code into a release, there must be an issue in kubernetes/enhancements referencing this KEP and targeting a release milestone before Enhancement Freeze of the targeted release.
For enhancements that make changes to code or processes/procedures in core Kubernetes i.e., kubernetes/kubernetes, we require the following Release Signoff checklist to be completed.
Check these off as they are completed for the Release Team to track. These checklist items must be updated for the enhancement to be released.
- kubernetes/enhancements issue in release milestone, which links to KEP (this should be a link to the KEP location in kubernetes/enhancements, not the initial KEP PR)
- KEP approvers have set the KEP status to
- Design details are appropriately documented
- Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
- Graduation criteria is in place
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
- Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Note: Any PRs to move a KEP to
implementable or significant changes once it is marked
implementable should be approved by each of the KEP approvers. If any of those approvers is no longer appropriate than changes to that list should be approved by the remaining approvers and/or the owning SIG (or SIG-arch for cross cutting KEPs).
Note: This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
The ContainerD maintainers have been working on CRI support which is stable on Linux, but is not yet available for Windows as of ContainerD 1.2. Currently it’s planned for ContainerD 1.3, and the developers in the Windows container platform team have most of the key work merged into master already. Supporting CRI-ContainerD on Windows means users will be able to take advantage of the latest container platform improvements that shipped in Windows Server 2019 / 1809 and beyond.
Windows Server 2019 includes an updated host container service (HCS v2) that offers more control over how containers are managed. This can remove some limitations and improve some Kubernetes API compatibility. However, the current Docker EE 18.09 release has not been updated to work with the Windows HCSv2, only ContainerD has been migrated. Moving to CRI-ContainerD allows the Windows OS team and Kubernetes developers to focus on an interface designed to work with Kubernetes to improve compatibility and accelerate development.
Additionally, users could choose to run with only CRI-ContainerD instead of Docker EE if they wanted to reduce the install footprint or produce their own self-supported CRI-ContainerD builds.
- Improve the matrix of Kubernetes features that can be supported on Windows
- Provide a path forward to implement Kubernetes-specific features that are not available in the Docker API today
- Align with
dockershimdeprecation timelines once they are defined
- Running Linux containers on Windows nodes. This would be addressed as a separate KEP since the use cases are different.
dockershim. This is out of scope for this KEP. The effort to migrate that code out of tree is in KEP PR 866 and deprecation discussions will happen later.
Improving Kubernetes integration for Windows Server containers
Moving to the new Windows HCSv2 platform and ContainerD would allow Kubernetes to add support for:
- Mounting single files, not just folders, into containers
- Termination messages (depends on single file mounts)
- /etc/hosts (c:\windows\system32\drivers\etc\hosts) file mapping
Improved isolation and compatibility between Windows pods using Hyper-V
Hyper-V enables each pod to run within it’s own hypervisor partition, with a separate kernel. This means that we can build forward-compatibility for containers across Windows OS versions - for example a container built using Windows Server 1809, could be run on a node running Windows Server 1903. This pod would use the Windows Server 1809 kernel to preserve full compatibility, and other pods could run using either a shared kernel with the node, or their own isolated Windows Server 1903 kernels. Containers requiring 1809 and 1903 (or later) cannot be mixed in the same pod, they must be deployed in separate pods so the matching kernel may be used. Running Windows Server version 1903 containers on a Windows Server 2019/1809 host will not work.
In addition, some customers may desire hypervisor-based isolation as an additional line of defense against a container break-out attack.
Adding Hyper-V support would use RuntimeClass. 3 typical RuntimeClass names would be configured in CRI-ContainerD to support common deployments:
- runhcs-wcow-process [default] - process isolation is used, container & node OS version must match
- runhcs-wcow-hypervisor - Hyper-V isolation is used, Pod will be compatible with containers built with Windows Server 2019 / 1809. Physical memory overcommit is allowed with overages filled from pagefile.
- runhcs-wcow-hypervisor-1903 - Hyper-V isolation is used, Pod will be compatible with containers built with Windows Server 1903. Physical memory overcommit is allowed with overages filled from pagefile.
Using Hyper-V isolation does require some extra memory for the isolated kernel & system processes. This could be accounted for by implementing the PodOverhead proposal for those runtime classes. We would include a recommended PodOverhead in the default CRDs, likely between 100-200M.
Improve Control over Memory & CPU Resources with Hyper-V
The Windows kernel itself cannot provide reserved memory for pods, containers or processes. They are always fulfilled using virtual allocations which could be paged out later. However, using a Hyper-V partition improves control over memory and CPU cores. Hyper-V can either allocate memory on-demand (while still enforcing a hard limit), or it can be reserved as a physical allocation up front. Physical allocations may be able to enable large page allocations within that range (to be confirmed) and improve cache coherency. CPU core counts may also be limited so a pod only has certain cores available, rather than shares spread across all cores, and applications can tune thread counts to the actually available cores.
Operators could deploy additional RuntimeClasses with more granular control for performance critical workloads:
- 2019-Hyper-V-Reserve: Hyper-V isolation is used, Pod will be compatible with containers built with Windows Server 2019 / 1809. Memory reserve == limit, and is guaranteed to not page out.
- 2019-Hyper-V-Reserve-Core: Same as above, except all but CPU cores are masked out.
- 1903-Hyper-V-Reserve: Hyper-V isolation is used, Pod will be compatible with containers built with Windows Server 1903. Memory reserve == limit, and is guaranteed to not page out.
- 1903-Hyper-V-Reserve-Core: Same as above, except all but CPU cores are masked out.
Improved Storage Control with Hyper-V
Hyper-V also brings the capability to attach storage to pods using block-based protocols (SCSI) instead of file-based protocols (host file mapping / NFS / SMB). These capabilities could be enabled in HCSv2 with CRI-ContainerD, so this could be an area of future work. Some examples could include:
Attaching a "physical disk" (such as a local SSD, iSCSI target, Azure Disk or Google Persistent Disk) directly to a pod. The kubelet would need to identify the disk beforehand, then attach it as the pod is created with CRI. It could then be formatted and used within the pod without being mounted or accessible on the host.
Creating Persistent Local Volumes using a local virtual disk attached directly to a pod. This would create local, non-resilient storage that could be formatted from the pod without being mounted on the host. This could be used to build out more resource controls such as fixed disk sizes and QoS based on IOPs or throughput and take advantage of high speed local storage such as temporary SSDs offered by cloud providers.
Enable runtime resizing of container resources
With virtual-based allocations and Hyper-V, it should be possible to increase the limit for a running pod. This won’t give it a guaranteed allocation, but will allow it to grow without terminating and scheduling a new pod. This could be a path to vertical pod autoscaling. This still needs more investigation and is mentioned as a future possibility.
The work needed will span multiple repos, SIG-Windows will be maintaining a Windows CRI-Containerd Project Board to track everything in one place.
Proposal: Use Runtimeclass Scheduler to simplify deployments based on OS version requirements
As of version 1.14, RuntimeClass is not considered by the Kubernetes scheduler. There’s no guarantee that a node can start a pod, and it could fail until it’s scheduled on an appropriate node. Additional node labels and nodeSelectors are required to avoid this problem. RuntimeClass Scheduling proposes being able to add nodeSelectors automatically when using a RuntimeClass, simplifying the deployment.
Windows forward compatibility will bring a new challenge as well because there are two ways a container could be run:
- Constrained to the OS version it was designed for, using process-based isolation
- Running on a newer OS version using Hyper-V. This second case could be enabled with a RuntimeClass. If a separate RuntimeClass was used based on OS version, this means the scheduler could find a node with matching class.
Proposal: Standardize hypervisor annotations
There are large number of Windows annotations defined that can control how Hyper-V will configure its hypervisor partition for the pod. Today, these could be set in the runtimeclasses defined in the CRI-ContainerD configuration file on the node, but it would be easier to maintain them if key settings around resources (cpu+memory+storage) could be aligned across multiple hypervisors and exposed in CRI.
Doing this would make pod definitions more portable between different isolation types. It would also avoid the need for a "t-shirt size" list of RuntimeClass instances to choose from:
- 1903-Hyper-V-Reserve-4Core-PhysicalMemory ...
Windows Server 2019
This work would be carried out and tested using the already-released Windows Server 2019. That will enable customers a migration path from Docker 18.09 to CRI-ContainerD if they want to get this new functionality. Windows Server 1903 and later will also be supported once they’re tested.
It was announced that the upcoming 1.3 release would include Windows support, but that release and timeline are still in planning as of early April 2019.
The code needed to run ContainerD is merged, and experimental support in moby has merged. CRI is in the process of being updated, and open issues are tracked on the Windows CRI-Containerd Project Board
The CRI plugin changes needed to enable Hyper-V isolation are still in a development branch jterry75/cri and don’t have an upstream PR open yet.
Code: mostly done CI+CD: lacking
Flannel isn’t expected to require any changes since the Windows-specific metaplugins ship outside of the main repo. However, there is still not a stable release supporting Windows so it needs to be built from source. Additionally, the Windows-specific metaplugins to support ContainerD are being developed in a new repo Microsoft/windows-container-networking. It’s still TBD whether this code will be merged into containernetworking/plugins, or maintained in a separate repo.
- Sdnbridge - this works with host-gw mode, replaces win-bridge
- Sdnoverlay - this works with vxlan overlay mode, replaces win-overlay
Code: in progress CI+CD: lacking
The same sdnbridge plugin should work with kubenet as well. If someone would like to use kubenet instead of flannel, that should be feasible.
GCE uses the win-bridge meta-plugin today for managing Windows network interfaces. This would also need to migrate to sdnbridge.
Storage: in-tree AzureFile, AzureDisk, Google PD
These are expected to work and the same tests will be run for both dockershim and CRI-ContainerD.
Storage: FlexVolume for iSCSI & SMB
These out-of-tree plugins are expected to work, and are not tested in prow jobs today. If they graduate to stable we’ll add them to testgrid.
Risks and Mitigations
As mentioned earlier, builds are not yet available. We will publish the setup steps required to build & test in the kubernetes-sigs/windows-testing repo during the course of alpha so testing can commence.
The existing test cases running on Testgrid that cover Windows Server 2019 with Docker will be reused with CRI-ContainerD. Testgrid will include results for both ContainerD and dockershim.
- TestGrid: SIG-Windows: flannel-l2bridge-windows-master - this uses dockershim
- TestGrid: SIG-Windows: containerd-l2bridge-windows-master - this uses ContainerD
Test cases that depend on ContainerD and won't pass with Dockershim will be marked with
dockershim is deprecated.
Proposed for 1.18
- Windows Server 2019 containers can run with process level isolation
- TestGrid has results for Kubernetes master branch. CRI-ContainerD and CNI built from source and may include non-upstream PRs.
Alpha -> Beta Graduation
Proposed for 1.19 or later
- Feature parity with dockershim, including:
- Group Managed Service Account support
- Named pipe & Unix domain socket mounts
- Support RuntimeClass to enable Hyper-V isolation
- Publically available builds (beta or better) of CRI-ContainerD, at least one CNI
- TestGrid results for above builds with Kubernetes master branch
Beta -> GA Graduation
Proposed for 1.20 or later
- Stable release of CRI-ContainerD on Windows, at least one CNI
- Master & release branches on TestGrid
Upgrade / Downgrade Strategy
Because no Kubernetes API changes are expected, there is no planned upgrade/downgrade testing at the cluster level.
Node upgrade/downgrade is currently out of scope of the Kubernetes project, but we'll aim to include CRI-ContainerD in other efforts such as
kubeadm bootstrapping for nodes.
As discussed in SIG-Node, there's also no testing on switching CRI on an existing node. These are expected to be installed and configured as a prerequisite before joining a node to the cluster.
Version Skew Strategy
There's no version skew considerations needed for the same reasons described in upgrade/downgrade strategy.
- 2019-04-24 - KEP started, based on the earlier doc shared SIG-Windows and SIG-Node
- 2019-09-20 - Updated with new milestones
- 2020-01-21 - Updated with new milestones
CRI-O is another runtime that aims to closely support all the fields available in the CRI spec. Currently there aren't any maintainers porting it to Windows so it's not a viable alternative.
No new infrastructure is currently needed from the Kubernetes community. The existing test jobs using prow & testgrid will be copied and modified to test CRI-ContainerD in addition to dockershim.