Support `containerd` as the container runtime #1048

nhaghighat · 2022-06-20T07:42:22Z

Description

This PR enables working with containerd runtime by introducing an init container to pull images, the controller detects the runtime and configures orchest accordingly”

In order to test this PR you need to have a cluster with contained runtime, microk8s is suggested, after microk8s is installed (here), the following addons need to be enabled.

microk8s enable hostpath-storage \
  && microk8s enable dns \
  && microk8s enable ingress

In order to be able to push images to microk8s node, after rebuilding all the images with some valid tag (for example v2022.06.4), you can save them to a tar file via following command:

docker save $(docker images | awk '{if ($1 ~ /^orchest\//) new_var=sprintf("%s:%s", $1, $2); print new_var}' | grep v2022.06.4 | sort | uniq) -o orchest-images.tar

Then this tar file can be shipped to microk8s node via scp.
scp ./orchest-images.tar {your_user}@${microk8s node ip}:~/

then inside the microk8s node, you can import the images via following command (note ctr has to be installed, binaries can be found here)

sudo ctr -n k8s.io -a /var/snap/microk8s/common/run/containerd.sock i import orchest-images.tar

# Or use microk8s ctr
microk8s ctr --namespace k8s.io --address /var/snap/microk8s/common/run/containerd.sock image import orchest-images.tar

then orchest can be installed with orchest-cli, with following command

orchest install --socket-path=/var/snap/microk8s/common/run/containerd.sock --dev

Note:

the manifests must be generated via make manifestgen in the orchest-controller directory.

TAGNAME=v2022.06.4 make -C ./services/orchest-controller manifestgen

Checklist

nhaghighat · 2022-06-20T07:57:58Z

It is tested with microk8s and minikube with containerd. k3s will be tested

orchest-cli/orchestcli/cmds.py

scripts/build_container.sh

services/orchest-controller/deploy/helm/values.yaml

fruttasecca · 2022-06-20T15:40:32Z

Not sure I agree with the deletion of the node agent:

Orchest loses the ability to delete images from the node
pre-pulling is still an important part of the UX imo, not only we pull env images but a bunch of other things

I guess part of the reason for this is the same reason behind the fact that we are using different images to interface with different CRIs? #1048 (comment)

Is the requirement for the cluster nodes to have a homogeneous runtime dictated by a lack of time (i.e. to keep scope down?) or a technical requirement?

nhaghighat · 2022-06-20T19:46:53Z

Not sure I agree with the deletion of the node agent.

I agree, this has this downside, but if we want to use node-agent we need to have some mechanism to delay pod deployment, until node-agent informed us about pulling.

Is the requirement for the cluster nodes to have a homogeneous runtime dictated by a lack of time (i.e. to keep scope down?) or a technical requirement?

It is a technical requirement, we don't know on which node the pod will end up, so in not-homogeneous cluster, orchest does not know which image to use.

fruttasecca · 2022-06-21T07:49:57Z

I agree, this has this downside, but if we want to use node-agent we need to have some mechanism to delay pod deployment, until node-agent informed us about pulling.

This would only affect the "puller" part of the node-agent, so I think we should at least retain the "deleter" part of it and aim to keep the puller part of it if possible. To be more precise, it only affects environment images and custom jupyter images (essentially all images that are pushed to the internal registry), so other images that we are pre-pulling are not part of the issue and would still benefit from being pre-pulled.

I think we have some architectural issues/questions here to solve:

how do different container runtimes handle concurrent pull calls? AFAIK that's fine but I haven't been able to find explicit docs about it for docker and/or containerd
now that we support contaienrd the node agent should cover that as well
this PR provides multiple images to interface with different container runtimes, what do we do with the node-agent? One image of the node agent for every container runtime? (Keen to know @yannickperrenet's opinion as well)

Note that, assuming we keep the node-agent around, the new images introduced by this PR should be part of the set of images that the node puller and/or deleter considers.

It is a technical requirement, we don't know on which node the pod will end up, so in not-homogeneous cluster, orchest does not know which image to use.

I see, makes sense. I think there might be some hacky workarounds for this but I'd rather keep the abstraction, I guess not many people have heterogeneous clusters anyway (?).

fruttasecca · 2022-06-21T07:52:47Z

Note that since users can create custom jupyter images that can be customized and that are pushed to the internal registry, so the same kind of pre-pull behaviour should take place (you should look into the sessions _core.py module AFAIR). Also the /orchest-images-to-pre-pull endpoint in namespace_ctl might need to be looked into (not sure)

fruttasecca · 2022-06-21T07:56:04Z

Would jupyter kernels actually work? I think _get_kernel_pod_manifest in services/orchest-webserver/app/app/res/kernels/launch_kubernetes.py might need to be changed to include the init containers?

fruttasecca · 2022-06-21T07:59:05Z

Another thing that might need looking into, user services allow you to use an environment image built in Orchest as a service, the logic happens at _get_user_service_deployment_service_manifest in services/orchest-api/app/app/core/sessions/_manifests.py. I guess we need to have init containers there are as well when the service image is an Orchest environment image.

fruttasecca · 2022-06-21T08:10:41Z

Another thing I would like to mention is the path of the init container directories, e.g. services/tools/docker/. Can we find a name that fits what the tool is doing and then improve the directory structure? I think this might have benefits in the long run, e.g. services/tools/init-container-puller/docker/, etc., discussing it in this PR feels a bit off already since I keep coming up with (probably wrong) names for it. This would also affect how to refer to this particular tool in the codebase, i.e. RUNTIME_IMAGE

lib/python/orchest-internals/_orchest/internals/config.py

orchest-cli/orchestcli/cli.py

orchest-cli/orchestcli/cmds.py

yannickperrenet · 2022-06-21T10:52:25Z

services/orchest-controller/deploy/thirdparty/argo-workflows/orchest-values.yaml

@@ -164,13 +164,13 @@ controller:

  # -- Specifies the container runtime interface to use (one of: `docker`, `kubelet`, `k8sapi`, `pns`, `emissary`)
  ## Ref: https://argoproj.github.io/argo-workflows/workflow-executors/
-  containerRuntimeExecutor: docker
+  containerRuntimeExecutor: emissary


If a user does orchest update, does it automatically pick up the emissary executor? I guess so, right?

I think not, the third parties only checked on installation time

I thought that changes to third parties would also be used in orchest update? Otherwise orchest update would actually 100% update Orchest.

I feel the update should indeed take care of updating third parties as much as possible

I spent quite some time on it, it is quite complex to solve, as we use OrchestCluster hash to detect if the cluster needs updating or not, and this part is not part of OrchestCluster. so the way we have to detect upgrading needs to be changed, I implemented that part but I'm afraid it might take more time.

as it does not affect our current users, I suggest creating a ticket for it to properly fix it later.

services/orchest-controller/pkg/controller/orchestcluster/cluster_utils.go

yannickperrenet · 2022-06-21T11:20:11Z

Some questions I have (in no particular order, but I figured the numbers could help to reference resp. question):

(1) I am not so sure about the deletion of the node-agent (pretty much what @fruttasecca mentioned in #1048 (comment)):

We need to keep deletion as otherwise all nodes in the cluster would slowly start to get a full disk due to the images.
Pre-pulling seems like a good idea still (to me) given that otherwise the init container would have to do the pulling, meaning that (from a UX standpoint) the user has to wait longer for e.g. a session to start.
"One image of the node agent for every container runtime?" (from Support containerd as the container runtime #1048 (comment) by @fruttasecca) -> I think we could use environment variables as well here (just like it is done for the orchest-api) that are added by the orchest-controller. I would rather have one bigger node-agent image that works for both container runtimes instead of having one for each. Given that it has a Python base, I don't think there is much gain anyways (in a relative sense).

(2) @nhaghighat You said in #1048 (comment) that you tested microk8s. Did it work out of the box or did you have to use the newly added --socket-path CLI option? In case it did not work by default then I think we should add a section to the docs that explains how to run Orchest on microk8s.

(3) The newly added docker-cli and containerd-cli need to be pushed to DockerHub on release. So they need to be added to the correct .github/workflow/... file.

In Support containerd as the container runtime #1048 (comment) it was discussed that having one image for each would save their size. What size benefits are we talking about here? In case it is super minimal, then we might as well ship one slightly larger image as to not have to distinguish between container runtimes. We can just first try to pull using Docker, and if that does not exist, then we can try to pull using containerd. This would also solve the need to check for a homogeneous cluster (as per Support containerd as the container runtime #1048 (comment))

(4) As mentioned in #1048 (comment), we might want to change the path from services/tools/... to tools/.... I don't really see the docker-cli/containerd-cli containers as services (at least not in the same way as the orchest-api etc.)

(5) Indeed the init container needs to be added to more places:

jupyter-server (Support containerd as the container runtime #1048 (comment))
kernels (Support containerd as the container runtime #1048 (comment))
user services (Support containerd as the container runtime #1048 (comment))

nhaghighat · 2022-06-21T11:37:23Z

how do different container runtimes handle concurrent pull calls? AFAIK that's fine but I haven't been able to find explicit docs about it for docker and/or containerd

@fruttasecca : You mean concurrent pull of different images or same image?

now that we support contaienrd the node agent should cover that as well

Pre-pulling seems like a good idea still (to me) given that otherwise the init container would have to do the pulling, meaning that (from a UX standpoint) the user has to wait longer for e.g. a session to start.

I also like the idea of pre-pulling, but I think at the moment it does not bring us much, most of our users have a limited amount of nodes and the images will be pulled by the first step container, so for the rest of the steps, the image is probably already present on the node.
Also, it fills all the nodes with images that might not be needed at all.
For example, in a 3 node cluster, it pulls jupyter-eg and jupyter-server on all of them, but the pod is spawned in only one of the nodes.

this PR provides multiple images to interface with different container runtimes, what do we do with the node-agent? One image of the node agent for every container runtime?

I would rather have one bigger node-agent image that works for both container runtimes instead of having one for each. Given that it has a Python base, I don't think there is much gain anyways (in a relative sense).

If we decide to keep node-agent that would also be my preference.

We need to keep deletion as otherwise all nodes in the cluster would slowly start to get a full disk due to the images.

I think we should at least retain the "deleter" part of it and aim to keep the puller part of it if possible

Kubelet has the functionality to remove unused, un-tagged images from the nodes, regularly.

In case it did not work by default then I think we should add a section to the docs that explains how to run Orchest on microk8s.

No, I have changed the OrchestCluster CRD, will add an example and a section to the doc about it.

nhaghighat · 2022-06-21T12:09:16Z

We can just first try to pull using Docker, and if that does not exist, then we can try to pull using containerd. This would also solve the need to check for a homogeneous cluster

The image size will not grow that much if we have both in the same image. now it is like 50MB each, then it would be 100MB.
But it still does not solve deployment is not a homogeneous cluster, we still need to detect the socket location beforehand and it is different for different container run-times.
Solving this problem now seems like a pre-mature optimization, most f the clusters are homogeneous.

I think the ultimate solution to solve all the mentioned issues is to eventually have our own scheduler.

fruttasecca · 2022-06-21T12:23:44Z

@fruttasecca : You mean concurrent pull of different images or same image?

same, was thinking about race conditions etc.

I also like the idea of pre-pulling, but I think at the moment it does not bring us much, most of our users have limited amount of nodes and the images will be pulled by the first step container, so for the rest of the steps the image is probably already present on the node.

I don't feel very strongly about the need for a pre-puller like the current state of the node-agent, although I think it still brings a significant UX improvement on first use(s) and first impressions matter, although I don't have anything quantifiable. About pulling images where they are not needed, I get the sentiment but any multi node (serious) use of Orchest will eventually lead to all "types" of pods being scheduled on different nodes, so I am not sure we should consider this point very much.
Imo if we keep the node-agent, let's say, for deletion, (and thus implement/make it work with different container runtimes) the adjustments needed to bring the puller logic up to speed should be so minimal that it wouldn't make sense to renounce the UX improvement

I would rather have one bigger node-agent image that works for both container runtimes...
If we decide to keep node-agent that would also be my preference.

same :)

Kubelet has a functionality to remove unused, un-tagged images from the nodes, regularly.

I think it might help us in reducing complexity by offloading more logic to k8s, I've skimmed through the docs very quickly and I think it has some downsides:

it requires setting and/or changing the kubelet configuration, I am not sure we can do that on third-party clusters or what's the overhead to implement this on different flavours of k8s, and asking users to do that would get in the way of a "it just works" experience imo
as far as I could see it's controlled by a periodic check configured through --minimum-image-ttl-duration duration, which means that freeing space is not immediate and you end up having delays at random times due to the image being re-pulled, the latter is a minor point imo, especially if the logic that considers an image "unused" is smart (which I don't know tho)
thinking about the cloud, if we require users to have at least X free gigs before performing an update we know that the outdated images will be swiftly cleaned up after said update, a periodic check with a long period wouldn't give us that, or we would have to complicate the logic further to quickly clean up those unused images. (@yannickperrenet tagging you here since this is likely of interest for the cloud migration)
generally speaking, and especially thinking about the cloud, freeing up space quickly might make a great difference when it comes to UX
we would still need to keep track of what images are to be deleted since environment and custom jupyter images are pushed to the internal registry

So imo even if this kubelet functionality would have came up during the implementation/review of the node-deleter I am not sure we would have gone with that

Solving this problem now seems like a pre-mature optimization, most f the clusters are homogeneous.

Agree

fruttasecca · 2022-06-21T12:28:37Z

I think the ultimate solution to solve all the mentioned issues is to eventually have our own scheduler.

I agree, or at least I agree on the fact that the current solution is not the final one, but when it comes to managing scope/roadmap/timelines I think this init container "hack" might be the right call in order to save time

nhaghighat · 2022-06-21T13:10:03Z

I agree, or at least I agree on the fact that the current solution is not the final one, but when it comes to managing scope/roadmap/timelines I think this init container "hack" might be the right call in order to save time

agree, it was the right call.

nhaghighat · 2022-06-21T14:20:38Z

it requires setting and/or changing the kubelet configuration

garbage collecting of images and containers is the default behavior of kubelet, so it is safe to assume users have not disabled it.

as far as I could see it's controlled by a periodic check configured through...

There are different TTL and threshold configurations. I think the default one is fine for us.

end up having delays at random times due to the image being re-pulled

This is true, but it can still happen with node-agent. imagine this scenario: kubelet deletes an image and then before node-agent pulls the image a pod is configured to run on that node.

For example, there is image-gc-high-threshold which is triggered if the disk usage reaches the threshold (of course it checks the disk usage periodically)

more information about it can be found here:
https://unofficial-kubernetes.readthedocs.io/en/latest/concepts/cluster-administration/kubelet-garbage-collection/

Change two-column design to one-column design.

yannickperrenet · 2022-07-06T10:54:59Z

So regarding #1048 (comment) on orchest update we are currently not updating any changes we make to third-parties? E.g. for new version X we make a change to the manifest of Argo (this has happened in the past where we added an environment variable, if I am not mistaken @fruttasecca), then this change won't actually be reflected when an existing user updates/upgrades to version X? That is not acceptable imo, but since it is not introduced in this PR I am happy to have it fixed in another PR.

Thus, for the time being, I guess we should not make changes to third-party helm charts as existing users won't get the update. Luckily adding this functionality to the orchest-controller should not be a breaking change.

services/orchest-controller/deploy/examples/example-orchestcluster-micrik8s.yaml

nhaghighat · 2022-07-06T14:10:23Z

then this change won't actually be reflected when an existing user updates/upgrades to version X

That is true, this change is not reflected on existing users

That is not acceptable imo

I agree, but that requires detecting changes of the thrird-parties, which is not that easy, as we use in container values file for that

yannickperrenet

Ohhyeahhhh!! Finally 😉 🕺

Great job 💯

nhaghighat requested review from fruttasecca and yannickperrenet June 20, 2022 07:42