Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] Automatic docker image pull on each node #864

Closed
ltetrel opened this issue May 29, 2019 · 4 comments
Closed

[feature request] Automatic docker image pull on each node #864

ltetrel opened this issue May 29, 2019 · 4 comments

Comments

@ltetrel
Copy link

ltetrel commented May 29, 2019

Hi everyone,

We are launching a new binderhub on computecanada for neuroscience : http://neurolibre.conp.ca/

We sometimes experience reaaaally slow pulling time from dockerhub (mostly because we share our bandwidth with other users on https://docs.computecanada.ca/wiki/Cloud_resources). This cause a really long pod spawn time if the image is not already on the node where the user starts an instance (https://mybinder.readthedocs.io/en/latest/faq.html#what-factors-influence-how-long-it-takes-a-binder-session-to-start).

Would it be possible to have a mechanism that automatically pull the image on each node (so we would merge step 1 and 2) ? Then, if a user has his instance assigned on a new node, he will not need to wait for a pull.
For now we manually force the pulling on each node..

Even better to avoid redundancy: An admin could mount a share volume on each node where the docker images would be stored (https://medium.com/@ibrahimgunduz34/how-to-change-docker-data-folder-configuration-33d372669056), but just for the user images, not the system images (hub, binder etc..).

Thanks to all the team for your great work !

@betatim
Copy link
Member

betatim commented May 29, 2019

It isn't currently possible to do this out-of-the-box. It will require a bit of work but I think it would be generally useful so it would be great if someone worked on this.

Instead of having a shared drive between all nodes I would investigate how much faster things get when you setup a docker image registry near your cluster. For example for mybinder.org we use Google Container Registry which has a faster network connection to our cluster than docker hub. This is probably less work than having BinderHub trigger a pull for every build. The shared drive would be a network drive -> no speed up compared to a registry. And you can get the registry "for free": it is a few clicks on GKE and just works (AWS must have something like this and for bare-metal clusters it is a task you should be able to give to whoever provides you with a k8s cluster). A high performance shared (write many) network filesystem sounds like it would use up a lot of your time (and nerves).

Now for how we could extend BinderHub, if a local registry isn't fast enough:

The JupyterHub helm chart already has the notion of fetching images before users arrive: https://zero-to-jupyterhub.readthedocs.io/en/latest/optimization.html?highlight=image%20puller#pulling-images-before-users-arrive. For a standard JupyterHub the job is a bit easier as the set of images needed is known ahead of time. I'd probably try and copy the ideas from the image puller, extend them to be able to handle images on-the-fly and make a new thing from that.

You could imagine that BinderHub edits/triggers a continuous image puller "job" each time an image build completes. This way we could reuse infrastructure that zero2jupyterhub already created (and debugged :) ).

Some things to ponder while building this:

  • not every image needs to be on every node as some images are not very popular, second order effect?
  • need to check how eviction from the docker image cache per node works. Last time we cleaned out the mybinder.org registry we deleted tens of terrabytes of images so storage gets used fast -> you need a clean up strategy that works. Kubernetes does some clean up but I don't know what strategy it uses and if a different strategy would be better (always delete the biggest image, always the elast used, randomly, etc)
  • build in throttling so a node doesn't get overwhelmed with pulling images

We could probably find someone to help give advice/review/work with if you start working on this.

@ltetrel
Copy link
Author

ltetrel commented May 29, 2019

Thanks for the fast reply !

It is maybe to technical for me haha
We also fought using our own registry, for now I think we will leave it as it is now and try to have a better internet speed on our server.

I will let you know if we have advancements on this side,

cc @dlq @mathieuboudreau @cmd-ntrf

@thomas-bc
Copy link

Bringing this up, this would be a great feature to have!

@consideRatio
Copy link
Member

consideRatio commented Sep 11, 2021

I think this is a feature supported by the JupyterHub Helm chart that Binderhub relies on. It is documented here.

A specific pre-defined image will be pulled on each node.

jupyterhub:
  prePuller:
    extraImages:
      myExtraImageIWantPulled:
        name: jupyter/all-spark-notebook
        tag: 2343e33dec46

I think it is out of scope for binderhub the Helm chart to support something more advanced than this even it something more advanced could make sense because we lack maintenance capacity to implement and maintain such feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants