Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Dockerhub Rate Limiting #5219

Closed
5 of 15 tasks
debuggy opened this issue Jan 4, 2021 · 3 comments · Fixed by #5290
Closed
5 of 15 tasks

Dockerhub Rate Limiting #5219

debuggy opened this issue Jan 4, 2021 · 3 comments · Fixed by #5290
Assignees

Comments

@debuggy
Copy link
Contributor

debuggy commented Jan 4, 2021

Introduction

On November 20, 2020, rate limits anonymous and free authenticated use of Docker Hub went into effect. Anonymous and Free Docker Hub users are limited to 100 and 200 container image pull requests per six hours -- from official doc

Issues

  1. In current OpenPAI, all jobs in the cluster pull images from dockerhub with a certain free account. So it will be soon to reach the rate limit if the cluster has heavy load.
    image

  2. When deploying or upgrading OpenPAI, there is a step that admin will build and push new images to configured registry (currently is docker.io/openpai). Admin will fail to build and push new images when reaching rate limit.
    image

  3. When start services in OpenPAI, the k8s cluster will create pods which pull images from configured registry (currently is docker.io/openpai). When reaching rate limit, cluster will fail to create pods.
    image

Proposals

  1. Deploy a new pull through cache docker registry for dockerhub to cache the images that already be pulled before. All jobs that uses the same image will pull from this registry instead of dockerhub to reduce pull rate.
  2. Reference from google cloud suggestion
    • Docker Hub treats GKE as an anonymous user by default. This means that unless you are specifying Docker Hub credentials in your configuration, your cluster is subject to the new throttling of 100 image pulls per six hours, per IP.

    • Container Registry hosts a cache of the most requested Docker Hub images from Google Cloud, and GKE is configured to use this cache by default. This means that the majority of image pulls by GKE workloads should not be affected by Docker Hub’s new rate limits. Furthermore, to remove any chance that your images would not be in the cache in the future, we recommend that you migrate your dependencies into Artifact Registry, so that you can pull all your images from a registry under your control.

  3. Create a docker image cache and configure OpenPAI k8s cluster to use this cache.
  4. Move OpenPAI service images to a registry under control.

Work Plan

  • Test the rate limit from local script
    Test scripts can find in this repo, currently python version is finished.
    In local test phase, due to limit of download rate, I implement a test minimal image, reached docker hub rate limit after 326
    downloads. After that, it will response a 500 error code,
    image
    In the repo above, there is also a bash script to query rest pull time. When reach limit rate, it shows as following.
    image

  • Deploy a local pull through registry to check if the pull rate is reduced successfully
    In above repo, there is also a boot_local_docker_pull_through_cache_registry.sh bash script, which can boot a pull-through
    cache registry on local machine. With local pull through cache registry, After 400 downloads in 2 hours, the docker hub rest rate
    limit is still 99 (100 quota for anonymous downloads).

  • Test the rate limit in OpenPAI cluster, see in what scenario this rate limit will be triggered and cause problems
    Trying to submit multi taskrole jobs, to reach rate limit on int bed, but meet oom question, job name: v-xianglong_85cf447f

  • Deploy pull through registry in OpenPAI and test

    • Switch registry storage backend to azure storage(Already get the key)
    • Test basic authentication on registry htpasswd setting (works on local)
    • Add TLS support for registry in htpasswd setting (Choose to offload tls to pylon/nginx tls)
    • Setup registry as a k8s service(in process, can't access pai-exp-01, working on setup k8s cluster in vm) Ref. repo
    • Switch pai cluster docker hub registry to new service (related source code)

Demos

TODO

Test cases

  • submit a job using dockerhub image
    • cache registry log shows that it receives a pull image request from the job container
  • batch submit 100+ simple jobs
    • cache registry log records all the pull image requests
    • check the worker node rate limit and it does not reach the limit
  • run contrib/kubespray/docker-cache-config-distribute.yml as ansible-playbook -i hosts.yml docker-cache-config-distribute.yml, pai-master node should be add to all cluster master/worker node /etc/docker/daemon.json
  • change docker cache host parameter in docker-cache-config-distribute.yml, and run it again, new specified ip added to all cluster master/worker node /etc/docker/daemon.json
  • test docker pull on each master / worker node for more than 400 times at least
  • test multiple service start / stop operation to test after test multiple pull
  • test enable / disable docker-cache cluster can work normally on both setting
  • test storage backend in "azure" or "fs" mode
  • test htpasswd auth enable / disable, htpasswd is an apache auth tool, a htpasswd auth file can use htpasswd generate, and encode by base64, to fit k8s secret format.
  • test kubespray quick-start-kubespray.sh, docker cache config should be auto distribute
@siaimes
Copy link
Contributor

siaimes commented Jan 4, 2021

Good idea, I think this can also solve the problem of not being able to pull the k8s image in mainland China. Users only need to use the proxy server to pull the target images to master node or use per-downloaded images, and pull to the local docker registry, and then deploy openpai service.

On the other hand, after this optimization, as long as the master node can access the Internet, our service can run normally. All tasks of pulling images from dockerhub are done by the master node, and then the worker nodes pull the images from the mirror registry deployed on the master node.

@debuggy debuggy mentioned this issue Jan 4, 2021
52 tasks
@debuggy
Copy link
Contributor Author

debuggy commented Jan 4, 2021

TODO List:

  • pai-exp-int ssh(IT returned the vm)
  • dockerhub account secret usage(can get in cluster-configuration)
  • pai job scheduler oom: int bed job name v-xianglong_85cf447f

@SwordFaith
Copy link
Contributor

I have built a repo to test registry container and k8s deploy. Anyone who want to try on local or k8s cluster can reference to this repo. @debuggy

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants