Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better auth support for swarm-mode services running images in private repos #24940

Open
friism opened this issue Jul 22, 2016 · 29 comments
Open
Labels
area/swarm kind/enhancement Enhancements are not bugs or new features but can improve usability or performance.

Comments

@friism
Copy link
Contributor

friism commented Jul 22, 2016

This is a follow-up to #24372

The current flow to deploy service with image in private repo is something like:

  1. docker login
  2. docker service create --registry-auth <private-repo-url>

With --registry-auth, registry authentication is passed to swarm and swarm passes it to the worker nodes when tasks are created so they can successfully pull images from the private repo.

When this flag is used with Docker Hub / Cloud, resetting the Hub password causes credentials to be rolled (and similarly when creds are rolled with other authenticated registry implementations). That will then cause running apps relying on that registry/auth to randomly start failing. This is a bad failure mode, because a password update or cred roll (potentially even by a different user) will cause seemingly unrelated apps to slowly start failing. Because of the tight coupling, the only correct way to update credentials is to carefully coordinate the update with service update for all services that rely on a private repo.

I don't know what's a better design, creating this issue for tracking.

cc @stevvooe

@thaJeztah thaJeztah added this to the 1.13.0 milestone Jul 23, 2016
@thaJeztah thaJeztah added kind/enhancement Enhancements are not bugs or new features but can improve usability or performance. area/swarm labels Jul 23, 2016
@rageshkrishna
Copy link

rageshkrishna commented Oct 24, 2016

If I've understood it correctly, running docker service update --with-registry-auth <service> will trigger a rolling update. I wouldn't mind scripting this, but only if there was some way to avoid the rolling update and just get the service definition updated with the new credentials so they can be used the next time the service needs to be deployed. We use ECR and this is definitely a pain point for us right now.

@stevvooe
Copy link
Contributor

Most of the rolling update issues should be addressed with secrets, but I am thin on the details.

The root problem here is that we are sending hub passwords. If this was a token, the hub could change the credentials without affecting the token.

@friism
Copy link
Contributor Author

friism commented Oct 26, 2016

@borjaburgos @fermayo is this something that you can help with? I'm reading up on the OAuth registry spec. I guess it would be something like:

  1. docker service create --with-registry-auth foo-user/bar-repo
  2. [optional] swarm makes sure that the current creds work for that repo
  3. swarm uses creds to obtain refresh token that's scoped to that particular repo and is read-only
  4. swarm hangs on to said refresh token
  5. whenever swarm commands a worker to start a task/container for this service, it obtains a short-lived auth-token from Hub and passes that along with the task command.
  6. the long-lived tokens are not nullified on password reset, etc.

This should work for DTR too I suppose

/cc @dmp42 @diogomonica @pdevine

@fermayo
Copy link
Member

fermayo commented Oct 27, 2016

@friism at the moment, tokens given by the registry authentication service only live for 5 minutes and cannot be refreshed. However, users can generate an API key in Docker Cloud that can be used to authenticate against our registry using it as a password. That API key is valid even after password refreshes, and can be revoked from Docker Cloud.

@stevvooe
Copy link
Contributor

stevvooe commented Oct 27, 2016

@fermayo There exists the concept of identity tokens. Such a key could also defer access to refresh the authorization.

Let's think outside the box on what we can do with garant tokens. The current thinking that they are limited to just the registry or are limited to 5 minutes is preventing us from implementing further workflows. We haven't even scratched the surface of what is possible with the access model provided by garant. Let's build up on this solid base rather than continuing to re-invent the wheel.

Registries don't care either way.

cc @dmcgowan

@diogomonica
Copy link
Contributor

@fermayo can a user provide that API key base64 encoded in a way that Docker recognizes?

If so, we could add that token, and change --with-registry-auth to receive the name of a secret that will be sent down to the node.

@friism
Copy link
Contributor Author

friism commented Oct 28, 2016

... or would it make sense for Docker Engine to auto-provision an API key (if possible) during docker service create? @diogomonica @stevvooe can you confirm that the authentication mechanism currently given to workers when starting tasks/containers are in fact the short-lived tokens, and not the full password?

@diogomonica
Copy link
Contributor

Right now it's the full password.

@fermayo
Copy link
Member

fermayo commented Oct 28, 2016

@diogomonica the user can provide that API key as a replacement of their password (i.e. providing it on docker login)

@thaJeztah
Copy link
Member

/cc @simonferquel

Also see the discussion on #31878 (comment), which contains information about (e.g.) the AWS credentials helper

@thaJeztah
Copy link
Member

Oh, and my comment here #31632 (comment) (interested on thoughts on that one)

@thaJeztah thaJeztah modified the milestones: 17.06, 1.13.0 Apr 20, 2017
@kencochrane
Copy link
Contributor

Is there any way we can change the way credential stores and helpers work, so that they are not just a client side feature, and we can use them server side as well?

If so, then we could allow the swarm service to pick which credential store and helpers they want to use, to help manage their registry credentials. It would be really cool if there was a swarm secrets credstore, where we can store the registry password as a swarm secret, and then if you need to change your password you just need to change the swarm secret, and ideally this doesn't trigger a rolling update of your service.

This would also work well with the registries that use short lived tokens like ECR and the AWS ECR credential helper. When ever Swarm needs to auth, it will reach out via API to get a valid AUTH token, and it would never need to be manually rotated.

Just an idea, are there any other current ideas, that haven't been posted to this issue yet?

@stevvooe
Copy link
Contributor

@kencochrane That sounds right on target. Secrets should be used here and we should introduce "external secrets".

I am not sure what the current plans for this are.

@diogomonica
Copy link
Contributor

@kencochrane we're thinking about it. Currently creating "secret types" for our external stores. I believe "registry auth" will probably be one of those types.

/cc @cyli @aaronlehmann

@kencochrane
Copy link
Contributor

Ok, thanks. If I remember correctly you can't change a swarm secret, once it is made. You need to rotate it out.

Would these different secret types be able to change? If not, it wouldn't work for auth that uses short lived tokens.

Since the registry auth secrets don't need to have access inside of a container, hopefully we can treat them differently and allow updating, or some other way for them to be dynamic. Like the way the ECR cred helper uses IAM roles to get a short lived auth token via an API.

@aaronlehmann
Copy link
Contributor

Ideally we would have some kind of plugin API that lets the manager request short-lived tokens for every task that's created, and provide individual tokens along with those tasks, rather than providing the actual credentials to all worker nodes. I'd really like to move away from the model where workers handle the credentials and request tokens for themselves.

@friism
Copy link
Contributor Author

friism commented May 22, 2017

@aaronlehmann that works for me too and it's somewhat like what I proposed here: #24940 (comment)

From the standpoint of Docker for AWS this is also ideal since only managers would required the privileged registry read-only role.

@friism
Copy link
Contributor Author

friism commented Jul 17, 2017

Cross-posting what I think is the current path forward (cc @diogomonica):

  1. Docker will get plug'able external secret stores
  2. --with-registry-auth transitions to using docker secrets for storing registry creds (right now I think it's just embedded in the service def)
  3. We implement an EC2 instance metadata pseudo-secret store
  4. That gets wired up so that swarm dynamically reads ECR registry creds as a secret, with that secret provided by the pseudo-secret EC2 instance metadata secrets plugin.

@thaJeztah thaJeztah removed this from the 17.06.0 milestone Jan 25, 2018
@nvivo
Copy link

nvivo commented Jan 27, 2018

Hi. Is this implemented yet? Is there any workaround to keep swarm deploying containers over time with short lived tokens?

@jeffjen
Copy link

jeffjen commented Feb 22, 2018

Still waiting for this feature. Managing 20+ service with private registry is not easy when you rotate or scale node count

@trajano
Copy link

trajano commented Jul 6, 2018

Given all these issues, I am thinking of just setting up a private registry that I can control that does not have the short lived tokens issue perhaps with registry:2 or nexus

I just want to use shepherd to do the automatic deployment.

@nvivo
Copy link

nvivo commented Jul 6, 2018

I ended up just paying for Docker Hub.

@RAKedz
Copy link

RAKedz commented Jul 18, 2018

I decided last week to use a credential manager with our Swarm. Been using Swarm for quite some time on Ubuntu 16.04 with about 6 nodes. It was set up by going through each node and using Docker Login to point to our private nexus registry and added --with-registry-auth to the service create.

I decided to use the pass credential manager on a node from the list posted here:

https://docs.docker.com/engine/reference/commandline/login/

I followed the instructions here:

docker/docker-credential-helpers#102

Using latest 0.6.0 with gpg2.

It tested fine as the instruction indicated. Did a simple test of pulling and pushing images to the private registry. When I removed one of my services for testing on the node using pass it failed indicating it couldn't find the image. I ended up putting the origin Docker Login results back.

I have been reading the short-comings of credential manger with the Swarm. Is there one that works with the Swarm? I'd hate to keep pressing on to hit another wall.

Appreciate your inputs on this.

@thaJeztah
Copy link
Member

@RAKedz note that the credential-helper on the node is not used in a swarm setup (unless that node is the manager from which the service is created, or updated);

In a swarm, credentials are distributed through the RAFT store; when you create a service, and add the --with-registry-auth (docker service create --registry-auth .....), the following happens;

  • the cli from which the service is created reads the credentials (which could be using a credential helper installed on that machine)
  • using those credentials, the image's digest is resolved
  • the service definition, including credentials are sent to the manager daemon, and stored in the raft store

The service's tasks are now scheduled, and the node on which a task is deployed gets the credentials that are stored in the raft store; the node will pull the image using those credentials.

It's important to know that credentials may expire (depending on the authentication mechanism used by the registry you're using); if a task is re-deployed to a different node (for example, a task failed, and a new one is started to replace it), and credentials have expired, pulling the image will fail, because the service only has the credentials that were stored at the time the service is created.

If you update a service, passing --with-registry-auth when doing so, will again read the credentials using the credentials-helper, and update the service definition in the raft store with those credentials

@RAKedz
Copy link

RAKedz commented Jul 19, 2018

@thaJeztah thanks so much for describing how this works. I would basically just install the credential helper on the managers and place --with-registry-auth to both service update and service create. Would also have to docker logout and docker login once I have it installed.

I also read about this credential helper docker-credential-secretservice and comparing its installation with pass it looked so much simpler.

See the two links below:

https://hackernoon.com/getting-rid-of-docker-plain-text-credentials-88309e07640d
https://www.projectatomic.io/blog/2016/03/docker-credentials-store/

I will now go back to my credential helper journey. Will keep you posted.

@rmetcalf9
Copy link

rmetcalf9 commented Mar 21, 2021

I am trying to migrate my docker swarm from using images on docker hub to images inside ECR.
I have just read through this issue from 2016 (5 years ago.)

In summary it seems to say that it is not possible to use docker swarm with ECR repositories due to the credential token expiring.

The best solution I can see is https://medium.com/@MahmoudGaballah/ecr-for-docker-swarm-fdea3a9b01b1 - creating an ECR Proxy but this is complicated.

Is the correct current situation as of Mar 2021 that swarm is incompatabile with ECR? Is it a work in progress?

@trajano
Copy link

trajano commented Mar 21, 2021

@rmetcalf9 I would recommend NOT using ECR, instead just run your own registry like Sonatype Nexus. This allows you to decouple yourself from AWS if needed.

@rmetcalf9
Copy link

@trajano Thanks for the response. I think you are right. I am in the process of setting up a repo. Fustrating though this should be one of the simple use cases!

@trajano
Copy link

trajano commented Mar 25, 2021

@rmetcalf9 if you want to be parcimonious initially you can use sonatype nexus

Here's my setup (note I am still on Traefik 1.7 at work been too busy to move onto 2.x).

version: "3.7"
services:
  nexus:
    image: sonatype/nexus3:3.29.2
    volumes:
      - nexus-data:/nexus-data
    deploy:
      replicas: 1
      restart_policy:
        condition: any
        delay: 10s
      resources:
        limits:
          memory: 3g
        reservations:
          memory: 1024M
      labels:
        - "traefik.dockerv1.frontend.rule=Host:repo.devhaus.com; PathPrefix:/v1/"
        - "traefik.dockerv1.frontend.entryPoints=https"
        - "traefik.dockerv1.frontend.passHostHeader=true"
        - "traefik.dockerv1.frontend.priority=100"
        - "traefik.dockerv1.port=8082"
        - "traefik.dockerv2.frontend.rule=Host:repo.devhaus.com; PathPrefix:/v2/"
        - "traefik.dockerv2.frontend.entryPoints=https"
        - "traefik.dockerv2.frontend.passHostHeader=true"
        - "traefik.dockerv2.frontend.priority=100"
        - "traefik.dockerv2.port=8082"
        - "traefik.proxydockerv1.frontend.rule=Host:docker-proxy.XXXXXXXXXXXXX.com; PathPrefix:/v1/"
        - "traefik.proxydockerv1.frontend.entryPoints=https"
        - "traefik.proxydockerv1.frontend.passHostHeader=true"
        - "traefik.proxydockerv1.frontend.priority=100"
        - "traefik.proxydockerv1.port=8083"
        - "traefik.proxydockerv2.frontend.rule=Host:docker-proxy.XXXXXXXXXXXXXX.com; PathPrefix:/v2/"
        - "traefik.proxydockerv2.frontend.entryPoints=https"
        - "traefik.proxydockerv2.frontend.passHostHeader=true"
        - "traefik.proxydockerv2.frontend.priority=100"
        - "traefik.proxydockerv2.port=8083"
        - "traefik.main.frontend.rule=Host:repo.devhaus.com"
        - "traefik.main.frontend.entryPoints=https"
        - "traefik.main.frontend.passHostHeader=true"
        - "traefik.main.port=8081"
        - "traefik.enable=true"
        - "traefik.docker.network=traefik"
    networks:
      - traefik
volumes:
  nexus-data:
    driver: local
    driver_opts:
      type: nfs
      device: ":/nexus"
      o: addr=fs-XXXXXXXXXXXXX.com,nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport
networks:
  traefik:
    external: true

Mind you, you WILL get sticker shock when your team grows and you want to get the Pro version of Nexus. So another alternative that you can have would be JFrog but I don't think they have a good free tier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/swarm kind/enhancement Enhancements are not bugs or new features but can improve usability or performance.
Projects
None yet
Development

No branches or pull requests