docker service create doesn't work when network and generic-resource are both attached #44378

Dogfalo · 2022-11-01T13:52:23Z

Description

Issue

In docker swarm, --generic-resource does not work when it is used alongside --network. This is due to an incorrect condition, if genericresource.HasResource(ta, available.Generic), in the constraint_enforcer.go code when the service is brought up.

for _, ta := range t.AssignedGenericResources {
	// Type change or no longer available
	if genericresource.HasResource(ta, available.Generic) {
		removeTasks[t.ID] = t
		break loop
	}
}

The code should read if !genericresource.HasResource(ta, available.Generic) { so that the task which has an assigned and available GenericResource is not removed.

This is bug is important as it prevents the usage of generic resources in Docker Swarm; this is particularly relevant for assigning services to nodes based on GPU availability.

The generic-resources feature used to work properly in swarm in version 18.06.1.

Reproduce

Bug Investigation + Reproduction steps

This functionality was working in version 18.06.1 but not in any version afterwards.

Each release was tested through these steps. An additional condition that is required for the bug to occur is that the service must be being brought up on a non-manager swarm node:

Initialize the swarm

docker swarm init --advertise-addr <host_ip>

Create an overlay network.

docker network create -d overlay --scope swarm test-network

Modify /etc/docker/daemon.json on the worker to add an item to node-generic-resources. Restart the docker service

{
  ...
  "node-generic-resources": [
      "gpu_<type>=GPU-sample-id",
  ]
}

Add a worker node to the swarm.
Create a service that with the network attached as well as a generic-resource. This step will fail and the service will never get to the running state.

docker service create --network test-network --generic-resource "gpu_<type>=1" --name test-service quay.io/centos/centos:stream8 bash -c "env && sleep infinity"

Observe the error by running docker service ps. These errors continue in a loop where a new task is created and subsequently rejected. This error does not resolve by itself and the service never reaches the Running state.

docker service ps --no-trunc test-service

# Example output

ID                          NAME                 IMAGE                                  NODE                     DESIRED STATE   CURRENT STATE             ERROR                                              PORTS
4tqh3odt54qhp3mzumwjt3mpj   test-service.1       quay.io/centos/centos:stream8@<hash>   <worker_node_hostname>   Ready           Rejected 4 seconds ago    "assigned node no longer meets constraints"
j6d0qwxv2kglyzugpgd98qx1f    \_ test-service.1   quay.io/centos/centos:stream8@<hash>   <worker_node_hostname>   Shutdown        Rejected 9 seconds ago    "assigned node no longer meets constraints"
etrnb3per1lgej9wgp9az2u4d    \_ test-service.1   quay.io/centos/centos:stream8@<hash>   <worker_node_hostname>   Shutdown        Rejected 9 seconds ago    "assigned node no longer meets constraints"
inyz4ez2i95rd5vyxggo0mgbk    \_ test-service.1   quay.io/centos/centos:stream8@<hash>   <worker_node_hostname>   Shutdown        Rejected 14 seconds ago   "assigned node no longer meets constraints"
ymy14a55fbv6cgdaxl4nzy06h    \_ test-service.1   quay.io/centos/centos:stream8@<hash>   <worker_node_hostname>   Shutdown        Rejected 19 seconds ago   "assigned node no longer meets constraints"

Remove the worker node from the swarm and repeat steps 4-6 for different versions of docker-ce

Expected behavior

docker service create should create a service with a network and generic-resource attached.

docker version

Client: Docker Engine - Community
 Version:           20.10.17
 API version:       1.41
 Go version:        go1.17.11
 Git commit:        100c701
 Built:             Mon Jun  6 23:03:11 2022
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.17
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.17.11
  Git commit:       a89b842
  Built:            Mon Jun  6 23:01:29 2022
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.8
  GitCommit:        9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6
 runc:
  Version:          1.1.4
  GitCommit:        v1.1.4-0-g5fd4c4d
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

docker info

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Docker Buildx (Docker Inc., v0.9.1-docker)
  scan: Docker Scan (Docker Inc., v0.17.0)

Server:
 Containers: 28
  Running: 6
  Paused: 0
  Stopped: 22
 Images: 91
 Server Version: 20.10.18
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: local
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: active
  NodeID: <node_id>
  Is Manager: false
  Node Address: <node_address>
  Manager Addresses:
   <manager_1_address>
   <manager_2_address>
   <manager_3_address>
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux nvidia runc
 Default Runtime: nvidia
 Init Binary: docker-init
 containerd version: 0197261a30bf81f1ee8e6a4dd2dea0ef95d67ccb
 runc version: v1.1.3-0-g6724737
 init version: de40ad0
 Kernel Version: <redacted>
 Operating System: <redacted>
 OSType: linux
 Architecture: x86_64
 CPUs: <redacted>
 Total Memory: <redacted>
 Name: <node_name>
 ID: <redacted>
 Debug Mode: false
 Experimental: false
 Live Restore Enabled: false

Additional Info

No response

The text was updated successfully, but these errors were encountered:

s4ke · 2022-11-06T21:59:41Z

This looks simple enough to fix, but probably has to happen over at swarmkit

AssignedGenericResources in constraint_enforcer.go were falsely checked inside a case that enforced Reservations to be set Furthermore, the if statement had a missing ! Signed-off-by: Martin Braun <braun@neuroforge.de

AssignedGenericResources in constraint_enforcer.go were falsely checked inside a case that enforced Reservations to be set Furthermore, the if statement had a missing ! Signed-off-by: Martin Braun <braun@neuroforge.de>

s4ke · 2022-11-06T23:18:54Z

Good find! Pretty sure we ran into this as well I submitted a PR over at swarmkit :)

markperri · 2023-01-28T04:52:19Z

This has been bugging me, thanks for finding it!

sam-thibault · 2023-02-02T11:22:27Z

@dperny @neersighted

…-generic-resources Fixes moby/moby#44378

s4ke · 2023-03-28T14:15:15Z

The fix for this should be included in 23.0.2.

mrnicegyu11 · 2023-08-11T13:54:38Z

We still encounter this bug on Docker Engine Version 23.0.2 and 24.0.5. We don't encounter it in Docker Engine Version 20.10.14. So for us the issue only pops up in versions where the docker engine changelog says it should be fixed 🤔.

Here is an excerpt from docker inspect of an an affected docker service with generic resources and networks specified upon creation (docker engine version 23.0.2):

[{
	"Network": {
		"ID": "zz9rrhtljch6egbam015b79pe",
		"Version": {
			"Index": 438
		},
		"CreatedAt": "2023-08-11T09:35:46.346163966Z",
		"UpdatedAt": "2023-08-11T09:35:46.347094734Z",
		"Spec": {
                        "Name": "xxxxxxxxxxxxx",
			"DriverConfiguration": {
				"Name": "overlay"
			},
			"Attachable": true,
			"IPAMOptions": {
				"Driver": {
					"Name": "default"
				},
				"Configs": [{
					"Subnet": "172.13.0.0/16",
					"Gateway": "172.13.0.1"
				}]
			},
			"Scope": "swarm"
		},
		"DriverState": {
			"Name": "overlay",
			"Options": {
				"com.docker.network.driver.overlay.vxlanid_list": "4106"
			}
		},
		"IPAMOptions": {
			"Driver": {
				"Name": "default"
			},
			"Configs": [{
				"Subnet": "172.13.0.0/16",
				"Gateway": "172.13.0.1"
			}]
		}
	},
	"Addresses": [
		"172.13.3.252/16"
	]
}],
"GenericResources": [
  {
      "DiscreteResourceSpec": {
          "Kind": "VRAM",
          "Value": 1
      }
  }
]

...

"Status": {
      "Timestamp": "2023-08-11T11:23:24.938575326Z",
      "State": "rejected",
      "Message": "preparing",
      "Err": "node is missing network attachments, ip addresses may be exhausted",
      "ContainerStatus": {
        "ContainerID": "",
        "PID": 0,
        "ExitCode": 0
      },
      "PortStatus": {}
}

Background: We run on Ubuntu20.04 with an apt-installed docker engine.

Any suggestions, or anyone else having this behaviour?

s4ke · 2023-08-11T14:15:38Z

Hmm, this looks unrelated though @mrnicegyu11 - but can you try to share a reproducer here? I tried the original steps on a single node (local devenv) with the following and it worked:

docker network create -d overlay --scope swarm test-network
docker service create --network test-network --generic-resource "gpu_test=1" --name test-service quay.io/centos/centos:stream8 bash -c "
env && sleep infinity"

docker service ps --no-trunc test-service
ID                          NAME             IMAGE                                                                                                   NODE      DESIRED STATE   CURRENT STATE            ERROR     PORTS
k3lrk5xjiorlorb7m7f9znyxn   test-service.1   quay.io/centos/centos:stream8@sha256:59460e4360f0657a24ce56339b117f14ac236dd51e1cf33f30ed5725ba1b4429   ubuntu    Running         Running 16 seconds ago             

docker service inspect test-service
[
    {
        "ID": "df0915k0vefz3f1u6zvafy2fq",
        "Version": {
            "Index": 5584
        },
        "CreatedAt": "2023-08-11T14:12:43.079778404Z",
        "UpdatedAt": "2023-08-11T14:12:43.080247586Z",
        "Spec": {
            "Name": "test-service",
            "Labels": {},
            "TaskTemplate": {
                "ContainerSpec": {
                    "Image": "quay.io/centos/centos:stream8@sha256:59460e4360f0657a24ce56339b117f14ac236dd51e1cf33f30ed5725ba1b4429",
                    "Args": [
                        "bash",
                        "-c",
                        "env \u0026\u0026 sleep infinity"
                    ],
                    "Init": false,
                    "StopGracePeriod": 10000000000,
                    "DNSConfig": {},
                    "Isolation": "default"
                },
                "Resources": {
                    "Limits": {},
                    "Reservations": {
                        "GenericResources": [
                            {
                                "DiscreteResourceSpec": {
                                    "Kind": "gpu_test",
                                    "Value": 1
                                }
                            }
                        ]
                    }
                },
                "RestartPolicy": {
                    "Condition": "any",
                    "Delay": 5000000000,
                    "MaxAttempts": 0
                },
                "Placement": {
                    "Platforms": [
                        {
                            "Architecture": "arm64",
                            "OS": "linux"
                        },
                        {
                            "Architecture": "ppc64le",
                            "OS": "linux"
                        },
                        {
                            "Architecture": "amd64",
                            "OS": "linux"
                        }
                    ]
                },
                "Networks": [
                    {
                        "Target": "vv3h7voq9m7b4k4rzovv8x9mf"
                    }
                ],
                "ForceUpdate": 0,
                "Runtime": "container"
            },
            "Mode": {
                "Replicated": {
                    "Replicas": 1
                }
            },
            "UpdateConfig": {
                "Parallelism": 1,
                "FailureAction": "pause",
                "Monitor": 5000000000,
                "MaxFailureRatio": 0,
                "Order": "stop-first"
            },
            "RollbackConfig": {
                "Parallelism": 1,
                "FailureAction": "pause",
                "Monitor": 5000000000,
                "MaxFailureRatio": 0,
                "Order": "stop-first"
            },
            "EndpointSpec": {
                "Mode": "vip"
            }
        },
        "Endpoint": {
            "Spec": {
                "Mode": "vip"
            },
            "VirtualIPs": [
                {
                    "NetworkID": "vv3h7voq9m7b4k4rzovv8x9mf",
                    "Addr": "10.0.2.2/24"
                }
            ]
        }
    }
]

The generic resource is set up like this: "node-generic-resources": ["gpu_test=GPU-1"]

s4ke · 2023-08-11T14:17:39Z

node is missing network attachments, ip addresses may be exhausted

This looks to me like an issue with your networks. Do you by any chance have a lot of overlay networks? How many containers do you have? Maybe the default network size is not big enough?

mrnicegyu11 · 2023-08-12T11:06:25Z

@s4ke thanks a lot already for the rapid response, let me provide you with some more details about our networking in docker swarm:
Yes, at times we do have a lot (>20 and <50) overlay networks, for this reason we have set a custom daemon.json setting with custom IP ranges:

default-address-pools": [
    {
      "base": "172.17.0.0/12",
      "size": 20
    },
    {
      "base": "192.168.0.0/16",
      "size": 24
    }

However, this error also occurs on an "empty" swarm with your basic example on your swarm (7 ubuntu machines). I can reproduce it with your given test-service commands.

More background: At the time when the error happened on docker engine major versions 23, 24, we tried the engine update on a secluded development cluster. On the machine where the "Err": "node is missing network attachments, ip addresses may be exhausted" error happened, we had at this point less than 10 containers running, and only one of them was a docker service that had GenericResources and Networks set. The other containers were prometheus exporters, sidecars and such. They all started and ran successfully. There was no actual users on the machines apart from me doing some tests. The issue is reproducible, when I remove the docker swarm, update the docker (engine) version via apt (on all machines of the swarm, there is no version mismatch), the error is seen. When I revert to docker engine version 20.10.14 all works again.

I have ran your minimal PoC code on our cluster, for docker engine version 20.10.14, it works:

$ docker network create -d overlay --scope swarm test-network
vp3dzid047v63f45v291mv55m

$ docker service create --network test-network --generic-resource "VRAM=1" --constraint node.labels.gpu==true --name test-service quay.io/centos/centos:stream8 bash -c "env && sleep infinity"
2r5bibe1ehlwfjbw39d23d7qa
overall progress: 1 out of 1 tasks
1/1: running   [==================================================>]
verify: Service converged

$ docker service inspect test-service
[
    {
        "ID": "2r5bibe1ehlwfjbw39d23d7qa",
        "Version": {
            "Index": 101847
        },
        "CreatedAt": "2023-08-12T10:35:51.454631862Z",
        "UpdatedAt": "2023-08-12T10:35:51.462638415Z",
        "Spec": {
            "Name": "test-service",
            "Labels": {},
            "TaskTemplate": {
                "ContainerSpec": {
                    "Image": "quay.io/centos/centos:stream8@sha256:59460e4360f0657a24ce56339b117f14ac236dd51e1cf33f30ed5725ba1b4429",
                    "Args": [
                        "bash",
                        "-c",
                        "env \u0026\u0026 sleep infinity"
                    ],
                    "Init": false,
                    "StopGracePeriod": 10000000000,
                    "DNSConfig": {},
                    "Isolation": "default"
                },
                "Resources": {
                    "Limits": {},
                    "Reservations": {
                        "GenericResources": [
                            {
                                "DiscreteResourceSpec": {
                                    "Kind": "VRAM",
                                    "Value": 1
                                }
                            }
                        ]
                    }
                },
                "RestartPolicy": {
                    "Condition": "any",
                    "Delay": 5000000000,
                    "MaxAttempts": 0
                },
                "Placement": {
                    "Platforms": [
                        {
                            "Architecture": "arm64",
                            "OS": "linux"
                        },
                        {
                            "Architecture": "ppc64le",
                            "OS": "linux"
                        },
                        {
                            "Architecture": "amd64",
                            "OS": "linux"
                        }
                    ]
                },
                "Networks": [
                    {
                        "Target": "vp3dzid047v63f45v291mv55m"
                    }
                ],
                "ForceUpdate": 0,
                "Runtime": "container"
            },
            "Mode": {
                "Replicated": {
                    "Replicas": 1
                }
            },
            "UpdateConfig": {
                "Parallelism": 1,
                "FailureAction": "pause",
                "Monitor": 5000000000,
                "MaxFailureRatio": 0,
                "Order": "stop-first"
            },
            "RollbackConfig": {
                "Parallelism": 1,
                "FailureAction": "pause",
                "Monitor": 5000000000,
                "MaxFailureRatio": 0,
                "Order": "stop-first"
            },
            "EndpointSpec": {
                "Mode": "vip"
            }
        },
        "Endpoint": {
            "Spec": {
                "Mode": "vip"
            },
            "VirtualIPs": [
                {
                    "NetworkID": "vp3dzid047v63f45v291mv55m",
                    "Addr": "10.4.116.4/24"
                }
            ]
        }
    }
]

For docker engine version 24.0.5, the docker service create command loads very long and then fails. In a second shell, I can run docker service ps on the test-service and get the errors as strings:

$ docker network create -d overlay --scope swarm test-network
9xmr7iwfoqe0rvgurbkb9432n

$ docker service create --network test-network --generic-resource "VRAM=1" --constraint node.labels.gpu==true --name test-service quay.io/centos/centos:stream8 bash -c "env && sleep infinity"
zoayc1y238hyki758387xq413
overall progress: 0 out of 1 tasks
1/1: assigned node no longer meets constraints

$ docker service ps zoayc1y238hy
ID             NAME                 IMAGE                           NODE               DESIRED STATE   CURRENT STATE             ERROR                              PORTS
7h46dgq6e65x   test-service.1       quay.io/centos/centos:stream8   master-07   Ready           Rejected 3 seconds ago    "assigned node no longer meets…"
kwqe4rbkg8vj    \_ test-service.1   quay.io/centos/centos:stream8   master-06   Shutdown        Rejected 3 seconds ago    "node is missing network attac…"
vx9kecisomio    \_ test-service.1   quay.io/centos/centos:stream8   master-06   Shutdown        Rejected 8 seconds ago    "node is missing network attac…"
sncamh0hgazo    \_ test-service.1   quay.io/centos/centos:stream8   master-07   Shutdown        Rejected 13 seconds ago   "node is missing network attac…"
woia9gqdw8b0    \_ test-service.1   quay.io/centos/centos:stream8   master-07   Shutdown        Rejected 18 seconds ago   "node is missing network attac…"

Just to be clear: I reproduced the issue on a docker swarm where nothing except your given test-service runs, and before that on a swarm that had other docker services, that didnt have GenericResources and networks specified at the same time ran just fine...

Thanks for the help / communication! :--)

s4ke · 2023-08-12T18:48:17Z

@dperny ideas?

s4ke · 2023-08-12T20:47:56Z

@mrnicegyu11 happy to help.

Just as additional infos so its easier to reproduce. Can you share more about your env? Which nodes have which docker version running?

I am wondering if we can build a unit test out of your Situation in moby/swarmkit.

Also, I am unsure if the behaviour differs between having only a single node vs multiple.

EDIT: just reread... You have no Version mismatch. Hmm

mrnicegyu11 · 2023-08-14T07:58:30Z

@s4ke Yes, we keep our machines' docker engine version always in sync :)

To progress maybe I can suggest the following: We provision our machines with ansible playbooks, so it is probably feasible for me within a reasonable time to reproduce this on a two-machine swarm (1 manager, 1 worker with generic resource), that we for example spin up vanilla on some cloud provider. If it also happens there, we would have an accessible and secluded environment to have a look at :) I will report back. If it doesnt happen there, it would be interesting to see if a full reinstall of the operating system will fix the issue on the machines of our swarm. After all, it is very weird that for us the versions where other people are affected run just fine, while the never versions have the issue even though it should be fixed.

I will also mention that I just up- and downgraded the docker engine version, so the docker-ce apt package. The docker-cli stayed in version 24, I assumed backwards compatibility. But thinking about it now, maybe this somehow mixes things up 🤔

s4ke · 2023-08-17T09:48:09Z

If you can try to reproduce that would be great. I will try to do the same next week or so.

s4ke · 2023-08-25T12:00:27Z

@mrnicegyu11 have you managed to reproduce it?

mrnicegyu11 · 2023-08-28T09:54:56Z

@mrnicegyu11 have you managed to reproduce it?

Sorry I had to temporarily drop this for some other tasks, thanks for the gentle reminder I am trying to reproduce it now once again :)

mrnicegyu11 · 2023-09-01T12:15:55Z

@s4ke I have managed to reproduce the problem on an isolated docker swarm created in AWS. The problem occurs with docker engine version 20, but not with version 24. I'll PN you some details for now :) Sorry for the delay!

mrnicegyu11 · 2023-09-01T13:02:10Z

@s4ke Here is terraform & ansible code that makes the issue reproducible, at least for me :) https://github.com/ITISFoundation/minimal-example-docker-custom-constraint-bug

neersighted · 2023-09-01T13:06:45Z

The fix was not merged until 23.0.2, so this is expected. You'll want to update to a supported engine version to benefit from recent fixes (currently 24.0.z).

I'm guessing that the "works" or "does not work" got inverted from this original statement:

We still encounter this bug on Docker Engine Version 23.0.2 and 24.0.5. We don't encounter it in Docker Engine Version 20.10.14. So for us the issue only pops up in versions where the docker engine changelog says it should be fixed 🤔.

mrnicegyu11 · 2023-09-04T07:12:47Z

@neersighted thanks for replying :--). In fact, the wording did not get mixed up. Let me clarify:

In an exactly opposite scenario to what is expected from the docker engine changelogs, we seem to encounter the issue not on docker engine version 20.x, but we do encounter it on docker engine version 24.x. We are very confused by this as well, and probably I am overlooking something somewhere. Nevertheless, with our machine setup (swarm on ubuntu 20.04, with nvidia docker runtime), the problem is reproducible for me, even when spinning up cloud VM machines. We are provisioning them with ansible, so everything should be noted in code in the link I have posted in a previous comment.

Let me know if I can provide some more specifics, thanks.

s4ke · 2023-09-04T07:28:34Z

@mrnicegyu11 Before I try to reproduce this on multiple nodes, can you verify whether this also happens on a single node for you?

s4ke · 2023-09-04T07:38:29Z

This is from a single node Swarm with GPU:

root@foehnenderhollaender /home/ansible # docker --version
Docker version 24.0.5, build ced0996
root@foehnenderhollaender /home/ansible # docker network create -d overlay --scope swarm test-network
06i5lbyigo9z0vmfvly0q15e2
root@foehnenderhollaender /home/ansible # docker service create --network test-network --generic-resource "NVIDIA-GPU=1" --name test-service quay.io/centos/centos:stream8 bash -c "env && sleep infinity"
0mtijw1d7gznpoqs5keixu2dc
overall progress: 1 out of 1 tasks 
1/1: running   [==================================================>] 
verify: Service converged 
root@foehnenderhollaender /home/ansible # docker service ps test-service
ID             NAME             IMAGE                           NODE                   DESIRED STATE   CURRENT STATE            ERROR     PORTS
16lcx0i6b986   test-service.1   quay.io/centos/centos:stream8   foehnenderhollaender   Running         Running 13 seconds ago

We use this daemon.json:

{
       "log-driver": "syslog",
    "log-opts": {
        "syslog-facility": "daemon",
        "tag": "foehnenderhollaender|{{.Name}}"
    },
    "default-runtime": "nvidia",
    "default-shm-size": "1G",
    "default-ulimits": {
        "memlock": {
            "hard": -1,
            "name": "memlock",
            "soft": -1
        },
        "stack": {
            "hard": 67108864,
            "name": "stack",
            "soft": 67108864
        }
    },
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "node-generic-resources": [
        "NVIDIA-GPU=GPU-14faafde"
    ]
}

s4ke · 2023-09-04T07:42:15Z

Force updating also works:

root@foehnenderhollaender /home/ansible # docker service update --force test-service
test-service
overall progress: 0 out of 1 tasks 
1/1: no suitable node (insufficient resources on 1 node) 
^COperation continuing in background.
Use `docker service ps test-service` to check progress.
root@foehnenderhollaender /home/ansible # docker^C
root@foehnenderhollaender /home/ansible # docker service ps test-service
ID             NAME                 IMAGE                           NODE                   DESIRED STATE   CURRENT STATE            ERROR                              PORTS
zu6q4xm3284c   test-service.1       quay.io/centos/centos:stream8                          Ready           Pending 11 seconds ago   "no suitable node (insufficien…"   
16lcx0i6b986    \_ test-service.1   quay.io/centos/centos:stream8   foehnenderhollaender   Shutdown        Running 11 seconds ago                                      
root@foehnenderhollaender /home/ansible # docker service ps test-service
ID             NAME                 IMAGE                           NODE                   DESIRED STATE   CURRENT STATE             ERROR     PORTS
zu6q4xm3284c   test-service.1       quay.io/centos/centos:stream8   foehnenderhollaender   Running         Running 19 seconds ago              
16lcx0i6b986    \_ test-service.1   quay.io/centos/centos:stream8   foehnenderhollaender   Shutdown        Shutdown 20 seconds ago

But since there is only one resource present and the first task of the service is still running the force update actually starts with reporting "no suitable node"

root@foehnenderhollaender /home/ansible # docker service update --force test-service
test-service
overall progress: 0 out of 1 tasks 
1/1: no suitable node (insufficient resources on 1 node) 
^COperation continuing in background.
Use `docker service ps test-service` to check progress.
root@foehnenderhollaender /home/ansible # docker^C
root@foehnenderhollaender /home/ansible # docker service ps test-service
ID             NAME                 IMAGE                           NODE                   DESIRED STATE   CURRENT STATE            ERROR                              PORTS
zu6q4xm3284c   test-service.1       quay.io/centos/centos:stream8                          Ready           Pending 11 seconds ago   "no suitable node (insufficien…"   
16lcx0i6b986    \_ test-service.1   quay.io/centos/centos:stream8   foehnenderhollaender   Shutdown        Running 11 seconds ago                                      
root@foehnenderhollaender /home/ansible # docker service ps test-service
ID             NAME                 IMAGE                           NODE                   DESIRED STATE   CURRENT STATE             ERROR     PORTS
zu6q4xm3284c   test-service.1       quay.io/centos/centos:stream8   foehnenderhollaender   Running         Running 19 seconds ago              
16lcx0i6b986    \_ test-service.1   quay.io/centos/centos:stream8   foehnenderhollaender   Shutdown        Shutdown 20 seconds ago

But it recovers from this nicely.

mrnicegyu11 · 2023-09-04T08:13:09Z

@s4ke I will give it a shot!

sanderegg · 2024-01-18T11:23:16Z

@s4ke I have been testing what you propose and indeed that works. but it helped me narrowing our problem.
so here is the problem now, please replace your

"node-generic-resources": [
        "NVIDIA-GPU=GPU-14faafde"
    ]

with the following:

"node-generic-resources": [
        "NVIDIA-GPU=5"
    ]

for example, which could mean that we have 5 GPUs in the node right? I would expect that everyone of my service would decrease the number of available GPUs by 1 ok?

now running your example code shows

ubuntu@ip-10-0-2-119:~$ cat /etc/docker/daemon.json 
{
  "node-generic-resources": ["NVIDIA-GPU=5"]
}
ubuntu@ip-10-0-2-119:~$ docker --version
Docker version 24.0.7, build afdd53b
ubuntu@ip-10-0-2-119:~$ docker network create -d overlay --scope swarm test-network
ir8hbeen70wdduvektz3q3xyp
ubuntu@ip-10-0-2-119:~$ docker service create --network test-network --generic-resource "NVIDIA-GPU=1" --name test-service quay.io/centos/centos:stream8 bash -c "env && sleep infinity"
wp59zh4y58fo8rrsmnnrm90xy
overall progress: 0 out of 1 tasks 
1/1: assigned node no longer meets constraints

This does not work. I cannot even start 1 service this way. What is wrong here? is the way I define the generic resources wrong? or the way the service is started?

sanderegg · 2024-01-18T11:24:29Z

and I may add, that not attaching the network does indeed work as well

docker node inspect self | jq
[
  {
    "ID": "296wpf736ezdstsx57djzoax1",
    "Version": {
      "Index": 340
    },
    "CreatedAt": "2024-01-18T11:38:43.97870017Z",
    "UpdatedAt": "2024-01-18T11:42:19.505035908Z",
    "Spec": {
      "Labels": {},
      "Role": "manager",
      "Availability": "active"
    },
    "Description": {
      "Hostname": "ip-10-0-2-119",
      "Platform": {
        "Architecture": "x86_64",
        "OS": "linux"
      },
      "Resources": {
        "NanoCPUs": 8000000000,
        "MemoryBytes": 33160314880,
        "GenericResources": [
          {
            "DiscreteResourceSpec": {
              "Kind": "NVIDIA-GPU",
              "Value": 5
            }
          }
        ]
      },
      "Engine": {
        "EngineVersion": "24.0.7",

s4ke · 2024-01-18T14:35:14Z

Have you tried listing all GPUs separately?

I will check the discrete example and get back to you though. Unsure though, how the code changed by me should have broken this.

sanderegg · 2024-01-18T14:48:11Z

@s4ke the point is in our working system we do not list the GPUS, but the available VRAM amount. therefore we are very interested in having it working with the numbers again. Basically listing them is not realistic for us. The idea being that we have a machine with a GPU that has 11000 MB VRAM, then a user can start 1 service with 400VRAM, another with 200VRAM and a last one with 10500VRAM.

Unless you have a way to do that and maybe we can adjust the syntax.

s4ke · 2024-01-18T14:57:44Z

I got that. What is the last version this worked on?

sanderegg · 2024-01-18T17:05:26Z

same as what @mrnicegyu11 said, docker 20.x

s4ke · 2024-01-18T18:13:33Z

You are correct. If you use discrete resources, this is causing an issue. Will try to see what I can do, but this is maybe something we need the cavalry for :D @dperny

s4ke · 2024-01-19T19:16:04Z

I am confused. After playing around with unit tests, it looks like HasResource in validate.go has the boolean query the wrong way around for discrete resources:

It is:

		case *api.GenericResource_DiscreteResourceSpec:
			if res.GetDiscreteResourceSpec() == nil {
				return false
			}

			if res.GetDiscreteResourceSpec().Value < rtype.DiscreteResourceSpec.Value {
				return false
			}

It should be:

			if res.GetDiscreteResourceSpec() == nil {
				return false
			}

			if res.GetDiscreteResourceSpec().Value > rtype.DiscreteResourceSpec.Value {
				return false
			}

HasResources has the following documentation:

// HasResource checks if there is enough "res" in the "resources" argument

s4ke · 2024-01-19T19:21:15Z

This kinda explains that it started working for people with named resources but it stopped working for people with discrete resources as my fix inverted the boolean logic for the caller (which was correct) but I missed the wrong logic in HasResource.

s4ke · 2024-01-19T19:25:35Z

If the values match everything succeeds in scheduling:

martinb@ubuntu:~$ docker service create --network test-network --generic-resource "gpu_test=2" --name test-service-2 quay.io/centos/centos:strea
m8 bash -c "env && sleep infinity"
m5mv5ubs1tpu7lhmrnzci0lno
overall progress: 0 out of 1 tasks 
1/1: assigned node no longer meets constraints 
^COperation continuing in background.
Use `docker service ps m5mv5ubs1tpu7lhmrnzci0lno` to check progress.
martinb@ubuntu:~$ docker ^C
martinb@ubuntu:~$ docker service rm test-service-2
test-service-2
martinb@ubuntu:~$ docker service create --network test-network --generic-resource "gpu_test=5" --name test-service-2 quay.io/centos/centos:strea
m8 bash -c "env && sleep infinity"
2k1vivjkv0yf0k5suzqgansbu
overall progress: 1 out of 1 tasks 
1/1: running   [==================================================>] 
verify: Service converged

s4ke · 2024-01-19T19:25:58Z

I will prepare a fix and PR it over at swarmkit.

s4ke · 2024-01-19T23:41:38Z

Fix is found and PR'ed over at moby/swarmkit. Thanks for the investigation @sanderegg @mrnicegyu11

sanderegg · 2024-01-22T08:02:10Z

Hey @s4ke thank you very much for super fast reaction time. looking forward. 💯

sanderegg · 2024-01-22T08:21:06Z

@s4ke do you have an idea when it would be released ? ETA? thanks again!

s4ke · 2024-01-22T09:51:26Z

I am merely a contributor to this project, so I don't know. Usually this is taken up quite fast for these kinds of small issues. I hope to get this into 25.0.x but this is more a question to the folks at docker/mirantis.

s4ke · 2024-01-22T09:52:20Z

PTAL @neersighted @thaJeztah

s4ke · 2024-01-25T14:06:13Z

Update: PR over at swarmkit is merged.

s4ke · 2024-01-30T15:14:25Z

Very strange. We actually have a situation with Docker 24.0.2 where we create the service with discrete resources and it works if we createit via a stack file. If we create the service via CLI it does not work and your issue is reproducible @sanderegg

s4ke · 2024-02-08T10:51:25Z

It's released. Please check if your issue is fixed @sanderegg

s4ke · 2024-02-08T11:14:34Z

In my tests, the issue seems to be fixed now.

Dogfalo added kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. status/0-triage labels Nov 1, 2022

s4ke mentioned this issue Nov 6, 2022

Fixes https://github.com/moby/moby/issues/44378 moby/swarmkit#3082

Merged

dperny added a commit to moby/swarmkit that referenced this issue Feb 2, 2023

Merge pull request #3082 from s4ke/44378-moby-fix-constraint-enforcer…

30042f3

…-generic-resources Fixes moby/moby#44378

thaJeztah closed this as completed in thaJeztah/swarmkit@770a4d3 Feb 4, 2023

This was referenced Mar 6, 2023

vendor: github.com/moby/swarmkit/v2 v2.0.0-20230302163403-80a528a86877 #45106

Merged

[23.0 backport] vendor: github.com/moby/swarmkit/v2 v2.0.0-20230302163403-80a528a86877 #45107

Merged

sanderegg mentioned this issue Jan 18, 2024

Dynamic services: Due to issues in docker >20.x VRAM generic resource may have to be used differently ITISFoundation/osparc-simcore#5250

Closed

s4ke mentioned this issue Jan 19, 2024

Fix HasResource inverted boolean error moby/swarmkit#3162

Merged

docker service create doesn't work when network and generic-resource are both attached #44378

docker service create doesn't work when network and generic-resource are both attached #44378

Comments

Dogfalo commented Nov 1, 2022

Description

Issue

Reproduce

Bug Investigation + Reproduction steps

Expected behavior

docker version

docker info

Additional Info

s4ke commented Nov 6, 2022

s4ke commented Nov 6, 2022 • edited Loading

markperri commented Jan 28, 2023

sam-thibault commented Feb 2, 2023

s4ke commented Mar 28, 2023

mrnicegyu11 commented Aug 11, 2023

s4ke commented Aug 11, 2023 • edited Loading

s4ke commented Aug 11, 2023 • edited Loading

mrnicegyu11 commented Aug 12, 2023

s4ke commented Aug 12, 2023

s4ke commented Aug 12, 2023 • edited Loading

mrnicegyu11 commented Aug 14, 2023

s4ke commented Aug 17, 2023

s4ke commented Aug 25, 2023

mrnicegyu11 commented Aug 28, 2023

mrnicegyu11 commented Sep 1, 2023

mrnicegyu11 commented Sep 1, 2023

neersighted commented Sep 1, 2023 • edited Loading

mrnicegyu11 commented Sep 4, 2023 • edited Loading

s4ke commented Sep 4, 2023

s4ke commented Sep 4, 2023 • edited Loading

s4ke commented Sep 4, 2023

mrnicegyu11 commented Sep 4, 2023

sanderegg commented Jan 18, 2024

sanderegg commented Jan 18, 2024 • edited Loading

s4ke commented Jan 18, 2024 • edited Loading

sanderegg commented Jan 18, 2024

s4ke commented Jan 18, 2024 • edited Loading

sanderegg commented Jan 18, 2024

s4ke commented Jan 18, 2024

s4ke commented Jan 19, 2024 • edited Loading

s4ke commented Jan 19, 2024 • edited Loading

s4ke commented Jan 19, 2024

s4ke commented Jan 19, 2024

s4ke commented Jan 19, 2024

sanderegg commented Jan 22, 2024

sanderegg commented Jan 22, 2024

s4ke commented Jan 22, 2024

s4ke commented Jan 22, 2024

s4ke commented Jan 25, 2024

s4ke commented Jan 30, 2024 • edited Loading

s4ke commented Feb 8, 2024

s4ke commented Feb 8, 2024

s4ke commented Nov 6, 2022 •

edited

Loading

s4ke commented Aug 11, 2023 •

edited

Loading

s4ke commented Aug 11, 2023 •

edited

Loading

s4ke commented Aug 12, 2023 •

edited

Loading

neersighted commented Sep 1, 2023 •

edited

Loading

mrnicegyu11 commented Sep 4, 2023 •

edited

Loading

s4ke commented Sep 4, 2023 •

edited

Loading

sanderegg commented Jan 18, 2024 •

edited

Loading

s4ke commented Jan 18, 2024 •

edited

Loading

s4ke commented Jan 18, 2024 •

edited

Loading

s4ke commented Jan 19, 2024 •

edited

Loading

s4ke commented Jan 19, 2024 •

edited

Loading

s4ke commented Jan 30, 2024 •

edited

Loading