node null not found error, service stuck with new state #37073

unkusr007 · 2018-05-16T01:04:58Z

Hello,

Since today, I have this issue that I can't fix with docker swarm.
Most of the replicas can't start, they stay into "new" state without a node attributed

j33jnn1pa048        backend_preview-internal-backend.1       quay.acme.com/smart/backend.internal:1.3.23   swarm-worker-02.smart.acme.org       Running             Running about an hour ago
c229dwboskik         \_ backend_preview-internal-backend.1   quay.acme.com/smart/backend.internal:1.3.23   swarm-manager-02.smart.acme.org      Shutdown            Shutdown about an hour ago
i0jqy75m4bjg        backend_preview-internal-backend.2       quay.acme.com/smart/backend.internal:1.3.23   swarm-manager-01.smart.acme.org      Running             Running 8 minutes ago
0ui98kdrpxp8        backend_preview-internal-backend.3       quay.acme.com/smart/backend.internal:1.3.23                                        Running             New 9 minutes ago
z3tkqvfyka23        backend_preview-internal-backend.4       quay.acme.com/smart/backend.internal:1.3.23                                        Running             New 9 minutes ago
kpamd2zxgmc2        backend_preview-internal-backend.5       quay.acme.com/smart/backend.internal:1.3.23                                        Running             New 9 minutes ago
ildzzir56340        backend_preview-internal-backend.6       quay.acme.com/smart/backend.internal:1.3.23                                        Running             New 9 minutes ago

On the 3 managers with journalctl -fu docker.service I have this log continuously:

May 16 00:57:36 swarm-manager-02.smart.acme.org env[3479]: time="2018-05-16T00:57:36.199069535Z" level=error msg="Error getting node null: node null not found"
May 16 00:57:41 swarm-manager-02.smart.acme.org env[3479]: time="2018-05-16T00:57:41.275237547Z" level=error msg="Error getting node null: node null not found"
May 16 00:57:41 swarm-manager-02.smart.acme.org env[3479]: time="2018-05-16T00:57:41.338913421Z" level=error msg="Error getting node null: node null not found"
May 16 00:57:41 swarm-manager-02.smart.acme.org env[3479]: time="2018-05-16T00:57:41.414829123Z" level=error msg="Error getting node null: node null not found"
May 16 00:57:46 swarm-manager-02.smart.acme.org env[3479]: time="2018-05-16T00:57:46.487962133Z" level=error msg="Error getting node null: node null not found"
May 16 00:57:46 swarm-manager-02.smart.acme.org env[3479]: time="2018-05-16T00:57:46.549313053Z" level=error msg="Error getting node null: node null not found"
May 16 00:57:46 swarm-manager-02.smart.acme.org env[3479]: time="2018-05-16T00:57:46.621488846Z" level=error msg="Error getting node null: node null not found"
May 16 00:57:51 swarm-manager-02.smart.acme.org env[3479]: time="2018-05-16T00:57:51.692517060Z" level=error msg="Error getting node null: node null not found"
May 16 00:57:51 swarm-manager-02.smart.acme.org env[3479]: time="2018-05-16T00:57:51.753280638Z" level=error msg="Error getting node null: node null not found"
May 16 00:57:51 swarm-manager-02.smart.acme.org env[3479]: time="2018-05-16T00:57:51.824797787Z" level=error msg="Error getting node null: node null not found"

docker version:

Client:
 Version:      18.03.1-ce
 API version:  1.37
 Go version:   go1.9.4
 Git commit:   9ee9f40
 Built:        Thu Apr 26 04:27:49 2018
 OS/Arch:      linux/amd64
 Experimental: false
 Orchestrator: swarm

Server:
 Engine:
  Version:      18.03.1-ce
  API version:  1.37 (minimum version 1.12)
  Go version:   go1.9.4
  Git commit:   9ee9f40
  Built:        Thu Apr 26 04:27:49 2018
  OS/Arch:      linux/amd64
  Experimental: true

docker info:

Containers: 35
 Running: 35
 Paused: 0
 Stopped: 0
Images: 31
Server Version: 18.03.1-ce
Storage Driver: overlay2
 Backing Filesystem: xfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host ipvlan macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: l3ku0pkmcdsq1cgx3hnv8a07i
 Is Manager: true
 ClusterID: wpdw3up06qrulow40cb0cketk
 Managers: 3
 Nodes: 6
 Orchestration:
  Task History Retention Limit: 1
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: 10.162.128.41
 Manager Addresses:
  10.162.128.40:2377
  10.162.128.41:2377
  10.162.128.42:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 773c489c9c1b21a6d78b5c538cd395416ec50f88
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: v0.13.2 (expected: 949e6facb77383876aeff8a6944dde66b3089574)
Security Options:
 seccomp
  Profile: default
 selinux
Kernel Version: 4.14.39-coreos
Operating System: Container Linux by CoreOS 1745.2.0 (Rhyolite)
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 29.45GiB
Name: swarm-manager-02.smart.acme.org
ID: 7KC3:YI4V:DYQZ:KK2K:PA2Q:NVDY:AUDP:MJ2U:I2PD:GLKC:UJJL:YPQL
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: true
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Thanks !

The text was updated successfully, but these errors were encountered:

thaJeztah · 2018-05-22T14:05:49Z

ping @dperny @anshulpundir @chungers PTAL

unkusr007 · 2018-05-24T12:03:01Z

I rebuild all my cluster again, and after few days the error is present again.
I don't know what to do, it's a production cluster, everything was working fine since few months, and now it's barely usable.

unkusr007 · 2018-05-24T20:29:32Z

don't consider the message Error getting node null: node null not found, it's a script that I ran on the server.
I've downgraded to docker 17.09.01 for the other issue, seems stable now.

dperny · 2018-05-24T23:03:53Z

Looks like IP address exhaustion. If the task is stuck in the NEW state, it's because it hasn't passed through the allocator yet. Do these tasks belong to a service that is attached to a network? Are there enough IP addresses on the network for every task?

unkusr007 · 2018-05-24T23:23:51Z

I read about this ip address exhaustion, so I recreate my cluster and I changed the CIDR to 10.0.0.0/21 instead of /24, for ~150 containers. The problem was the same after few days. One thing fixed the issue, drain all managers except one manager, and then all replicas for all my services were good.

GordonTheTurtle added the area/swarm label May 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

node null not found error, service stuck with new state #37073

node null not found error, service stuck with new state #37073

unkusr007 commented May 16, 2018

thaJeztah commented May 22, 2018

unkusr007 commented May 24, 2018

unkusr007 commented May 24, 2018 •

edited

dperny commented May 24, 2018

unkusr007 commented May 24, 2018

node null not found error, service stuck with new state #37073

node null not found error, service stuck with new state #37073

Comments

unkusr007 commented May 16, 2018

thaJeztah commented May 22, 2018

unkusr007 commented May 24, 2018

unkusr007 commented May 24, 2018 • edited

dperny commented May 24, 2018

unkusr007 commented May 24, 2018

unkusr007 commented May 24, 2018 •

edited