Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node null not found error, service stuck with new state #37073

Open
unkusr007 opened this issue May 16, 2018 · 5 comments
Open

node null not found error, service stuck with new state #37073

unkusr007 opened this issue May 16, 2018 · 5 comments

Comments

@unkusr007
Copy link

Hello,

Since today, I have this issue that I can't fix with docker swarm.
Most of the replicas can't start, they stay into "new" state without a node attributed

j33jnn1pa048        backend_preview-internal-backend.1       quay.acme.com/smart/backend.internal:1.3.23   swarm-worker-02.smart.acme.org       Running             Running about an hour ago
c229dwboskik         \_ backend_preview-internal-backend.1   quay.acme.com/smart/backend.internal:1.3.23   swarm-manager-02.smart.acme.org      Shutdown            Shutdown about an hour ago
i0jqy75m4bjg        backend_preview-internal-backend.2       quay.acme.com/smart/backend.internal:1.3.23   swarm-manager-01.smart.acme.org      Running             Running 8 minutes ago
0ui98kdrpxp8        backend_preview-internal-backend.3       quay.acme.com/smart/backend.internal:1.3.23                                        Running             New 9 minutes ago
z3tkqvfyka23        backend_preview-internal-backend.4       quay.acme.com/smart/backend.internal:1.3.23                                        Running             New 9 minutes ago
kpamd2zxgmc2        backend_preview-internal-backend.5       quay.acme.com/smart/backend.internal:1.3.23                                        Running             New 9 minutes ago
ildzzir56340        backend_preview-internal-backend.6       quay.acme.com/smart/backend.internal:1.3.23                                        Running             New 9 minutes ago

On the 3 managers with journalctl -fu docker.service I have this log continuously:

May 16 00:57:36 swarm-manager-02.smart.acme.org env[3479]: time="2018-05-16T00:57:36.199069535Z" level=error msg="Error getting node null: node null not found"
May 16 00:57:41 swarm-manager-02.smart.acme.org env[3479]: time="2018-05-16T00:57:41.275237547Z" level=error msg="Error getting node null: node null not found"
May 16 00:57:41 swarm-manager-02.smart.acme.org env[3479]: time="2018-05-16T00:57:41.338913421Z" level=error msg="Error getting node null: node null not found"
May 16 00:57:41 swarm-manager-02.smart.acme.org env[3479]: time="2018-05-16T00:57:41.414829123Z" level=error msg="Error getting node null: node null not found"
May 16 00:57:46 swarm-manager-02.smart.acme.org env[3479]: time="2018-05-16T00:57:46.487962133Z" level=error msg="Error getting node null: node null not found"
May 16 00:57:46 swarm-manager-02.smart.acme.org env[3479]: time="2018-05-16T00:57:46.549313053Z" level=error msg="Error getting node null: node null not found"
May 16 00:57:46 swarm-manager-02.smart.acme.org env[3479]: time="2018-05-16T00:57:46.621488846Z" level=error msg="Error getting node null: node null not found"
May 16 00:57:51 swarm-manager-02.smart.acme.org env[3479]: time="2018-05-16T00:57:51.692517060Z" level=error msg="Error getting node null: node null not found"
May 16 00:57:51 swarm-manager-02.smart.acme.org env[3479]: time="2018-05-16T00:57:51.753280638Z" level=error msg="Error getting node null: node null not found"
May 16 00:57:51 swarm-manager-02.smart.acme.org env[3479]: time="2018-05-16T00:57:51.824797787Z" level=error msg="Error getting node null: node null not found"

docker version:

Client:
 Version:      18.03.1-ce
 API version:  1.37
 Go version:   go1.9.4
 Git commit:   9ee9f40
 Built:        Thu Apr 26 04:27:49 2018
 OS/Arch:      linux/amd64
 Experimental: false
 Orchestrator: swarm

Server:
 Engine:
  Version:      18.03.1-ce
  API version:  1.37 (minimum version 1.12)
  Go version:   go1.9.4
  Git commit:   9ee9f40
  Built:        Thu Apr 26 04:27:49 2018
  OS/Arch:      linux/amd64
  Experimental: true

docker info:

Containers: 35
 Running: 35
 Paused: 0
 Stopped: 0
Images: 31
Server Version: 18.03.1-ce
Storage Driver: overlay2
 Backing Filesystem: xfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host ipvlan macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: l3ku0pkmcdsq1cgx3hnv8a07i
 Is Manager: true
 ClusterID: wpdw3up06qrulow40cb0cketk
 Managers: 3
 Nodes: 6
 Orchestration:
  Task History Retention Limit: 1
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: 10.162.128.41
 Manager Addresses:
  10.162.128.40:2377
  10.162.128.41:2377
  10.162.128.42:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 773c489c9c1b21a6d78b5c538cd395416ec50f88
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: v0.13.2 (expected: 949e6facb77383876aeff8a6944dde66b3089574)
Security Options:
 seccomp
  Profile: default
 selinux
Kernel Version: 4.14.39-coreos
Operating System: Container Linux by CoreOS 1745.2.0 (Rhyolite)
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 29.45GiB
Name: swarm-manager-02.smart.acme.org
ID: 7KC3:YI4V:DYQZ:KK2K:PA2Q:NVDY:AUDP:MJ2U:I2PD:GLKC:UJJL:YPQL
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: true
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Thanks !

@thaJeztah
Copy link
Member

ping @dperny @anshulpundir @chungers PTAL

@unkusr007
Copy link
Author

I rebuild all my cluster again, and after few days the error is present again.
I don't know what to do, it's a production cluster, everything was working fine since few months, and now it's barely usable.

@unkusr007
Copy link
Author

unkusr007 commented May 24, 2018

don't consider the message Error getting node null: node null not found, it's a script that I ran on the server.
I've downgraded to docker 17.09.01 for the other issue, seems stable now.

@dperny
Copy link
Contributor

dperny commented May 24, 2018

Looks like IP address exhaustion. If the task is stuck in the NEW state, it's because it hasn't passed through the allocator yet. Do these tasks belong to a service that is attached to a network? Are there enough IP addresses on the network for every task?

@unkusr007
Copy link
Author

I read about this ip address exhaustion, so I recreate my cluster and I changed the CIDR to 10.0.0.0/21 instead of /24, for ~150 containers. The problem was the same after few days. One thing fixed the issue, drain all managers except one manager, and then all replicas for all my services were good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants