New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Daemon “Node join event” logs in 17.06 #33962

Open
kinghuang opened this Issue Jul 5, 2017 · 15 comments

Comments

Projects
None yet
10 participants
@kinghuang

kinghuang commented Jul 5, 2017

Description

Since upgrading my swarms to Docker CE 17.06, the Docker daemons have been endlessly outputting info log lines about “Node join event for (some node here)”. Is this normal? They are annoying in that there's a huge number of log lines being generated, but there hasn't been any noticeable ill-effect.

Steps to reproduce the issue:

  1. Given a 10-node Docker swarm on Docker CE 17.05, upgrade the entire swarm to 17.06.

Describe the results you received:

The Docker daemon logs on all the swarm nodes are endlessly outputting log lines about node join events. Here's a brief sample.

Jul 05 06:48:00 itrmsdev01.ucalgary.ca dockerd[61491]: time="2017-07-05T06:48:00.549832682-06:00" level=info msg="Node join event for itrmsdev06.ucalgary.ca-1765e9fefb75/10.41.149.149"
Jul 05 06:48:02 itrmsdev01.ucalgary.ca dockerd[61491]: time="2017-07-05T06:48:02.849150257-06:00" level=info msg="Node join event for itrmsdev07.ucalgary.ca-0550e75fb99a/10.41.149.147"
Jul 05 06:48:32 itrmsdev01.ucalgary.ca dockerd[61491]: time="2017-07-05T06:48:32.854937108-06:00" level=info msg="Node join event for itrmsdev08.ucalgary.ca-ba7e5b1ad050/10.41.149.87"
Jul 05 06:49:02 itrmsdev01.ucalgary.ca dockerd[61491]: time="2017-07-05T06:49:02.858112455-06:00" level=info msg="Node join event for itrmsdev06.ucalgary.ca-1765e9fefb75/10.41.149.149"
Jul 05 06:49:04 itrmsdev01.ucalgary.ca dockerd[61491]: time="2017-07-05T06:49:04.130278304-06:00" level=info msg="Node join event for itrmsdev02.ucalgary.ca-a87b5ed6e3c6/10.41.149.139"
Jul 05 06:49:32 itrmsdev01.ucalgary.ca dockerd[61491]: time="2017-07-05T06:49:32.863643544-06:00" level=info msg="Node join event for itrmsdev02.ucalgary.ca-a87b5ed6e3c6/10.41.149.139"
Jul 05 06:49:36 itrmsdev01.ucalgary.ca dockerd[61491]: time="2017-07-05T06:49:36.352351798-06:00" level=info msg="Node join event for itrmsdev07.ucalgary.ca-0550e75fb99a/10.41.149.147"
Jul 05 06:49:53 itrmsdev01.ucalgary.ca dockerd[61491]: time="2017-07-05T06:49:53.765021527-06:00" level=info msg="Node join event for itrmsdev04.ucalgary.ca-4eb27ff0412f/10.41.149.148"
Jul 05 06:49:53 itrmsdev01.ucalgary.ca dockerd[61491]: time="2017-07-05T06:49:53.815197302-06:00" level=info msg="Node join event for itrmsdev10.ucalgary.ca-ba90731aa607/10.41.149.84"
Jul 05 06:50:02 itrmsdev01.ucalgary.ca dockerd[61491]: time="2017-07-05T06:50:02.868975750-06:00" level=info msg="Node join event for itrmsdev04.ucalgary.ca-4eb27ff0412f/10.41.149.148"
Jul 05 06:50:06 itrmsdev01.ucalgary.ca dockerd[61491]: time="2017-07-05T06:50:06.359332176-06:00" level=info msg="Node join event for itrmsdev07.ucalgary.ca-0550e75fb99a/10.41.149.147"
Jul 05 06:50:18 itrmsdev01.ucalgary.ca dockerd[61491]: time="2017-07-05T06:50:18.518216484-06:00" level=info msg="Node join event for itrmsdev08.ucalgary.ca-ba7e5b1ad050/10.41.149.87"
Jul 05 06:50:23 itrmsdev01.ucalgary.ca dockerd[61491]: time="2017-07-05T06:50:23.820983401-06:00" level=info msg="Node join event for itrmsdev10.ucalgary.ca-ba90731aa607/10.41.149.84"
Jul 05 06:50:32 itrmsdev01.ucalgary.ca dockerd[61491]: time="2017-07-05T06:50:32.894985130-06:00" level=info msg="Node join event for itrmsdev04.ucalgary.ca-4eb27ff0412f/10.41.149.148"
Jul 05 06:50:36 itrmsdev01.ucalgary.ca dockerd[61491]: time="2017-07-05T06:50:36.364264838-06:00" level=info msg="Node join event for itrmsdev07.ucalgary.ca-0550e75fb99a/10.41.149.147"
Jul 05 06:50:48 itrmsdev01.ucalgary.ca dockerd[61491]: time="2017-07-05T06:50:48.524466576-06:00" level=info msg="Node join event for itrmsdev08.ucalgary.ca-ba7e5b1ad050/10.41.149.87"
Jul 05 06:51:02 itrmsdev01.ucalgary.ca dockerd[61491]: time="2017-07-05T06:51:02.900238466-06:00" level=info msg="Node join event for itrmsdev08.ucalgary.ca-ba7e5b1ad050/10.41.149.87"
Jul 05 06:51:18 itrmsdev01.ucalgary.ca dockerd[61491]: time="2017-07-05T06:51:18.532095242-06:00" level=info msg="Node join event for itrmsdev08.ucalgary.ca-ba7e5b1ad050/10.41.149.87"
Jul 05 06:51:23 itrmsdev01.ucalgary.ca dockerd[61491]: time="2017-07-05T06:51:23.836963853-06:00" level=info msg="Node join event for itrmsdev10.ucalgary.ca-ba90731aa607/10.41.149.84"
Jul 05 06:51:32 itrmsdev01.ucalgary.ca dockerd[61491]: time="2017-07-05T06:51:32.906082694-06:00" level=info msg="Node join event for itrmsdev10.ucalgary.ca-ba90731aa607/10.41.149.84"
Jul 05 06:51:34 itrmsdev01.ucalgary.ca dockerd[61491]: time="2017-07-05T06:51:34.168579959-06:00" level=info msg="Node join event for itrmsdev02.ucalgary.ca-a87b5ed6e3c6/10.41.149.139"
Jul 05 06:51:36 itrmsdev01.ucalgary.ca dockerd[61491]: time="2017-07-05T06:51:36.380119279-06:00" level=info msg="Node join event for itrmsdev07.ucalgary.ca-0550e75fb99a/10.41.149.147"
Jul 05 06:51:47 itrmsdev01.ucalgary.ca dockerd[61491]: time="2017-07-05T06:51:47.406343066-06:00" level=info msg="Node join event for itrmsdev09.ucalgary.ca-491ef8ad6092/10.41.149.86"
Jul 05 06:51:48 itrmsdev01.ucalgary.ca dockerd[61491]: time="2017-07-05T06:51:48.538771147-06:00" level=info msg="Node join event for itrmsdev08.ucalgary.ca-ba7e5b1ad050/10.41.149.87"
Jul 05 06:52:02 itrmsdev01.ucalgary.ca dockerd[61491]: time="2017-07-05T06:52:02.911722699-06:00" level=info msg="Node join event for itrmsdev04.ucalgary.ca-4eb27ff0412f/10.41.149.148"
Jul 05 06:52:32 itrmsdev01.ucalgary.ca dockerd[61491]: time="2017-07-05T06:52:32.918801817-06:00" level=info msg="Node join event for itrmsdev09.ucalgary.ca-491ef8ad6092/10.41.149.86"
Jul 05 06:52:53 itrmsdev01.ucalgary.ca dockerd[61491]: time="2017-07-05T06:52:53.860424186-06:00" level=info msg="Node join event for itrmsdev10.ucalgary.ca-ba90731aa607/10.41.149.84"
Jul 05 06:53:00 itrmsdev01.ucalgary.ca dockerd[61491]: time="2017-07-05T06:53:00.615539332-06:00" level=info msg="Node join event for itrmsdev06.ucalgary.ca-1765e9fefb75/10.41.149.149"
Jul 05 06:53:02 itrmsdev01.ucalgary.ca dockerd[61491]: time="2017-07-05T06:53:02.923647658-06:00" level=info msg="Node join event for itrmsdev02.ucalgary.ca-a87b5ed6e3c6/10.41.149.139"
Jul 05 06:53:04 itrmsdev01.ucalgary.ca dockerd[61491]: time="2017-07-05T06:53:04.187673479-06:00" level=info msg="Node join event for itrmsdev02.ucalgary.ca-a87b5ed6e3c6/10.41.149.139"
Jul 05 06:53:23 itrmsdev01.ucalgary.ca dockerd[61491]: time="2017-07-05T06:53:23.866587231-06:00" level=info msg="Node join event for itrmsdev10.ucalgary.ca-ba90731aa607/10.41.149.84"
Jul 05 06:53:30 itrmsdev01.ucalgary.ca dockerd[61491]: time="2017-07-05T06:53:30.621512944-06:00" level=info msg="Node join event for itrmsdev06.ucalgary.ca-1765e9fefb75/10.41.149.149"
Jul 05 06:53:32 itrmsdev01.ucalgary.ca dockerd[61491]: time="2017-07-05T06:53:32.928307151-06:00" level=info msg="Node join event for itrmsdev03.ucalgary.ca-ce40e4760502/10.41.149.138"
Jul 05 06:53:45 itrmsdev01.ucalgary.ca dockerd[61491]: time="2017-07-05T06:53:45.632933649-06:00" level=info msg="Node join event for itrmsdev03.ucalgary.ca-ce40e4760502/10.41.149.138"

Describe the results you expected:

Previous versions of Docker did not produce these log lines. They are adding very significant noise to the daemon logs. I'm not sure if this indicates a problem, or what node join events mean in this context.

Additional information you deem important (e.g. issue happens only occasionally):

The log line comes from libnetwork in networkdb/delegate.go for NodeEventTypeJoin, and was merged in docker/libnetwork#1775.

Output of docker version:

→  ~ docker version
Client:
 Version:      17.06.0-ce
 API version:  1.30
 Go version:   go1.8.3
 Git commit:   02c1d87
 Built:        Fri Jun 23 21:31:53 2017
 OS/Arch:      darwin/amd64

Server:
 Version:      17.06.0-ce
 API version:  1.30 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   02c1d87
 Built:        Fri Jun 23 21:21:56 2017
 OS/Arch:      linux/amd64
 Experimental: true

Output of docker info:

→  ~ docker info
Containers: 88
 Running: 37
 Paused: 0
 Stopped: 51
Images: 681
Server Version: 17.06.0-ce
Storage Driver: overlay
 Backing Filesystem: xfs
 Supports d_type: true
Logging Driver: gelf
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host ipvlan macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: 5i7ghc9agzge2hqfrmrgcucvn
 Is Manager: true
 ClusterID: m0gz05zqgmhx67noqdvko2npr
 Managers: 3
 Nodes: 10
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Root Rotation In Progress: false
 Node Address: 10.41.149.137
 Manager Addresses:
  10.41.149.137:2377
  10.41.149.138:2377
  10.41.149.139:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: cfb82a876ecc11b5ca0977d1733adbe58599088a
runc version: 2d41c047c83e09a6d61d464906feb2a2f3c52aa4
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 3.10.0-514.21.1.el7.x86_64
Operating System: Red Hat Enterprise Linux Server 7.3 (Maipo)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.51GiB
Name: itrmsdev01.ucalgary.ca
ID: MYIJ:6PFS:CZR5:QGU6:2ZBA:S3T7:3ER5:3QRM:5QYB:ZM22:SH6N:S2JP
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: true
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):

The data here is from a 10-node Docker CE 17.06.0 swarm on RHEL 7.3. My other Docker CE 17.06.0 swarms also exhibit the same behaviour.

@cpuguy83

This comment has been minimized.

Contributor

cpuguy83 commented Jul 5, 2017

Looks like something in libnetwork, which I don't see in master.

Ping @sanimej

@sanimej

This comment has been minimized.

sanimej commented Jul 5, 2017

@kinghuang This should have been a debug message.

All the cluster nodes exchange a gossip message to indicate what networks are currently active on a node. This message is also synced over TCP every 30 seconds or if a peer queries for it. This doesn't indicate nor should cause any issue other than the chatty logs. I will push a change to fix it.

@thaJeztah

This comment has been minimized.

Member

thaJeztah commented Jul 8, 2017

Opened docker/libnetwork#1833 to change the logging level from info to debug

@thaJeztah

This comment has been minimized.

Member

thaJeztah commented Aug 5, 2017

The noisy logs are resolved on master through 68d8198 (#34103), and included in 17.07

@thaJeztah thaJeztah closed this Aug 5, 2017

@Cas-pian

This comment has been minimized.

Cas-pian commented Aug 16, 2017

@thaJeztah Will this also be fixed in 17.06.1?

@matzegebbe

This comment has been minimized.

matzegebbe commented Aug 22, 2017

@Cas-pian For me I still see it in 17.06.1

Aug 22 09:30:24 emea-jas-a13p dockerd[12307]: time="2017-08-22T09:30:24.697638549+02:00" level=info msg="Node join event for emea-jas-a16p-5d424159f458/172.20.26.21"

root@xxxx:~# docker version
Client:
 Version:      17.06.1-ce
 API version:  1.30
 Go version:   go1.8.3
 Git commit:   874a737
 Built:        Thu Aug 17 22:01:59 2017
 OS/Arch:      linux/amd64
@kinghuang

This comment has been minimized.

kinghuang commented Aug 31, 2017

I've upgraded to Docker CE 17.07 on one of my swarms, but I'm still getting the “Node join event …” log lines at info level.

@thaJeztah Is the fix for sure in 17.07? Or, is this a new problem?

Aug 31 16:52:16 itrmswebdev01.ucalgary.ca dockerd[27005]: time="2017-08-31T16:52:16.106887576-06:00" level=info msg="Node join event for itrmswebdev02.ucalgary.ca-452ee701da73/10.43.64.120"
Aug 31 16:52:46 itrmswebdev01.ucalgary.ca dockerd[27005]: time="2017-08-31T16:52:46.110052018-06:00" level=info msg="Node join event for itrmswebdev03.ucalgary.ca-76769f98baac/10.43.64.121"
Aug 31 16:53:14 itrmswebdev01.ucalgary.ca dockerd[27005]: time="2017-08-31T16:53:14.323864136-06:00" level=info msg="Node join event for itrmswebdev03.ucalgary.ca-76769f98baac/10.43.64.121"
Aug 31 16:53:16 itrmswebdev01.ucalgary.ca dockerd[27005]: time="2017-08-31T16:53:16.114292669-06:00" level=info msg="Node join event for itrmswebdev02.ucalgary.ca-452ee701da73/10.43.64.120"
Aug 31 16:53:44 itrmswebdev01.ucalgary.ca dockerd[27005]: time="2017-08-31T16:53:44.327674147-06:00" level=info msg="Node join event for itrmswebdev03.ucalgary.ca-76769f98baac/10.43.64.121"
Aug 31 16:53:46 itrmswebdev01.ucalgary.ca dockerd[27005]: time="2017-08-31T16:53:46.117058648-06:00" level=info msg="Node join event for itrmswebdev03.ucalgary.ca-76769f98baac/10.43.64.121"
[kchuang@itrmswebdev01 ~]$ docker info
Containers: 1
 Running: 1
 Paused: 0
 Stopped: 0
Images: 1
Server Version: 17.07.0-ce
Storage Driver: overlay2
 Backing Filesystem: xfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host ipvlan macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: y1xhcbb4im3wrj1e4yd6jwksh
 Is Manager: true
 ClusterID: v35u6m3fg05esczglsvdsoqur
 Managers: 3
 Nodes: 3
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Root Rotation In Progress: false
 Node Address: 10.43.64.119
 Manager Addresses:
  10.43.64.119:2377
  10.43.64.120:2377
  10.43.64.121:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 3addd840653146c90a254301d6c3a663c7fd6429
runc version: 2d41c047c83e09a6d61d464906feb2a2f3c52aa4
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 3.10.0-693.el7.x86_64
Operating System: Red Hat Enterprise Linux Server 7.4 (Maipo)
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 3.694GiB
Name: itrmswebdev01.ucalgary.ca
ID: ZN6S:FSZV:RPTY:GGAZ:DK2F:UVE3:RDUO:CPF4:LQUJ:TBJW:4HSR:OQ2C
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: true
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
@kinghuang

This comment has been minimized.

kinghuang commented Sep 8, 2017

@thaJeztah I've completed upgrading my swarms to 17.07.0-ce. They're all still outputing the “Node join event…” messages.

@thaJeztah

This comment has been minimized.

Member

thaJeztah commented Sep 11, 2017

I can confirm the logs are still noisy; I dived a bit in the code to understand what everything does, and think I see what's happening (I'm not really familiar with that code, so explaining what I saw below);

First the node is retrieved from the database through getNode(), followed by some logic for single-node managers.

After that, the node is retrieved from the database again, this time using checkAndGetNode()

Looking at checkAndGetNode(), that method checks if the cached status of the node found was set by the current (or later) event and if not (thus from an older event), the node's state is removed.
In practice, this is likely to happen (i.e. a "node join" event adds the node to the nDB.nodes list, and the next event removes it from nDB.nodes).

After that, the code I added checks if the node's status was already known as "joined" (due to the explanation above, it won't be), and after that, adds the node to the list of joined nodes, and if it was a "new" join (again, probably "always"), logs the node join event.

Possible solutions:

  • check if the node exists in nDB.nodes before calling nDB.checkAndGetNode()
  • rethink nDB.checkAndGetNode() - why is the node's always removed? Instead, we could probably;
    • inside the switch nEvent.Type update the node's state depending on the type of event, and remove the node from all other states
    • if an unknown event is received, remove the node from all lists (or something?)

@fcrisciani @abhinandanpb any suggestions?

@fcrisciani

This comment has been minimized.

Contributor

fcrisciani commented Sep 11, 2017

@thaJeztah the proper solution is in splitting the lamport time between the node events and network events and after that correct the use of the witness.

The witness clock always checks that the node lamport time is max(local, event) and if that is > than the local one it increments it by 1 and update the local one. Every 30s there is a TCP sync that exchange the full series of messages and the node related ones are getting the latest lamport time that is in the node, that got bumped by the previous logic that I was mentioning, creating the illusion to other nodes that the events received are actually new and so printing the log (and incrementing again the lamport time).
The proper solution is to make sure that the lamport time of events remains consistent and old node join events are not getting advertised with higher lamport time.
I hope that the explanation is not too convoluted, it took me some time too to understand why that was happening

@MarvAmBass

This comment has been minimized.

MarvAmBass commented Nov 8, 2017

Hi,

I'm still getting this error with version 'Server Version: 17.09.0-ce'.

@fcrisciani your solution has to do with the source code not with the server setup right? Because I don't know how to make sure that the lamport time remains consistent from the server/node side.

Thanks

@fcrisciani

This comment has been minimized.

Contributor

fcrisciani commented Nov 9, 2017

@MarvAmBass yep correct, that will require a code change, at the moment not being a breaking issue that creates traffic disruption is not getting much priority.

@agilob

This comment has been minimized.

agilob commented Dec 8, 2017

The noisy logs are resolved on master through 68d8198 (#34103), and included in 17.07

I'm still getting this on docker -v version 17.11.0-ce, build 1caf76c

Dec 08 21:14:47 rpi1 dockerd[403]: time="2017-12-08T21:14:47.461682063Z" level=info msg="Node join event for b6
Dec 08 21:14:48 rpi1 dockerd[403]: time="2017-12-08T21:14:48.268214535Z" level=info msg="Node join event for b6
Dec 08 21:14:48 rpi1 dockerd[403]: time="2017-12-08T21:14:48.461724884Z" level=info msg="Node join event for b6
Dec 08 21:14:49 rpi1 dockerd[403]: time="2017-12-08T21:14:49.348132454Z" level=info msg="Node join event for b6
Dec 08 21:14:49 rpi1 dockerd[403]: time="2017-12-08T21:14:49.472665401Z" level=info msg="Node join event for b6
Dec 08 21:14:50 rpi1 dockerd[403]: time="2017-12-08T21:14:50.283037118Z" level=info msg="Node join event for b6
Dec 08 21:14:50 rpi1 dockerd[403]: time="2017-12-08T21:14:50.461113483Z" level=info msg="Node join event for b6
Dec 08 21:14:51 rpi1 dockerd[403]: time="2017-12-08T21:14:51.268262994Z" level=info msg="Node join event for b6
Dec 08 21:14:51 rpi1 dockerd[403]: time="2017-12-08T21:14:51.461198307Z" level=info msg="Node join event for b6
Dec 08 21:14:52 rpi1 dockerd[403]: time="2017-12-08T21:14:52.272784101Z" level=info msg="Node join event for b6
Dec 08 21:14:52 rpi1 dockerd[403]: time="2017-12-08T21:14:52.461887169Z" level=info msg="Node join event for b6
Dec 08 21:14:53 rpi1 dockerd[403]: time="2017-12-08T21:14:53.279151326Z" level=info msg="Node join event for b6
Dec 08 21:14:53 rpi1 dockerd[403]: time="2017-12-08T21:14:53.463325080Z" level=info msg="Node join event for b6
Dec 08 21:14:54 rpi1 dockerd[403]: time="2017-12-08T21:14:54.273065756Z" level=info msg="Node join event for b6
Dec 08 21:14:54 rpi1 dockerd[403]: time="2017-12-08T21:14:54.461628790Z" level=info msg="Node join event for b6
Dec 08 21:14:55 rpi1 dockerd[403]: time="2017-12-08T21:14:55.267522221Z" level=info msg="Node join event for b6
Dec 08 21:14:55 rpi1 dockerd[403]: time="2017-12-08T21:14:55.461482600Z" level=info msg="Node join event for b6
@thaJeztah

This comment has been minimized.

Member

thaJeztah commented Dec 8, 2017

@agilob if you read beyond that you would have seen that this was reopened and my later comment: #33962 (comment)

@fcrisciani

This comment has been minimized.

Contributor

fcrisciani commented Dec 8, 2017

17.12 should fix that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment