Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods are sometimes deployed without internal network connectivity #14092

Closed
andrewklau opened this issue May 7, 2017 · 62 comments
Closed

Pods are sometimes deployed without internal network connectivity #14092

andrewklau opened this issue May 7, 2017 · 62 comments
Assignees
Labels

Comments

@andrewklau
Copy link
Contributor

andrewklau commented May 7, 2017

The projects that are running CronJobs seem to eventually lose internal network connectivity.

ie.
PHP + MySQL setup
Run cron job every minute, php script to connect to MySQL

By the end of the day, the PHP web server is unable to connect to the MySQL server with a connection timeout

Similar case happening to Redis and NodeJS, so it doesn't appear to be technology specific.

These are the same ScheduledJob that were being run on 1.4 (exported and re-imported as CronJob).

Even after cleaning up the completed jobs (jobs older then 2 hours are deleted) the issue still occurs.

Projects without CronJobs don't appear to be effected.

Version

openshift v1.5.0+031cbe4
kubernetes v1.5.2+43a9be4

@knobunc
Copy link
Contributor

knobunc commented May 9, 2017

Can you get the node logs please from the node where the failing pod is running?

@andrewklau
Copy link
Contributor Author

The issue only happens after a day or two of running, I had suspended all cron jobs when it started bringing down their services.

I'm re-enabling some and will post back the logs when it happens again.

@andrewklau
Copy link
Contributor Author

I was able to replicate it again, the issue doesn't seem to particularity relate to CronJobs but it happens mostly on projects that are running them. Most of the time right after a build succeeds and a new deployment is created.

Below are the errors:


May 13 09:23:05 ip-10-0-200-127.ap-southeast-2.compute.internal dockerd-current[6361]: I0513 09:23:05.875583   14242 operation_executor.go:917] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/f1124b4b-3769-11e7-ba45-06579dce22e7-default-token-zdk92" (spec.Name: "default-token-zdk92") pod "f1124b4b-3769-11e7-ba45-06579dce22e7" (UID: "f1124b4b-3769-11e7-ba45-06579dce22e7").
May 13 09:23:05 ip-10-0-200-127.ap-southeast-2.compute.internal origin-node[14164]: I0513 09:23:05.875583   14242 operation_executor.go:917] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/f1124b4b-3769-11e7-ba45-06579dce22e7-default-token-zdk92" (spec.Name: "default-token-zdk92") pod "f1124b4b-3769-11e7-ba45-06579dce22e7" (UID: "f1124b4b-3769-11e7-ba45-06579dce22e7").
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal ovs-vsctl[127682]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists del-port veth00735fc5
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: device veth00735fc5 left promiscuous mode
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal dhclient[127114]: DHCPDISCOVER on veth58a37e85 to 255.255.255.255 port 67 interval 11 (xid=0x268e1510)
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal systemd[1]: Scope libcontainer-127701-systemd-test-default-dependencies.scope has no PIDs. Refusing.
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal systemd[1]: Scope libcontainer-127701-systemd-test-default-dependencies.scope has no PIDs. Refusing.
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal ovs-vsctl[127712]: ovs|00001|db_ctl_base|ERR|no row "veth00735fc5" in table Port
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal ovs-vsctl[127717]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists del-port veth00735fc5
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal systemd[1]: Scope libcontainer-127713-systemd-test-default-dependencies.scope has no PIDs. Refusing.
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal dockerd-current[6361]: W0513 09:23:06.195544   14242 multitenant.go:156] refcounting error on vnid 14177148
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal origin-node[14164]: W0513 09:23:06.195544   14242 multitenant.go:156] refcounting error on vnid 14177148
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal systemd[1]: Scope libcontainer-127713-systemd-test-default-dependencies.scope has no PIDs. Refusing.
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal oci-systemd-hook[127730]: systemdhook <debug>: Skipping as container command is /pod, not init or systemd
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal systemd[1]: Scope libcontainer-127732-systemd-test-default-dependencies.scope has no PIDs. Refusing.
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal systemd[1]: Scope libcontainer-127732-systemd-test-default-dependencies.scope has no PIDs. Refusing.
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-10): Unmounting Filesystem
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal dhclient[127397]: receive_packet failed on veth00735fc5: Network is down
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494631386.3021] device (veth00735fc5): driver 'veth' does not support carrier detection.
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494631386.3027] device (veth00735fc5): state change: ip-config -> unmanaged (reason 'unmanaged') [70 10 3]
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494631386.3097] dhcp4 (veth00735fc5): canceled DHCP transaction, DHCP client pid 127397
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494631386.3098] dhcp4 (veth00735fc5): state changed unknown -> done
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal dockerd-current[6361]: time="2017-05-13T09:23:06.373001459+10:00" level=error msg="Handler for POST /v1.24/containers/53417d8e1b62ed7c3e9856c320ee0113d6eddda3ce3ea05fb28444676064bb48/stop returned error: Container 53417d8e1b62ed7c3e9856c320ee0113d6eddda3ce3ea05fb28444676064bb48 is already stopped"
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal origin-node[14164]: W0513 09:23:06.373203   14242 docker_manager.go:1582] No ref for pod '"53417d8e1b62ed7c3e9856c320ee0113d6eddda3ce3ea05fb28444676064bb48 gg/pi-1494631380-5286k"'
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal dockerd-current[6361]: W0513 09:23:06.373203   14242 docker_manager.go:1582] No ref for pod '"53417d8e1b62ed7c3e9856c320ee0113d6eddda3ce3ea05fb28444676064bb48 gg/pi-1494631380-5286k"'

To clarify, running CronJobs only doesn't seem to break the project networking. It seems to happen mostly when a build finishes and a new deployment starts. But the issue happens only on projects that are running CronJobs, so perhaps some sort of race condition if they happen at the same time?

I have the full log dump of one of the nodes where this happened.

@andrewklau
Copy link
Contributor Author

More relevant logs


May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630086.0812] device (veth77b313c0): link connected
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630086.0827] manager: (veth77b313c0): new Veth device (/org/freedesktop/NetworkManager/Devices/4663)
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630086.1029] device (veth77b313c0): state change: unmanaged -> unavailable (reason 'managed') [10 20 2]
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630086.1073] ifcfg-rh: add connection in-memory (87e3c85c-92b0-3cd4-bed2-a5d37ba0b0da,"Wired connection 3")
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630086.1082] settings: (veth77b313c0): created default wired connection 'Wired connection 3'
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630086.1099] device (veth77b313c0): state change: unavailable -> disconnected (reason 'none') [20 30 0]
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630086.1114] policy: auto-activating connection 'Wired connection 3'
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630086.1133] device (veth77b313c0): Activation: starting connection 'Wired connection 3' (87e3c85c-92b0-3cd4-bed2-a5d37ba0b0da)
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630086.1135] device (veth77b313c0): state change: disconnected -> prepare (reason 'none') [30 40 0]
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630086.1144] device (veth77b313c0): state change: prepare -> config (reason 'none') [40 50 0]
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630086.1157] device (veth77b313c0): state change: config -> ip-config (reason 'none') [50 70 0]
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630086.1162] dhcp4 (veth77b313c0): activation: beginning transaction (timeout in 45 seconds)
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal ovs-vsctl[101569]: ovs|00001|vsctl|INFO|Called as ovs-vsctl add-port br0 veth77b313c0
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: device veth77b313c0 entered promiscuous mode
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630086.1790] dhcp4 (veth77b313c0): dhclient started with pid 101571
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630086.1809] device (veth77b313c0): enslaved to non-master-type device ovs-system; ignoring
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal dhclient[101571]: DHCPDISCOVER on veth77b313c0 to 255.255.255.255 port 67 interval 7 (xid=0x74555c58)
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-27): Mounting V5 Filesystem
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-27): Ending clean mount
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal dockerd-current[6361]: I0513 09:01:26.441869   14242 docker_manager.go:2206] Determined pod ip after infra change: "auth-worker-110-c8x7z_(ec154037-3766-11e7-ba45-06579dce22e7)": "10.131.10.130"
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal origin-node[14164]: I0513 09:01:26.441869   14242 docker_manager.go:2206] Determined pod ip after infra change: "auth-worker-110-c8x7z_(ec154037-3766-11e7-ba45-06579dce22e7)": "10.131.10.130"
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-27): Unmounting Filesystem
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-27): Mounting V5 Filesystem
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-27): Ending clean mount
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-27): Unmounting Filesystem
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-27): Mounting V5 Filesystem
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-27): Ending clean mount
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal systemd[1]: Scope libcontainer-101641-systemd-test-default-dependencies.scope has no PIDs. Refusing.
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal systemd[1]: Scope libcontainer-101641-systemd-test-default-dependencies.scope has no PIDs. Refusing.
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: nf_conntrack: falling back to vmalloc.
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: nf_conntrack: falling back to vmalloc.
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal systemd[1]: Started docker container 30e7d4718535645141a0bff09e42f80710fd88ede1a2f93adaedcda61cc4dd28.
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal systemd[1]: Starting docker container 30e7d4718535645141a0bff09e42f80710fd88ede1a2f93adaedcda61cc4dd28.
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-28): Mounting V5 Filesystem
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal oci-systemd-hook[101678]: systemdhook <debug>: Skipping as container command is /pod, not init or systemd
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal systemd[1]: Scope libcontainer-101679-systemd-test-default-dependencies.scope has no PIDs. Refusing.
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal systemd[1]: Scope libcontainer-101679-systemd-test-default-dependencies.scope has no PIDs. Refusing.
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-28): Ending clean mount
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal dockerd-current[6361]: E0513 09:01:27.284821   14242 docker_manager.go:1741] Failed to create symbolic link to the log file of pod "auth-139-ljc8r_(ecc3537b-3766-11e7-ba45-06579dce22e7)" container "POD": symlink  /var/log/containers/auth-139-ljc8r__POD-30e7d4718535645141a0bff09e42f80710fd88ede1a2f93adaedcda61cc4dd28.log: no such file or directory
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal origin-node[14164]: E0513 09:01:27.284821   14242 docker_manager.go:1741] Failed to create symbolic link to the log file of pod "auth-139-ljc8r_(ecc3537b-3766-11e7-ba45-06579dce22e7)" container "POD": symlink  /var/log/containers/auth-139-ljc8r__POD-30e7d4718535645141a0bff09e42f80710fd88ede1a2f93adaedcda61cc4dd28.log: no such file or directory
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-28): Unmounting Filesystem
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal dockerd-current[6361]: E0513 09:01:27.370479   14242 docker_manager.go:378] NetworkPlugin cni failed on the status hook for pod 'auth-139-ljc8r' - Unexpected command output Device "eth0" does not exist.
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal dockerd-current[6361]:  with error: exit status 1
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal origin-node[14164]: E0513 09:01:27.370479   14242 docker_manager.go:378] NetworkPlugin cni failed on the status hook for pod 'auth-139-ljc8r' - Unexpected command output Device "eth0" does not exist.
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal origin-node[14164]: with error: exit status 1
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630087.4540] manager: (vethf0daeab9): new Veth device (/org/freedesktop/NetworkManager/Devices/4664)
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630087.4562] device (vethf0daeab9): link connected
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630087.4563] device (vethf0daeab9): state change: unmanaged -> unavailable (reason 'managed') [10 20 2]
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630087.4600] ifcfg-rh: add connection in-memory (c9cf445a-4141-3e70-a65c-87a3588f6c49,"Wired connection 11")
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630087.4608] settings: (vethf0daeab9): created default wired connection 'Wired connection 11'
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630087.4637] device (vethf0daeab9): state change: unavailable -> disconnected (reason 'none') [20 30 0]
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630087.4677] policy: auto-activating connection 'Wired connection 11'
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630087.4704] device (vethf0daeab9): Activation: starting connection 'Wired connection 11' (c9cf445a-4141-3e70-a65c-87a3588f6c49)
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630087.4706] device (vethf0daeab9): state change: disconnected -> prepare (reason 'none') [30 40 0]
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630087.4714] device (vethf0daeab9): state change: prepare -> config (reason 'none') [40 50 0]
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal ovs-vsctl[101727]: ovs|00001|vsctl|INFO|Called as ovs-vsctl add-port br0 vethf0daeab9
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630087.4724] device (vethf0daeab9): state change: config -> ip-config (reason 'none') [50 70 0]
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630087.4727] dhcp4 (vethf0daeab9): activation: beginning transaction (timeout in 45 seconds)
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: device vethf0daeab9 entered promiscuous mode
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630087.5142] dhcp4 (vethf0daeab9): dhclient started with pid 101729
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630087.5176] device (vethf0daeab9): enslaved to non-master-type device ovs-system; ignoring
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal dhclient[101729]: DHCPDISCOVER on vethf0daeab9 to 255.255.255.255 port 67 interval 5 (xid=0xe7661df)
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-28): Mounting V5 Filesystem
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-28): Ending clean mount
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-28): Unmounting Filesystem
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal origin-node[14164]: I0513 09:01:27.921616   14242 docker_manager.go:2206] Determined pod ip after infra change: "auth-139-ljc8r_(ecc3537b-3766-11e7-ba45-06579dce22e7)": "10.131.10.131"
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal dockerd-current[6361]: I0513 09:01:27.921616   14242 docker_manager.go:2206] Determined pod ip after infra change: "auth-139-ljc8r_(ecc3537b-3766-11e7-ba45-06579dce22e7)": "10.131.10.131"

auth-worker-110-c8x7z and auth-139-ljc8r are two of the pods that recently came up and error'd. Other pods came up without a problem.

@andrewklau
Copy link
Contributor Author

andrewklau commented May 15, 2017

I have been able to repeatedly replicate this issue on two different clusters. Steps:

  • Create project
  • Create from template (eg. cakephp-mysql)
  • Scale up to at least 2
  • Create a cronjob that runs php health.php every minute (this just pings the MySQL database)
  • Ensure healthchecks are pinging health.php (this will help us catch the fail to dns lookup as the pod will CrashLoop)
  • Trigger new build
    (repeat new build until eventually a pod will fail being unable to connect to the database.)

Seems to happen when a new deployment starts at the same time the cronjob executes. Every 2-3 builds it happens.

Delete the CronJob and all completed Jobs, keep triggering builds. Issue no longer occurs.

@andrewklau
Copy link
Contributor Author

After further tests, it doesn't appear to be just related to CronJobs at all. It was just occurring more frequently with projects that had CronJobs.

@andrewklau andrewklau changed the title Projects using CronJobs eventually lose internal network connectivity Pods are sometimes deployed without internal network connectivity May 15, 2017
@harobed
Copy link

harobed commented May 17, 2017

@mbalazs
Copy link

mbalazs commented May 17, 2017

I have experienced very similar issues after upgrading to 1.5 - I haven't had any CronJobs though, so I can confirm that it is probably unrelated to CronJobs.

Occasionally, when a Pod starts up, it does not have network connectivity. It happens about once in 5 newly created pods. Restarting (deleting) the Pod, or restarting the origin-node or openvswitch service on the node that the Pod is running on fixes the problem.

I have tried to investigate a bit further, and I noticed that in this setup:

(project A / pod1) <-> Node1 <-> Node2 <-> (project A / pod2)

If pod1 is experiencing this issue, and I was pinging pod2 from pod1, then the packets were actually arriving at pod2, and it was sending ICMP replies, which got back encapsulated in vxlan to Node1, but they haven't shown up on pod1's virtual interface, so it seemed the issue was with the vxlan termination of incoming traffic to the Pod. At this point I started looking at OpenFlow rules but I couldn't determine anything, I haven't found anything trivially missing so I restarted origin-node on Node1 which resolved the issue. But it's still an annoyance as it happens fairly frequently.

@danwinship
Copy link
Contributor

May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal origin-node[14164]: W0513 09:23:06.195544   14242 multitenant.go:156] refcounting error on vnid 14177148

That's the problem; something is going wrong in the networking code and causing it to drop OVS rules for a namespace when, apparently, the namespace is still in use.

Could you restart the node with --loglevel=5, run until the error shows up, and then post the logs (from just the atomic-openshift-node service) somewhere? (It will probably be way too large to post here but you could use a pastebin and post just a link, or you could email it to me (danw at redhat))

@danwinship
Copy link
Contributor

from just the atomic-openshift-node service

where by "atomic-openshift-node" I probably mean "origin-node" I guess...

@andrewklau
Copy link
Contributor Author

I'll try to replicate this again on a test cluster so I don't need to update all the nodes to loglevel=5

@mbalazs curious if you are also running on AWS?

@mbalazs
Copy link

mbalazs commented May 17, 2017

@andrewklau No, it's on a private OpenNebula installation. Some additional info that might help in reproducing:

  1. I have two deployments in the project, a webapp called "vanilla" and a mysql server: https://pastebin.com/RL90QCsd
  2. The problem manifests itself by the webapp not being able to connect to mysql through the mysql service. (It cannot access the pod directly, either). Both deployments are scaled to 1 replica.
  3. It seems to me, that by only deleting the "vanilla-..." pods over and over again, the issue does not manifest at all, the new pods will always work. However, if I do an "oc rollout latest vanilla", a broken pod will present itself after 3-4 iterations. I guess it's probably related to multiple pods being started / stopped at the same time...

@andrewklau
Copy link
Contributor Author

@danwinship I am emailing the logs to you now

I was able to replicate this fairly quickly:

  • Create a 2 node cluster (centos, openshift-ansible, multienant sdn, containerised install)
  • Deploy cakephp-ex
  • nodeselected php to master
  • nodeselected mysql to node

Continued to oc rollout on cakephp-ex. By deployment 6 it crashed.

@gravis
Copy link

gravis commented May 18, 2017

Please note that we're having the exact same issue, and it's not related to CronJobs.
I hope we'll have a patch soon, otherwise we'll have to downgrade to 1.4 :(
thanks

@knobunc
Copy link
Contributor

knobunc commented May 18, 2017

Can you tell me if the broken pod has an IP address? Can it ping the node it is on? Can the node ping it? (I'm wondering if this is an ARP cache problem)

@mbalazs
Copy link

mbalazs commented May 18, 2017

@knobunc In my case, the broken pod (php) has an IP address, and it can ping other nodes in the cluster. It can also ping the docker registry pod. And I can also ping the broken pod from the cluster nodes and from the registry pod.

However, it can not ping the other pod (mysql) in the same project, and vice-versa. The other pod can also ping all the nodes and the registry too. So the only thing that seems to be broken is the communication between these two pods.

@andrewklau
Copy link
Contributor Author

@knobunc I have a 2 node test cluster running that I created to replicate the problem. It has hit the problem if you would like to have access. It's also running at --log-level=5

@andrewklau
Copy link
Contributor Author

@knobunc @danwinship are there any additional logs or access I can provide to further identify what might be the cause?

@knobunc
Copy link
Contributor

knobunc commented May 24, 2017

Can you ping pods from that pod that are on the same node?

What plugin are you using for networking?

@andrewklau
Copy link
Contributor Author

It seems to only happen with multitenant plugin.

@harobed
Copy link

harobed commented May 24, 2017

Yes: multitenant

@andrewklau
Copy link
Contributor Author

This is starting to happen more frequently. Were the you able to get logs from @weliang1 's environment or should I reproduce a cluster logs again?

@knobunc
Copy link
Contributor

knobunc commented Jun 6, 2017

@weliang1 has the logs we need... we're trying to track down the race. Thanks!

@knobunc knobunc assigned pecameron and unassigned pecameron Jun 6, 2017
@andrewklau
Copy link
Contributor Author

Is the problem in openvswitch or the node? I'm wondering if it'd be possible to run a downgraded openvswitch v1.4.1 just so this stops causing app outages in the interim.

@harobed
Copy link

harobed commented Jun 8, 2017

I'm wondering if it'd be possible to run a downgraded openvswitch v1.4.1 just so this stops causing app outages in the interim.

Someone can do the test? :)

@gravis
Copy link

gravis commented Jun 8, 2017

Are you close to have a fix, or is it worth downgrading now?
Thanks

@danwinship
Copy link
Contributor

it's not in openvswitch

@gravis
Copy link

gravis commented Jun 8, 2017

@andrewklau We have downgraded our nodes to 1.4.1 and the issue seems to be gone. All pods are starting directly now. We're waiting a few days before concluding to victory, but so far so good.
ps: We downgraded only the nodes (origin-node), not ovs nor the master(s).

@andrewklau
Copy link
Contributor Author

@gravis did you roll out new nodes or just switch the node container image?

@knobunc when/if(?) we get a fix will origin get a minor version bump? or would we need to wait for 3.6

@gravis
Copy link

gravis commented Jun 9, 2017 via email

@andrewklau
Copy link
Contributor Author

I'm not sure if this is related, but I noticed this pop up in one of the events for a 1.5 node:

 "SetupNetwork" for "wordpress-5-deploy_retirement" with SetupNetworkError: "Failed to setup network for pod \"wordpress-5-deploy_retirement(ecaeb6ee-4ead-11e7-8646-06579dce22e7)\" using network plugins \"cni\": CNI request failed with status 400: 'error running network setup script:\nhostVethName vethab45e589, contVethMac 6a:35:8d:9a:31:80, podIP 10.131.10.17, podConfig &plugin.PodConfig{vnid:0xe8a77f, ingressBandwidth:\"\", egressBandwidth:\"\", wantMacvlan:false}\n + ovs-ofctl -O OpenFlow13 add-flow br0 'table=2, priority=100, in_port=-1, arp, nw_src=10.131.10.17, arp_sha=6a:35:8d:9a:31:80, actions=load:15247231->NXM_NX_REG0[], goto_table:5'\n2017-06-11T13:57:38Z|00001|ofp_util|WARN|Negative value -1 is not a valid port number.\novs-ofctl: -1: port value out of range for in_port\n\n'; Skipping pod"

@caruccio
Copy link

Is this going to be backported to 1.5?

@AlbertoPeon
Copy link
Contributor

Backporting this to 1.5 would be great!

@gravis
Copy link

gravis commented Jun 27, 2017

Any news regarding a backport? This issue is still blocking our nodes from upgrading to 1.5 :(

@caruccio
Copy link

caruccio commented Jun 27, 2017 via email

@eparis
Copy link
Member

eparis commented Jun 27, 2017

The fix was merged in #14801

@danwinship
Copy link
Contributor

yeah, the fix is on the release-1.5 branch but AFAIK we don't do regular point releases of origin

@tobru
Copy link

tobru commented Jun 27, 2017

There was already a point release for origin, so why not do another? With this issue fixed, I guess it would be great to have another point release. not?

@gravis
Copy link

gravis commented Jul 11, 2017

Hi guys, it would be really nice to have a backport of this in the 1.5 branch.
This bug justifies a point release all by itself.

Thanks

@gravis
Copy link

gravis commented Jul 17, 2017

We've been waiting for a patch version for almost a month now, anyone?

@tobru
Copy link

tobru commented Jul 17, 2017

I'm pretty sure there won't be any point release for 1.5. But it's very easy to build a release by your own: https://github.com/openshift/origin/blob/release-1.5/HACKING.md -> just use the release-1.5 branch.

@gravis
Copy link

gravis commented Jul 17, 2017

Yes, we can build our own images, but if we can't use official images anymore, that pretty much mean the end of openshift for us :(

@danwinship
Copy link
Contributor

but if we can't use official images anymore, that pretty much mean the end of openshift for us :(

Why? Origin is not formally supported by Red Hat anyway. Literally the only difference between "official" images and images you built yourself is who typed "make release".

@ibotty
Copy link
Contributor

ibotty commented Jul 18, 2017

Yes, but some do trust CentOS (which provides the origin packages), but not some random build packages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests