New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS is broken after node reboot #8144

Closed
mbforbes opened this Issue May 12, 2015 · 16 comments

Comments

Projects
None yet
7 participants
@mbforbes
Contributor

mbforbes commented May 12, 2015

This is blocking #7580 (test reboot).

I originally ran into this trying to run the guestbook test after the nodes had all rebooted; here is the log for the failed guestbook test.

I added more logging to help understand what is happening. I added the following logging after this line:

Logf("Call to makeRequestToGuestbook(client, cmd='%s', arg='%s', ns='%s') failed.", cmd, arg, ns)
Logf("\t response: expected '%s', got: '%s'", expectedResponse, res)
Logf("\t error:    expected '<nil>', got: '%v'", err)

The first time it's called I get an error that says no endpoints available for "frontend", but then after that, I get the following repeated errors:

INFO: Call to makeRequestToGuestbook(client, cmd='get', arg='', ns='e2e-tests-kubectl-7d744f0d-775f-4fd0-93b2-88135adbc11e') failed.
INFO:    response: expected '{"data": ""}', got: '<br />
<b>Fatal error</b>:  Uncaught exception 'Predis\Connection\ConnectionException' with message 'php_network_getaddresses: getaddrinfo failed: Name or service not known [tcp://redis-slave:6379]' in /vendor/predis/predis/lib/Predis/Connection/AbstractConnection.php:141
Stack trace:
#0 /vendor/predis/predis/lib/Predis/Connection/StreamConnection.php(96): Predis\Connection\AbstractConnection-&gt;onConnectionError('php_network_get...', 0)
#1 /vendor/predis/predis/lib/Predis/Connection/StreamConnection.php(70): Predis\Connection\StreamConnection-&gt;tcpStreamInitializer(Object(Predis\Connection\ConnectionParameters))
#2 /vendor/predis/predis/lib/Predis/Connection/AbstractConnection.php(96): Predis\Connection\StreamConnection-&gt;createResource()
#3 /vendor/predis/predis/lib/Predis/Connection/StreamConnection.php(144): Predis\Connection\AbstractConnection-&gt;connect()
#4 /vendor/predis/predis/lib/Predis/Connection/AbstractConnection.php(181): Predis\Connection\StreamConnection-&gt;connect()
#5 /vendor/predis/predis/lib/Predis/Connection/StreamConnectio in <b>/vendor/predis/predis/lib/Predis/Connection/AbstractConnection.php</b> on line <b>141</b><br />
'
INFO:    error:    expected '<nil>', got: '<nil>'

Just to make that more readable, the error before the stack trace is: Uncaught exception 'Predis\Connection\ConnectionException' with message 'php_network_getaddresses: getaddrinfo failed: Name or service not known [tcp://redis-slave:6379]' in /vendor/predis/predis/lib/Predis/Connection/AbstractConnection.php:141

Any tips or help greatly appreciated. I'm not sure where in the tcp / services / php / redis stack this is going wrong.

Apologies in advance for the large cc:
+cc @roberthbailey @zmerlynn @quinton-hoole @ixdy

@roberthbailey

This comment has been minimized.

Show comment
Hide comment
@roberthbailey

roberthbailey May 13, 2015

Member

@piosz (since he originally ported this test from shell to go and may have some ideas)

Member

roberthbailey commented May 13, 2015

@piosz (since he originally ported this test from shell to go and may have some ideas)

@piosz

This comment has been minimized.

Show comment
Hide comment
@piosz

piosz May 13, 2015

Member

It looks like a problem with DNS. Is DNS working after node reboot? Could you provide some details how can I reproduce the bug?

Member

piosz commented May 13, 2015

It looks like a problem with DNS. Is DNS working after node reboot? Could you provide some details how can I reproduce the bug?

@mbforbes

This comment has been minimized.

Show comment
Hide comment
@mbforbes

mbforbes May 13, 2015

Contributor

Hey @piosz, thanks for jumping in. No, DNS isn't working after reboot—here's a log of the DNS test.

This is actually fairly easy to reproduce:

  • create a testing cluster with go run hack/e2e.go --v --up
  • ssh into each node and run sudo reboot, then waiting for all nodes/pods to be Running & ready
  • run the test of your choice; for example, to run guestbook, run go run hack/e2e.go --v --test --test_args="--ginkgo.focus=guestbook"

This is done programmatically in #7580 and tests start breaking, including guestbook, DNS, elasticsearch, and influx / heapster. Any help greatly appreciated!

Contributor

mbforbes commented May 13, 2015

Hey @piosz, thanks for jumping in. No, DNS isn't working after reboot—here's a log of the DNS test.

This is actually fairly easy to reproduce:

  • create a testing cluster with go run hack/e2e.go --v --up
  • ssh into each node and run sudo reboot, then waiting for all nodes/pods to be Running & ready
  • run the test of your choice; for example, to run guestbook, run go run hack/e2e.go --v --test --test_args="--ginkgo.focus=guestbook"

This is done programmatically in #7580 and tests start breaking, including guestbook, DNS, elasticsearch, and influx / heapster. Any help greatly appreciated!

@roberthbailey

This comment has been minimized.

Show comment
Hide comment
@roberthbailey

roberthbailey May 13, 2015

Member

@thockin FYI

Should we just rename this issue "DNS is broken after reboot"?

Member

roberthbailey commented May 13, 2015

@thockin FYI

Should we just rename this issue "DNS is broken after reboot"?

@thockin

This comment has been minimized.

Show comment
Hide comment
@thockin

thockin May 13, 2015

Member

What does reboot have to do with DNS? That's bizarre. Trying to repro

On Tue, May 12, 2015 at 11:48 PM, Robert Bailey notifications@github.com
wrote:

@thockin https://github.com/thockin FYI

Should we just rename this issue "DNS is broken after reboot"?


Reply to this email directly or view it on GitHub
#8144 (comment)
.

Member

thockin commented May 13, 2015

What does reboot have to do with DNS? That's bizarre. Trying to repro

On Tue, May 12, 2015 at 11:48 PM, Robert Bailey notifications@github.com
wrote:

@thockin https://github.com/thockin FYI

Should we just rename this issue "DNS is broken after reboot"?


Reply to this email directly or view it on GitHub
#8144 (comment)
.

@thockin

This comment has been minimized.

Show comment
Hide comment
@thockin

thockin May 13, 2015

Member

From controller-manager logs:

I0513 07:13:10.186807 7 defaults.go:179] upward merge of
container.Capabilities.Add for container skydns
I0513 07:13:10.186848 7 defaults.go:183] upward merge of
container.Capabilities.Drop for container skydns
I0513 07:13:10.186873 7 defaults.go:189] downward merge of
container.Privileged for container skydns
I0513 07:13:10.189282 7 controller_utils.go:98] Controller kube-dns
either never recorded expectations, or the ttl expired.
I0513 07:13:10.191572 7 replication_controller.go:339] Finished
syncing controller "default/kube-dns" (2.346368ms)
I0513 07:13:10.191395 7 endpoints_controller.go:258] Finished syncing
service "default/kube-dns" endpoints. (20.849555ms)

@lavalamp - you touched it last, any clue what this means?

It says it synced service kube-dns, but 'get endpoints' shows nothing.

root@e2e-test-thockin-master:/home/thockin# kubectl get endpoints kube-dns
NAME ENDPOINTS
kube-dns

root@e2e-test-thockin-master:/home/thockin# kubectl get service kube-dns
NAME LABELS
SELECTOR IP(S) PORT(S)
kube-dns k8s-app=kube-dns,kubernetes.io/cluster-service=true,name=kube-dns
k8s-app=kube-dns 10.0.0.10 53/UDP

                              53/TCP

root@e2e-test-thockin-master:/home/thockin# kubectl get pods -l
k8s-app=kube-dns
POD IP CONTAINER(S) IMAGE(S)
HOST LABELS
STATUS CREATED MESSAGE
kube-dns-6tbbb 10.245.0.4
e2e-test-thockin-minion-fdo0/146.148.61.252
k8s-app=kube-dns,kubernetes.io/cluster-service=true Running 15 hours

                          skydns

gcr.io/google_containers/skydns:2015-03-11-001

Running 15 minutes
etcd
gcr.io/google_containers/etcd:2.0.9

Running 15 minutes
kube2sky
gcr.io/google_containers/kube2sky:1.4

Running 9 seconds last termination: exit code 1

On Wed, May 13, 2015 at 12:00 AM, Tim Hockin thockin@google.com wrote:

What does reboot have to do with DNS? That's bizarre. Trying to repro

On Tue, May 12, 2015 at 11:48 PM, Robert Bailey notifications@github.com
wrote:

@thockin https://github.com/thockin FYI

Should we just rename this issue "DNS is broken after reboot"?


Reply to this email directly or view it on GitHub
#8144 (comment)
.

Member

thockin commented May 13, 2015

From controller-manager logs:

I0513 07:13:10.186807 7 defaults.go:179] upward merge of
container.Capabilities.Add for container skydns
I0513 07:13:10.186848 7 defaults.go:183] upward merge of
container.Capabilities.Drop for container skydns
I0513 07:13:10.186873 7 defaults.go:189] downward merge of
container.Privileged for container skydns
I0513 07:13:10.189282 7 controller_utils.go:98] Controller kube-dns
either never recorded expectations, or the ttl expired.
I0513 07:13:10.191572 7 replication_controller.go:339] Finished
syncing controller "default/kube-dns" (2.346368ms)
I0513 07:13:10.191395 7 endpoints_controller.go:258] Finished syncing
service "default/kube-dns" endpoints. (20.849555ms)

@lavalamp - you touched it last, any clue what this means?

It says it synced service kube-dns, but 'get endpoints' shows nothing.

root@e2e-test-thockin-master:/home/thockin# kubectl get endpoints kube-dns
NAME ENDPOINTS
kube-dns

root@e2e-test-thockin-master:/home/thockin# kubectl get service kube-dns
NAME LABELS
SELECTOR IP(S) PORT(S)
kube-dns k8s-app=kube-dns,kubernetes.io/cluster-service=true,name=kube-dns
k8s-app=kube-dns 10.0.0.10 53/UDP

                              53/TCP

root@e2e-test-thockin-master:/home/thockin# kubectl get pods -l
k8s-app=kube-dns
POD IP CONTAINER(S) IMAGE(S)
HOST LABELS
STATUS CREATED MESSAGE
kube-dns-6tbbb 10.245.0.4
e2e-test-thockin-minion-fdo0/146.148.61.252
k8s-app=kube-dns,kubernetes.io/cluster-service=true Running 15 hours

                          skydns

gcr.io/google_containers/skydns:2015-03-11-001

Running 15 minutes
etcd
gcr.io/google_containers/etcd:2.0.9

Running 15 minutes
kube2sky
gcr.io/google_containers/kube2sky:1.4

Running 9 seconds last termination: exit code 1

On Wed, May 13, 2015 at 12:00 AM, Tim Hockin thockin@google.com wrote:

What does reboot have to do with DNS? That's bizarre. Trying to repro

On Tue, May 12, 2015 at 11:48 PM, Robert Bailey notifications@github.com
wrote:

@thockin https://github.com/thockin FYI

Should we just rename this issue "DNS is broken after reboot"?


Reply to this email directly or view it on GitHub
#8144 (comment)
.

@piosz

This comment has been minimized.

Show comment
Hide comment
@piosz

piosz May 13, 2015

Member

Thanks @mbforbes for details. The problem with guestbook should be figured out once DNS starts working. I'll also take a look into the problem with DNS.

cc @satnam6502 who may have an idea why elasticsearch doesn't work

Member

piosz commented May 13, 2015

Thanks @mbforbes for details. The problem with guestbook should be figured out once DNS starts working. I'll also take a look into the problem with DNS.

cc @satnam6502 who may have an idea why elasticsearch doesn't work

@satnam6502

This comment has been minimized.

Show comment
Hide comment
@satnam6502

satnam6502 May 13, 2015

Contributor

The elasticsearch test needs DNS to be working. So if DNS is busted, the elasticsearch e2e test will fail.

Contributor

satnam6502 commented May 13, 2015

The elasticsearch test needs DNS to be working. So if DNS is busted, the elasticsearch e2e test will fail.

@satnam6502

This comment has been minimized.

Show comment
Hide comment
@satnam6502

satnam6502 May 13, 2015

Contributor

Although note that I did update the Elasticsearch logging setup on Tuesday and I did slightly adjust the elasticsearch logging e2e test.

Contributor

satnam6502 commented May 13, 2015

Although note that I did update the Elasticsearch logging setup on Tuesday and I did slightly adjust the elasticsearch logging e2e test.

@piosz piosz changed the title from e2e 'kubectl guestbook' test fails after node reboot to DNS is broken after node reboot May 13, 2015

@piosz

This comment has been minimized.

Show comment
Hide comment
@piosz

piosz May 13, 2015

Member

kube2sky is crashlooping with the following error:

2015/05/13 11:54:46 Etcd server found: http://127.0.0.1:4001
2015/05/13 11:54:47 Failed to create a kubernetes client: stat /etc/dns_token/kubeconfig: no such file or directory

Crashloop logs:

pszczesniak@kubernetes-minion-v115:~$ sudo docker ps -a
CONTAINER ID        IMAGE                                                COMMAND                CREATED                  STATUS                          PORTS               NAMES
d747cec0c714        gcr.io/google_containers/kube2sky:1.4                "/kube2sky -domain=c   Less than a second ago                                                       k8s_kube2sky.cb40782f_kube-dns-pw41j_default_e8583425-f955-11e4-8669-42010af012c8_4ec6d028                                          
dd7a4b46ad3e        gcr.io/google_containers/kube2sky:1.4                "/kube2sky -domain=c   11 seconds ago           Exited (1) 8 seconds ago                            k8s_kube2sky.cb40782f_kube-dns-pw41j_default_e8583425-f955-11e4-8669-42010af012c8_cfa9a2ff                                          
3ff7f50003fb        gcr.io/google_containers/kube2sky:1.4                "/kube2sky -domain=c   21 seconds ago           Exited (1) 18 seconds ago                           k8s_kube2sky.cb40782f_kube-dns-pw41j_default_e8583425-f955-11e4-8669-42010af012c8_2ac1d06d                                          
cba76b58e697        gcr.io/google_containers/kube2sky:1.4                "/kube2sky -domain=c   31 seconds ago           Exited (1) 28 seconds ago                           k8s_kube2sky.cb40782f_kube-dns-pw41j_default_e8583425-f955-11e4-8669-42010af012c8_a5d29473                                          
832cecca3f76        gcr.io/google_containers/kube2sky:1.4                "/kube2sky -domain=c   41 seconds ago           Exited (1) 38 seconds ago                           k8s_kube2sky.cb40782f_kube-dns-pw41j_default_e8583425-f955-11e4-8669-42010af012c8_eba7a878                                          
be83e7e16546        gcr.io/google_containers/kube2sky:1.4                "/kube2sky -domain=c   51 seconds ago           Exited (1) 48 seconds ago                           k8s_kube2sky.cb40782f_kube-dns-pw41j_default_e8583425-f955-11e4-8669-42010af012c8_bfa9a817
Member

piosz commented May 13, 2015

kube2sky is crashlooping with the following error:

2015/05/13 11:54:46 Etcd server found: http://127.0.0.1:4001
2015/05/13 11:54:47 Failed to create a kubernetes client: stat /etc/dns_token/kubeconfig: no such file or directory

Crashloop logs:

pszczesniak@kubernetes-minion-v115:~$ sudo docker ps -a
CONTAINER ID        IMAGE                                                COMMAND                CREATED                  STATUS                          PORTS               NAMES
d747cec0c714        gcr.io/google_containers/kube2sky:1.4                "/kube2sky -domain=c   Less than a second ago                                                       k8s_kube2sky.cb40782f_kube-dns-pw41j_default_e8583425-f955-11e4-8669-42010af012c8_4ec6d028                                          
dd7a4b46ad3e        gcr.io/google_containers/kube2sky:1.4                "/kube2sky -domain=c   11 seconds ago           Exited (1) 8 seconds ago                            k8s_kube2sky.cb40782f_kube-dns-pw41j_default_e8583425-f955-11e4-8669-42010af012c8_cfa9a2ff                                          
3ff7f50003fb        gcr.io/google_containers/kube2sky:1.4                "/kube2sky -domain=c   21 seconds ago           Exited (1) 18 seconds ago                           k8s_kube2sky.cb40782f_kube-dns-pw41j_default_e8583425-f955-11e4-8669-42010af012c8_2ac1d06d                                          
cba76b58e697        gcr.io/google_containers/kube2sky:1.4                "/kube2sky -domain=c   31 seconds ago           Exited (1) 28 seconds ago                           k8s_kube2sky.cb40782f_kube-dns-pw41j_default_e8583425-f955-11e4-8669-42010af012c8_a5d29473                                          
832cecca3f76        gcr.io/google_containers/kube2sky:1.4                "/kube2sky -domain=c   41 seconds ago           Exited (1) 38 seconds ago                           k8s_kube2sky.cb40782f_kube-dns-pw41j_default_e8583425-f955-11e4-8669-42010af012c8_eba7a878                                          
be83e7e16546        gcr.io/google_containers/kube2sky:1.4                "/kube2sky -domain=c   51 seconds ago           Exited (1) 48 seconds ago                           k8s_kube2sky.cb40782f_kube-dns-pw41j_default_e8583425-f955-11e4-8669-42010af012c8_bfa9a817
@satnam6502

This comment has been minimized.

Show comment
Hide comment
@satnam6502

satnam6502 May 13, 2015

Contributor

So it looks like it is not finding the dns_token secret?

Contributor

satnam6502 commented May 13, 2015

So it looks like it is not finding the dns_token secret?

@antmanler

This comment has been minimized.

Show comment
Hide comment
@antmanler

antmanler May 13, 2015

Contributor

+1 the same issue, cannot find dns_token

Contributor

antmanler commented May 13, 2015

+1 the same issue, cannot find dns_token

@roberthbailey

This comment has been minimized.

Show comment
Hide comment
@roberthbailey

roberthbailey May 13, 2015

Member

This appears to be caused by #7958.

Member

roberthbailey commented May 13, 2015

This appears to be caused by #7958.

@thockin

This comment has been minimized.

Show comment
Hide comment
@thockin

thockin May 13, 2015

Member

I was just writing the same response. Yes,

On Wed, May 13, 2015 at 9:51 AM, Robert Bailey notifications@github.com
wrote:

This appears to be caused by #7958
#7958.


Reply to this email directly or view it on GitHub
#8144 (comment)
.

Member

thockin commented May 13, 2015

I was just writing the same response. Yes,

On Wed, May 13, 2015 at 9:51 AM, Robert Bailey notifications@github.com
wrote:

This appears to be caused by #7958
#7958.


Reply to this email directly or view it on GitHub
#8144 (comment)
.

@lavalamp

This comment has been minimized.

Show comment
Hide comment
@lavalamp

lavalamp May 13, 2015

Member

Yeah, the crashloop explains why endpoints controller isn't making any endpoints (I am getting nervous because there's been NO BUGS in my rewrite of endpoints controller, so there's probably a really awful one waiting to bite us, but this is not it.). This is clearly #7958. Let's just close this as a dup?

Member

lavalamp commented May 13, 2015

Yeah, the crashloop explains why endpoints controller isn't making any endpoints (I am getting nervous because there's been NO BUGS in my rewrite of endpoints controller, so there's probably a really awful one waiting to bite us, but this is not it.). This is clearly #7958. Let's just close this as a dup?

@piosz

This comment has been minimized.

Show comment
Hide comment
@piosz

piosz May 13, 2015

Member

duplicate of #7958

Member

piosz commented May 13, 2015

duplicate of #7958

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment