Kubectl stop rc <name> is flaky #4528

ghost · 2015-02-18T14:39:38Z

When playing with get pods performance under pressure I found that stopping replication controller don't work reliably. Workflow:
create 50 pods, each with different rc:

for i in `seq 50`; do cluster/kubectl.sh run-container test$i --image=kubernetes/pause --port=12345; done

works reliably, but reverse

for i in `seq 50`; do cluster/kubectl.sh stop rc test$i & done; wait

doesn't. I get errors of type

F0218 15:01:04.217323   30271 stop.go:52] 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]

which leaves cluster in a state when number of replicas under this rc is set to 0 or not and rc itself is not deleted:

Running: cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get pod
POD                                          IP                  CONTAINER(S)            IMAGE(S)                            HOST                                                              LABELS                                   STATUS
elasticsearch-logging-controller-xshy2       10.244.2.3          elasticsearch-logging   dockerfile/elasticsearch            kubernetes-minion-3.c.gmarek-kubernetes.internal/23.251.151.97    name=elasticsearch-logging               Running
kibana-logging-controller-plmkm              10.244.1.5          kibana-logging          kubernetes/kibana:1.0               kubernetes-minion-2.c.gmarek-kubernetes.internal/104.154.95.14    name=kibana-logging                      Running
monitoring-heapster-controller-hlnmd         10.244.2.4          heapster                kubernetes/heapster:v0.6            kubernetes-minion-3.c.gmarek-kubernetes.internal/23.251.151.97    name=heapster,uses=monitoring-influxdb   Running
monitoring-influx-grafana-controller-bfhf7   10.244.2.5          influxdb                kubernetes/heapster_influxdb:v0.3   kubernetes-minion-3.c.gmarek-kubernetes.internal/23.251.151.97    name=influxGrafana                       Running
                                                                 grafana                 kubernetes/heapster_grafana:v0.3                                                                                                               
skydns-kq8u2                                 10.244.1.6          etcd                    quay.io/coreos/etcd:latest          kubernetes-minion-2.c.gmarek-kubernetes.internal/104.154.95.14    k8s-app=skydns                           Running
                                                                 kube2sky                kubernetes/kube2sky:1.0                                                                                                                        
                                                                 skydns                  kubernetes/skydns:2014-12-23-001                                                                                                               
test1-fjgt7                                  10.244.3.52         test1                   kubernetes/pause                    kubernetes-minion-4.c.gmarek-kubernetes.internal/104.154.88.56    run-container=test1                      Running
test12-rs3rh                                 10.244.0.53         test12                  kubernetes/pause                    kubernetes-minion-1.c.gmarek-kubernetes.internal/104.154.85.184   run-container=test12                     Running
test14-9k9z5                                 10.244.3.50         test14                  kubernetes/pause                    kubernetes-minion-4.c.gmarek-kubernetes.internal/104.154.88.56    run-container=test14                     Running
test26-86kb7                                 10.244.1.45         test26                  kubernetes/pause                    kubernetes-minion-2.c.gmarek-kubernetes.internal/104.154.95.14    run-container=test26                     Running
test27-z3bke                                 10.244.0.51         test27                  kubernetes/pause                    kubernetes-minion-1.c.gmarek-kubernetes.internal/104.154.85.184   run-container=test27                     Running
test29-bxiq1                                 10.244.0.52         test29                  kubernetes/pause                    kubernetes-minion-1.c.gmarek-kubernetes.internal/104.154.85.184   run-container=test29                     Running
test36-ot6y0                                 10.244.3.51         test36                  kubernetes/pause                    kubernetes-minion-4.c.gmarek-kubernetes.internal/104.154.88.56    run-container=test36                     Running

real    0m1.137s
user    0m0.229s
sys 0m0.038s

Running: cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get rc
CONTROLLER                             CONTAINER(S)            IMAGE(S)                            SELECTOR                     REPLICAS
elasticsearch-logging-controller       elasticsearch-logging   dockerfile/elasticsearch            name=elasticsearch-logging   1
kibana-logging-controller              kibana-logging          kubernetes/kibana:1.0               name=kibana-logging          1
monitoring-heapster-controller         heapster                kubernetes/heapster:v0.6            name=heapster                1
monitoring-influx-grafana-controller   influxdb                kubernetes/heapster_influxdb:v0.3   name=influxGrafana           1
                                       grafana                 kubernetes/heapster_grafana:v0.3                                 
skydns                                 etcd                    quay.io/coreos/etcd:latest          k8s-app=skydns               1
                                       kube2sky                kubernetes/kube2sky:1.0                                          
                                       skydns                  kubernetes/skydns:2014-12-23-001                                 
test11                                 test11                  kubernetes/pause                    run-container=test11         0
test17                                 test17                  kubernetes/pause                    run-container=test17         0
test2                                  test2                   kubernetes/pause                    run-container=test2          0
test20                                 test20                  kubernetes/pause                    run-container=test20         0
test24                                 test24                  kubernetes/pause                    run-container=test24         0
test30                                 test30                  kubernetes/pause                    run-container=test30         0
test35                                 test35                  kubernetes/pause                    run-container=test35         0
test44                                 test44                  kubernetes/pause                    run-container=test44         0
test46                                 test46                  kubernetes/pause                    run-container=test46         0
test50                                 test50                  kubernetes/pause                    run-container=test50         0

real    0m1.401s
user    0m0.222s
sys 0m0.028s

It seems that there are two kind of errors: not deleting a pod before stopping RC (which I guess should happen), and failing to delete RC itself. Doing the same loop sequentially seems to work, which suggests a race condition.

The text was updated successfully, but these errors were encountered:

davidopp · 2015-02-18T22:41:41Z

"All the given peers are not reachable" seems to happen a lot. I'm seeing it when I run e2e test (one of the tests in pods.go fails with this error), also we have seen it in #4188, #3950, and elsewhere. In #4188 Tim said this indicates an etcd error.

dchen1107 · 2015-02-18T22:44:23Z

@davidopp, that is an issue in etcd client library.

davidopp · 2015-02-18T22:47:20Z

Yeah, Tim had made a similar comment. What is the best way to get more details about the error?

davidopp · 2015-02-18T22:57:55Z

etcd has these flags:
-v - Enable verbose logging. Defaults to false.
-vv - Enable very verbose logging. Defaults to false

Dawn said to enable etcd logging change the salt config that starts it up.

dchen1107 · 2015-02-18T23:07:37Z

To debug such issue last time, I had to modify etcd client library in Godeps/_workspace/src/github.com/coreos/go-etcd/etcd/ with more logging too.

davidopp · 2015-02-18T23:13:18Z

Thanks. And on the server side it looks like we want to add -vv to the command line in one or both of these files: cluster/saltbase/salt/etcd/{default,initd}
(though if the problem is noticed in the client as you said, then this is less useful)

davidopp · 2015-02-19T05:59:02Z

This is totally the wrong bug to be discussing this, but I'm not sure we should create another bug.

Anyway, one thing that seems to be going on is that (at least in my cluster) etcd seems to keep crashing and getting restarted by monit. /var/log/monit.log looks like this:

[UTC Feb 19 05:17:31] info : monit: generated unique Monit id 0c489c0ae08f15498e0e67573ba7e33a and stored to '/var/lib/monit/id'
[UTC Feb 19 05:17:31] info : Starting monit daemon
[UTC Feb 19 05:17:31] error : State file '/var/lib/monit/state': Unable to read magic
[UTC Feb 19 05:17:31] info : 'system_e2e-test-davidopp-master.c.striped-torus-760.internal' Monit started
[UTC Feb 19 05:17:32] info : monit daemon with pid [5595] killed
[UTC Feb 19 05:17:32] info : 'system_e2e-test-davidopp-master.c.striped-torus-760.internal' Monit stopped
[UTC Feb 19 05:17:32] info : Starting monit daemon
[UTC Feb 19 05:17:32] info : 'system_e2e-test-davidopp-master.c.striped-torus-760.internal' Monit started
[UTC Feb 19 05:17:32] error : 'etcd' failed, cannot open a connection to INET[127.0.0.1:4001/v2/keys/] via TCP
[UTC Feb 19 05:17:32] info : 'etcd' trying to restart
[UTC Feb 19 05:17:32] info : 'etcd' stop: /etc/init.d/etcd
[UTC Feb 19 05:17:32] info : 'etcd' start: /etc/init.d/etcd
[UTC Feb 19 05:19:32] error : 'etcd' failed, cannot open a connection to INET[127.0.0.1:4001/v2/keys/] via TCP
[UTC Feb 19 05:19:32] info : 'etcd' trying to restart
[UTC Feb 19 05:19:32] info : 'etcd' stop: /etc/init.d/etcd
[UTC Feb 19 05:19:32] info : 'etcd' start: /etc/init.d/etcd
[UTC Feb 19 05:21:32] error : 'etcd' failed, cannot open a connection to INET[127.0.0.1:4001/v2/keys/] via TCP
[UTC Feb 19 05:21:32] info : 'etcd' trying to restart
[UTC Feb 19 05:21:32] info : 'etcd' stop: /etc/init.d/etcd
[UTC Feb 19 05:21:32] info : 'etcd' start: /etc/init.d/etcd
(... and it keeps doing this...)

/var/log/etcd.log also shows it restarting at the corresponding times.

This is surprising, though, since it would seem things would be more broken if etcd were really restarting frequently.

BTW apparently the -vv flag is no longer supported, so I'm going to retry with -v.

ghost · 2015-02-19T08:57:47Z

It may be possible that it's not causing more visible problems because once every 2 minutes is not very often. My interpretation of the fact that I see this kind of errors only when running things in parallel is that in normal circumstances there's pretty low probability of having a API call in flight when etcd restarts, but when there's 50 of them it gets probable enough (especially because on single threaded master '''stop''' runs longer than 2 minutes).

ghost · 2015-02-19T12:36:17Z

Davids point was a good one. AFAICT PR #4587 fixes this problem, and thus it's possible that it'll help with other flakiness as well.

bgrant0607 · 2015-03-10T21:27:46Z

In addition to fixing etcd problems, we should make the client more robust, by retrying.

bprashanth · 2015-03-10T21:49:20Z

@davidopp I put in retries for stop in #5193 and etcd stability is being addressed in #3853

brendandburns · 2015-03-23T21:47:50Z

I'm closing this, given the retries for stop, and better monitoring of etcd.

yujuhong added kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. team/master labels Feb 18, 2015

davidopp mentioned this issue Feb 19, 2015

Pods should be updated flaky #4535

Closed

davidopp self-assigned this Mar 3, 2015

bprashanth mentioned this issue Mar 9, 2015

Explore memory pressure on etcd. #3853

Closed

goltermann modified the milestone: High Pri Bug fix-it Mar 10, 2015

davidopp modified the milestones: v1.0, High Pri Bug fix-it Mar 19, 2015

brendandburns closed this as completed Mar 23, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubectl stop rc <name> is flaky #4528

Kubectl stop rc <name> is flaky #4528

ghost commented Feb 18, 2015

davidopp commented Feb 18, 2015

dchen1107 commented Feb 18, 2015

davidopp commented Feb 18, 2015

davidopp commented Feb 18, 2015

dchen1107 commented Feb 18, 2015

davidopp commented Feb 18, 2015

davidopp commented Feb 19, 2015

ghost commented Feb 19, 2015

ghost commented Feb 19, 2015

bgrant0607 commented Mar 10, 2015

bprashanth commented Mar 10, 2015

brendandburns commented Mar 23, 2015

Kubectl stop rc <name> is flaky #4528

Kubectl stop rc <name> is flaky #4528

Comments

ghost commented Feb 18, 2015

davidopp commented Feb 18, 2015

dchen1107 commented Feb 18, 2015

davidopp commented Feb 18, 2015

davidopp commented Feb 18, 2015

dchen1107 commented Feb 18, 2015

davidopp commented Feb 18, 2015

davidopp commented Feb 19, 2015

ghost commented Feb 19, 2015

ghost commented Feb 19, 2015

bgrant0607 commented Mar 10, 2015

bprashanth commented Mar 10, 2015

brendandburns commented Mar 23, 2015