Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestWatch {etcd} #31262

Closed
k8s-github-robot opened this issue Aug 23, 2016 · 41 comments
Closed

TestWatch {etcd} #31262

k8s-github-robot opened this issue Aug 23, 2016 · 41 comments
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.

Comments

@k8s-github-robot
Copy link

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-test-go/16135/

Failed: TestWatch {etcd}

resttest.go:1251: unexpected timeout from result channel
@k8s-github-robot k8s-github-robot added priority/backlog Higher priority than priority/awaiting-more-evidence. kind/flake Categorizes issue or PR as related to a flaky test. labels Aug 23, 2016
@sttts sttts added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Aug 23, 2016
@sttts
Copy link
Contributor

sttts commented Aug 23, 2016

This blocks the queue it seems.

@sttts sttts mentioned this issue Aug 23, 2016
14 tasks
@spxtr
Copy link
Contributor

spxtr commented Aug 23, 2016

Assigning according to this.

@spxtr spxtr assigned dchen1107 and unassigned spxtr Aug 23, 2016
@ncdc
Copy link
Member

ncdc commented Aug 23, 2016

I haven't been able to reproduce locally, but I do see some odd output. It appears that TestWatch tries to connect to the sockets for the etcd nodes created in the other tests in this file.

=== RUN   TestWatch
2016-08-23 08:51:09.610505 I | integration: launching node4516792630069338985 (unix://localhost:node4516792630069338985.sock.bridge)
2016-08-23 08:51:09.615184 I | etcdserver: name = node4516792630069338985
2016-08-23 08:51:09.615307 I | etcdserver: data dir = /tmp/etcd487135095
2016-08-23 08:51:09.615389 I | etcdserver: member dir = /tmp/etcd487135095/member
2016-08-23 08:51:09.615463 I | etcdserver: heartbeat = 10ms
2016-08-23 08:51:09.615537 I | etcdserver: election = 100ms
2016-08-23 08:51:09.615698 I | etcdserver: snapshot count = 0
2016-08-23 08:51:09.615790 I | etcdserver: advertise client URLs = unix://127.0.0.1:21012.15635.sock
2016-08-23 08:51:09.615887 I | etcdserver: initial advertise peer URLs = unix://127.0.0.1:21011.15635.sock
2016-08-23 08:51:09.616072 I | etcdserver: initial cluster = node4516792630069338985=unix://127.0.0.1:21011.15635.sock
2016-08-23 08:51:09.625353 I | etcdserver: starting member 9f295cd627bdb62e in cluster df8194b67a5d422b
2016-08-23 08:51:09.625562 I | raft: 9f295cd627bdb62e became follower at term 0
2016-08-23 08:51:09.625670 I | raft: newRaft 9f295cd627bdb62e [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
2016-08-23 08:51:09.625739 I | raft: 9f295cd627bdb62e became follower at term 1
2016-08-23 08:51:09.641947 I | etcdserver: set snapshot count to default 10000
2016-08-23 08:51:09.642128 I | etcdserver: starting server... [version: 3.0.6, cluster version: to_be_decided]
2016-08-23 08:51:09.644212 I | membership: added member 9f295cd627bdb62e [unix://127.0.0.1:21011.15635.sock] to cluster df8194b67a5d422b
2016-08-23 08:51:09.649572 I | integration: launched node4516792630069338985 (unix://localhost:node4516792630069338985.sock.bridge)
2016-08-23 08:51:09.706646 I | raft: 9f295cd627bdb62e is starting a new election at term 1
2016-08-23 08:51:09.706803 I | raft: 9f295cd627bdb62e became candidate at term 2
2016-08-23 08:51:09.706884 I | raft: 9f295cd627bdb62e received vote from 9f295cd627bdb62e at term 2
2016-08-23 08:51:09.707004 I | raft: 9f295cd627bdb62e became leader at term 2
2016-08-23 08:51:09.707092 I | raft: raft.node: 9f295cd627bdb62e elected leader 9f295cd627bdb62e at term 2
2016-08-23 08:51:09.708994 I | etcdserver: setting up the initial cluster version to 3.0
2016-08-23 08:51:09.709347 I | etcdserver: published {Name:node4516792630069338985 ClientURLs:[unix://127.0.0.1:21012.15635.sock]} to cluster df8194b67a5d422b
2016-08-23 08:51:09.709945 N | membership: set the initial cluster version to 3.0
2016-08-23 08:51:10.088592 I | v3rpc/grpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix localhost:node4390910715892619484.sock.bridge: connect: no such file or directory"; Reconnecting to {"localhost:node4390910715892619484.sock.bridge" <nil>}
2016-08-23 08:51:10.174713 I | v3rpc/grpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix localhost:node3201104790320849340.sock.bridge: connect: no such file or directory"; Reconnecting to {"localhost:node3201104790320849340.sock.bridge" <nil>}
2016-08-23 08:51:10.321057 I | v3rpc/grpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix localhost:node1203876774733832651.sock.bridge: connect: no such file or directory"; Reconnecting to {"localhost:node1203876774733832651.sock.bridge" <nil>}
2016-08-23 08:51:10.457051 I | v3rpc/grpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix localhost:node143222603956848319.sock.bridge: connect: no such file or directory"; Reconnecting to {"localhost:node143222603956848319.sock.bridge" <nil>}
2016-08-23 08:51:10.600320 I | v3rpc/grpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix localhost:node235110591907564707.sock.bridge: connect: no such file or directory"; Reconnecting to {"localhost:node235110591907564707.sock.bridge" <nil>}
2016-08-23 08:51:11.509978 I | v3rpc/grpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix localhost:node4390910715892619484.sock.bridge: connect: no such file or directory"; Reconnecting to {"localhost:node4390910715892619484.sock.bridge" <nil>}
2016-08-23 08:51:11.919793 I | v3rpc/grpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix localhost:node3201104790320849340.sock.bridge: connect: no such file or directory"; Reconnecting to {"localhost:node3201104790320849340.sock.bridge" <nil>}
2016-08-23 08:51:12.127168 I | v3rpc/grpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix localhost:node143222603956848319.sock.bridge: connect: no such file or directory"; Reconnecting to {"localhost:node143222603956848319.sock.bridge" <nil>}
2016-08-23 08:51:12.149581 I | v3rpc/grpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix localhost:node1203876774733832651.sock.bridge: connect: no such file or directory"; Reconnecting to {"localhost:node1203876774733832651.sock.bridge" <nil>}
2016-08-23 08:51:12.331657 I | v3rpc/grpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix localhost:node235110591907564707.sock.bridge: connect: no such file or directory"; Reconnecting to {"localhost:node235110591907564707.sock.bridge" <nil>}
2016-08-23 08:51:14.009577 I | v3rpc/grpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix localhost:node4390910715892619484.sock.bridge: connect: no such file or directory"; Reconnecting to {"localhost:node4390910715892619484.sock.bridge" <nil>}
2016-08-23 08:51:14.188795 I | v3rpc/grpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix localhost:node3201104790320849340.sock.bridge: connect: no such file or directory"; Reconnecting to {"localhost:node3201104790320849340.sock.bridge" <nil>}
2016-08-23 08:51:14.646463 I | v3rpc/grpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix localhost:node235110591907564707.sock.bridge: connect: no such file or directory"; Reconnecting to {"localhost:node235110591907564707.sock.bridge" <nil>}
2016-08-23 08:51:14.680831 I | v3rpc/grpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix localhost:node143222603956848319.sock.bridge: connect: no such file or directory"; Reconnecting to {"localhost:node143222603956848319.sock.bridge" <nil>}
2016-08-23 08:51:15.030171 I | v3rpc/grpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix localhost:node1203876774733832651.sock.bridge: connect: no such file or directory"; Reconnecting to {"localhost:node1203876774733832651.sock.bridge" <nil>}
2016-08-23 08:51:17.694162 I | v3rpc/grpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix localhost:node4390910715892619484.sock.bridge: connect: no such file or directory"; Reconnecting to {"localhost:node4390910715892619484.sock.bridge" <nil>}
2016-08-23 08:51:18.680946 I | v3rpc/grpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix localhost:node3201104790320849340.sock.bridge: connect: no such file or directory"; Reconnecting to {"localhost:node3201104790320849340.sock.bridge" <nil>}
2016-08-23 08:51:18.787490 I | v3rpc/grpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix localhost:node1203876774733832651.sock.bridge: connect: no such file or directory"; Reconnecting to {"localhost:node1203876774733832651.sock.bridge" <nil>}
2016-08-23 08:51:19.364232 I | v3rpc/grpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix localhost:node143222603956848319.sock.bridge: connect: no such file or directory"; Reconnecting to {"localhost:node143222603956848319.sock.bridge" <nil>}
2016-08-23 08:51:19.447686 I | v3rpc/grpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix localhost:node235110591907564707.sock.bridge: connect: no such file or directory"; Reconnecting to {"localhost:node235110591907564707.sock.bridge" <nil>}
2016-08-23 08:51:23.218406 I | v3rpc/grpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix localhost:node4390910715892619484.sock.bridge: connect: no such file or directory"; Reconnecting to {"localhost:node4390910715892619484.sock.bridge" <nil>}
2016-08-23 08:51:25.070712 I | v3rpc/grpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix localhost:node3201104790320849340.sock.bridge: connect: no such file or directory"; Reconnecting to {"localhost:node3201104790320849340.sock.bridge" <nil>}
2016-08-23 08:51:25.792412 I | v3rpc/grpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix localhost:node1203876774733832651.sock.bridge: connect: no such file or directory"; Reconnecting to {"localhost:node1203876774733832651.sock.bridge" <nil>}
2016-08-23 08:51:26.697937 I | v3rpc/grpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix localhost:node235110591907564707.sock.bridge: connect: no such file or directory"; Reconnecting to {"localhost:node235110591907564707.sock.bridge" <nil>}
2016-08-23 08:51:27.052865 I | v3rpc/grpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix localhost:node143222603956848319.sock.bridge: connect: no such file or directory"; Reconnecting to {"localhost:node143222603956848319.sock.bridge" <nil>}
2016-08-23 08:51:34.104028 I | v3rpc/grpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix localhost:node4390910715892619484.sock.bridge: connect: no such file or directory"; Reconnecting to {"localhost:node4390910715892619484.sock.bridge" <nil>}
2016-08-23 08:51:36.834508 I | v3rpc/grpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix localhost:node3201104790320849340.sock.bridge: connect: no such file or directory"; Reconnecting to {"localhost:node3201104790320849340.sock.bridge" <nil>}
2016-08-23 08:51:37.381142 I | v3rpc/grpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix localhost:node1203876774733832651.sock.bridge: connect: no such file or directory"; Reconnecting to {"localhost:node1203876774733832651.sock.bridge" <nil>}
2016-08-23 08:51:37.658273 I | v3rpc/grpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix localhost:node235110591907564707.sock.bridge: connect: no such file or directory"; Reconnecting to {"localhost:node235110591907564707.sock.bridge" <nil>}
2016-08-23 08:51:38.809849 I | v3rpc/grpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix localhost:node143222603956848319.sock.bridge: connect: no such file or directory"; Reconnecting to {"localhost:node143222603956848319.sock.bridge" <nil>}
2016-08-23 08:51:43.063297 I | integration: terminating node4516792630069338985 (unix://localhost:node4516792630069338985.sock.bridge)
2016-08-23 08:51:43.067599 I | v3rpc/grpc: transport: http2Client.notifyError got notified that the client transport was broken EOF.
2016-08-23 08:51:43.075521 I | integration: terminated node4516792630069338985 (unix://localhost:node4516792630069338985.sock.bridge)
--- FAIL: TestWatch (33.47s)
    resttest.go:1251: unexpected timeout from result channel

@k8s-github-robot
Copy link
Author

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-test-go/16144/

Failed: TestWatch {etcd}

resttest.go:1251: unexpected timeout from result channel

@k8s-github-robot
Copy link
Author

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-test-go/16148/

Failed: TestWatch {etcd}

resttest.go:1251: unexpected timeout from result channel

@k8s-github-robot
Copy link
Author

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-test-go/16156/

Failed: TestWatch {etcd}

resttest.go:1251: unexpected timeout from result channel

@k8s-github-robot
Copy link
Author

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-test-go/16162/

Failed: TestWatch {etcd}

resttest.go:1251: unexpected timeout from result channel

@k8s-github-robot
Copy link
Author

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-test-go/16164/

Failed: TestWatch {etcd}

resttest.go:1251: unexpected timeout from result channel

@wojtek-t
Copy link
Member

This is definitely related to this PR:
#30890

that switched unit tests to use v3 API of etcd.

@hongchaodeng @xiang90 @timothysc - seems like a blocker.

@lavalamp - FYI

@k8s-github-robot
Copy link
Author

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-test-go/16169/

Failed: TestWatch {etcd}

resttest.go:1251: unexpected timeout from result channel

@timothysc
Copy link
Member

FWIW I ran unit-tests locally tens of times without failure.

I wonder if it's similar to the resource issue we found a while ago on our jenkins env /cc @ixdy

@k8s-github-robot
Copy link
Author

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-test-go/16183/

Failed: TestWatch {etcd}

resttest.go:1251: unexpected timeout from result channel

@lavalamp
Copy link
Member

I made a rollback PR. #31371

@hongchaodeng
Copy link
Contributor

Some findings on logs from last failure -- the test infra is leaking clients

Here are the logs for failed test:

=== RUN   TestWatch
...
2016-08-24 12:04:09.195133 I | integration: launched node417290545105324029 (unix://localhost:node417290545105324029.sock.bridge)
...
2016-08-24 12:04:09.210408 I | v3rpc/grpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix localhost:node5478120498384533947.sock.bridge: connect: no such file or directory"; Reconnecting to {"localhost:node5478120498384533947.sock.bridge" <nil>}
2016-08-24 12:04:09.360063 I | v3rpc/grpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix localhost:node1204988732873562.sock.bridge: connect: no such file or directory"; Reconnecting to {"localhost:node1204988732873562.sock.bridge" <nil>}
...
... a bunch of "localhost:nodeXXX.sock.bridge: connect: no such file or directory" failures
...
--- FAIL: TestWatch (32.21s)
    resttest.go:1251: unexpected timeout from result channel

Basically, TestWatch client should talk to "node417290545105324029".
For "node1204988732873562", it's created in TestResourceLocation earlier. The connection failed because TestResourceLocation already finished and unix socket files were cleaned up. But there is a client that keeps talking to it.
It's also true for a couple of other similar errors.
Apparently, this is leaking clients.

@hongchaodeng
Copy link
Contributor

After digging into the code, we are obviously not closing any clients in a lot of tests.
Even though we rollback to v2, it's still leaking clients.

@hongchaodeng
Copy link
Contributor

Submit a fix: #31390

@k8s-github-robot
Copy link
Author

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-test-go/16194/

Failed: TestWatch {etcd}

resttest.go:1251: unexpected timeout from result channel

@k8s-github-robot
Copy link
Author

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-test-go/16195/

Failed: TestWatch {etcd}

resttest.go:1299: unexpected timeout from result channel

@k8s-github-robot
Copy link
Author

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-test-go/16198/

Failed: TestWatch {etcd}

resttest.go:1251: unexpected timeout from result channel

k8s-github-robot pushed a commit that referenced this issue Aug 25, 2016
Automatic merge from submit-queue

Revert "Enable v3 Client as the default on UTs"

Reverts #30890

Fix flake problems (#31262)
@lavalamp
Copy link
Member

Looks like we stopped seeing these after #31371 merged.

k8s-github-robot pushed a commit that referenced this issue Aug 25, 2016
Automatic merge from submit-queue

return destroy func to clean up internal resources of storage

What?
Provide a destroy func to clean up internal resources of storage.
It changes **unit tests** to clean up resources. (Maybe fix integration test in another PR.)

Why?
Although apiserver is designed to be long running, there are some cases that it's not.
See #31262 (comment)
We need to gracefully shutdown and clean up resources.
@wojtek-t
Copy link
Member

Which was expected BTW :)

@timothysc
Copy link
Member

I've nearly got the deferFunc() plumbed through to the rest of the registry testing and will put up a PR, but I'm happy to wait till 1.5.

@timothysc
Copy link
Member

fix posted.

k8s-github-robot pushed a commit that referenced this issue Sep 15, 2016
Automatic merge from submit-queue

Etcd3 v3client + grpc client leak fix

This re-enables the client and plumbs through the destroyFunc to cleanup clients. 

Fix for #31262
This was referenced Sep 15, 2016
perotinus pushed a commit to kubernetes-retired/cluster-registry that referenced this issue Sep 2, 2017
Automatic merge from submit-queue

return destroy func to clean up internal resources of storage

What?
Provide a destroy func to clean up internal resources of storage.
It changes **unit tests** to clean up resources. (Maybe fix integration test in another PR.)

Why?
Although apiserver is designed to be long running, there are some cases that it's not.
See kubernetes/kubernetes#31262 (comment)
We need to gracefully shutdown and clean up resources.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

No branches or pull requests