This repository has been archived by the owner. It is now read-only.

fixed updated pd leader issue #107

Merged
merged 7 commits into from Sep 24, 2017

Conversation

Projects
None yet
2 participants
@zhexuany
Member

zhexuany commented Sep 22, 2017

The changes has not tested yet, but you can review first.


This change is Reviewable

zhexuany added some commits Sep 22, 2017

@zhexuany

This comment has been minimized.

Show comment
Hide comment
@zhexuany
Member

zhexuany commented Sep 22, 2017

@zhexuany

This comment has been minimized.

Show comment
Hide comment
@zhexuany

zhexuany Sep 22, 2017

Member

fixing #88

Member

zhexuany commented Sep 22, 2017

fixing #88

@ilovesoup

In case of leader down, PDErrorhandler has no way to capture it since we all rely on PD's error return, aren't we?

zhexuany added some commits Sep 22, 2017

@zhexuany

This comment has been minimized.

Show comment
Hide comment
@zhexuany

zhexuany Sep 24, 2017

Member

Ran Query 17 from TPCH with scale factor 100 while keep killing pd-server's leader. The query suffers exception but related task was recovered by Spark.

See below log for more details:

scala> q17.show
[Stage 8:(282 + 64) / 563][Stage 10:>  (0 + 0) / 16][Stage 11:> (0 + 0) / 563]17/09/24 10:35:21 WARN PDClient: failed to get member from pd server.
io.grpc.StatusRuntimeException: UNAVAILABLE
	at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:227)
	at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:208)
	at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:141)
	at com.pingcap.tikv.kvproto.PDGrpc$PDBlockingStub.getMembers(PDGrpc.java:626)
	at com.pingcap.tikv.PDClient.getMembers(PDClient.java:233)
	at com.pingcap.tikv.PDClient.updateLeader(PDClient.java:277)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)
Caused by: io.netty.netty4pingcap.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /172.16.10.7:2379
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
	at io.netty.netty4pingcap.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:352)
	at io.netty.netty4pingcap.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
	at io.netty.netty4pingcap.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:632)
	at io.netty.netty4pingcap.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579)
	at io.netty.netty4pingcap.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496)
	at io.netty.netty4pingcap.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
	at io.netty.netty4pingcap.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
	at io.netty.netty4pingcap.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
	... 1 more
Caused by: java.net.ConnectException: Connection refused
	... 11 more
[Stage 8:(288 + 64) / 563][Stage 10:>  (0 + 0) / 16][Stage 11:> (0 + 0) / 563]17/09/24 10:35:21 WARN PDClient: failed to get member from pd server.
io.grpc.StatusRuntimeException: UNAVAILABLE
	at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:227)
	at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:208)
	at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:141)
	at com.pingcap.tikv.kvproto.PDGrpc$PDBlockingStub.getMembers(PDGrpc.java:626)
	at com.pingcap.tikv.PDClient.getMembers(PDClient.java:233)
	at com.pingcap.tikv.PDClient.updateLeader(PDClient.java:277)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)
Caused by: io.netty.netty4pingcap.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /172.16.10.7:2379
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
	at io.netty.netty4pingcap.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:352)
	at io.netty.netty4pingcap.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
	at io.netty.netty4pingcap.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:632)
	at io.netty.netty4pingcap.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579)
	at io.netty.netty4pingcap.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496)
	at io.netty.netty4pingcap.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
	at io.netty.netty4pingcap.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
	at io.netty.netty4pingcap.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
	... 1 more
Caused by: java.net.ConnectException: Connection refused
	... 11 more
17/09/24 10:35:21 WARN PDClient: failed to get member from pd server.
io.grpc.StatusRuntimeException: UNAVAILABLE
	at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:227)
	at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:208)
	at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:141)
	at com.pingcap.tikv.kvproto.PDGrpc$PDBlockingStub.getMembers(PDGrpc.java:626)
	at com.pingcap.tikv.PDClient.getMembers(PDClient.java:233)
	at com.pingcap.tikv.PDClient.updateLeader(PDClient.java:277)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)
Caused by: io.netty.netty4pingcap.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /172.16.10.7:2379
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
	at io.netty.netty4pingcap.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:352)
	at io.netty.netty4pingcap.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
	at io.netty.netty4pingcap.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:632)
	at io.netty.netty4pingcap.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579)
	at io.netty.netty4pingcap.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496)
	at io.netty.netty4pingcap.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
	at io.netty.netty4pingcap.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
	at io.netty.netty4pingcap.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
	... 1 more
Caused by: java.net.ConnectException: Connection refused
	... 11 more
+--------------------+
|          avg_yearly|
+--------------------+
|3.2087018998692103E7|
+--------------------+
Member

zhexuany commented Sep 24, 2017

Ran Query 17 from TPCH with scale factor 100 while keep killing pd-server's leader. The query suffers exception but related task was recovered by Spark.

See below log for more details:

scala> q17.show
[Stage 8:(282 + 64) / 563][Stage 10:>  (0 + 0) / 16][Stage 11:> (0 + 0) / 563]17/09/24 10:35:21 WARN PDClient: failed to get member from pd server.
io.grpc.StatusRuntimeException: UNAVAILABLE
	at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:227)
	at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:208)
	at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:141)
	at com.pingcap.tikv.kvproto.PDGrpc$PDBlockingStub.getMembers(PDGrpc.java:626)
	at com.pingcap.tikv.PDClient.getMembers(PDClient.java:233)
	at com.pingcap.tikv.PDClient.updateLeader(PDClient.java:277)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)
Caused by: io.netty.netty4pingcap.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /172.16.10.7:2379
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
	at io.netty.netty4pingcap.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:352)
	at io.netty.netty4pingcap.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
	at io.netty.netty4pingcap.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:632)
	at io.netty.netty4pingcap.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579)
	at io.netty.netty4pingcap.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496)
	at io.netty.netty4pingcap.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
	at io.netty.netty4pingcap.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
	at io.netty.netty4pingcap.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
	... 1 more
Caused by: java.net.ConnectException: Connection refused
	... 11 more
[Stage 8:(288 + 64) / 563][Stage 10:>  (0 + 0) / 16][Stage 11:> (0 + 0) / 563]17/09/24 10:35:21 WARN PDClient: failed to get member from pd server.
io.grpc.StatusRuntimeException: UNAVAILABLE
	at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:227)
	at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:208)
	at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:141)
	at com.pingcap.tikv.kvproto.PDGrpc$PDBlockingStub.getMembers(PDGrpc.java:626)
	at com.pingcap.tikv.PDClient.getMembers(PDClient.java:233)
	at com.pingcap.tikv.PDClient.updateLeader(PDClient.java:277)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)
Caused by: io.netty.netty4pingcap.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /172.16.10.7:2379
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
	at io.netty.netty4pingcap.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:352)
	at io.netty.netty4pingcap.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
	at io.netty.netty4pingcap.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:632)
	at io.netty.netty4pingcap.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579)
	at io.netty.netty4pingcap.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496)
	at io.netty.netty4pingcap.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
	at io.netty.netty4pingcap.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
	at io.netty.netty4pingcap.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
	... 1 more
Caused by: java.net.ConnectException: Connection refused
	... 11 more
17/09/24 10:35:21 WARN PDClient: failed to get member from pd server.
io.grpc.StatusRuntimeException: UNAVAILABLE
	at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:227)
	at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:208)
	at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:141)
	at com.pingcap.tikv.kvproto.PDGrpc$PDBlockingStub.getMembers(PDGrpc.java:626)
	at com.pingcap.tikv.PDClient.getMembers(PDClient.java:233)
	at com.pingcap.tikv.PDClient.updateLeader(PDClient.java:277)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)
Caused by: io.netty.netty4pingcap.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /172.16.10.7:2379
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
	at io.netty.netty4pingcap.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:352)
	at io.netty.netty4pingcap.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
	at io.netty.netty4pingcap.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:632)
	at io.netty.netty4pingcap.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579)
	at io.netty.netty4pingcap.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496)
	at io.netty.netty4pingcap.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
	at io.netty.netty4pingcap.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
	at io.netty.netty4pingcap.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
	... 1 more
Caused by: java.net.ConnectException: Connection refused
	... 11 more
+--------------------+
|          avg_yearly|
+--------------------+
|3.2087018998692103E7|
+--------------------+
@ilovesoup

This comment has been minimized.

Show comment
Hide comment
@ilovesoup

ilovesoup Sep 24, 2017

Contributor

In case of client conn refuse, for Spark it will be fine since spark has its own retry policy. But in general we should also deal with this our own. We can leave it be here for now. RegionStoreClient has the same problem as well. Let's deal with in another PR.

Contributor

ilovesoup commented Sep 24, 2017

In case of client conn refuse, for Spark it will be fine since spark has its own retry policy. But in general we should also deal with this our own. We can leave it be here for now. RegionStoreClient has the same problem as well. Let's deal with in another PR.

@zhexuany

This comment has been minimized.

Show comment
Hide comment
@zhexuany

zhexuany Sep 24, 2017

Member

You are absolutely right. Let's deal with it in another PR.

Member

zhexuany commented Sep 24, 2017

You are absolutely right. Let's deal with it in another PR.

@zhexuany zhexuany merged commit bf3de8f into master Sep 24, 2017

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@zhexuany zhexuany deleted the fix-change-pd-leader-bug branch Sep 24, 2017

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.