Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed after retry of OutOfOrderScannerNextException: was there a rpc timeout #7

Closed
mattshma opened this issue Mar 14, 2016 · 1 comment
Labels

Comments

@mattshma
Copy link
Owner

报错如下:

16/03/15 12:01:00 INFO mapreduce.Job:  map 0% reduce 0%
16/03/15 12:05:11 INFO mapreduce.Job: Task Id : attempt_1449806584223_1303688_m_000003_0, Status : FAILED
Error: org.apache.hadoop.hbase.DoNotRetryIOException: Failed after retry of OutOfOrderScannerNextException: was there a rpc timeout?
        at org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:384)
        at org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.nextKeyValue(TableRecordReaderImpl.java:227)
        at org.apache.hadoop.hbase.mapreduce.TableRecordReader.nextKeyValue(TableRecordReader.java:138)
        at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:533)
        at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
        at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Caused by: org.apache.hadoop.hbase.exceptions.OutOfOrderScannerNextException: org.apache.hadoop.hbase.exceptions.OutOfOrderScannerNextException: Expected nextCallSeq: 1 But the nextCallSeq got from client: 0; request=scanner_id: 40507 number_of_rows: 50 close_scanner: false next_call_seq: 0
        at org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3088)
        at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29497)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2012)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:98)
        at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
        at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
        at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
        at java.lang.Thread.run(Thread.java:745)

        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
        at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)

查看ScannerCallable.java

Client and RS maintain a nextCallSeq number during the scan. Every next() call from client to server will increment this number in both sides. Client passes this number along with the request and at RS side both the incoming nextCallSeq and its nextCallSeq will be matched. In case of a timeout this increment at the client side should not happen. If at the server side fetching of next batch of data was over, there will be mismatch in the nextCallSeq number. Server will throw OutOfOrderScannerNextException and then client will reopen the scanner with startrow as the last successfully retrieved row.
See HBASE-5974

RSRpcServices.java

// if nextCallSeq does not match throw Exception straight away. This needs to be
// performed even before checking of Lease.
// See HBASE-5974
        if (request.hasNextCallSeq()) {
          if (rsh != null) {
            if (request.getNextCallSeq() != rsh.getNextCallSeq()) {
              throw new OutOfOrderScannerNextException(
                "Expected nextCallSeq: " + rsh.getNextCallSeq()
                + " But the nextCallSeq got from client: " + request.getNextCallSeq() +
                "; request=" + TextFormat.shortDebugString(request));
            }
            // Increment the nextCallSeq value which is the next expected from client.
            rsh.incNextCallSeq();
          }
        }

Client的一个Scan可能需要多次向RS取数据,为了保证数据顺序,Client和RS各自维护一个nextCallSeq字段。RS每次返回一批数据,Client端都会将自已的nextCallSeq加1,供后续ScanRequest使用。RS端每次接受到ScanRequest也会将自己的nextCallSeq加1。如果客户端在获取数据超时,那么Client的nextCallSeq没有加1,后续RS收到ScanRequest发现nextCallSeq匹配不上,RS会抛出OutOfOrderScannerNextException。

既然问题是由client端超时引起的,那相应的减少客户端cache(hbase.client.scanner.caching)大小或者加大rpc timeout时间(hbase.rpc.timeout)即可。

参考

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants
@mattshma and others