Cannot specify multiple column families for a new HBase table that shc creates on the fly #121

khampson · 2017-04-10T15:52:09Z

When writing out a table to HBase from a dataframe with the following code:
df.write.options( Map(HBaseTableCatalog.tableCatalog -> HbaseCommon.catalogMitLicResults(WriteTableName), HBaseTableCatalog.newTable -> HbaseNumRegions)) .format("org.apache.spark.sql.execution.datasources.hbase") .save()

It works OK if I define all columns in the catalog in the same column family -- let's say i. But I decided I wanted to put a couple columns in a different column family, as they were larger fields, and not necessarily needed all the time, so I wanted to save pulling them in on a regular scan unless specifically requested.

However, when running with this catalog definition in place -- all columns but two in cf i, and two columns in another cf m, the Spark job failed saying org.apache.hadoop.hbase.regionserver.NoSuchColumnFamilyException: Column family m does not exist in region, so it would appear that shc is not carrying through the column family definition that is outlined in the catalog.

Thoughts?

Thanks!

The text was updated successfully, but these errors were encountered:

weiqingy · 2017-04-11T18:15:10Z

@khampson The column family issue should have been completely fixed in #114. Could you pull the new code and re-run your application? let me know if you hit any issue.

khampson · 2017-04-12T15:34:41Z

@weiqingy : Thanks. I pulled the latest code for the 2.1 branch and rebuilt. I still see the error, though. When I look at the commits in tig, I don't see #103 listed. Perhaps it did not get merged to the 2.1 branch after all?

weiqingy · 2017-04-12T17:50:00Z

The fix is in #114 which has been in branch 2.1. Could you please share the stack trace?

khampson · 2017-04-12T19:05:08Z

@weiqingy : Here is the stack trace. Thanks.

Wed Apr 12 14:56:41 EDT 2017, null, java.net.SocketTimeoutException: callTimeout=60000, callDuration=68217: row '' on table 'foo' at region=foo,,1491924147650.ed4f9f7e1e29d6751019508
b52d26ad1., hostname=hostname,16020,1490506079287, seqNum=2

        at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:271)
        at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:210)
        at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:60)
        at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
        at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:327)
        at org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:302)
        at org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:167)
        at org.apache.hadoop.hbase.client.ClientScanner.<init>(ClientScanner.java:162)
        at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:794)
        at org.apache.spark.sql.execution.datasources.hbase.TableResource$$anonfun$getScanner$1.apply(HBaseResources.scala:145)
        at org.apache.spark.sql.execution.datasources.hbase.TableResource$$anonfun$getScanner$1.apply(HBaseResources.scala:145)
        at org.apache.spark.sql.execution.datasources.hbase.ReferencedResource$class.releaseOnException(HBaseResources.scala:77)
        at org.apache.spark.sql.execution.datasources.hbase.TableResource.releaseOnException(HBaseResources.scala:120)
        at org.apache.spark.sql.execution.datasources.hbase.TableResource.getScanner(HBaseResources.scala:144)
        at org.apache.spark.sql.execution.datasources.hbase.HBaseTableScanRDD$$anonfun$9.apply(HBaseTableScan.scala:283)
        at org.apache.spark.sql.execution.datasources.hbase.HBaseTableScanRDD$$anonfun$9.apply(HBaseTableScan.scala:282)
        at scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:657)
        at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:49)
        at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
        at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
        at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:51)
        at scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:648)
        at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:152)
        at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:443)
        at scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.net.SocketTimeoutException: callTimeout=60000, callDuration=68217: row '' on table 'foo' at region=foo,,1491924147650.ed4f9f7e1e29d6751019508b52d26ad1., hostname=
hostname,16020,1490506079287, seqNum=2
        at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:159)
        at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:65)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.SocketTimeoutException: callTimeout=60000, callDuration=68217: row '' on table 'foo' at region=foo,,1491924147650.ed4f9f7e1e29d6751019508b52d26ad1., hostname=
hostname,16020,1490506079287, seqNum=2
        at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:159)
        at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:65)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.regionserver.NoSuchColumnFamilyException): org.apache.hadoop.hbase.regionserver.NoSuchColumnFamilyException: Column family m does not exist in region
 foo,,1491924147650.ed4f9f7e1e29d6751019508b52d26ad1. in table 'foo', {NAME => 'i', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_B
LOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
        at org.apache.hadoop.hbase.regionserver.HRegion.checkFamily(HRegion.java:7605)
        at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2610)
        at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2595)
        at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2282)
        at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32295)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2127)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
        at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)
        at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108)
        at java.lang.Thread.run(Thread.java:745)

        at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1225)
        at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213)
        at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287)
        at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:32741)
        at org.apache.hadoop.hbase.client.ScannerCallable.openScanner(ScannerCallable.java:379)
        at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:201)
        at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:63)
        at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
        at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:364)
        at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:338)
        at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:126)
        ... 4 more

weiqingy · 2017-04-12T21:57:45Z

It was supposed to work. I think your case is similar with the test cases here, but the test cases work well. Is there any difference between your catalog definition and the one in the test case?

khampson · 2017-04-14T20:08:04Z

@weiqingy : The catalog definition was all primitive types, so nothing different, I don't think. I re-pulled from the remote origin and retried (last evening), and this time it appeared to work OK. I had noticed after my first pull earlier in the week that I didn't see #114 in the git history via tig. After this repull, I do see #114 in the git history. I'm not sure why it wouldn't have come down in the first pull, since it's time stamped April 2nd, but that appears to be the case -- something only made it appear in the master of 2.1 sometime in between my first pull and yesterday.

Thanks.

weiqingy · 2017-04-15T15:05:37Z

@khampson Great! Then let's close this issue. Thanks. :)

khampson closed this as completed Apr 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot specify multiple column families for a new HBase table that shc creates on the fly #121

Cannot specify multiple column families for a new HBase table that shc creates on the fly #121

khampson commented Apr 10, 2017

weiqingy commented Apr 11, 2017 •

edited

khampson commented Apr 12, 2017

weiqingy commented Apr 12, 2017

khampson commented Apr 12, 2017

weiqingy commented Apr 12, 2017

khampson commented Apr 14, 2017 •

edited

weiqingy commented Apr 15, 2017

Cannot specify multiple column families for a new HBase table that shc creates on the fly #121

Cannot specify multiple column families for a new HBase table that shc creates on the fly #121

Comments

khampson commented Apr 10, 2017

weiqingy commented Apr 11, 2017 • edited

khampson commented Apr 12, 2017

weiqingy commented Apr 12, 2017

khampson commented Apr 12, 2017

weiqingy commented Apr 12, 2017

khampson commented Apr 14, 2017 • edited

weiqingy commented Apr 15, 2017

weiqingy commented Apr 11, 2017 •

edited

khampson commented Apr 14, 2017 •

edited