Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Presto Queries are not working with Hive Table data at S3 #5375

Closed
rajitsaha opened this issue May 27, 2016 · 6 comments
Closed

Presto Queries are not working with Hive Table data at S3 #5375

rajitsaha opened this issue May 27, 2016 · 6 comments

Comments

@rajitsaha
Copy link

Hi

I have created Hortonworks 2.4 cluster with AWS EC2 nodes with Ambaru 2.2.2. Then deployed
Teradata Presto "presto-server-rpm-0.141t-1.x86_64". I have created data with TPCDS . And same data is copied both in HDFS and S3. I have two Hive external tables one pointing to HDFS data ( Hive table : tpcds_bin_partitioned_orc_10.web_sales ) and one pointing to S3 data ( Hive Table : s3_tpcds_bin_partitioned_orc_10.web_sales )

The presto query with Hive table pointing to HDFS data is working fine but Hive table pointing to S3 data is failing with following error

com.facebook.presto.spi.PrestoException: No nodes available to run query at com.facebook.presto.execution.scheduler.SimpleNodeSelector.computeAssignments(SimpleNodeSelector.java:120) at com.facebook.presto.execution.scheduler.DynamicSplitPlacementPolicy.computeAssignments(DynamicSplitPlacementPolicy.java:42) at com.facebook.presto.execution.scheduler.SourcePartitionedScheduler.schedule(SourcePartitionedScheduler.java:97) at com.facebook.presto.execution.scheduler.SqlQueryScheduler.schedule(SqlQueryScheduler.java:326) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

@rajitsaha
Copy link
Author

I have
JVM config

-server -Xmx28G -XX:+UseG1GC -XX:+UseGCOverheadLimit -XX:+ExplicitGCInvokesConcurrent -XX:+HeapDumpOnOutOfMemoryError -XX:OnOutOfMemoryError=kill -9 %p

Hive connector config

{'hive':[ 'connector.name=hive-hadoop2', 'hive.force-local-scheduling=true', 'hive.metastore.uri=thrift://<Hive Metastore Hostname>:9083', 'hive.config.resources=/etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xml', 'hive.s3.aws-access-key=<AWS Access Key>', 'hive.s3.aws-secret-key=<AWS Secret Key>' ]}

@maciejgrzybek
Copy link
Member

@rajitsaha this should be posted to mailing list instead. Here we should post only well defined (on low level) issues, e.g. "Presto fails planning when there are two identical window functions".
Your problem does not necessary be a bug. It could be some misconfiguration (to be investigated though). But such a problem should be posted on mailing list. If identified as a bug, it should be posted here.

@rschlussel-zz
Copy link
Member

I believe your issue is with hive.force-local-scheduling=true
hive.force-local-scheduling requires that you have presto on every node where your that your data is located. Since you do not have presto on the s3 nodes, there are no nodes that can run those queries.

You can use node-scheduler.network-topology=flat for a less strict version of hive.force-local-scheduling. It will reserve 50% of the work queue on a given node for local splits.

@kbajda
Copy link

kbajda commented Jun 7, 2016

@rajitsaha : could you please tell us whether @rschlussel's suggestion worked for you? If so, please close this issue. Thanks!
Btw. more details on various config options can be found here: http://teradata.github.io/presto/docs/141t/admin/tuning.html

@rschlussel-zz
Copy link
Member

Also should be addressed by https://github.com/prestodb/presto/pull/5417/files for the next release.
I'm closing this issue. You can post to the mailing list if you have more questions.

@wbgentleman
Copy link

Modifying this parameter node-scheduler.network-topology=flat works for me! Thx guys!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants