-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hadoop 3 compatibility #465
Comments
hey @skhalid7 what's the error you get with dataproc? And how are you installing Glow? Any more details you can provide will be helpful |
#467 For now, can you use older versions of glow + dataproc that depend on older versions of hadoop / spark? |
just confirming from circleci checks that changing the hadoop version does break the scala tests. It will not be possible to resolve this in the short term Does dataproc support docker containers? If so we can work with you to adapt the glow docker container to work on dataproc |
Hi 'ERROR org.apache.spark.util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[readingParquetFooters-ForkJoinPool-1-worker-1,5,main] I install glow using the maven artifact glow-spark3_2.12:1.1.1 and pip install. Unfortunately, all Spark 3 dataprocs come with Hadoop 3.2 pre-installed, so I can't use glow 1.0+ versions on it. I've been using glow 0.6 successfully on Dataproc. Docker is supported and sounds like a good option. Thank you! |
The Glow docker container is built off the Databricks Runtime version of Spark in layers. The relevant genomics layers can be adapted for Dataproc. I expect you can then override the hadoop version, but I do not know how. Do you have a cloud engineer at Google who can help work on this? Please message your GCP account team to get them in the loop on this so we can chart a path forward |
Thanks, |
hey @skhalid7, after consulting internally it may be significant development effort to get this working on dataproc, a few weeks of engineer time. And we do not have funding approved for this. I am sure we can find a way but it will take a while. Are you able to use Databricks on GCP for this work? Or does it have to be with dataproc? |
hey @skhalid7 we now have a container that should work on google cloud with GCS and includes hadoop 3 compatibility: This was contributed by @edg1983 with some modifications The container is now on the projectglow dockerhub page here:
Sorry it took so long, hopefully this solution will work for you on google cloud |
Thank you very much! |
Hi
I'm running glow on GCP, using a dataproc. Spark3 is set up with Hadoop 3.2 there. Going over glows pom.xml, it seems that it's using Hadoop2.7, which causes dependency conflicts. Unfortunately changing the dataproc configurations isn't that straightforward, is there any way I can change the Hadoop dependency for glow to Hadoop 3.2?
Thank you
The text was updated successfully, but these errors were encountered: