Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hadoop 3 compatibility #465

Closed
skhalid7 opened this issue Dec 20, 2021 · 9 comments
Closed

Hadoop 3 compatibility #465

skhalid7 opened this issue Dec 20, 2021 · 9 comments

Comments

@skhalid7
Copy link

Hi
I'm running glow on GCP, using a dataproc. Spark3 is set up with Hadoop 3.2 there. Going over glows pom.xml, it seems that it's using Hadoop2.7, which causes dependency conflicts. Unfortunately changing the dataproc configurations isn't that straightforward, is there any way I can change the Hadoop dependency for glow to Hadoop 3.2?

Thank you

@williambrandler
Copy link
Contributor

hey @skhalid7 what's the error you get with dataproc? And how are you installing Glow?

Any more details you can provide will be helpful

@williambrandler
Copy link
Contributor

williambrandler commented Dec 22, 2021

#467
opened this to track, it may be that this change will cause issues with other libraries that glow depends on, such as hadoop-bam. Will test for the next release of glow (Spark 3.2). Next release will take some time as we are working with the spark-core team to figure out breaking changes in Spark 3.2

For now, can you use older versions of glow + dataproc that depend on older versions of hadoop / spark?

@williambrandler
Copy link
Contributor

just confirming from circleci checks that changing the hadoop version does break the scala tests. It will not be possible to resolve this in the short term

Does dataproc support docker containers? If so we can work with you to adapt the glow docker container to work on dataproc

@skhalid7
Copy link
Author

Hi
Thanks for the confirmation. The error message I get when I try any io operation is:

'ERROR org.apache.spark.util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[readingParquetFooters-ForkJoinPool-1-worker-1,5,main]
java.lang.IllegalAccessError: class org.apache.hadoop.hdfs.web.HftpFileSystem cannot access its superinterface org.apache.hadoop.hdfs.web.TokenAspect$TokenManagementDelegator'

I install glow using the maven artifact glow-spark3_2.12:1.1.1 and pip install. Unfortunately, all Spark 3 dataprocs come with Hadoop 3.2 pre-installed, so I can't use glow 1.0+ versions on it. I've been using glow 0.6 successfully on Dataproc.

Docker is supported and sounds like a good option.

Thank you!

@williambrandler
Copy link
Contributor

The Glow docker container is built off the Databricks Runtime version of Spark in layers. The relevant genomics layers can be adapted for Dataproc. I expect you can then override the hadoop version, but I do not know how. Do you have a cloud engineer at Google who can help work on this? Please message your GCP account team to get them in the loop on this so we can chart a path forward

@skhalid7
Copy link
Author

Thanks,
I dropped you a message on the glow slack, alternatively is there an email I should reach out to you with?

@williambrandler
Copy link
Contributor

hey @skhalid7, after consulting internally it may be significant development effort to get this working on dataproc, a few weeks of engineer time. And we do not have funding approved for this.

I am sure we can find a way but it will take a while. Are you able to use Databricks on GCP for this work? Or does it have to be with dataproc?

@williambrandler
Copy link
Contributor

hey @skhalid7 we now have a container that should work on google cloud with GCS and includes hadoop 3 compatibility:

#503

This was contributed by @edg1983 with some modifications
#494

The container is now on the projectglow dockerhub page here:
https://hub.docker.com/r/projectglow/open-source-glow

projectglow/open-source-glow:1.1.2

Sorry it took so long, hopefully this solution will work for you on google cloud

@skhalid7
Copy link
Author

Thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants