Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

H2O on a Multi-Node LSF Batch Submission System? #106

Open
DevinRouth opened this issue Mar 12, 2019 · 1 comment
Open

H2O on a Multi-Node LSF Batch Submission System? #106

DevinRouth opened this issue Mar 12, 2019 · 1 comment

Comments

@DevinRouth
Copy link

Hello,

I work in the Crowther Lab at ETH-Zürich, and we're starting to use H2O to crunch massive ecological datasets (1.2+ million rows with 75+ covariates) that we've collected. In terms of computing, ETH uses an LSF system to manage computational resources of the university clusters. When recently submitting some of the large models, we realized that H2O wasn't using all of the nodes that were assigned to the job. From the university cluster support staff: "it appears that the [script] can not use multiple nodes at the same time... I suggest you check your program's documentation to see whether it is possible to run it with distributed memory, so it can use more lower-memory nodes".

When diagnosing the issue, I found the article on how to use multiple nodes on a SLURM system and another article on how to use H2O on multi-node clusters.

Essentially, I'm unsure as to whether these types of approaches would work with an LSF system as the nodes that are used for each job are only assigned after a full program script has been submitted via Bash using batch submission functions. In other words, I don't know that it's possible to access the IP addresses of the connected nodes before the program has been submitted to the cluster.

Has anyone else had any experience with H2O on LSF based clusters? Have I missed a critical or obvious step somewhere that would allow H2O to access/distribute the memory across all of the nodes?

Thanks so much!

Cheers,
Devin Routh

@tomkraljevic
Copy link
Contributor

No, there is no current support for LSF-based environments.

If there is a Spark-based way of running on LSF (sorry, I don't know), then you could try to run sparkling water.

Otherwise, you will need to solve the "Cluster Formation" problem, where each of the nodes finds each other. H2O-3 does this for Hadoop by having a dedicated driver program. You can find the source code here:

You would either need to write something similar to that, or write a stub that collects the IP addresses of the worker nodes, distributes a flat file, distributes the jar file, and then starts each worker up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants