-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
running on SLURM with MP and MPI #53
Comments
By MP, I assume you mean OpenMP. On a cluster you have OpenMP and MPI occurring simultaneously. Within a node, you have OpenMP executing, and across nodes, you have MPI. We successfully ran a SLURM batch script on a cluster back in the day as a test of scalability. The use of a cluster is indicated if you want to grow more trees simultaneously. Hybrid computing is not indicated as a work-around for memory issues. In your case, it's theoretically possible to grow 56 trees in parallel across two nodes. But each tree still needs to access all the data. Also, a complicating issue is that one must give the cluster instructions to grow two sub-forests, one on each node. The problem is then of combining the ensembles from each forest into a single forest. It's a non-trivial enterprise. You are using the wrapper imbalanced.rfsrc() in a hybrid environment. Anything other than standard rfsrc() or predict.rfsrc() calls are not recommended on a cluster as many of the other functions contain multiple calls to these two core functions. Combining sub-forest outputs from either of the two core functions into a single forest would be necessary before any other calculations could proceed. At the end of the day, using a cluster requires writing some code and using mpi.send.Robj(), mpi.recv.Robj(), mpi.spawn.Rslaves(), and using the supervisor process to parse the output sent back from the workers into a single forest output. I don't think any of this is what you need to do at all. What are the original dimensions of your data? What is ntree? What, specifically, are the parameters in your function call? |
I think I understand what you are saying about OpenMP/MPI (I’ve read the Supplementary Code at https://kogalur.github.io/randomForestSRC/theory.html#section9). But, I don’t quite understand why more nodes (memory) will not help me.
In any case, the size of my data sets are on the order of a few multiples of 100,000 cases and a few multiples of 1000 covariates (e.g., 250,000 cases by 3000 covariates). Everything is binary (2 level factors), and the response indicates a rare disease (about 0.5% rate), so I’m interested in imbalanced methods. I thought Balanced Random Forests (Chen and Brieman 2004), as implemented in imbalanced.rfsrc() with the ‘’brf” option would help with memory because it selects such a releatively small bootstrap sample from the majority class. My call to imbalanced.rfsrc() is
testing.rfsrc<- imbalanced(NAS ~ .,
testing.df,
method="brf",
importance="permute”)
so ntrees = 3000 (default) and, for this case, I reduced the size of my data set to 50,000 x 100, just until I can get things to run on the cluster (MPI). But, as I said, my real data sets are on the order of 250000 x 3000 (more or less).
To be sure, I can get this smaller size data set (testing.df 50000 x 100) to run, no problem, but when attempting to use my real, larger data sets, I get "out of memory” errors both on my Mac (32GB 8 cores) and on a CentOS Linux SLURM cluster (1 node 28 cores 128 GB (no MPI)). And, my attempt at MP/MPI, across multiple nodes on the cluster, failed because I did not understand (at the time) the more hands on approach required to get MPI to run. Alas, you say this MP may not work anyway with imbalanced.rfsrc().
I wonder if I just have to ask for _fewer cores_ to get more memory per core to escape the “out of memory” error at the expense of compute time. Your suggestions are welcomed.
Best,
Jay
J. Jay Barber
Associate Professor
Northern Arizona University
School of Informatics, Computing, and Cybersystems (SICCS)
Bldg. #90
Room 221
1295 S Knoles Drive
PO Box 5693
Flagstaff, AZ 86001
Phone: 928-523-6869
On Nov 4, 2019, at 9:29 AM, Udaya Kogalur <notifications@github.com<mailto:notifications@github.com>> wrote:
By MP, I assume you mean OpenMP. On a cluster you have OpenMP and MPI occurring simultaneously. Within a node, you have OpenMP executing, and across nodes, you have MPI. We successfully ran a SLURM batch script on a cluster back in the day as a test of scalability. The use of a cluster is indicated if you want to grow more trees simultaneously. Hybrid computing is not indicated as a work-around for memory issues.
In your case, it's theoretically possible to grow 56 trees in parallel across two nodes. But each tree still needs to access all the data. Also, a complicating issue is that one must give the cluster instructions to grow two sub-forests, one on each node. The problem is then of combining the ensembles from each forest into a single forest. It's a non-trivial enterprise.
You are using the wrapper imbalanced.rfsrc() in a hybrid environment. Anything other than standard rfsrc() or predict.rfsrc() calls are not recommended on a cluster as many of the other functions contain multiple calls to these two core functions. Combining sub-forest outputs from either of the two core functions into a single forest would be necessary before any other calculations could proceed.
At the end of the day, using a cluster requires writing some code and using mpi.send.Robj(), mpi.recv.Robj(), mpi.spawn.Rslaves(), and using the supervisor process to parse the output sent back from the workers into a single forest output.
I don't think any of this is what you need to do at all. What are the original dimensions of your data? What is ntree? What, specifically, are the parameters in your function call?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_kogalur_randomForestSRC_issues_53-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DACKV7DJVDAXRJTO7SAU4J33QSBEXXA5CNFSM4JGQFXS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEC73MZY-23issuecomment-2D549434983&d=DwMCaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=zI_qnInFzYljMfgrW4XCCTppKvsJRrkfHDKifJ93E0U&m=3grbe_bv_ZE8awfoHOfWvi4ekIMX6dHFTLOzh7__omA&s=dShnE0ukmKmhkcxG-VkwlVS-vhRKCWqdNoZ2FIVBDuY&e=>, or unsubscribe<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ACKV7DNXEVGXEAIJGVHKA6LQSBEXXANCNFSM4JGQFXSQ&d=DwMCaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=zI_qnInFzYljMfgrW4XCCTppKvsJRrkfHDKifJ93E0U&m=3grbe_bv_ZE8awfoHOfWvi4ekIMX6dHFTLOzh7__omA&s=Mzq3hXpD0OkhmAMKFLU-pXRYS9Ryd9fJVLLPVA0nVaY&e=>.
|
The request for importance is extremely computationally demanding. Try first running it without
this option.
Hemant
…On 11/4/19 2:06 PM, Jay wrote:
I think I understand what you are saying about OpenMP/MPI (I’ve read the Supplementary Code at
https://kogalur.github.io/randomForestSRC/theory.html#section9). But, I don’t quite understand why
more nodes (memory) will not help me.
In any case, the size of my data sets are on the order of a few multiples of 100,000 cases and a
few multiples of 1000 covariates (e.g., 250,000 cases by 3000 covariates). Everything is binary (2
level factors), and the response indicates a rare disease (about 0.5% rate), so I’m interested in
imbalanced methods. I thought Balanced Random Forests (Chen and Brieman 2004), as implemented in
imbalanced.rfsrc() with the ‘’brf” option would help with memory because it selects such a
releatively small bootstrap sample from the majority class. My call to imbalanced.rfsrc() is
testing.rfsrc<- imbalanced(NAS ~ .,
testing.df,
method="brf",
importance="permute”)
so ntrees = 3000 (default) and, for this case, I reduced the size of my data set to 50,000 x 100,
just until I can get things to run on the cluster (MPI). But, as I said, my real data sets are on
the order of 250000 x 3000 (more or less).
To be sure, I can get this smaller size data set (testing.df 50000 x 100) to run, no problem, but
when attempting to use my real, larger data sets, I get "out of memory” errors both on my Mac
(32GB 8 cores) and on a CentOS Linux SLURM cluster (1 node 28 cores 128 GB (no MPI)). And, my
attempt at MP/MPI, across multiple nodes on the cluster, failed because I did not understand (at
the time) the more hands on approach required to get MPI to run. Alas, you say this MP may not
work anyway with imbalanced.rfsrc().
I wonder if I just have to ask for _fewer cores_ to get more memory per core to escape the “out of
memory” error at the expense of compute time. Your suggestions are welcomed.
Best,
Jay
J. Jay Barber
Associate Professor
Northern Arizona University
School of Informatics, Computing, and Cybersystems (SICCS)
Bldg. #90
Room 221
1295 S Knoles Drive
PO Box 5693
Flagstaff, AZ 86001
Phone: 928-523-6869
On Nov 4, 2019, at 9:29 AM, Udaya Kogalur
***@***.******@***.***>> wrote:
By MP, I assume you mean OpenMP. On a cluster you have OpenMP and MPI occurring simultaneously.
Within a node, you have OpenMP executing, and across nodes, you have MPI. We successfully ran a
SLURM batch script on a cluster back in the day as a test of scalability. The use of a cluster is
indicated if you want to grow more trees simultaneously. Hybrid computing is not indicated as a
work-around for memory issues.
In your case, it's theoretically possible to grow 56 trees in parallel across two nodes. But each
tree still needs to access all the data. Also, a complicating issue is that one must give the
cluster instructions to grow two sub-forests, one on each node. The problem is then of combining
the ensembles from each forest into a single forest. It's a non-trivial enterprise.
You are using the wrapper imbalanced.rfsrc() in a hybrid environment. Anything other than standard
rfsrc() or predict.rfsrc() calls are not recommended on a cluster as many of the other functions
contain multiple calls to these two core functions. Combining sub-forest outputs from either of
the two core functions into a single forest would be necessary before any other calculations could
proceed.
At the end of the day, using a cluster requires writing some code and using mpi.send.Robj(),
mpi.recv.Robj(), mpi.spawn.Rslaves(), and using the supervisor process to parse the output sent
back from the workers into a single forest output.
I don't think any of this is what you need to do at all. What are the original dimensions of your
data? What is ntree? What, specifically, are the parameters in your function call?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on
GitHub<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_kogalur_randomForestSRC_issues_53-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DACKV7DJVDAXRJTO7SAU4J33QSBEXXA5CNFSM4JGQFXS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEC73MZY-23issuecomment-2D549434983&d=DwMCaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=zI_qnInFzYljMfgrW4XCCTppKvsJRrkfHDKifJ93E0U&m=3grbe_bv_ZE8awfoHOfWvi4ekIMX6dHFTLOzh7__omA&s=dShnE0ukmKmhkcxG-VkwlVS-vhRKCWqdNoZ2FIVBDuY&e=>,
or
unsubscribe<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ACKV7DNXEVGXEAIJGVHKA6LQSBEXXANCNFSM4JGQFXSQ&d=DwMCaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=zI_qnInFzYljMfgrW4XCCTppKvsJRrkfHDKifJ93E0U&m=3grbe_bv_ZE8awfoHOfWvi4ekIMX6dHFTLOzh7__omA&s=Mzq3hXpD0OkhmAMKFLU-pXRYS9Ryd9fJVLLPVA0nVaY&e=>.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#53?email_source=notifications&email_token=AK6G4J4AIQI2WGJELQZK7FDQSBXBVA5CNFSM4JGQFXS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDALOXY#issuecomment-549500767>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AK6G4J6RPEQQAOZ4JLQ4QY3QSBXBVANCNFSM4JGQFXSQ>.
--
Hemant Ishwaran
Deputy Statistical Editor, J. Thor. Cardio. Surg.
Director of Statistical Methodology
Professor, Division of Biostatistics
Director of Graduate Studies
Don Soffer Clinical Research Center, Room 1058
1120 NW 14th Street
University of Miami, Miami FL 33136
hemant.ishwaran@gmail.com (preferred)
hishwaran@med.miami.edu
(305) 243-5473 (office)
(305) 243-5544 (fax)
http://web.ccs.miami.edu/~hishwaran
|
Hi All,
I have a large data set that kept giving me out of memory errors. So, I've reduced it to 50000 rows by 100 columns (plus response column) for testing on a cluster managed by SLURM to explore if I can break things up among nodes to access more memory (and cores). Incidentally, all variables are 2-level factors with substantial imbalance in the response (output) factor, so I'm using imbalanced.rfsrc.
My R code and my SLURM batch files are below. Basically, I ask for 56 processes over 2 nodes (28 cores each node) as a way to explore if I'm getting the benefit of the memory of 2 nodes (plus the cores); I've asked for a minimum of 128GB each node. I'm wondering if my R code is structured correctly to exploit both the shared memory (MP) and distributed memory (MPI). I get the following output (56 times), which does not look good (indeed my batch job seems to be hanging as I type).
OUTPUT (so far...still running...seems to be hanging):
56 times I get:
randomForestSRC 2.9.1
Type rfsrc.news() to see new features, changes, and bug fixes.
**followed by 56 instances of **:
A process has executed an operation involving a call to the
"fork()" system call to create a child process. Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your job may hang, crash, or produce silent
data corruption. The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.
The process that invoked fork was:
Local host: [[42790,0],2] (PID 302952)
If you are absolutely sure that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
followed by 56 instances of (from detectCores())
[1] 28
But, so far, I cannot tell if imbalanced.rfsrc has been called yet. As the warnings suggest, my job may be hanging. I expected that, with the presence of Rmpi, etc, imbalanced.rfsrc would take care of the MPI + MP details behind the scenes. What wrong? Your help is much appreciated.
Best, -- Jay
MY R CODE (testing.R):
lp<- .libPaths()
.libPaths(c("/home/jjb485/Rlib", lp))
library(parallel)
library(Rmpi, lib.loc="/home/jjb485/Rlib")
library(randomForestSRC, lib.loc="/home/jjb485/Rlib")
detectCores()
options(rf.cores=detectCores(), mc.cores=detectCores())
load("/home/jjb485/neonatal/testing/testing.df.RData")
testing.rfsrc<- imbalanced(NAS ~ .,
testing.df,
method="brf",
importance="permute")
save("testing.rfsrc", file="testing.RData")
MY SLURM BATCH SCRIPT:
#!/bin/bash
#SBATCH --job-name=testing
#SBATCH --output=/scratch/jjb485/neonatal/testing/testing.txt
#SBATCH --time=1-00:00:00
#SBATCH --chdir=/scratch/jjb485/neonatal/testing
#SBATCH --ntasks=56
#SBATCH --nodes=2-2
#SBATCH --mem=128G
module load openmpi
module load R/latest
srun Rscript /home/jjb485/neonatal/testing/testing.R
The text was updated successfully, but these errors were encountered: