-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nequip memory requirements ❓ [QUESTION] #293
Comments
Hi @ipcamit , 100GB RAM is far, far more than you should need on the CPU side--- I suspect you are actually running out of GPU memory due to either a large cutoff, batch size, model, or all three. Can you post your actual error and that information? |
The HPC specification says there is 80 GB memory on the GPU cards. Error on the hpc is not that descriptive I think:
Below is the diff of things I changed.
|
Hm nevermind, that's a very reasonable set of parameters. Looks like it's dying before the model is ever called during the neighbor list preprocessing of the dataset. This preprocessing step can be run on a CPU node by calling |
Ok so if I understand correctly, I should do,
, is that correct?
|
Yes, that's correct, just be sure that nequip-benchmark actually uses the same dataset config as your final training and that you make sure you use the same directory to store the data in. Rule of thumb, I usually don't need more than 30GB, in particular for a dataset as small as yours. If it OOMs on the CPU front, I increase to 50GB or so, the absolute worst case I've used 500GB, but that was for a massive data set. More cores themselves won't help, but if they come with more memory that will help. |
This is correct, but just want to note a caveat for others reading this issue--- the ASE dataset loader/preprocessor (like used for extXYZ files, our general recommendation for dataset format) is parallelized over CPU cores and can use as many as you throw at it on a single node (it is not MPI parallelized). It should autodetect the available cores from SLURM environment variables, but you can also set that manually with the Also note that the CPU RAM demands of training after preprocessing is complete are generally lower than preprocessing. |
Thanks for the help. In case anyone stumble on this in future, instead of using npz, taking cue from the conversation above I instead appended all my data in a single xyz file, and used
This processed the data easily. And now running Thanks |
Great, glad this resolved your issue @ipcamit , and thank you for documenting it for future users! |
Oh and just noting: our PBC key is actually also lowercase |
Awesome, great to hear. |
Is there any rule of thumb for nequip memory requirements? I have a 7000 configurations dataset, (Si 64 atoms each, periodic structure), in npz format. I am trying to train on 3000 configs, but I keep getting my jobs killed by scheduler for OOM error.
forces: 1
fromloss_coeffs
section but it still says that it calculated forces rms for scaling, is this expected?)I am using the model in example.yaml but with 3 layers instead of 4.
My last attempt was 1 core, 1 A100 GPU, 100 GB ram.
The text was updated successfully, but these errors were encountered: