Nequip memory requirements ❓ [QUESTION] #293

ipcamit · 2023-01-18T01:24:59Z

Is there any rule of thumb for nequip memory requirements? I have a 7000 configurations dataset, (Si 64 atoms each, periodic structure), in npz format. I am trying to train on 3000 configs, but I keep getting my jobs killed by scheduler for OOM error.

How much ram are we supposed to provide?
How can I reduce ram requirement?
How to train on energies only? (I tried removing forces: 1 from loss_coeffs section but it still says that it calculated forces rms for scaling, is this expected?)

I am using the model in example.yaml but with 3 layers instead of 4.
My last attempt was 1 core, 1 A100 GPU, 100 GB ram.

The text was updated successfully, but these errors were encountered:

Linux-cpp-lisp · 2023-01-18T02:19:44Z

Hi @ipcamit ,

100GB RAM is far, far more than you should need on the CPU side--- I suspect you are actually running out of GPU memory due to either a large cutoff, batch size, model, or all three. Can you post your actual error and that information?

ipcamit · 2023-01-18T02:43:22Z

The HPC specification says there is 80 GB memory on the GPU cards.
Cutoff was 4.0 (now submitted again with 3.77 to check), if I remember correctly , avg. number of neighbors that nequip computed was about 8.3 on my laptop (on a very small toy set of 4 samples). Batch size was 2, and model is just 3 layer version of example.yaml, which is showing 87096 parameters on my laptop.

Error on the hpc is not that descriptive I think:

Torch device: cuda
Processing dataset...
./run.sh: line 3: 729160 Killed                  python nequip/nequip/scripts/train.py example.yaml
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=29229954.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

Below is the diff of things I changed.

+ num_layers: 3                                                                     # number of interaction blocks, we find 3-5 to work best
- num_layers: 4                                                                     # number of interaction blocks, we find 3-5 to work best
51,52

+   #dataset_url: http://quantum-machine.org/gdml/data/npz/toluene_ccsd_t.zip           # url to download the npz. optional
+ dataset_file_name: ./Si4.npz
- dataset_url: http://quantum-machine.org/gdml/data/npz/toluene_ccsd_t.zip           # url to download the npz. optional
- dataset_file_name: ./benchmark_data/toluene_ccsd_t-train.npz                       # path to data set file
58

+   P: pbc
61
+   - pbc
64

+   - Si
-   - H
-   - C
67

+ wandb: false                                                                        # we recommend using wandb for logging
- wandb: true                                                                        # we recommend using wandb for logging
77

+ n_train: 3000                                                                       # number of training data
+ n_val: 500                                                                         # number of validation data
- n_train: 100                                                                       # number of training data
- n_val: 50                                                                          # number of validation data
80

+ batch_size: 2 
- batch_size: 5

Linux-cpp-lisp · 2023-01-18T03:05:53Z

Hm nevermind, that's a very reasonable set of parameters.

Looks like it's dying before the model is ever called during the neighbor list preprocessing of the dataset. This preprocessing step can be run on a CPU node by calling nequip-benchmark my-config.yaml, where presumably you are able to successfully allocate more CPU RAM to the SLURM job?

ipcamit · 2023-01-18T03:11:54Z

Ok so if I understand correctly, I should do,

nequip-benchmark example.yaml
nequip-train example.yaml

, is that correct?
Like I said is there any rule of thumb on how much ram to request?
Will allocating more cores help in nequip-benchmark (is it parallel)?
Lastly, I can run these scripts explicitly right?

python nequip/benchmark.py example.yaml
python nequip/train.py example.yaml

simonbatzner · 2023-01-18T15:50:15Z

Yes, that's correct, just be sure that nequip-benchmark actually uses the same dataset config as your final training and that you make sure you use the same directory to store the data in.

Rule of thumb, I usually don't need more than 30GB, in particular for a dataset as small as yours. If it OOMs on the CPU front, I increase to 50GB or so, the absolute worst case I've used 500GB, but that was for a massive data set. More cores themselves won't help, but if they come with more memory that will help.

Linux-cpp-lisp · 2023-01-18T16:02:17Z

Will allocating more cores help in nequip-benchmark (is it parallel)?
More cores themselves won't help, but if they come with more memory that will help.

This is correct, but just want to note a caveat for others reading this issue--- the ASE dataset loader/preprocessor (like used for extXYZ files, our general recommendation for dataset format) is parallelized over CPU cores and can use as many as you throw at it on a single node (it is not MPI parallelized). It should autodetect the available cores from SLURM environment variables, but you can also set that manually with the NEQUIP_NUM_TASKS environment variable.

Also note that the CPU RAM demands of training after preprocessing is complete are generally lower than preprocessing.

ipcamit · 2023-01-18T18:33:01Z

Thanks for the help. In case anyone stumble on this in future, instead of using npz, taking cue from the conversation above I instead appended all my data in a single xyz file, and used ase datatype, and ran nequip-benchmark first. As it is also thread parallel, it processed the data in relatively modest resources of 5 cores and 50 GB total ram (I think it can go much lower, but didn't test, though on my laptop I could do 1000 configs easily in 8 cores, 16 gb ram). I made the following changes to my example.yaml file to map fields properly.

dataset: ase
dataset_file_name: /pathe/to/Si_all.xyz
dataset_key_mapping:
  forces: forces 
  Energy: total_energy # extxyz file has energy stores as key Energy
  pbc:  PBC

chemical_symbols:
  - Si

dataset_include_frames: !!python/object/apply:builtins.range
  - 0
  - 7400
  - 1

This processed the data easily. And now running nequip-train is training the model on GPU.

Thanks

Linux-cpp-lisp · 2023-01-18T19:25:42Z

Great, glad this resolved your issue @ipcamit , and thank you for documenting it for future users!

Linux-cpp-lisp · 2023-01-18T19:27:01Z

Oh and just noting: our PBC key is actually also lowercase pbc, so I think you can remove the last mapping— ase should handle it anyway, though. And identity mappings like forces: forces should also be unnecessary; if you find it to be needed please let me know as I'd consider that a bug.

simonbatzner · 2023-01-18T21:35:25Z

Awesome, great to hear.

ipcamit added the question Further information is requested label Jan 18, 2023

ipcamit closed this as completed Jan 18, 2023

Linux-cpp-lisp mentioned this issue Feb 21, 2023

Error with the new pair_allegro-stress branch mir-group/pair_allegro#12

Closed

yzjin mentioned this issue May 25, 2023

Allegro memory requirements? mir-group/allegro#45

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nequip memory requirements ❓ [QUESTION] #293

Nequip memory requirements ❓ [QUESTION] #293

ipcamit commented Jan 18, 2023 •

edited

Loading

Linux-cpp-lisp commented Jan 18, 2023

ipcamit commented Jan 18, 2023

Linux-cpp-lisp commented Jan 18, 2023 •

edited

Loading

ipcamit commented Jan 18, 2023

simonbatzner commented Jan 18, 2023

Linux-cpp-lisp commented Jan 18, 2023 •

edited

Loading

ipcamit commented Jan 18, 2023

Linux-cpp-lisp commented Jan 18, 2023

Linux-cpp-lisp commented Jan 18, 2023 •

edited

Loading

simonbatzner commented Jan 18, 2023

Nequip memory requirements ❓ [QUESTION] #293

Nequip memory requirements ❓ [QUESTION] #293

Comments

ipcamit commented Jan 18, 2023 • edited Loading

Linux-cpp-lisp commented Jan 18, 2023

ipcamit commented Jan 18, 2023

Linux-cpp-lisp commented Jan 18, 2023 • edited Loading

ipcamit commented Jan 18, 2023

simonbatzner commented Jan 18, 2023

Linux-cpp-lisp commented Jan 18, 2023 • edited Loading

ipcamit commented Jan 18, 2023

Linux-cpp-lisp commented Jan 18, 2023

Linux-cpp-lisp commented Jan 18, 2023 • edited Loading

simonbatzner commented Jan 18, 2023

ipcamit commented Jan 18, 2023 •

edited

Loading

Linux-cpp-lisp commented Jan 18, 2023 •

edited

Loading

Linux-cpp-lisp commented Jan 18, 2023 •

edited

Loading

Linux-cpp-lisp commented Jan 18, 2023 •

edited

Loading