Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nequip memory requirements ❓ [QUESTION] #293

Closed
ipcamit opened this issue Jan 18, 2023 · 10 comments
Closed

Nequip memory requirements ❓ [QUESTION] #293

ipcamit opened this issue Jan 18, 2023 · 10 comments
Labels
question Further information is requested

Comments

@ipcamit
Copy link

ipcamit commented Jan 18, 2023

Is there any rule of thumb for nequip memory requirements? I have a 7000 configurations dataset, (Si 64 atoms each, periodic structure), in npz format. I am trying to train on 3000 configs, but I keep getting my jobs killed by scheduler for OOM error.

  1. How much ram are we supposed to provide?
  2. How can I reduce ram requirement?
  3. How to train on energies only? (I tried removing forces: 1 from loss_coeffs section but it still says that it calculated forces rms for scaling, is this expected?)

I am using the model in example.yaml but with 3 layers instead of 4.
My last attempt was 1 core, 1 A100 GPU, 100 GB ram.

@ipcamit ipcamit added the question Further information is requested label Jan 18, 2023
@Linux-cpp-lisp
Copy link
Collaborator

Hi @ipcamit ,

100GB RAM is far, far more than you should need on the CPU side--- I suspect you are actually running out of GPU memory due to either a large cutoff, batch size, model, or all three. Can you post your actual error and that information?

@ipcamit
Copy link
Author

ipcamit commented Jan 18, 2023

The HPC specification says there is 80 GB memory on the GPU cards.
Cutoff was 4.0 (now submitted again with 3.77 to check), if I remember correctly , avg. number of neighbors that nequip computed was about 8.3 on my laptop (on a very small toy set of 4 samples). Batch size was 2, and model is just 3 layer version of example.yaml, which is showing 87096 parameters on my laptop.

Error on the hpc is not that descriptive I think:

Torch device: cuda
Processing dataset...
./run.sh: line 3: 729160 Killed                  python nequip/nequip/scripts/train.py example.yaml
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=29229954.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

Below is the diff of things I changed.

+ num_layers: 3                                                                     # number of interaction blocks, we find 3-5 to work best
- num_layers: 4                                                                     # number of interaction blocks, we find 3-5 to work best
51,52

+   #dataset_url: http://quantum-machine.org/gdml/data/npz/toluene_ccsd_t.zip           # url to download the npz. optional
+ dataset_file_name: ./Si4.npz
- dataset_url: http://quantum-machine.org/gdml/data/npz/toluene_ccsd_t.zip           # url to download the npz. optional
- dataset_file_name: ./benchmark_data/toluene_ccsd_t-train.npz                       # path to data set file
58

+   P: pbc
61
+   - pbc
64

+   - Si
-   - H
-   - C
67

+ wandb: false                                                                        # we recommend using wandb for logging
- wandb: true                                                                        # we recommend using wandb for logging
77

+ n_train: 3000                                                                       # number of training data
+ n_val: 500                                                                         # number of validation data
- n_train: 100                                                                       # number of training data
- n_val: 50                                                                          # number of validation data
80

+ batch_size: 2 
- batch_size: 5 

@Linux-cpp-lisp
Copy link
Collaborator

Linux-cpp-lisp commented Jan 18, 2023

Hm nevermind, that's a very reasonable set of parameters.

Looks like it's dying before the model is ever called during the neighbor list preprocessing of the dataset. This preprocessing step can be run on a CPU node by calling nequip-benchmark my-config.yaml, where presumably you are able to successfully allocate more CPU RAM to the SLURM job?

@ipcamit
Copy link
Author

ipcamit commented Jan 18, 2023

Ok so if I understand correctly, I should do,

nequip-benchmark example.yaml
nequip-train example.yaml

, is that correct?
Like I said is there any rule of thumb on how much ram to request?
Will allocating more cores help in nequip-benchmark (is it parallel)?
Lastly, I can run these scripts explicitly right?

python nequip/benchmark.py example.yaml
python nequip/train.py example.yaml

@simonbatzner
Copy link
Collaborator

Yes, that's correct, just be sure that nequip-benchmark actually uses the same dataset config as your final training and that you make sure you use the same directory to store the data in.

Rule of thumb, I usually don't need more than 30GB, in particular for a dataset as small as yours. If it OOMs on the CPU front, I increase to 50GB or so, the absolute worst case I've used 500GB, but that was for a massive data set. More cores themselves won't help, but if they come with more memory that will help.

@Linux-cpp-lisp
Copy link
Collaborator

Linux-cpp-lisp commented Jan 18, 2023

Will allocating more cores help in nequip-benchmark (is it parallel)?
More cores themselves won't help, but if they come with more memory that will help.

This is correct, but just want to note a caveat for others reading this issue--- the ASE dataset loader/preprocessor (like used for extXYZ files, our general recommendation for dataset format) is parallelized over CPU cores and can use as many as you throw at it on a single node (it is not MPI parallelized). It should autodetect the available cores from SLURM environment variables, but you can also set that manually with the NEQUIP_NUM_TASKS environment variable.

Also note that the CPU RAM demands of training after preprocessing is complete are generally lower than preprocessing.

@ipcamit
Copy link
Author

ipcamit commented Jan 18, 2023

Thanks for the help. In case anyone stumble on this in future, instead of using npz, taking cue from the conversation above I instead appended all my data in a single xyz file, and used ase datatype, and ran nequip-benchmark first. As it is also thread parallel, it processed the data in relatively modest resources of 5 cores and 50 GB total ram (I think it can go much lower, but didn't test, though on my laptop I could do 1000 configs easily in 8 cores, 16 gb ram). I made the following changes to my example.yaml file to map fields properly.

dataset: ase
dataset_file_name: /pathe/to/Si_all.xyz
dataset_key_mapping:
  forces: forces 
  Energy: total_energy # extxyz file has energy stores as key Energy
  pbc:  PBC

chemical_symbols:
  - Si

dataset_include_frames: !!python/object/apply:builtins.range
  - 0
  - 7400
  - 1

This processed the data easily. And now running nequip-train is training the model on GPU.

Thanks

@ipcamit ipcamit closed this as completed Jan 18, 2023
@Linux-cpp-lisp
Copy link
Collaborator

Great, glad this resolved your issue @ipcamit , and thank you for documenting it for future users!

@Linux-cpp-lisp
Copy link
Collaborator

Linux-cpp-lisp commented Jan 18, 2023

Oh and just noting: our PBC key is actually also lowercase pbc, so I think you can remove the last mapping— ase should handle it anyway, though. And identity mappings like forces: forces should also be unnecessary; if you find it to be needed please let me know as I'd consider that a bug.

@simonbatzner
Copy link
Collaborator

Awesome, great to hear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants