Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low GPU memory-usage and 0 GPU-Util #588

Closed
xuxiaochen1209 opened this issue Sep 6, 2022 · 2 comments
Closed

Low GPU memory-usage and 0 GPU-Util #588

xuxiaochen1209 opened this issue Sep 6, 2022 · 2 comments
Labels
relax Amber relaxation stage issues

Comments

@xuxiaochen1209
Copy link

xuxiaochen1209 commented Sep 6, 2022

Hello,

I had a problem with running alphafold. The first two hours are very smooth, and I think the MSA part is finished in these two hours. However, when it showd:

I0905 13:06:56.466166 140453353674560 model.py:175] Output shape was {'distogram': {'bin_edges': (63,), 'logits': (691, 691, 64)}, 'experimentally_resolved': {'logits': (691, 37)}, 'masked_msa': {'logits': (252, 691, 22)}, 'predicted_aligned_error': (691, 691), 'predicted_lddt': {'logits': (691, 50)}, 'structure_module': {'final_atom_mask': (691, 37), 'final_atom_positions': (691, 37, 3)}, 'plddt': (691,), 'aligned_confidence_probs': (691, 691, 64), 'max_predicted_aligned_error': (), 'ptm': (), 'iptm': (), 'ranking_confidence': ()}
I0905 13:06:56.467109 140453353674560 run_alphafold.py:202] Total JAX model model_1_multimer_v2_pred_0 on VHVL predict time (includes compilation time, see --benchmark): 246.2s

This step takes forever. I checked the CPU usage, memory usage, and the GPU usage and they are:

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 35488 dell      20   0   69.9g   4.8g 594148 R 100.0  3.8   1591:11 python /h+

              total        used        free      shared  buff/cache   available
Mem:         128357        6557        1730         106      120069      121081

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.43.04    Driver Version: 515.43.04    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:3B:00.0 Off |                  N/A |
| 30%   33C    P2   101W / 320W |   5886MiB / 10240MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:5E:00.0 Off |                  N/A |
| 30%   25C    P0    88W / 320W |      0MiB / 10240MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  Off  | 00000000:B1:00.0 Off |                  N/A |
| 30%   25C    P0    89W / 320W |      0MiB / 10240MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce ...  Off  | 00000000:D9:00.0 Off |                  N/A |
| 30%   25C    P0    94W / 320W |      0MiB / 10240MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     35488      C   python                           1020MiB |
+-----------------------------------------------------------------------------+

The GPU memory is not very high since I saw some people's A100 had a menory usage with over 20000MiB. What's more, the GPU-Util is only 0-1%. I'm not sure whether it's because the graphic driver/CUDA/CUDNN/JAX versions are not matched (driver version: 515.43.04, CUDA version: 11.7, CUDNN version: 8.4.1.50, jaxlib version: 0.3.15+cuda11.cudnn82, python version: 3.8). I didn't see any error log, but it just didn't move on for over 30 hours. I also used 'conda activate alphafold' and tested in python3:

>>> import torch
>>> print(torch.cuda.is_available())
True
>>> from torch.backends import  cudnn
>>> print(cudnn.is_available())
True

It seems that the CUDA and CUDNN works. So I'm confused and did anyone have this problem before and could you please kindly teach me how to solve it? Thanks a lot for your kind guide.

@xuxiaochen1209
Copy link
Author

And I found there is only one result_model_1_multimer_v2_pred_0.pkl, one unrelaxed_model_1_multimer_v2_pred_0.pdb and one features.pkl file in the output file, not 5 pdb files and 5 unrelaxed pdb files as usual.

@Augustin-Zidek Augustin-Zidek added cuda relax Amber relaxation stage issues and removed cuda labels Jan 17, 2023
@alexanderimanicowenrivers
Copy link
Collaborator

Hey @xuxiaochen1209, GPU memory usage is highly dependent on sequence size, and given that you have a small protein of size 691, you will have low memory utilisation. Your script terminates at the relax stage, a known issue - #667. Please update the codebase to v2.3.1 and use the latest script, as we have fixed issues with relax. See - https://github.com/deepmind/alphafold/releases/tag/v2.3.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
relax Amber relaxation stage issues
Projects
None yet
Development

No branches or pull requests

3 participants