Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output Killed with no other information #1

Closed
RodenLuo opened this issue Jul 1, 2022 · 7 comments
Closed

Output Killed with no other information #1

RodenLuo opened this issue Jul 1, 2022 · 7 comments

Comments

@RodenLuo
Copy link

RodenLuo commented Jul 1, 2022

Hi, Thanks for the excellent work.

I installed the tool and can run the file generation command:
cgdms makeinput -i 1CRN.pdb -s 1CRN.ss2 > 1CRN.txt

But I cannot run the simulate command. The output is just one word Killed.

$ cgdms simulate -i 1CRN.txt -o traj.pdb -s predss -n 1.2e7
Killed

Could you please take a look and advise how to debug?

Best,
Roden

My conda environment is attached:

name: cgdms
channels:
  - pytorch
  - salilab
  - conda-forge
  - bioconda
  - defaults
dependencies:
  - _libgcc_mutex=0.1=conda_forge
  - _openmp_mutex=4.5=2_kmp_llvm
  - blas=1.0=mkl
  - bzip2=1.0.8=h7f98852_4
  - ca-certificates=2022.6.15=ha878542_0
  - cudatoolkit=10.2.89=h713d32c_10
  - freetype=2.10.4=h0708190_1
  - giflib=5.2.1=h36c2ea0_2
  - jpeg=9e=h166bdaf_2
  - lcms2=2.12=hddcbb42_0
  - ld_impl_linux-64=2.36.1=hea4e1c9_2
  - lerc=3.0=h9c3ff4c_0
  - libdeflate=1.12=h166bdaf_0
  - libffi=3.4.2=h7f98852_5
  - libgcc-ng=12.1.0=h8d9b700_16
  - libnsl=2.0.0=h7f98852_0
  - libpng=1.6.37=h21135ba_2
  - libstdcxx-ng=12.1.0=ha89aaad_16
  - libtiff=4.4.0=hc85c160_1
  - libuuid=2.32.1=h7f98852_1000
  - libwebp=1.2.2=h3452ae3_0
  - libwebp-base=1.2.2=h7f98852_1
  - libxcb=1.13=h7f98852_1004
  - libzlib=1.2.12=h166bdaf_1
  - llvm-openmp=14.0.4=he0ac6c6_0
  - lz4-c=1.9.3=h9c3ff4c_1
  - mkl=2021.4.0=h8d4b97c_729
  - mkl-service=2.4.0=py38h95df7f1_0
  - mkl_fft=1.3.1=py38h8666266_1
  - mkl_random=1.2.2=py38h1abd341_0
  - ncurses=6.3=h27087fc_1
  - ninja=1.11.0=h924138e_0
  - numpy=1.22.3=py38he7a7128_0
  - numpy-base=1.22.3=py38hf524024_0
  - openjpeg=2.4.0=hb52868f_1
  - openssl=3.0.4=h166bdaf_2
  - pillow=9.1.1=py38h0ee0e06_1
  - pip=22.1.2=pyhd8ed1ab_0
  - pthread-stubs=0.4=h36c2ea0_1001
  - python=3.8.13=ha86cf86_0_cpython
  - python_abi=3.8=2_cp38
  - pytorch=1.6.0=py3.8_cuda10.2.89_cudnn7.6.5_0
  - readline=8.1.2=h0f457ee_0
  - setuptools=62.6.0=py38h578d9bd_0
  - six=1.16.0=pyh6c4a22f_0
  - sqlite=3.39.0=h4ff8645_0
  - tbb=2021.5.0=h924138e_1
  - tk=8.6.12=h27826a3_0
  - torchvision=0.7.0=py38_cu102
  - wheel=0.37.1=pyhd8ed1ab_0
  - xorg-libxau=1.0.9=h7f98852_0
  - xorg-libxdmcp=1.1.3=h7f98852_0
  - xz=5.2.5=h516909a_1
  - zlib=1.2.12=h166bdaf_1
  - zstd=1.5.2=h8a70e8d_2
  - pip:
    - biopython==1.79
    - cgdms==1.0
    - colorama==0.4.5
    - peptidebuilder==1.1.0
@jgreener64
Copy link
Member

Thanks for reporting. If you add the -d cpu argument does it run? The PyTorch version is correct but it could be some GPU issue. What GPU are you running on? Does PyTorch seem to work generally, i.e. does import torch; torch.rand(5, 5, device="cuda") work?

@RodenLuo
Copy link
Author

RodenLuo commented Jul 3, 2022

Thanks! -d cpu seems to work. I'm on V100. PyTorch works.

$ cgdms simulate -i 1CRN.txt -o traj.pdb -s predss -n 1.2e7 -d cpu
    Step        1 / 12000000 - acc  0.005 - vel  0.024 - energy -44.03 ( -21.59 -15.59  -6.85 ) -  RMSD  32.59

$ python
Python 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:18)
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch;
>>> torch.rand(5, 5, device="cuda")
tensor([[0.9750, 0.8992, 0.7012, 0.1458, 0.7875],
        [0.1238, 0.1129, 0.4178, 0.7608, 0.2411],
        [0.3505, 0.2031, 0.9376, 0.4649, 0.3073],
        [0.5086, 0.2415, 0.9404, 0.9678, 0.4551],
        [0.7188, 0.8842, 0.8739, 0.2875, 0.8161]], device='cuda:0')

@RodenLuo
Copy link
Author

RodenLuo commented Jul 3, 2022

I'm on a cluster. I tested a few GPU types and realized that gtx1080ti works... Still trying to get V100 and A100 to work.

@jgreener64
Copy link
Member

Seems like a weird one, possibly related to the CUDA version too.

I tried running simulate with PyTorch v1.11 and it seems to run fine, there is one performance warning which can be ignored. Perhaps give this a go? PyTorch v1.6 was recommended because it was used for development, but recent changes in PyTorch may fix the issue on V100/A100.

@RodenLuo
Copy link
Author

I directly gave it a try with PyTorch v1.12. I had some compatibility issues with A100 and memory issues with V100.

With A100:

$ cgdms simulate -i 1CRN.txt -o traj.pdb -s predss -n 1.2e7
/miniconda3/envs/cgdms3/lib/python3.9/site-packages/torch/cuda/__init__.py:146: UserWarning: 
NVIDIA A100-SXM-80GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_61 sm_70 sm_75 compute_37.
If you want to use the NVIDIA A100-SXM-80GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

  warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
Killed

With V100:

$ cgdms simulate -i 1CRN.txt -o traj.pdb -s predss -n 1.2e7
/miniconda3/envs/cgdms3/lib/python3.9/site-packages/cgdms/cgdms.py:532: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at  /opt/conda/conda-bld/pytorch_1656352616446/work/torch/csrc/utils/tensor_new.cpp:204.)
  coords[len(atoms) * i + ai] = torch.tensor(
Killed

$ logout
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=21647000.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: gpu213-14: task 0: Out Of Memory

Installing with PyTorch v1.11 now.

@RodenLuo
Copy link
Author

Thanks! With PyTorch v1.11. I can run on V100 with 256GB MEM, but still with the above memory warning.

$ cgdms simulate -i 1CRN.txt -o traj.pdb -s predss -n 1.2e7
/miniconda3/envs/cgdms4/lib/python3.9/site-packages/cgdms/cgdms.py:532: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at  /opt/conda/conda-bld/pytorch_1646755849709/work/torch/csrc/utils/tensor_new.cpp:210.)
  coords[len(atoms) * i + ai] = torch.tensor(
    Step        1 / 12000000 - acc  0.005 - vel  0.025 - energy -43.97 ( -21.55 -15.57  -6.84 ) - Cα RMSD  32.59
    Step    10001 / 12000000 - acc  0.006 - vel  0.031 - energy  -9.18 (  -8.60   1.04  -1.62 ) - Cα RMSD  32.36
    Step    20001 / 12000000 - acc  0.005 - vel  0.029 - energy  -6.78 (  -9.48   2.15   0.55 ) - Cα RMSD  32.10
    Step    30001 / 12000000 - acc  0.005 - vel  0.027 - energy  -7.72 (  -8.87   2.08  -0.93 ) - Cα RMSD  31.89

With A100, I still have the CUDA capability issue.

@jgreener64
Copy link
Member

Great. The CUDA compatibility issue sounds like a PyTorch issue rather than an issue with this software. Not sure about the memory issue but if it runs okay then it can probably be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants