Output `Killed` with no other information #1

RodenLuo · 2022-07-01T14:28:24Z

Hi, Thanks for the excellent work.

I installed the tool and can run the file generation command:
cgdms makeinput -i 1CRN.pdb -s 1CRN.ss2 > 1CRN.txt

But I cannot run the simulate command. The output is just one word Killed.

$ cgdms simulate -i 1CRN.txt -o traj.pdb -s predss -n 1.2e7
Killed

Could you please take a look and advise how to debug?

Best,
Roden

My conda environment is attached:

name: cgdms
channels:
  - pytorch
  - salilab
  - conda-forge
  - bioconda
  - defaults
dependencies:
  - _libgcc_mutex=0.1=conda_forge
  - _openmp_mutex=4.5=2_kmp_llvm
  - blas=1.0=mkl
  - bzip2=1.0.8=h7f98852_4
  - ca-certificates=2022.6.15=ha878542_0
  - cudatoolkit=10.2.89=h713d32c_10
  - freetype=2.10.4=h0708190_1
  - giflib=5.2.1=h36c2ea0_2
  - jpeg=9e=h166bdaf_2
  - lcms2=2.12=hddcbb42_0
  - ld_impl_linux-64=2.36.1=hea4e1c9_2
  - lerc=3.0=h9c3ff4c_0
  - libdeflate=1.12=h166bdaf_0
  - libffi=3.4.2=h7f98852_5
  - libgcc-ng=12.1.0=h8d9b700_16
  - libnsl=2.0.0=h7f98852_0
  - libpng=1.6.37=h21135ba_2
  - libstdcxx-ng=12.1.0=ha89aaad_16
  - libtiff=4.4.0=hc85c160_1
  - libuuid=2.32.1=h7f98852_1000
  - libwebp=1.2.2=h3452ae3_0
  - libwebp-base=1.2.2=h7f98852_1
  - libxcb=1.13=h7f98852_1004
  - libzlib=1.2.12=h166bdaf_1
  - llvm-openmp=14.0.4=he0ac6c6_0
  - lz4-c=1.9.3=h9c3ff4c_1
  - mkl=2021.4.0=h8d4b97c_729
  - mkl-service=2.4.0=py38h95df7f1_0
  - mkl_fft=1.3.1=py38h8666266_1
  - mkl_random=1.2.2=py38h1abd341_0
  - ncurses=6.3=h27087fc_1
  - ninja=1.11.0=h924138e_0
  - numpy=1.22.3=py38he7a7128_0
  - numpy-base=1.22.3=py38hf524024_0
  - openjpeg=2.4.0=hb52868f_1
  - openssl=3.0.4=h166bdaf_2
  - pillow=9.1.1=py38h0ee0e06_1
  - pip=22.1.2=pyhd8ed1ab_0
  - pthread-stubs=0.4=h36c2ea0_1001
  - python=3.8.13=ha86cf86_0_cpython
  - python_abi=3.8=2_cp38
  - pytorch=1.6.0=py3.8_cuda10.2.89_cudnn7.6.5_0
  - readline=8.1.2=h0f457ee_0
  - setuptools=62.6.0=py38h578d9bd_0
  - six=1.16.0=pyh6c4a22f_0
  - sqlite=3.39.0=h4ff8645_0
  - tbb=2021.5.0=h924138e_1
  - tk=8.6.12=h27826a3_0
  - torchvision=0.7.0=py38_cu102
  - wheel=0.37.1=pyhd8ed1ab_0
  - xorg-libxau=1.0.9=h7f98852_0
  - xorg-libxdmcp=1.1.3=h7f98852_0
  - xz=5.2.5=h516909a_1
  - zlib=1.2.12=h166bdaf_1
  - zstd=1.5.2=h8a70e8d_2
  - pip:
    - biopython==1.79
    - cgdms==1.0
    - colorama==0.4.5
    - peptidebuilder==1.1.0

The text was updated successfully, but these errors were encountered:

jgreener64 · 2022-07-02T09:51:09Z

Thanks for reporting. If you add the -d cpu argument does it run? The PyTorch version is correct but it could be some GPU issue. What GPU are you running on? Does PyTorch seem to work generally, i.e. does import torch; torch.rand(5, 5, device="cuda") work?

RodenLuo · 2022-07-03T19:52:46Z

Thanks! -d cpu seems to work. I'm on V100. PyTorch works.

$ cgdms simulate -i 1CRN.txt -o traj.pdb -s predss -n 1.2e7 -d cpu
    Step        1 / 12000000 - acc  0.005 - vel  0.024 - energy -44.03 ( -21.59 -15.59  -6.85 ) - Cα RMSD  32.59

$ python
Python 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:18)
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch;
>>> torch.rand(5, 5, device="cuda")
tensor([[0.9750, 0.8992, 0.7012, 0.1458, 0.7875],
        [0.1238, 0.1129, 0.4178, 0.7608, 0.2411],
        [0.3505, 0.2031, 0.9376, 0.4649, 0.3073],
        [0.5086, 0.2415, 0.9404, 0.9678, 0.4551],
        [0.7188, 0.8842, 0.8739, 0.2875, 0.8161]], device='cuda:0')

RodenLuo · 2022-07-03T20:02:17Z

I'm on a cluster. I tested a few GPU types and realized that gtx1080ti works... Still trying to get V100 and A100 to work.

jgreener64 · 2022-07-04T13:08:36Z

Seems like a weird one, possibly related to the CUDA version too.

I tried running simulate with PyTorch v1.11 and it seems to run fine, there is one performance warning which can be ignored. Perhaps give this a go? PyTorch v1.6 was recommended because it was used for development, but recent changes in PyTorch may fix the issue on V100/A100.

RodenLuo · 2022-07-10T21:34:45Z

I directly gave it a try with PyTorch v1.12. I had some compatibility issues with A100 and memory issues with V100.

With A100:

$ cgdms simulate -i 1CRN.txt -o traj.pdb -s predss -n 1.2e7
/miniconda3/envs/cgdms3/lib/python3.9/site-packages/torch/cuda/__init__.py:146: UserWarning: 
NVIDIA A100-SXM-80GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_61 sm_70 sm_75 compute_37.
If you want to use the NVIDIA A100-SXM-80GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

  warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
Killed

With V100:

$ cgdms simulate -i 1CRN.txt -o traj.pdb -s predss -n 1.2e7
/miniconda3/envs/cgdms3/lib/python3.9/site-packages/cgdms/cgdms.py:532: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at  /opt/conda/conda-bld/pytorch_1656352616446/work/torch/csrc/utils/tensor_new.cpp:204.)
  coords[len(atoms) * i + ai] = torch.tensor(
Killed

$ logout
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=21647000.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: gpu213-14: task 0: Out Of Memory

Installing with PyTorch v1.11 now.

RodenLuo · 2022-07-11T08:40:48Z

Thanks! With PyTorch v1.11. I can run on V100 with 256GB MEM, but still with the above memory warning.

$ cgdms simulate -i 1CRN.txt -o traj.pdb -s predss -n 1.2e7
/miniconda3/envs/cgdms4/lib/python3.9/site-packages/cgdms/cgdms.py:532: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at  /opt/conda/conda-bld/pytorch_1646755849709/work/torch/csrc/utils/tensor_new.cpp:210.)
  coords[len(atoms) * i + ai] = torch.tensor(
    Step        1 / 12000000 - acc  0.005 - vel  0.025 - energy -43.97 ( -21.55 -15.57  -6.84 ) - Cα RMSD  32.59
    Step    10001 / 12000000 - acc  0.006 - vel  0.031 - energy  -9.18 (  -8.60   1.04  -1.62 ) - Cα RMSD  32.36
    Step    20001 / 12000000 - acc  0.005 - vel  0.029 - energy  -6.78 (  -9.48   2.15   0.55 ) - Cα RMSD  32.10
    Step    30001 / 12000000 - acc  0.005 - vel  0.027 - energy  -7.72 (  -8.87   2.08  -0.93 ) - Cα RMSD  31.89

With A100, I still have the CUDA capability issue.

jgreener64 · 2022-07-11T09:51:00Z

Great. The CUDA compatibility issue sounds like a PyTorch issue rather than an issue with this software. Not sure about the memory issue but if it runs okay then it can probably be ignored.

RodenLuo closed this as completed Jul 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output `Killed` with no other information #1

Output `Killed` with no other information #1

RodenLuo commented Jul 1, 2022

jgreener64 commented Jul 2, 2022

RodenLuo commented Jul 3, 2022

RodenLuo commented Jul 3, 2022

jgreener64 commented Jul 4, 2022

RodenLuo commented Jul 10, 2022

RodenLuo commented Jul 11, 2022

jgreener64 commented Jul 11, 2022

Output Killed with no other information #1

Output Killed with no other information #1

Comments

RodenLuo commented Jul 1, 2022

jgreener64 commented Jul 2, 2022

RodenLuo commented Jul 3, 2022

RodenLuo commented Jul 3, 2022

jgreener64 commented Jul 4, 2022

RodenLuo commented Jul 10, 2022

RodenLuo commented Jul 11, 2022

jgreener64 commented Jul 11, 2022

Output `Killed` with no other information #1

Output `Killed` with no other information #1