Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running error with deepvariant_1.6.0-gpu.sif #774

Closed
Ge-Lab opened this issue Feb 18, 2024 · 4 comments
Closed

Running error with deepvariant_1.6.0-gpu.sif #774

Ge-Lab opened this issue Feb 18, 2024 · 4 comments

Comments

@Ge-Lab
Copy link

Ge-Lab commented Feb 18, 2024

Hi,

I followed the instructions on deepvariant quick start (https://github.com/google/deepvariant/blob/r1.6/docs/deepvariant-quick-start.md) to create deepvariant_1.6.0.sif and deepvariant_1.6.0-gpu.sif successfully using apptainer.

Then, I followed the complete genomics T7 case study (https://github.com/google/deepvariant/blob/r1.6/docs/deepvariant-complete-t7-case-study.md) to have some test runs.

  1. CPU version
    I run the following command:
  -B input:/input \
  -B output_apptainer_cpu:/output \
  deepvariant_1.6.0.sif \
  /opt/deepvariant/bin/run_deepvariant \
  --model_type=WGS \
  --ref=reference/GRCh38_no_alt_analysis_set.fasta \
  --reads=input/HG001.complete_t7.E100030471QC960.grch38.chr20.bam \
  --output_vcf=output_apptainer_cpu/HG001.apptainer.cpu.output.vcf.gz \
  --output_gvcf=output_apptainer_cpu/HG001.apptainer.cpu.output.g.vcf.gz \
  --num_shards=$(nproc) \
  --customized_model=input/weights-51-0.995354.ckpt

It was successful. Both vcf and gvcf were generated.

  1. GPU version
    I run the following command:
  -B input:/input \
  -B output_apptainer_gpu:/output \
  deepvariant_1.6.0-gpu.sif \
  /opt/deepvariant/bin/run_deepvariant \
  --model_type=WGS \
  --ref=reference/GRCh38_no_alt_analysis_set.fasta \
  --reads=input/HG001.complete_t7.E100030471QC960.grch38.chr20.bam \
  --output_vcf=output_apptainer_gpu/HG001.apptainer.gpu.output.vcf.gz \
  --output_gvcf=output_apptainer_gpu/HG001.apptainer.gpu.output.g.vcf.gz \
  --num_shards=$(nproc) \
  --customized_model=input/weights-51-0.995354.ckpt

It seems there are some errors and GPU was not used. These are the output (part of the output were removed due to the limit of the characters of this post):

➜  t7 apptainer run --nv \
  -B input:/input \
  -B output_apptainer_gpu:/output \
  deepvariant_1.6.0-gpu.sif \
  /opt/deepvariant/bin/run_deepvariant \
  --model_type=WGS \
  --ref=reference/GRCh38_no_alt_analysis_set.fasta \
  --reads=input/HG001.complete_t7.E100030471QC960.grch38.chr20.bam \
  --output_vcf=output_apptainer_gpu/HG001.apptainer.gpu.output.vcf.gz \
  --output_gvcf=output_apptainer_gpu/HG001.apptainer.gpu.output.g.vcf.gz \
  --num_shards=$(nproc) \
  --customized_model=input/weights-51-0.995354.ckpt
INFO:    /usr/local/etc/singularity/ exists; cleanup by system administrator is not complete (see https://apptainer.org/docs/admin/latest/singularity_migration.html)

==========
== CUDA ==
==========

CUDA Version 11.3.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

2024-02-17 23:31:25.687399: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-02-17 23:31:39.809521: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libcublas.so.12: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64:/.singularity.d/libs
2024-02-17 23:31:39.810043: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2024-02-17 23:31:59.620996: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
I0217 23:31:59.623967 140288433825600 run_deepvariant.py:519] Re-using the directory for intermediate results in /tmp/tmpd74of138
I0217 23:31:59.629002 140288433825600 run_deepvariant.py:551] You set --customized_model. Instead of using the default model for WGS, `call_variants` step will load input/weights-51-0.995354.ckpt* instead.

***** Intermediate results will be written to /tmp/tmpd74of138 in docker. ****


***** Running the command:*****
time seq 0 15 | parallel -q --halt 2 --line-buffer /opt/deepvariant/bin/make_examples --mode calling --ref "reference/GRCh38_no_alt_analysis_set.fasta" --reads "input/HG001.complete_t7.E100030471QC960.grch38.chr20.bam" --examples "/tmp/tmpd74of138/make_examples.tfrecord@16.gz" --channels "insert_size" --gvcf "/tmp/tmpd74of138/gvcf.tfrecord@16.gz" --task {}

perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
	LANGUAGE = "en_US:en",
	LC_ALL = (unset),
	LC_ADDRESS = "en_US.UTF-8",
	LC_NAME = "en_US.UTF-8",
	LC_MONETARY = "en_US.UTF-8",
	LC_PAPER = "en_US.UTF-8",
	LC_IDENTIFICATION = "en_US.UTF-8",
	LC_TELEPHONE = "en_US.UTF-8",
	LC_MEASUREMENT = "en_US.UTF-8",
	LC_CTYPE = "C.UTF-8",
	LC_TIME = "en_US.UTF-8",
	LC_NUMERIC = "en_US.UTF-8",
	LANG = "en_US.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
	LANGUAGE = "en_US:en",
	LC_ALL = (unset),
	LC_TIME = "en_US.UTF-8",
	LC_MONETARY = "en_US.UTF-8",
	LC_CTYPE = "C.UTF-8",
	LC_ADDRESS = "en_US.UTF-8",
	LC_TELEPHONE = "en_US.UTF-8",
	LC_NAME = "en_US.UTF-8",
	LC_MEASUREMENT = "en_US.UTF-8",
	LC_IDENTIFICATION = "en_US.UTF-8",
	LC_NUMERIC = "en_US.UTF-8",
	LC_PAPER = "en_US.UTF-8",
	LANG = "en_US.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
2024-02-17 23:32:31.107126: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libcublas.so.12: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64:/.singularity.d/libs
2024-02-17 23:32:31.108506: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2024-02-17 23:32:31.006781: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libcublas.so.12: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64:/.singularity.d/libs
2024-02-17 23:32:31.007601: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2024-02-17 23:32:31.110201: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libcublas.so.12: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64:/.singularity.d/libs
...
2024-02-17 23:33:25.887517: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
I0217 23:33:25.933275 140533724936000 genomics_reader.py:222] Reading input/HG001.complete_t7.E100030471QC960.grch38.chr20.bam with NativeSamReader
I0217 23:33:25.939588 140533724936000 make_examples_core.py:301] Task 15/16: Preparing inputs
I0217 23:33:25.967685 140533724936000 genomics_reader.py:222] Reading input/HG001.complete_t7.E100030471QC960.grch38.chr20.bam with NativeSamReader
I0217 23:33:26.024591 140533724936000 make_examples_core.py:301] Task 15/16: Common contigs are ['chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr20', 'chr21', 'chr22', 'chrX', 'chrY', 'chrM']
2024-02-17 23:33:25.886408: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
I0217 23:33:25.933485 139726133032768 genomics_reader.py:222] Reading input/HG001.complete_t7.E100030471QC960.grch38.chr20.bam with NativeSamReader
I0217 23:33:25.940178 139726133032768 make_examples_core.py:301] Task 4/16: Preparing inputs
I0217 23:33:25.967752 139726133032768 genomics_reader.py:222] Reading input/HG001.complete_t7.E100030471QC960.grch38.chr20.bam with NativeSamReader
...
2024-02-17 23:33:25.888518: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
I0217 23:33:25.933323 140099871606592 genomics_reader.py:222] Reading input/HG001.complete_t7.E100030471QC960.grch38.chr20.bam with NativeSamReader
I0217 23:33:25.939591 140099871606592 make_examples_core.py:301] Task 0/16: Preparing inputs
I0217 23:33:25.967773 140099871606592 genomics_reader.py:222] Reading input/HG001.complete_t7.E100030471QC960.grch38.chr20.bam with NativeSamReader
I0217 23:33:26.024448 140099871606592 make_examples_core.py:301] Task 0/16: Common contigs are ['chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr20', 'chr21', 'chr22', 'chrX', 'chrY', 'chrM']
I0217 23:33:34.679437 140533724936000 make_examples_core.py:301] Task 15/16: Starting from v0.9.0, --use_ref_for_cram is default to true. If you are using CRAM input, note that we will decode CRAM using the reference you passed in with --ref
I0217 23:33:34.748554 140533724936000 genomics_reader.py:222] Reading input/HG001.complete_t7.E100030471QC960.grch38.chr20.bam with NativeSamReader
I0217 23:33:34.253728 140616181937984 make_examples_core.py:301] Task 14/16: Starting from v0.9.0, --use_ref_for_cram is default to true. If you are using CRAM input, note that we will decode CRAM using the reference you passed in with --ref
I0217 23:33:34.663679 140616181937984 genomics_reader.py:222] Reading input/HG001.complete_t7.E100030471QC960.grch38.chr20.bam with NativeSamReader
...
I0217 23:33:34.663670 140099871606592 genomics_reader.py:222] Reading input/HG001.complete_t7.E100030471QC960.grch38.chr20.bam with NativeSamReader
I0217 23:33:34.887505 140533724936000 genomics_reader.py:222] Reading input/HG001.complete_t7.E100030471QC960.grch38.chr20.bam with NativeSamReader
I0217 23:33:34.888105 140533724936000 make_examples_core.py:301] Task 15/16: Writing gvcf records to /tmp/tmpd74of138/gvcf.tfrecord-00015-of-00016.gz
I0217 23:33:34.888602 140533724936000 make_examples_core.py:301] Task 15/16: Writing examples to /tmp/tmpd74of138/make_examples.tfrecord-00015-of-00016.gz
I0217 23:33:34.888697 140533724936000 make_examples_core.py:301] Task 15/16: Overhead for preparing inputs: 8 seconds
I0217 23:33:34.912838 140533724936000 make_examples_core.py:301] Task 15/16: 0 candidates (0 examples) [0.02s elapsed]
I0217 23:33:35.455924 139726133032768 make_examples_core.py:301] Task 4/16: Starting from v0.9.0, --use_ref_for_cram is default to true. If you are using CRAM input, note that we will decode CRAM using the reference you passed in with --ref
I0217 23:33:35.526908 139726133032768 genomics_reader.py:222] Reading input/HG001.complete_t7.E100030471QC960.grch38.chr20.bam with NativeSamReader
I0217 23:33:35.675196 139726133032768 genomics_reader.py:222] Reading input/HG001.complete_t7.E100030471QC960.grch38.chr20.bam with NativeSamReader
...
I0217 23:33:34.810906 140099871606592 make_examples_core.py:301] Task 0/16: Writing gvcf records to /tmp/tmpd74of138/gvcf.tfrecord-00000-of-00016.gz
I0217 23:33:34.811542 140099871606592 make_examples_core.py:301] Task 0/16: Writing examples to /tmp/tmpd74of138/make_examples.tfrecord-00000-of-00016.gz
I0217 23:33:34.811659 140099871606592 make_examples_core.py:301] Task 0/16: Overhead for preparing inputs: 8 seconds
I0217 23:33:34.827609 140099871606592 make_examples_core.py:301] Task 0/16: 0 candidates (0 examples) [0.02s elapsed]
...
I0218 00:34:18.301548 140191938357056 make_examples_core.py:2958] example_shape = [100, 221, 7]
I0218 00:34:18.301738 140191938357056 make_examples_core.py:2959] example_channels = [1, 2, 3, 4, 5, 6, 19]
I0218 00:34:18.302148 140191938357056 make_examples_core.py:301] Task 3/16: Found 9819 candidate variants
I0218 00:34:18.302218 140191938357056 make_examples_core.py:301] Task 3/16: Created 10372 examples

real	62m19.124s
user	928m53.495s
sys	2m16.403s

***** Running the command:*****
time /opt/deepvariant/bin/call_variants --outfile "/tmp/tmpd74of138/call_variants_output.tfrecord.gz" --examples "/tmp/tmpd74of138/make_examples.tfrecord@16.gz" --checkpoint "input/weights-51-0.995354.ckpt"

2024-02-18 00:34:28.767569: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libcublas.so.12: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64:/.singularity.d/libs
2024-02-18 00:34:28.768358: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
/usr/local/lib/python3.8/dist-packages/tensorflow_addons/utils/tfa_eol_msg.py:23: UserWarning: 

TensorFlow Addons (TFA) has ended development and introduction of new features.
TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.
Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP). 

For more information see: https://github.com/tensorflow/addons/issues/2807 

  warnings.warn(
2024-02-18 00:34:45.482939: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
I0218 00:34:45.513278 140119155529536 call_variants.py:471] Total 1 writing processes started.
I0218 00:34:45.536368 140119155529536 dv_utils.py:365] From /tmp/tmpd74of138/make_examples.tfrecord-00000-of-00016.gz.example_info.json: Shape of input examples: [100, 221, 7], Channels of input examples: [1, 2, 3, 4, 5, 6, 19].
I0218 00:34:45.536543 140119155529536 call_variants.py:506] Shape of input examples: [100, 221, 7]
I0218 00:34:45.537125 140119155529536 call_variants.py:510] Use saved model: False
Model: "inceptionv3"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 input_1 (InputLayer)           [(None, 100, 221, 7  0           []                               
                                )]                                                                                                                                    
 ...
 classification (Dense)         (None, 3)            6147        ['dropout[0][0]']                
                                                                                                  
==================================================================================================
Total params: 21,810,083
Trainable params: 21,775,651
Non-trainable params: 34,432
__________________________________________________________________________________________________
/usr/local/lib/python3.8/dist-packages/keras/applications/inception_v3.py:138: UserWarning: This model usually expects 1 or 3 input channels. However, it was passed an input_shape with 7 input channels.
  input_shape = imagenet_utils.obtain_input_shape(
I0218 00:34:52.923406 140119155529536 keras_modeling.py:325] Number of l2 regularizers: 95.
I0218 00:34:52.923618 140119155529536 keras_modeling.py:337] inceptionv3: No initial checkpoint specified.
2024-02-18 00:34:57.911320: W tensorflow/tsl/framework/cpu_allocator_impl.cc:82] Allocation of 1330642944 exceeds 10% of free system memory.
2024-02-18 00:34:58.566676: W tensorflow/tsl/framework/cpu_allocator_impl.cc:82] Allocation of 842268672 exceeds 10% of free system memory.
I0218 00:35:01.595164 140119155529536 call_variants.py:583] Predicted 1024 examples in 1 batches [0.637 sec per 100].
2024-02-18 00:35:02.648043: W tensorflow/tsl/framework/cpu_allocator_impl.cc:82] Allocation of 1330642944 exceeds 10% of free system memory.
2024-02-18 00:35:03.234445: W tensorflow/tsl/framework/cpu_allocator_impl.cc:82] Allocation of 842268672 exceeds 10% of free system memory.
2024-02-18 00:35:07.222464: W tensorflow/tsl/framework/cpu_allocator_impl.cc:82] Allocation of 1330642944 exceeds 10% of free system memory.
I0218 00:38:56.687749 140119155529536 call_variants.py:583] Predicted 52224 examples in 51 batches [0.463 sec per 100].
I0218 00:42:59.116032 140119155529536 call_variants.py:583] Predicted 103424 examples in 101 batches [0.468 sec per 100].
I0218 00:46:58.822113 140119155529536 call_variants.py:583] Predicted 154624 examples in 151 batches [0.468 sec per 100].
I0218 00:47:39.156648 140119155529536 call_variants.py:623] Complete: call_variants.

real	13m21.231s
user	118m36.634s
sys	25m56.983s

***** Running the command:*****
time /opt/deepvariant/bin/postprocess_variants --ref "reference/GRCh38_no_alt_analysis_set.fasta" --infile "/tmp/tmpd74of138/call_variants_output.tfrecord.gz" --outfile "output_apptainer_gpu/HG001.apptainer.gpu.output.vcf.gz" --cpus "16" --gvcf_outfile "output_apptainer_gpu/HG001.apptainer.gpu.output.g.vcf.gz" --nonvariant_site_tfrecord_path "/tmp/tmpd74of138/gvcf.tfrecord@16.gz"

2024-02-18 00:47:52.195457: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libcublas.so.12: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64:/.singularity.d/libs
2024-02-18 00:47:52.196245: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2024-02-18 00:48:10.043945: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
I0218 00:48:10.133844 139719065552704 postprocess_variants.py:1211] Using sample name from call_variants output. Sample name: HG001
I0218 00:48:12.163552 139719065552704 postprocess_variants.py:1313] CVO sorting took 0.03374857902526855 minutes
I0218 00:48:12.163919 139719065552704 postprocess_variants.py:1316] Transforming call_variants_output to variants.
I0218 00:48:12.163960 139719065552704 postprocess_variants.py:1318] Using 16 CPUs for parallelization of variant transformation.
I0218 00:48:12.684920 139719065552704 postprocess_variants.py:1211] Using sample name from call_variants output. Sample name: HG001
I0218 00:48:18.996037 139719065552704 postprocess_variants.py:1386] Processing variants (and writing to temporary file) took 0.06664579312006633 minutes
I0218 00:48:39.012242 139719065552704 postprocess_variants.py:1407] Finished writing VCF and gVCF in 0.33359973033269247 minutes.

real	0m59.941s
user	0m58.218s
sys	0m5.086s

***** Running the command:*****
time /opt/deepvariant/bin/vcf_stats_report --input_vcf "output_apptainer_gpu/HG001.apptainer.gpu.output.vcf.gz" --outfile_base "output_apptainer_gpu/HG001.apptainer.gpu.output"

2024-02-18 00:48:50.006549: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libcublas.so.12: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64:/.singularity.d/libs
2024-02-18 00:48:50.008250: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2024-02-18 00:48:57.417490: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
I0218 00:48:57.421117 139673283618624 genomics_reader.py:222] Reading output_apptainer_gpu/HG001.apptainer.gpu.output.vcf.gz with NativeVcfReader

real	0m23.982s
user	0m12.056s
sys	0m2.006s



My system is Ubuntu 22.04. I have two GPUs.

nvidia-smi

Sat Feb 17 23:40:49 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 4000     On   | 00000000:17:00.0 Off |                  N/A |
| 30%   27C    P8     9W / 125W |    110MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro P4000        On   | 00000000:65:00.0  On |                  N/A |
| 46%   33C    P0    28W / 105W |   1048MiB /  8192MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2236      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A      3948      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A      4070    C+G   ...ome-remote-desktop-daemon       96MiB |
|    1   N/A  N/A      2236      G   /usr/lib/xorg/Xorg                 21MiB |
|    1   N/A  N/A      2504      G   ...mviewer/tv_bin/TeamViewer       37MiB |
|    1   N/A  N/A      3948      G   /usr/lib/xorg/Xorg                509MiB |
|    1   N/A  N/A      4182      G   /usr/bin/gnome-shell              195MiB |
|    1   N/A  N/A     11053      G   uex                                 1MiB |
|    1   N/A  N/A   1285104      G   ...on=20240130-180151.247000      148MiB |
|    1   N/A  N/A   1287635      G   ...--variations-seed-version       19MiB |
+-----------------------------------------------------------------------------+

I did run export CUDA_VISIBLE_DEVICES=0 before running deepvariant, as you suggested in this issue #761.

Why the GPU was not used in my run? Any help would be greatly appreciated!

@danielecook
Copy link
Collaborator

@Ge-Lab if possible, please structure your issue using code fences. See this guide for details. It will make it easier to read and understand if you place logs inside code blocks, for example.

This guide suggests a few things to try. Interestingly, it apperas you should set CUDA_VISIBLE_DEVICES=0 within the container itself, but SINGULARITYENV_CUDA_VISIBLE_DEVICES=0 outside of the container. Were you setting the variable appropriately, within the container or outside of it?

Additionally, since you have two GPUs you will want to set this variable to 0,1.

Do you have the nvidia-container-cli installed as suggested on the support page?

@Ge-Lab
Copy link
Author

Ge-Lab commented Feb 21, 2024

Hi Daniele,

Thanks for your response. I updated the post as you suggested.

How can I set CUDA_VISIBLE_DEVICES=0 within the container and SINGULARITYENV_CUDA_VISIBLE_DEVICES=0 outside of the container? I just typed export CUDA_VISIBLE_DEVICES=0 in the terminal and then typed the apptainer command to run.

I do have nvidia-container-cli.

@danielecook
Copy link
Collaborator

@Ge-Lab first lets try setting export SINGULARITYENV_CUDA_VISIBLE_DEVICES=0 before you run the container since that is easier. Give that a try and let me know if you are still having issues.

@pichuan
Copy link
Collaborator

pichuan commented Apr 15, 2024

Hi @Ge-Lab , this has been inactive for more than a month, so I'll close. But please feel free to reopen with more details now if you want to follow up!

@pichuan pichuan closed this as completed Apr 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants