CUDAOutOfMemoryError on A100 #5

vdejager · 2022-05-24T12:47:10Z

I managed to compile on RedHat/CentOS8, but I'm getting errors with the 'sup' models:
dna_r9.4.1_e8.1_sup@v3.3 and dna_r9.4.1_e8_sup@v3.3

Data is from a: FLO-MIN106 SQK-DCS109 dna_r9.4.1_450bps_hac
An amplicon run, so no prior info other than that it should contain 16s sequences.

the following modes work fine: dna_r9.4.1_e8_hac@v3.3, dna_r9.4.1_e8.1_hac@v3.3, dna_r9.4.1_e8_fast@v3.4 and dna_r9.4.1_e8.1_fast@v3.4
However, what model would your suggest to use and what would the best method be to compare them against Bonito and the Guppy output?

error below:

Creating basecall pipeline
@hd VN:1.5 SO:unknown
@pg ID:basecaller PN:dorado VN:0.0.1a0 CL:dorado basecaller dna_r9.4.1_e8_sup@v3.3 /projects/0/lwc2020006/nanopore/0_5cmSedAarhusBay/test
terminate called after throwing an instance of 'c10::CUDAOutOfMemoryError'
what(): CUDA out of memory. Tried to allocate 12.50 GiB (GPU 0; 39.59 GiB total capacity; 27.41 GiB already allocated; 9.90 GiB free; 27.46 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Exception raised from malloc at ../c10/cuda/CUDACachingAllocator.cpp:536 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x148b5fee7d62 in /projects/0/lwc2020006/software/nanoporetech/dorado/dorado/3rdparty/torch-1.10.2-Linux/libtorch/lib/libc10.so)
frame #1: + 0x257de (0x148b179577de in /projects/0/lwc2020006/software/nanoporetech/dorado/dorado/3rdparty/torch-1.10.2-Linux/libtorch/lib/libc10_cuda.so)
frame #2: + 0x264b2 (0x148b179584b2 in /projects/0/lwc2020006/software/nanoporetech/dorado/dorado/3rdparty/torch-1.10.2-Linux/libtorch/lib/libc10_cuda.so)
frame #3: + 0x268e2 (0x148b179588e2 in /projects/0/lwc2020006/software/nanoporetech/dorado/dorado/3rdparty/torch-1.10.2-Linux/libtorch/lib/libc10_cuda.so)
frame #4: at::native::empty_cuda(c10::ArrayRef, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional, c10::optionalc10::MemoryFormat) + 0x124 (0x148b797d25a4 in /projects/0/lwc2020006/software/nanoporetech/dorado/dorado/3rdparty/torch-1.10.2-Linux/libtorch/lib/libtorch_cuda_cpp.so)
frame #5: + 0x25aaed9 (0x148b1f00eed9 in /projects/0/lwc2020006/software/nanoporetech/dorado/dorado/3rdparty/torch-1.10.2-Linux/libtorch/lib/libtorch_cuda_cu.so)
frame #6: + 0x25ee6fd (0x148b1f0526fd in /projects/0/lwc2020006/software/nanoporetech/dorado/dorado/3rdparty/torch-1.10.2-Linux/libtorch/lib/libtorch_cuda_cu.so)
frame #7: at::TensorIteratorBase::allocate_or_resize_outputs() + 0x25b (0x148b613277db in /projects/0/lwc2020006/software/nanoporetech/dorado/dorado/3rdparty/torch-1.10.2-Linux/libtorch/lib/libtorch_cpu.so)
frame #8: at::TensorIteratorBase::build(at::TensorIteratorConfig&) + 0x1d3 (0x148b61328b23 in /projects/0/lwc2020006/software/nanoporetech/dorado/dorado/3rdparty/torch-1.10.2-Linux/libtorch/lib/libtorch_cpu.so)
frame #9: at::TensorIteratorBase::build_borrowing_binary_op(at::Tensor const&, at::Tensor const&, at::Tensor const&) + 0xd5 (0x148b6132a1f5 in /projects/0/lwc2020006/software/nanoporetech/dorado/dorado/3rdparty/torch-1.10.2-Linux/libtorch/lib/libtorch_cpu.so)
frame #10: + 0x25e2cfd (0x148b1f046cfd in /projects/0/lwc2020006/software/nanoporetech/dorado/dorado/3rdparty/torch-1.10.2-Linux/libtorch/lib/libtorch_cuda_cu.so)
frame #11: + 0x25e2dcf (0x148b1f046dcf in /projects/0/lwc2020006/software/nanoporetech/dorado/dorado/3rdparty/torch-1.10.2-Linux/libtorch/lib/libtorch_cuda_cu.so)
frame #12: at::_ops::mul_Tensor::call(at::Tensor const&, at::Tensor const&) + 0x136 (0x148b61a43556 in /projects/0/lwc2020006/software/nanoporetech/dorado/dorado/3rdparty/torch-1.10.2-Linux/libtorch/lib/libtorch_cpu.so)
frame #13: at::native::mul(at::Tensor const&, c10::Scalar const&) + 0xaf (0x148b614d581f in /projects/0/lwc2020006/software/nanoporetech/dorado/dorado/3rdparty/torch-1.10.2-Linux/libtorch/lib/libtorch_cpu.so)
frame #14: + 0x1e300bf (0x148b61f5e0bf in /projects/0/lwc2020006/software/nanoporetech/dorado/dorado/3rdparty/torch-1.10.2-Linux/libtorch/lib/libtorch_cpu.so)
frame #15: at::_ops::mul_Scalar::call(at::Tensor const&, c10::Scalar const&) + 0x12d (0x148b61dce4cd in /projects/0/lwc2020006/software/nanoporetech/dorado/dorado/3rdparty/torch-1.10.2-Linux/libtorch/lib/libtorch_cpu.so)
frame #16: dorado() [0x5360f5]
frame #17: dorado() [0x536467]
frame #18: dorado() [0x539be5]
frame #19: dorado() [0x559e28]
frame #20: dorado() [0x5593bf]
frame #21: dorado() [0x55860b]
frame #22: dorado() [0x555556]
frame #23: dorado() [0x544502]
frame #24: dorado() [0x5405c0]
frame #25: dorado() [0x53c3ee]
frame #26: dorado() [0x55a90e]
frame #27: dorado() [0x55989b]
frame #28: dorado() [0x558e21]
frame #29: dorado() [0x5576d2]
frame #30: dorado() [0x4cd878]
frame #31: dorado() [0x4cd4a7]
frame #32: dorado() [0x4ccee5]
frame #33: dorado() [0x56009a]
frame #34: dorado() [0x560be8]
frame #35: dorado() [0x56b723]
frame #36: dorado() [0x56b5a1]
frame #37: dorado() [0x56b489]
frame #38: dorado() [0x56b396]
frame #39: dorado() [0x56b320]
frame #40: + 0xc71f (0x148be296371f in /projects/0/lwc2020006/software/nanoporetech/dorado/dorado/3rdparty/torch-1.10.2-Linux/libtorch/lib/libtorch_cuda.so)
frame #41: + 0x814a (0x148be2d7514a in /lib64/libpthread.so.0)
frame #42: clone + 0x43 (0x148b1bf99dc3 in /lib64/libc.so.6)

system:
ThinkSystem SD650-N v2
Intel Xeon Platinum 8360Y (2x),36 Cores/Socket, 2.4 GHz (Speed Select SKU), 250W
NVIDIA A100 (4x),40 GiB HMB2 memory with 5 active memory stacks per GPU
16x32 GiB,3200 MHz, DDR4
512GiB160GiB HMB2(7.111 GiB)
2xHDR100 ConnectX-6 single port2x25GbE SFP28 LOM1x1GbE RJ45 LOM

vellamike · 2022-05-24T13:19:18Z

Hi @vdejager - can you try reducing the batch size using the -b parameter to 512 or 256? Dorado currently uses more memory than Guppy and Bonito - this is a known issue we are working on.

There is no model we suggest in particular - the models have an accuracy/speed tradeoff (SUP model being the most accurate and Fast being the fastest).

vdejager · 2022-05-24T14:17:30Z

I'm running it now with

dorado basecaller -b 256 --emit-fastq  ${MODEL} ${DATA_DIR}

with the dna_r9.4.1_e8.1_sup@v3.3 model.
MODEL contains the model used. DATA_DIR contains the directory with fast5 files.

This seems to work on a small fast5 file. I'm going to test it on a bigger dataset to see how it goes.

vellamike · 2022-05-26T09:05:28Z

Thanks @vdejager - I am closing this issue. When we have updates on memory footprint we will notify in release notes..

vellamike closed this as completed May 26, 2022

olawa mentioned this issue Dec 9, 2022

CUDAOutOfMemoryError for duplex with 3080ti (12Gb) #57

Closed

Kirk3gaard mentioned this issue Jan 9, 2023

out of memory core dump with dna_r10.4.1_e8.2_400bps_sup@v4.0.0 #64

Closed

jagos01 mentioned this issue Feb 16, 2023

CUDA CUBLAS error when basecalling with v0.2.1 #102

Closed

aCoalBall mentioned this issue May 16, 2023

[error] Unable to open the group "channel_id": (Symbol table) Object not found #176

Closed

Kirk3gaard mentioned this issue May 23, 2023

RAM needs? #86

Closed

AyshaSezmis mentioned this issue May 24, 2023

Empty sequence and qstring provided for read id XXXX #192

Closed

krpcem mentioned this issue Jun 20, 2023

CUDA device requested but no devices found #251

Closed

shiying-sxu mentioned this issue Jul 16, 2023

P100 CUDA error #301

Closed

ymcki mentioned this issue Aug 8, 2023

Cannot basecall some old R9.4.1 data #324

Closed

Kirk3gaard mentioned this issue Aug 9, 2023

dorado 0.3.3 duplex CUDA out of memory #326

Closed

ritma001 mentioned this issue Dec 4, 2023

dorado basecaller on linux with Tesla K20Xm #498

Closed

tnn111 mentioned this issue Dec 11, 2023

cuDNN error..... #515

Closed

DCossey mentioned this issue Dec 19, 2023

Cuda out of memory #541

Open

krobik26 mentioned this issue Jan 19, 2024

CUDA Error #589

Closed

AzlanNI mentioned this issue Jan 22, 2024

terminate called after throwing an instance of 'std::runtime_error' what(): Failed to write SAM record, error code -1 #579

Closed

fayora mentioned this issue Jan 23, 2024

Duplex basecalling returns 'CUDA out of memory' error in v0.5.x #594

Closed

ericmsmall mentioned this issue Feb 7, 2024

CUDA OUT OF MEMORY (Dependent on # of GPUs) #615

Open

peterthorpe5 mentioned this issue Feb 8, 2024

[error] Unknown flowcell_product_code: #618

Closed

faulk-lab mentioned this issue May 22, 2024

dorado correct runtimes #831

Closed

pre-mRNA mentioned this issue May 25, 2024

dorado 0.7.0 RNA004 modbase calling results in CUDA error #842

Open

VBHerrenC mentioned this issue May 28, 2024

CUDA error: out of memory on Dorado 0.7.0 with dna_r10.4.1_e8.2_400bps_sup@v5.0.0 #849

Closed

markme123 mentioned this issue Jun 1, 2024

dorado 0.7.0 --batch_size error #861

Closed

fehofman mentioned this issue Jun 4, 2024

CUDA illegal memory access was encountered with Dorado v0.7.1 #866

Open

NStrowbridge mentioned this issue Jun 5, 2024

Size of tensor a not matching size of tensor b - Direct RNA (SQK-RNA004) basecalling #873

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDAOutOfMemoryError on A100 #5

CUDAOutOfMemoryError on A100 #5

vdejager commented May 24, 2022

vellamike commented May 24, 2022

vdejager commented May 24, 2022 •

edited

vellamike commented May 26, 2022

CUDAOutOfMemoryError on A100 #5

CUDAOutOfMemoryError on A100 #5

Comments

vdejager commented May 24, 2022

vellamike commented May 24, 2022

vdejager commented May 24, 2022 • edited

vellamike commented May 26, 2022

vdejager commented May 24, 2022 •

edited