Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDAOutOfMemoryError on A100 #5

Closed
vdejager opened this issue May 24, 2022 · 3 comments
Closed

CUDAOutOfMemoryError on A100 #5

vdejager opened this issue May 24, 2022 · 3 comments

Comments

@vdejager
Copy link

I managed to compile on RedHat/CentOS8, but I'm getting errors with the 'sup' models:
dna_r9.4.1_e8.1_sup@v3.3 and dna_r9.4.1_e8_sup@v3.3

Data is from a: FLO-MIN106 SQK-DCS109 dna_r9.4.1_450bps_hac
An amplicon run, so no prior info other than that it should contain 16s sequences.

the following modes work fine: dna_r9.4.1_e8_hac@v3.3, dna_r9.4.1_e8.1_hac@v3.3, dna_r9.4.1_e8_fast@v3.4 and dna_r9.4.1_e8.1_fast@v3.4
However, what model would your suggest to use and what would the best method be to compare them against Bonito and the Guppy output?

error below:

Creating basecall pipeline
@hd VN:1.5 SO:unknown
@pg ID:basecaller PN:dorado VN:0.0.1a0 CL:dorado basecaller dna_r9.4.1_e8_sup@v3.3 /projects/0/lwc2020006/nanopore/0_5cmSedAarhusBay/test
terminate called after throwing an instance of 'c10::CUDAOutOfMemoryError'
what(): CUDA out of memory. Tried to allocate 12.50 GiB (GPU 0; 39.59 GiB total capacity; 27.41 GiB already allocated; 9.90 GiB free; 27.46 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Exception raised from malloc at ../c10/cuda/CUDACachingAllocator.cpp:536 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x148b5fee7d62 in /projects/0/lwc2020006/software/nanoporetech/dorado/dorado/3rdparty/torch-1.10.2-Linux/libtorch/lib/libc10.so)
frame #1: + 0x257de (0x148b179577de in /projects/0/lwc2020006/software/nanoporetech/dorado/dorado/3rdparty/torch-1.10.2-Linux/libtorch/lib/libc10_cuda.so)
frame #2: + 0x264b2 (0x148b179584b2 in /projects/0/lwc2020006/software/nanoporetech/dorado/dorado/3rdparty/torch-1.10.2-Linux/libtorch/lib/libc10_cuda.so)
frame #3: + 0x268e2 (0x148b179588e2 in /projects/0/lwc2020006/software/nanoporetech/dorado/dorado/3rdparty/torch-1.10.2-Linux/libtorch/lib/libc10_cuda.so)
frame #4: at::native::empty_cuda(c10::ArrayRef, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional, c10::optionalc10::MemoryFormat) + 0x124 (0x148b797d25a4 in /projects/0/lwc2020006/software/nanoporetech/dorado/dorado/3rdparty/torch-1.10.2-Linux/libtorch/lib/libtorch_cuda_cpp.so)
frame #5: + 0x25aaed9 (0x148b1f00eed9 in /projects/0/lwc2020006/software/nanoporetech/dorado/dorado/3rdparty/torch-1.10.2-Linux/libtorch/lib/libtorch_cuda_cu.so)
frame #6: + 0x25ee6fd (0x148b1f0526fd in /projects/0/lwc2020006/software/nanoporetech/dorado/dorado/3rdparty/torch-1.10.2-Linux/libtorch/lib/libtorch_cuda_cu.so)
frame #7: at::TensorIteratorBase::allocate_or_resize_outputs() + 0x25b (0x148b613277db in /projects/0/lwc2020006/software/nanoporetech/dorado/dorado/3rdparty/torch-1.10.2-Linux/libtorch/lib/libtorch_cpu.so)
frame #8: at::TensorIteratorBase::build(at::TensorIteratorConfig&) + 0x1d3 (0x148b61328b23 in /projects/0/lwc2020006/software/nanoporetech/dorado/dorado/3rdparty/torch-1.10.2-Linux/libtorch/lib/libtorch_cpu.so)
frame #9: at::TensorIteratorBase::build_borrowing_binary_op(at::Tensor const&, at::Tensor const&, at::Tensor const&) + 0xd5 (0x148b6132a1f5 in /projects/0/lwc2020006/software/nanoporetech/dorado/dorado/3rdparty/torch-1.10.2-Linux/libtorch/lib/libtorch_cpu.so)
frame #10: + 0x25e2cfd (0x148b1f046cfd in /projects/0/lwc2020006/software/nanoporetech/dorado/dorado/3rdparty/torch-1.10.2-Linux/libtorch/lib/libtorch_cuda_cu.so)
frame #11: + 0x25e2dcf (0x148b1f046dcf in /projects/0/lwc2020006/software/nanoporetech/dorado/dorado/3rdparty/torch-1.10.2-Linux/libtorch/lib/libtorch_cuda_cu.so)
frame #12: at::_ops::mul_Tensor::call(at::Tensor const&, at::Tensor const&) + 0x136 (0x148b61a43556 in /projects/0/lwc2020006/software/nanoporetech/dorado/dorado/3rdparty/torch-1.10.2-Linux/libtorch/lib/libtorch_cpu.so)
frame #13: at::native::mul(at::Tensor const&, c10::Scalar const&) + 0xaf (0x148b614d581f in /projects/0/lwc2020006/software/nanoporetech/dorado/dorado/3rdparty/torch-1.10.2-Linux/libtorch/lib/libtorch_cpu.so)
frame #14: + 0x1e300bf (0x148b61f5e0bf in /projects/0/lwc2020006/software/nanoporetech/dorado/dorado/3rdparty/torch-1.10.2-Linux/libtorch/lib/libtorch_cpu.so)
frame #15: at::_ops::mul_Scalar::call(at::Tensor const&, c10::Scalar const&) + 0x12d (0x148b61dce4cd in /projects/0/lwc2020006/software/nanoporetech/dorado/dorado/3rdparty/torch-1.10.2-Linux/libtorch/lib/libtorch_cpu.so)
frame #16: dorado() [0x5360f5]
frame #17: dorado() [0x536467]
frame #18: dorado() [0x539be5]
frame #19: dorado() [0x559e28]
frame #20: dorado() [0x5593bf]
frame #21: dorado() [0x55860b]
frame #22: dorado() [0x555556]
frame #23: dorado() [0x544502]
frame #24: dorado() [0x5405c0]
frame #25: dorado() [0x53c3ee]
frame #26: dorado() [0x55a90e]
frame #27: dorado() [0x55989b]
frame #28: dorado() [0x558e21]
frame #29: dorado() [0x5576d2]
frame #30: dorado() [0x4cd878]
frame #31: dorado() [0x4cd4a7]
frame #32: dorado() [0x4ccee5]
frame #33: dorado() [0x56009a]
frame #34: dorado() [0x560be8]
frame #35: dorado() [0x56b723]
frame #36: dorado() [0x56b5a1]
frame #37: dorado() [0x56b489]
frame #38: dorado() [0x56b396]
frame #39: dorado() [0x56b320]
frame #40: + 0xc71f (0x148be296371f in /projects/0/lwc2020006/software/nanoporetech/dorado/dorado/3rdparty/torch-1.10.2-Linux/libtorch/lib/libtorch_cuda.so)
frame #41: + 0x814a (0x148be2d7514a in /lib64/libpthread.so.0)
frame #42: clone + 0x43 (0x148b1bf99dc3 in /lib64/libc.so.6)

system:
ThinkSystem SD650-N v2
Intel Xeon Platinum 8360Y (2x),36 Cores/Socket, 2.4 GHz (Speed Select SKU), 250W
NVIDIA A100 (4x),40 GiB HMB2 memory with 5 active memory stacks per GPU
16x32 GiB,3200 MHz, DDR4
512GiB160GiB HMB2(7.111 GiB)
2xHDR100 ConnectX-6 single port2x25GbE SFP28 LOM1x1GbE RJ45 LOM

@vellamike
Copy link
Collaborator

Hi @vdejager - can you try reducing the batch size using the -b parameter to 512 or 256? Dorado currently uses more memory than Guppy and Bonito - this is a known issue we are working on.

There is no model we suggest in particular - the models have an accuracy/speed tradeoff (SUP model being the most accurate and Fast being the fastest).

@vdejager
Copy link
Author

vdejager commented May 24, 2022

I'm running it now with

dorado basecaller -b 256 --emit-fastq  ${MODEL} ${DATA_DIR}

with the dna_r9.4.1_e8.1_sup@v3.3 model.
MODEL contains the model used. DATA_DIR contains the directory with fast5 files.

This seems to work on a small fast5 file. I'm going to test it on a bigger dataset to see how it goes.

@vellamike
Copy link
Collaborator

Thanks @vdejager - I am closing this issue. When we have updates on memory footprint we will notify in release notes..

@Kirk3gaard Kirk3gaard mentioned this issue May 23, 2023
@krobik26 krobik26 mentioned this issue Jan 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants