Skip to content

Fatal error while training with Word Generator on multi GPU #2016

@neojg

Description

@neojg

Bug description

I run the training using the following command:

CUDA_VISIBLE_DEVICES=0,1 \
torchrun --nproc_per_node=2 \
references/recognition/train.py  \
 vitstr_base  \
 --vocab polish  \
 --output_dir ./polish_train  \
 --name FirstTrain  \
 --max-chars 32 \
 --epochs 20  \
 --train-samples 50000 \
 --val-samples 500 \
 --resume ./polish_train/FirstTrain.vl.0.0506558.pt  \
 --batch_size 192 \
  --font arial-2.ttf,DejaVuSansMono-Bold.ttf,Havana-Regular.ttf,LiberationSansNarrow-Italic.ttf,AYearWithoutRain.ttf,DejaVuSansMono-Oblique.ttf,Helvetica.ttf,...,LiberationSansNarrow-Bold.ttf \
--backend nccl

It happens more then once that I got the following error. It may happen during the first epoch or after few epochs. I cannot perform full training!

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/jacek/DocTR/doctr-main/references/recognition/train.py", line 726, in <module>
[rank0]:     main(args)
[rank0]:     ~~~~^^^^^^
[rank0]:   File "/home/jacek/DocTR/doctr-main/references/recognition/train.py", line 562, in main
[rank0]:     train_loss, actual_lr = fit_one_epoch(
[rank0]:                             ~~~~~~~~~~~~~^
[rank0]:         model,
[rank0]:         ^^^^^^
[rank0]:     ...<7 lines>...
[rank0]:         rank=rank,
[rank0]:         ^^^^^^^^^^
[rank0]:     )
[rank0]:     ^
[rank0]:   File "/home/jacek/DocTR/doctr-main/references/recognition/train.py", line 121, in fit_one_epoch
[rank0]:     for images, targets in pbar:
[rank0]:                            ^^^^
[rank0]:   File "/home/jacek/DocTR/doctr-main/env13/lib/python3.13/site-packages/tqdm/std.py", line 1181, in __iter__
[rank0]:     for obj in iterable:
[rank0]:                ^^^^^^^^
[rank0]:   File "/home/jacek/DocTR/doctr-main/env13/lib/python3.13/site-packages/torch/utils/data/dataloader.py", line 734, in __next__
[rank0]:     data = self._next_data()
[rank0]:   File "/home/jacek/DocTR/doctr-main/env13/lib/python3.13/site-packages/torch/utils/data/dataloader.py", line 1516, in _next_data
[rank0]:     return self._process_data(data, worker_id)
[rank0]:            ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/jacek/DocTR/doctr-main/env13/lib/python3.13/site-packages/torch/utils/data/dataloader.py", line 1551, in _process_data
[rank0]:     data.reraise()
[rank0]:     ~~~~~~~~~~~~^^
[rank0]:   File "/home/jacek/DocTR/doctr-main/env13/lib/python3.13/site-packages/torch/_utils.py", line 769, in reraise
[rank0]:     raise exception
[rank0]: RuntimeError: Caught RuntimeError in DataLoader worker process 7.
[rank0]: Original Traceback (most recent call last):
[rank0]:   File "/home/jacek/DocTR/doctr-main/env13/lib/python3.13/site-packages/torch/utils/data/_utils/worker.py", line 349, in _worker_loop
[rank0]:     data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
[rank0]:   File "/home/jacek/DocTR/doctr-main/env13/lib/python3.13/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
[rank0]:     data = [self.dataset[idx] for idx in possibly_batched_index]
[rank0]:             ~~~~~~~~~~~~^^^^^
[rank0]:   File "/home/jacek/DocTR/doctr-main/doctr/datasets/datasets/base.py", line 57, in __getitem__
[rank0]:     img = self.img_transforms(img)
[rank0]:   File "/home/jacek/DocTR/doctr-main/env13/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/jacek/DocTR/doctr-main/env13/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/jacek/DocTR/doctr-main/env13/lib/python3.13/site-packages/torchvision/transforms/v2/_container.py", line 52, in forward
[rank0]:     outputs = transform(*inputs)
[rank0]:   File "/home/jacek/DocTR/doctr-main/env13/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/jacek/DocTR/doctr-main/env13/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/jacek/DocTR/doctr-main/doctr/transforms/modules/pytorch.py", line 79, in forward
[rank0]:     img = F.resize(img, tmp_size, self.interpolation, antialias=True)
[rank0]:   File "/home/jacek/DocTR/doctr-main/env13/lib/python3.13/site-packages/torchvision/transforms/functional.py", line 479, in resize
[rank0]:     return F_t.resize(img, size=output_size, interpolation=interpolation.value, antialias=antialias)
[rank0]:            ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/jacek/DocTR/doctr-main/env13/lib/python3.13/site-packages/torchvision/transforms/_functional_tensor.py", line 467, in resize
[rank0]:     img = interpolate(img, size=size, mode=interpolation, align_corners=align_corners, antialias=antialias)
[rank0]:   File "/home/jacek/DocTR/doctr-main/env13/lib/python3.13/site-packages/torch/nn/functional.py", line 4759, in interpolate
[rank0]:     return torch._C._nn._upsample_bilinear2d_aa(
[rank0]:            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
[rank0]:         input, output_size, align_corners, scale_factors
[rank0]:         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:     )
[rank0]:     ^
[rank0]: RuntimeError: Input and output sizes should be greater than 0, but got input (H: 0, W: 29) output (H: 1, W: 128)

Code snippet to reproduce the bug

CUDA_VISIBLE_DEVICES=0,1 \
torchrun --nproc_per_node=2 \
references/recognition/train.py  \
 vitstr_base  \
 --vocab polish  \
 --output_dir ./polish_train  \
 --name FirstTrain  \
 --max-chars 32 \
 --epochs 20  \
 --train-samples 50000 \
 --val-samples 500 \
 --resume ./polish_train/FirstTrain.vl.0.0506558.pt  \
 --batch_size 192 \
  --font arial-2.ttf,DejaVuSansMono-Bold.ttf,Havana-Regular.ttf,LiberationSansNarrow-Italic.ttf,AYearWithoutRain.ttf,DejaVuSansMono-Oblique.ttf,Helvetica.ttf,...,LiberationSansNarrow-Bold.ttf \
--backend nccl

Error traceback

[rank0]: RuntimeError: Input and output sizes should be greater than 0, but got input (H: 0, W: 29) output (H: 1, W: 128)

Environment

Collecting environment information...

DocTR version: 1.0.0 # manually added the code from github
PyTorch version: 2.6.0+debian (torchvision 0.21.0)
OpenCV version: N/A
OS: Ubuntu 25.04
Python version: 3.13.3
Is CUDA available (PyTorch): No
CUDA runtime version: 12.2.140
GPU models and configuration: 
GPU 0: NVIDIA GeForce RTX 4090
GPU 1: NVIDIA GeForce RTX 4090

Nvidia driver version: 575.57.08
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.8.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    type: bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions