-
Notifications
You must be signed in to change notification settings - Fork 598
Open
Labels
type: bugSomething isn't workingSomething isn't working
Description
Bug description
I run the training using the following command:
CUDA_VISIBLE_DEVICES=0,1 \
torchrun --nproc_per_node=2 \
references/recognition/train.py \
vitstr_base \
--vocab polish \
--output_dir ./polish_train \
--name FirstTrain \
--max-chars 32 \
--epochs 20 \
--train-samples 50000 \
--val-samples 500 \
--resume ./polish_train/FirstTrain.vl.0.0506558.pt \
--batch_size 192 \
--font arial-2.ttf,DejaVuSansMono-Bold.ttf,Havana-Regular.ttf,LiberationSansNarrow-Italic.ttf,AYearWithoutRain.ttf,DejaVuSansMono-Oblique.ttf,Helvetica.ttf,...,LiberationSansNarrow-Bold.ttf \
--backend nccl
It happens more then once that I got the following error. It may happen during the first epoch or after few epochs. I cannot perform full training!
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/jacek/DocTR/doctr-main/references/recognition/train.py", line 726, in <module>
[rank0]: main(args)
[rank0]: ~~~~^^^^^^
[rank0]: File "/home/jacek/DocTR/doctr-main/references/recognition/train.py", line 562, in main
[rank0]: train_loss, actual_lr = fit_one_epoch(
[rank0]: ~~~~~~~~~~~~~^
[rank0]: model,
[rank0]: ^^^^^^
[rank0]: ...<7 lines>...
[rank0]: rank=rank,
[rank0]: ^^^^^^^^^^
[rank0]: )
[rank0]: ^
[rank0]: File "/home/jacek/DocTR/doctr-main/references/recognition/train.py", line 121, in fit_one_epoch
[rank0]: for images, targets in pbar:
[rank0]: ^^^^
[rank0]: File "/home/jacek/DocTR/doctr-main/env13/lib/python3.13/site-packages/tqdm/std.py", line 1181, in __iter__
[rank0]: for obj in iterable:
[rank0]: ^^^^^^^^
[rank0]: File "/home/jacek/DocTR/doctr-main/env13/lib/python3.13/site-packages/torch/utils/data/dataloader.py", line 734, in __next__
[rank0]: data = self._next_data()
[rank0]: File "/home/jacek/DocTR/doctr-main/env13/lib/python3.13/site-packages/torch/utils/data/dataloader.py", line 1516, in _next_data
[rank0]: return self._process_data(data, worker_id)
[rank0]: ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jacek/DocTR/doctr-main/env13/lib/python3.13/site-packages/torch/utils/data/dataloader.py", line 1551, in _process_data
[rank0]: data.reraise()
[rank0]: ~~~~~~~~~~~~^^
[rank0]: File "/home/jacek/DocTR/doctr-main/env13/lib/python3.13/site-packages/torch/_utils.py", line 769, in reraise
[rank0]: raise exception
[rank0]: RuntimeError: Caught RuntimeError in DataLoader worker process 7.
[rank0]: Original Traceback (most recent call last):
[rank0]: File "/home/jacek/DocTR/doctr-main/env13/lib/python3.13/site-packages/torch/utils/data/_utils/worker.py", line 349, in _worker_loop
[rank0]: data = fetcher.fetch(index) # type: ignore[possibly-undefined]
[rank0]: File "/home/jacek/DocTR/doctr-main/env13/lib/python3.13/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
[rank0]: data = [self.dataset[idx] for idx in possibly_batched_index]
[rank0]: ~~~~~~~~~~~~^^^^^
[rank0]: File "/home/jacek/DocTR/doctr-main/doctr/datasets/datasets/base.py", line 57, in __getitem__
[rank0]: img = self.img_transforms(img)
[rank0]: File "/home/jacek/DocTR/doctr-main/env13/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jacek/DocTR/doctr-main/env13/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/home/jacek/DocTR/doctr-main/env13/lib/python3.13/site-packages/torchvision/transforms/v2/_container.py", line 52, in forward
[rank0]: outputs = transform(*inputs)
[rank0]: File "/home/jacek/DocTR/doctr-main/env13/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jacek/DocTR/doctr-main/env13/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/home/jacek/DocTR/doctr-main/doctr/transforms/modules/pytorch.py", line 79, in forward
[rank0]: img = F.resize(img, tmp_size, self.interpolation, antialias=True)
[rank0]: File "/home/jacek/DocTR/doctr-main/env13/lib/python3.13/site-packages/torchvision/transforms/functional.py", line 479, in resize
[rank0]: return F_t.resize(img, size=output_size, interpolation=interpolation.value, antialias=antialias)
[rank0]: ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jacek/DocTR/doctr-main/env13/lib/python3.13/site-packages/torchvision/transforms/_functional_tensor.py", line 467, in resize
[rank0]: img = interpolate(img, size=size, mode=interpolation, align_corners=align_corners, antialias=antialias)
[rank0]: File "/home/jacek/DocTR/doctr-main/env13/lib/python3.13/site-packages/torch/nn/functional.py", line 4759, in interpolate
[rank0]: return torch._C._nn._upsample_bilinear2d_aa(
[rank0]: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
[rank0]: input, output_size, align_corners, scale_factors
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: )
[rank0]: ^
[rank0]: RuntimeError: Input and output sizes should be greater than 0, but got input (H: 0, W: 29) output (H: 1, W: 128)
Code snippet to reproduce the bug
CUDA_VISIBLE_DEVICES=0,1 \
torchrun --nproc_per_node=2 \
references/recognition/train.py \
vitstr_base \
--vocab polish \
--output_dir ./polish_train \
--name FirstTrain \
--max-chars 32 \
--epochs 20 \
--train-samples 50000 \
--val-samples 500 \
--resume ./polish_train/FirstTrain.vl.0.0506558.pt \
--batch_size 192 \
--font arial-2.ttf,DejaVuSansMono-Bold.ttf,Havana-Regular.ttf,LiberationSansNarrow-Italic.ttf,AYearWithoutRain.ttf,DejaVuSansMono-Oblique.ttf,Helvetica.ttf,...,LiberationSansNarrow-Bold.ttf \
--backend nccl
Error traceback
[rank0]: RuntimeError: Input and output sizes should be greater than 0, but got input (H: 0, W: 29) output (H: 1, W: 128)
Environment
Collecting environment information...
DocTR version: 1.0.0 # manually added the code from github
PyTorch version: 2.6.0+debian (torchvision 0.21.0)
OpenCV version: N/A
OS: Ubuntu 25.04
Python version: 3.13.3
Is CUDA available (PyTorch): No
CUDA runtime version: 12.2.140
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 4090
GPU 1: NVIDIA GeForce RTX 4090
Nvidia driver version: 575.57.08
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.8.0
Metadata
Metadata
Assignees
Labels
type: bugSomething isn't workingSomething isn't working