Why train in docker, show "Training in device: cpu" and "Bus error". #30

Xl-wj · 2020-01-08T06:10:30Z

Hi, thank you for the work of open source。
When I run in docker, encountered two problems， train output are as follows.

INTERFACE:
dataset /bonnet/KITTI/
arch_cfg config/arch/squeezeseg.yaml
data_cfg config/labels/semantic-kitti.yaml
log /bonnet/lidar-bonnetal/logs/
pretrained None

Commit hash (training version): b'4233111'

Opening arch config file config/arch/squeezeseg.yaml
Opening data config file config/labels/semantic-kitti.yaml
No pretrained directory found.
Copying files to /bonnet/lidar-bonnetal/logs/ for further reference.
Sequences folder exists! Using sequences from /bonnet/KITTI/sequences
parsing seq 00
parsing seq 01
parsing seq 02
parsing seq 03
parsing seq 04
parsing seq 05
parsing seq 06
parsing seq 07
parsing seq 09
parsing seq 10
Using 2761 scans from sequences [0, 1, 2, 3, 4, 5, 6, 7, 9, 10]
Sequences folder exists! Using sequences from /bonnet/KITTI/sequences
parsing seq 05
Using 2761 scans from sequences [5]
Loss weights from content: tensor([ 0.0000, 22.9317, 857.5627, 715.1100, 315.9618, 356.2452, 747.6170,
887.2239, 963.8915, 5.0051, 63.6247, 6.9002, 203.8796, 7.4802,
13.6315, 3.7339, 142.1462, 12.6355, 259.3699, 618.9667])
Using SqueezeNet Backbone
Depth of backbone input = 5
Original OS: 16
New OS: 16
Strides: [2, 2, 2, 2]
Decoder original OS: 16
Decoder new OS: 16
Decoder strides: [2, 2, 2, 2]
Total number of parameters: 915540
Total number of parameters requires_grad: 915540
Param encoder 724032
Param decoder 179968
Param head 11540
No path to pretrained, using random init.
Training in device: cpu
Ignoring class 0 in IoU evaluation
[IOU EVAL] IGNORE: tensor([0])
[IOU EVAL] INCLUDE: tensor([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19])
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
File "./train.py", line 115, in
trainer.train()
File "../../tasks/semantic/modules/trainer.py", line 236, in train
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
show_scans=self.ARCH["train"]["show_scans"])
File "../../tasks/semantic/modules/trainer.py", line 307, in train_epoch
for i, (in_vol, proj_mask, proj_labels, _, path_seq, path_name, _, _, _, _, _, _, _, _, _) in enumerate(train_loader):
File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 576, in next
idx, batch = self._get_batch()
File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 553, in _get_batch
success, data = self._try_get_batch()
File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 511, in _try_get_batch
data = self.data_queue.get(timeout=timeout)
File "/usr/lib/python3.5/multiprocessing/queues.py", line 104, in get
if timeout < 0 or not self._poll(timeout):
File "/usr/lib/python3.5/multiprocessing/connection.py", line 257, in poll
return self._poll(timeout)
File "/usr/lib/python3.5/multiprocessing/connection.py", line 414, in _poll
r = wait([self], timeout)
File "/usr/lib/python3.5/multiprocessing/connection.py", line 911, in wait
ready = selector.select(timeout)
File "/usr/lib/python3.5/selectors.py", line 376, in select
fd_event_list = self._poll.poll(timeout)
File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/_utils/signal_handling.py", line 63, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 212) is killed by signal: Bus error.

Looking forward to your reply.

tano297 · 2020-01-08T09:44:35Z

at first sightn it looks like you are running out of memory, can you try with workers=0?

Xl-wj · 2020-01-09T08:01:58Z

Thanks, it works! @tano297

at first sightn it looks like you are running out of memory, can you try with workers=0?

tano297 closed this as completed Jan 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why train in docker, show "Training in device: cpu" and "Bus error". #30

Why train in docker, show "Training in device: cpu" and "Bus error". #30

Xl-wj commented Jan 8, 2020 •

edited

tano297 commented Jan 8, 2020

Xl-wj commented Jan 9, 2020

Why train in docker, show "Training in device: cpu" and "Bus error". #30

Why train in docker, show "Training in device: cpu" and "Bus error". #30

Comments

Xl-wj commented Jan 8, 2020 • edited

INTERFACE: dataset /bonnet/KITTI/ arch_cfg config/arch/squeezeseg.yaml data_cfg config/labels/semantic-kitti.yaml log /bonnet/lidar-bonnetal/logs/ pretrained None

Commit hash (training version): b'4233111'

tano297 commented Jan 8, 2020

Xl-wj commented Jan 9, 2020

Xl-wj commented Jan 8, 2020 •

edited

INTERFACE:
dataset /bonnet/KITTI/
arch_cfg config/arch/squeezeseg.yaml
data_cfg config/labels/semantic-kitti.yaml
log /bonnet/lidar-bonnetal/logs/
pretrained None