Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why train in docker, show "Training in device: cpu" and "Bus error". #30

Closed
Xl-wj opened this issue Jan 8, 2020 · 2 comments
Closed

Comments

@Xl-wj
Copy link

Xl-wj commented Jan 8, 2020

Hi, thank you for the work of open source。
When I run in docker, encountered two problems, train output are as follows.


INTERFACE:
dataset /bonnet/KITTI/
arch_cfg config/arch/squeezeseg.yaml
data_cfg config/labels/semantic-kitti.yaml
log /bonnet/lidar-bonnetal/logs/
pretrained None

Commit hash (training version): b'4233111'

Opening arch config file config/arch/squeezeseg.yaml
Opening data config file config/labels/semantic-kitti.yaml
No pretrained directory found.
Copying files to /bonnet/lidar-bonnetal/logs/ for further reference.
Sequences folder exists! Using sequences from /bonnet/KITTI/sequences
parsing seq 00
parsing seq 01
parsing seq 02
parsing seq 03
parsing seq 04
parsing seq 05
parsing seq 06
parsing seq 07
parsing seq 09
parsing seq 10
Using 2761 scans from sequences [0, 1, 2, 3, 4, 5, 6, 7, 9, 10]
Sequences folder exists! Using sequences from /bonnet/KITTI/sequences
parsing seq 05
Using 2761 scans from sequences [5]
Loss weights from content: tensor([ 0.0000, 22.9317, 857.5627, 715.1100, 315.9618, 356.2452, 747.6170,
887.2239, 963.8915, 5.0051, 63.6247, 6.9002, 203.8796, 7.4802,
13.6315, 3.7339, 142.1462, 12.6355, 259.3699, 618.9667])
Using SqueezeNet Backbone
Depth of backbone input = 5
Original OS: 16
New OS: 16
Strides: [2, 2, 2, 2]
Decoder original OS: 16
Decoder new OS: 16
Decoder strides: [2, 2, 2, 2]
Total number of parameters: 915540
Total number of parameters requires_grad: 915540
Param encoder 724032
Param decoder 179968
Param head 11540
No path to pretrained, using random init.
Training in device: cpu
Ignoring class 0 in IoU evaluation
[IOU EVAL] IGNORE: tensor([0])
[IOU EVAL] INCLUDE: tensor([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19])
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
File "./train.py", line 115, in
trainer.train()
File "../../tasks/semantic/modules/trainer.py", line 236, in train
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
show_scans=self.ARCH["train"]["show_scans"])
File "../../tasks/semantic/modules/trainer.py", line 307, in train_epoch
for i, (in_vol, proj_mask, proj_labels, _, path_seq, path_name, _, _, _, _, _, _, _, _, _) in enumerate(train_loader):
File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 576, in next
idx, batch = self._get_batch()
File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 553, in _get_batch
success, data = self._try_get_batch()
File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 511, in _try_get_batch
data = self.data_queue.get(timeout=timeout)
File "/usr/lib/python3.5/multiprocessing/queues.py", line 104, in get
if timeout < 0 or not self._poll(timeout):
File "/usr/lib/python3.5/multiprocessing/connection.py", line 257, in poll
return self._poll(timeout)
File "/usr/lib/python3.5/multiprocessing/connection.py", line 414, in _poll
r = wait([self], timeout)
File "/usr/lib/python3.5/multiprocessing/connection.py", line 911, in wait
ready = selector.select(timeout)
File "/usr/lib/python3.5/selectors.py", line 376, in select
fd_event_list = self._poll.poll(timeout)
File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/_utils/signal_handling.py", line 63, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 212) is killed by signal: Bus error.

Looking forward to your reply.

@tano297
Copy link
Member

tano297 commented Jan 8, 2020

at first sightn it looks like you are running out of memory, can you try with workers=0?

@Xl-wj
Copy link
Author

Xl-wj commented Jan 9, 2020

Thanks, it works! @tano297

at first sightn it looks like you are running out of memory, can you try with workers=0?

@tano297 tano297 closed this as completed Jan 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants