Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AssertionError in training #30

Open
SuperTyrael opened this issue Jul 22, 2020 · 8 comments
Open

AssertionError in training #30

SuperTyrael opened this issue Jul 22, 2020 · 8 comments

Comments

@SuperTyrael
Copy link

I met this assertionError when I was training this model.
Can you guys help me?

Traceback (most recent call last):
  File "/anaconda3/envs/fasterRCNN/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/CrowdDet/tools/train.py", line 109, in train_worker
    do_train_epoch(net, data_iter, optimizer, rank, epoch_id, train_config)
  File "/CrowdDet/tools/train.py", line 58, in do_train_epoch
    assert torch.isfinite(total_loss).all(), outputs
AssertionError: {'loss_rpn_cls': tensor(nan, device='cuda:0', grad_fn=<MulBackward0>), 'loss_rpn_loc': tensor(inf, device='cuda:0', grad_fn=<MulBackward0>), 'loss_rcnn_loc': tensor(nan, device='cuda:0', grad_fn=<MulBackward0>), 'loss_rcnn_cls': tensor(nan, device='cuda:0', grad_fn=<MulBackward0>)}```
@xg-chu
Copy link
Owner

xg-chu commented Jul 23, 2020

Try several times, sometimes this error raises at the beginning of training.

@SuperTyrael
Copy link
Author

Try several times, sometimes this error raises at the beginning of training.

Sorry, but we've tried several times and just get the same error in the almost same iteration in the first epoch

@xg-chu
Copy link
Owner

xg-chu commented Jul 31, 2020

Have you modified the code or data? Such mistakes rarely occur.
Try changing the dataset initialization sequence or decreasing the learning rate.

@chjXu
Copy link

chjXu commented Oct 21, 2020

I have a another AssertionError in training.
Can you help me?

Num of GPUs:3, learning rate:0.00750, mini batch size:2,
train_epoch:30, iter_per_epoch:2500, decay_epoch:[24, 27]
Init multi-processing training...
Traceback (most recent call last):
File "/media/xcj/data/xcj/CrowdDet/tools/train.py", line 174, in
run_train()
File "/media/xcj/data/xcj/CrowdDet/tools/train.py", line 171, in run_train
multi_train(args, config, Network)
File "/media/xcj/data/xcj/CrowdDet/tools/train.py", line 155, in multi_train
torch.multiprocessing.spawn(train_worker, nprocs=num_gpus, args=(train_config, network, config))
File "/home/xcj/anaconda3/envs/py_cpn/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/home/xcj/anaconda3/envs/py_cpn/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/xcj/anaconda3/envs/py_cpn/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/media/xcj/data/xcj/CrowdDet/tools/train.py", line 102, in train_worker
crowdhuman = CrowdHuman(config, if_train=True)
File "../lib/data/CrowdHuman.py", line 20, in init
self.records = misc_utils.load_json_lines(source)
File "../lib/utils/misc_utils.py", line 11, in load_json_lines
assert os.path.exists(fpath)
AssertionError

@xg-chu
Copy link
Owner

xg-chu commented Oct 28, 2020

I have a another AssertionError in training.
Can you help me?

Num of GPUs:3, learning rate:0.00750, mini batch size:2,
train_epoch:30, iter_per_epoch:2500, decay_epoch:[24, 27]
Init multi-processing training...
Traceback (most recent call last):
File "/media/xcj/data/xcj/CrowdDet/tools/train.py", line 174, in
run_train()
File "/media/xcj/data/xcj/CrowdDet/tools/train.py", line 171, in run_train
multi_train(args, config, Network)
File "/media/xcj/data/xcj/CrowdDet/tools/train.py", line 155, in multi_train
torch.multiprocessing.spawn(train_worker, nprocs=num_gpus, args=(train_config, network, config))
File "/home/xcj/anaconda3/envs/py_cpn/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/home/xcj/anaconda3/envs/py_cpn/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/xcj/anaconda3/envs/py_cpn/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/media/xcj/data/xcj/CrowdDet/tools/train.py", line 102, in train_worker
crowdhuman = CrowdHuman(config, if_train=True)
File "../lib/data/CrowdHuman.py", line 20, in init
self.records = misc_utils.load_json_lines(source)
File "../lib/utils/misc_utils.py", line 11, in load_json_lines
assert os.path.exists(fpath)
AssertionError

Looks like the annotation file path is wrong, Check the "train_source" and "eval_source" in config.py.

@yaru-w
Copy link

yaru-w commented Nov 11, 2020

I have a another AssertionError in training.
Can you help me?

Num of GPUs:3, learning rate:0.00750, mini batch size:2,
train_epoch:30, iter_per_epoch:2500, decay_epoch:[24, 27]
Init multi-processing training...
Traceback (most recent call last):
File "/media/xcj/data/xcj/CrowdDet/tools/train.py", line 174, in
run_train()
File "/media/xcj/data/xcj/CrowdDet/tools/train.py", line 171, in run_train
multi_train(args, config, Network)
File "/media/xcj/data/xcj/CrowdDet/tools/train.py", line 155, in multi_train
torch.multiprocessing.spawn(train_worker, nprocs=num_gpus, args=(train_config, network, config))
File "/home/xcj/anaconda3/envs/py_cpn/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/home/xcj/anaconda3/envs/py_cpn/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/xcj/anaconda3/envs/py_cpn/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/media/xcj/data/xcj/CrowdDet/tools/train.py", line 102, in train_worker
crowdhuman = CrowdHuman(config, if_train=True)
File "../lib/data/CrowdHuman.py", line 20, in init
self.records = misc_utils.load_json_lines(source)
File "../lib/utils/misc_utils.py", line 11, in load_json_lines
assert os.path.exists(fpath)
AssertionError

你好,请问你的问题解决了吗,我好像也遇到了类似的问题,关于load_json_lines的问题,找不到json_file的路径

@xg-chu
Copy link
Owner

xg-chu commented Nov 11, 2020

I have a another AssertionError in training.
Can you help me?
Num of GPUs:3, learning rate:0.00750, mini batch size:2,
train_epoch:30, iter_per_epoch:2500, decay_epoch:[24, 27]
Init multi-processing training...
Traceback (most recent call last):
File "/media/xcj/data/xcj/CrowdDet/tools/train.py", line 174, in
run_train()
File "/media/xcj/data/xcj/CrowdDet/tools/train.py", line 171, in run_train
multi_train(args, config, Network)
File "/media/xcj/data/xcj/CrowdDet/tools/train.py", line 155, in multi_train
torch.multiprocessing.spawn(train_worker, nprocs=num_gpus, args=(train_config, network, config))
File "/home/xcj/anaconda3/envs/py_cpn/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/home/xcj/anaconda3/envs/py_cpn/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/xcj/anaconda3/envs/py_cpn/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/media/xcj/data/xcj/CrowdDet/tools/train.py", line 102, in train_worker
crowdhuman = CrowdHuman(config, if_train=True)
File "../lib/data/CrowdHuman.py", line 20, in init
self.records = misc_utils.load_json_lines(source)
File "../lib/utils/misc_utils.py", line 11, in load_json_lines
assert os.path.exists(fpath)
AssertionError

你好,请问你的问题解决了吗,我好像也遇到了类似的问题,关于load_json_lines的问题,找不到json_file的路径

检查一下文件到底在不在那个路径就可以了。

@yaru-w
Copy link

yaru-w commented Nov 12, 2020

I have a another AssertionError in training.
Can you help me?
Num of GPUs:3, learning rate:0.00750, mini batch size:2,
train_epoch:30, iter_per_epoch:2500, decay_epoch:[24, 27]
Init multi-processing training...
Traceback (most recent call last):
File "/media/xcj/data/xcj/CrowdDet/tools/train.py", line 174, in
run_train()
File "/media/xcj/data/xcj/CrowdDet/tools/train.py", line 171, in run_train
multi_train(args, config, Network)
File "/media/xcj/data/xcj/CrowdDet/tools/train.py", line 155, in multi_train
torch.multiprocessing.spawn(train_worker, nprocs=num_gpus, args=(train_config, network, config))
File "/home/xcj/anaconda3/envs/py_cpn/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/home/xcj/anaconda3/envs/py_cpn/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/xcj/anaconda3/envs/py_cpn/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/media/xcj/data/xcj/CrowdDet/tools/train.py", line 102, in train_worker
crowdhuman = CrowdHuman(config, if_train=True)
File "../lib/data/CrowdHuman.py", line 20, in init
self.records = misc_utils.load_json_lines(source)
File "../lib/utils/misc_utils.py", line 11, in load_json_lines
assert os.path.exists(fpath)
AssertionError

你好,请问你的问题解决了吗,我好像也遇到了类似的问题,关于load_json_lines的问题,找不到json_file的路径

检查一下文件到底在不在那个路径就可以了。

嗯嗯,我是在运行这行代码时:python3 eval_json.py -f your_json_path.json
遇到了以下错误:
Traceback (most recent call last):
File "eval_json.py", line 36, in
run_eval()
File "eval_json.py", line 33, in run_eval
eval_all(args)
File "eval_json.py", line 19, in eval_all
res_line, JI = compute_JI.evaluation_all(args.json_file, 'box')
File "../lib/evaluate/compute_JI.py", line 18, in evaluation_all
records = misc_utils.load_json_lines(path)
File "../lib/utils/misc_utils.py", line 11, in load_json_lines
assert os.path.exists(fpath)
AssertionError
谢谢您。我不清楚your_json_path.json是什么,所以也找不到它的位置,但是打开result_eval.md后,里面有一行your_json_path.json,但我还是不知道怎么修复这个错误,或许您有什么意见吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants