problem #14

xiexu0210 · 2021-03-29T14:27:19Z

Hi, author, I have the following problems in running your model

Traceback (most recent call last):
File "/DATA/xiexu/yolo/cvpods/tools/train_net.py", line 109, in
args=(args,),
File "/home/xiexu/.local/lib/python3.7/site-packages/cvpods/engine/launch.py", line 56, in launch
main_func(*args)
File "/DATA/xiexu/yolo/cvpods/tools/train_net.py", line 74, in main
runner = runner_decrator(RUNNERS.get(cfg.TRAINER.NAME))(cfg, build_model)
File "/home/xiexu/.local/lib/python3.7/site-packages/cvpods/engine/runner.py", line 86, in init
self.data_loader = self.build_train_loader(cfg)
File "/home/xiexu/.local/lib/python3.7/site-packages/cvpods/engine/runner.py", line 307, in build_train_loader
return build_train_loader(cfg)
File "/home/xiexu/.local/lib/python3.7/site-packages/cvpods/data/build.py", line 130, in build_train_loader
transform_gens = build_transform_gens(cfg.INPUT.AUG.TRAIN_PIPELINES)
File "/home/xiexu/.local/lib/python3.7/site-packages/cvpods/data/build.py", line 69, in build_transform_gens
return build(pipelines)
File "/home/xiexu/.local/lib/python3.7/site-packages/cvpods/data/build.py", line 58, in build
tfm = TRANSFORMS.get(aug)(**args)
File "/home/xiexu/.local/lib/python3.7/site-packages/cvpods/utils/registry.py", line 66, in get
"No object named '{}' found in '{}' registry!".format(name, self._name)
KeyError: "No object named 'RandomShift' found in 'transforms' registry!"

chensnathan · 2021-03-30T02:46:18Z

Hi,

You can refer to #12 to re-install cvpods.

BTW, we are working on a neat implementation in this pr (#13). It will be merged when it is ready.

xiexu0210 · 2021-03-30T12:37:32Z

Hi,

You can refer to #12 to re-install cvpods.

BTW, we are working on a neat implementation in this pr (#13). It will be merged when it is ready.

thanks for your reply. but I have some other problems. There are bugs in your code during training

ERROR [03/30 20:32:14 c2.engine.base_runner]: Exception during training:
Traceback (most recent call last):
File "/DATA/xiexu/yf/YOLOF/cvpods/engine/base_runner.py", line 84, in train
self.run_step()
File "/DATA/xiexu/yf/YOLOF/cvpods/engine/base_runner.py", line 185, in run_step
loss_dict = self.model(data)
File "/home/xiexu/anaconda3/envs/yfb/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/xiexu/anaconda3/envs/yfb/python3.7/site-packages/torch/nn/parallel/distributed.py", line 511, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/xiexu/anaconda3/envs/yfb/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "../yolof_base/yolof.py", line 131, in forward
anchors, pred_anchor_deltas, gt_instances)
File "/home/xiexu/anaconda3/envs/yfb/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
return func(*args, **kwargs)
File "../yolof_base/yolof.py", line 245, in get_ground_truth
box_pred = self.box2box_transform.apply_deltas(box_delta, all_anchors)
File "/DATA/xiexu/yf/YOLOF/cvpods/modeling/box_regression.py", line 93, in apply_deltas
deltas).all().item(), "Box regression deltas become infinite or NaN!"
AssertionError: Box regression deltas become infinite or NaN!

walynlee · 2021-03-30T13:25:46Z

Hello, when I run it with 'pods_train --num-gpus 1' and I have the same probelm "KeyError: "No object named 'RandomShift' found in 'transforms' registry!", I also refer to #12 to do ,the result is “.....Requirement already satisfied: certifi>=2020.06.20 in ./.local/lib/python3.6/site-packages (from matplotlib>=3.1.1->lvis) (2020.6.20)
Requirement already satisfied: pillow>=6.2.0 in ./.local/lib/python3.6/site-packages (from matplotlib>=3.1.1->lvis) (7.2.0)
Installing collected packages: lvis
Successfully installed lvis-0.5.3
”,everything is successful but it still repo this wrong(No object named 'RandomShift' found in 'transforms' registry!) and not to solve.
How did you solve the problem，upstairs？Hope someone to tell me.Thanks so much.

xiexu0210 · 2021-03-30T13:51:34Z

Hello, when I run it with 'pods_train --num-gpus 1' and I have the same probelm "KeyError: "No object named 'RandomShift' found in 'transforms' registry!", I also refer to #12 to do ,the result is “.....Requirement already satisfied: certifi>=2020.06.20 in ./.local/lib/python3.6/site-packages (from matplotlib>=3.1.1->lvis) (2020.6.20)
Requirement already satisfied: pillow>=6.2.0 in ./.local/lib/python3.6/site-packages (from matplotlib>=3.1.1->lvis) (7.2.0)
Installing collected packages: lvis
Successfully installed lvis-0.5.3
”,everything is successful but it still repo this wrong(No object named 'RandomShift' found in 'transforms' registry!) and not to solve.
How did you solve the problem，upstairs？Hope someone to tell me.Thanks so much.
hi, I think you need to re install the environment，The steps are as follows
pytorch=1.6 python==3.7
git clone https://github.com/thomasbrandon/mish-cuda
cd mish-cuda
python setup.py build install
cd ..
git clone git@github.com:megvii-model/YOLOF.git
cd YOLOF/
python setup.py develop
cd ./playground/detection/coco/yolof/yolof.res50.C5.1x
pods_train --num-gpus 2

chensnathan · 2021-03-30T15:52:48Z

@xiexu0210 Hi, have you modify any code in the repo? And does the bug occur every time you run with YOLOF?

chensnathan · 2021-03-30T15:56:17Z

@walynlee Hi, maybe you should uninstall the previous cvpods first, then re-install YOLOF locally follow the steps.

walynlee · 2021-03-31T00:03:46Z

@chensnathan Hello, I haven't modified any code yet,it already report errors，this is my environment, should I uninstall pytorch1.7 and install pytorch1.6 and update my python version?

Environment info:

sys.platform linux
Python 3.6.9 (default, Jan 26 2021, 15:33:00) [GCC 8.4.0]
numpy 1.19.3
cvpods 0.1 @/home/a303/cvpods/cvpods
cvpods compiler GCC 7.5
cvpods CUDA compiler 10.0
cvpods arch flags /home/a303/cvpods/cvpods/_C.cpython-36m-x86_64-linux-gnu.so; cannot find cuobjdump
cvpods_ENV_MODULE
PyTorch 1.7.0 @/home/a303/.local/lib/python3.6/site-packages/torch
PyTorch debug build True
CUDA available True
GPU 0 GeForce RTX 2080 Ti
CUDA_HOME :/usr/local/cuda-10.0
Pillow 7.2.0
torchvision 0.8.1 @/home/a303/.local/lib/python3.6/site-packages/torchvision
torchvision arch flags /home/a303/.local/lib/python3.6/site-packages/torchvision/_C.so; cannot find cuobjdump
cv2 4.4.0

PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)
OpenMP 201511 (a.k.a. OpenMP 4.5)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 10.2
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75
CuDNN 7.6.5
Magma 2.5.2

chensnathan · 2021-03-31T04:45:44Z

Could you post your training log?

xiexu0210 · 2021-03-31T06:54:36Z

Could you post your training log?

hi,The error was reported only once 。
I ran through your code and changed the number of GPUs to 3。The result was very bad。

I have another question, How to debug your code，I only run it as a command line

chensnathan · 2021-03-31T07:08:48Z

The model diverges during your training. When you use fewer GPUs, you should warm up more iterations.

qijindao · 2021-03-31T09:18:47Z

Hi,
You can refer to #12 to re-install cvpods.
BTW, we are working on a neat implementation in this pr (#13). It will be merged when it is ready.

thanks for your reply. but I have some other problems. There are bugs in your code during training

ERROR [03/30 20:32:14 c2.engine.base_runner]: Exception during training:
Traceback (most recent call last):
File "/DATA/xiexu/yf/YOLOF/cvpods/engine/base_runner.py", line 84, in train
self.run_step()
File "/DATA/xiexu/yf/YOLOF/cvpods/engine/base_runner.py", line 185, in run_step
loss_dict = self.model(data)
File "/home/xiexu/anaconda3/envs/yfb/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/xiexu/anaconda3/envs/yfb/python3.7/site-packages/torch/nn/parallel/distributed.py", line 511, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/xiexu/anaconda3/envs/yfb/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "../yolof_base/yolof.py", line 131, in forward
anchors, pred_anchor_deltas, gt_instances)
File "/home/xiexu/anaconda3/envs/yfb/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
return func(*args, **kwargs)
File "../yolof_base/yolof.py", line 245, in get_ground_truth
box_pred = self.box2box_transform.apply_deltas(box_delta, all_anchors)
File "/DATA/xiexu/yf/YOLOF/cvpods/modeling/box_regression.py", line 93, in apply_deltas
deltas).all().item(), "Box regression deltas become infinite or NaN!"
AssertionError: Box regression deltas become infinite or NaN!

Hi,I met the same problem when my code have been trained for a little time.' Box regression deltas become infinite or NaN!'suddenly occurs.How did you solve the problem?

tangjiuqi097 · 2021-03-31T10:11:38Z

Could you post your training log?

hi,The error was reported only once 。
I ran through your code and changed the number of GPUs to 3。The result was very bad。

I have another question, How to debug your code，I only run it as a command line

Hi, if you use Pycharm to debug, in Run/Debug Configurations, you can

set the working directory to the code path which you want to run, e.g. YOLOF/playground/detection/coco/yolof/yolof.res50.C5.1x

set the script path to YOLOF/tools/train_net.py.

shenhaibb · 2022-04-25T02:18:23Z

Could you post your training log?

hi,The error was reported only once 。
I ran through your code and changed the number of GPUs to 3。The result was very bad。

I have another question, How to debug your code，I only run it as a command line

Hi, if you use Pycharm to debug, in Run/Debug Configurations, you can

set the working directory to the code path which you want to run, e.g. YOLOF/playground/detection/coco/yolof/yolof.res50.C5.1x

set the script path to YOLOF/tools/train_net.py.

I set like this, why the problem still has

zcl912 mentioned this issue Apr 8, 2021

/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda [](int)->auto::operator()(int)->auto: block: [0,0,0], thread: [121,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. #20

Open

chensnathan closed this as completed Apr 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

problem #14

problem #14

xiexu0210 commented Mar 29, 2021

chensnathan commented Mar 30, 2021

xiexu0210 commented Mar 30, 2021

walynlee commented Mar 30, 2021

xiexu0210 commented Mar 30, 2021

chensnathan commented Mar 30, 2021

chensnathan commented Mar 30, 2021

walynlee commented Mar 31, 2021 •

edited

chensnathan commented Mar 31, 2021

xiexu0210 commented Mar 31, 2021

chensnathan commented Mar 31, 2021

qijindao commented Mar 31, 2021

tangjiuqi097 commented Mar 31, 2021

shenhaibb commented Apr 25, 2022

problem #14

problem #14

Comments

xiexu0210 commented Mar 29, 2021

chensnathan commented Mar 30, 2021

xiexu0210 commented Mar 30, 2021

walynlee commented Mar 30, 2021

xiexu0210 commented Mar 30, 2021

chensnathan commented Mar 30, 2021

chensnathan commented Mar 30, 2021

walynlee commented Mar 31, 2021 • edited

chensnathan commented Mar 31, 2021

xiexu0210 commented Mar 31, 2021

chensnathan commented Mar 31, 2021

qijindao commented Mar 31, 2021

tangjiuqi097 commented Mar 31, 2021

shenhaibb commented Apr 25, 2022

walynlee commented Mar 31, 2021 •

edited