Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problem #14

Closed
xiexu0210 opened this issue Mar 29, 2021 · 13 comments
Closed

problem #14

xiexu0210 opened this issue Mar 29, 2021 · 13 comments

Comments

@xiexu0210
Copy link

Hi, author, I have the following problems in running your model

Traceback (most recent call last):
File "/DATA/xiexu/yolo/cvpods/tools/train_net.py", line 109, in
args=(args,),
File "/home/xiexu/.local/lib/python3.7/site-packages/cvpods/engine/launch.py", line 56, in launch
main_func(*args)
File "/DATA/xiexu/yolo/cvpods/tools/train_net.py", line 74, in main
runner = runner_decrator(RUNNERS.get(cfg.TRAINER.NAME))(cfg, build_model)
File "/home/xiexu/.local/lib/python3.7/site-packages/cvpods/engine/runner.py", line 86, in init
self.data_loader = self.build_train_loader(cfg)
File "/home/xiexu/.local/lib/python3.7/site-packages/cvpods/engine/runner.py", line 307, in build_train_loader
return build_train_loader(cfg)
File "/home/xiexu/.local/lib/python3.7/site-packages/cvpods/data/build.py", line 130, in build_train_loader
transform_gens = build_transform_gens(cfg.INPUT.AUG.TRAIN_PIPELINES)
File "/home/xiexu/.local/lib/python3.7/site-packages/cvpods/data/build.py", line 69, in build_transform_gens
return build(pipelines)
File "/home/xiexu/.local/lib/python3.7/site-packages/cvpods/data/build.py", line 58, in build
tfm = TRANSFORMS.get(aug)(**args)
File "/home/xiexu/.local/lib/python3.7/site-packages/cvpods/utils/registry.py", line 66, in get
"No object named '{}' found in '{}' registry!".format(name, self._name)
KeyError: "No object named 'RandomShift' found in 'transforms' registry!"

@chensnathan
Copy link
Collaborator

Hi,

You can refer to #12 to re-install cvpods.

BTW, we are working on a neat implementation in this pr (#13). It will be merged when it is ready.

@xiexu0210
Copy link
Author

Hi,

You can refer to #12 to re-install cvpods.

BTW, we are working on a neat implementation in this pr (#13). It will be merged when it is ready.

thanks for your reply. but I have some other problems. There are bugs in your code during training

ERROR [03/30 20:32:14 c2.engine.base_runner]: Exception during training:
Traceback (most recent call last):
File "/DATA/xiexu/yf/YOLOF/cvpods/engine/base_runner.py", line 84, in train
self.run_step()
File "/DATA/xiexu/yf/YOLOF/cvpods/engine/base_runner.py", line 185, in run_step
loss_dict = self.model(data)
File "/home/xiexu/anaconda3/envs/yfb/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/xiexu/anaconda3/envs/yfb/python3.7/site-packages/torch/nn/parallel/distributed.py", line 511, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/xiexu/anaconda3/envs/yfb/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "../yolof_base/yolof.py", line 131, in forward
anchors, pred_anchor_deltas, gt_instances)
File "/home/xiexu/anaconda3/envs/yfb/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
return func(*args, **kwargs)
File "../yolof_base/yolof.py", line 245, in get_ground_truth
box_pred = self.box2box_transform.apply_deltas(box_delta, all_anchors)
File "/DATA/xiexu/yf/YOLOF/cvpods/modeling/box_regression.py", line 93, in apply_deltas
deltas).all().item(), "Box regression deltas become infinite or NaN!"
AssertionError: Box regression deltas become infinite or NaN!

@walynlee
Copy link

Hello, when I run it with 'pods_train --num-gpus 1' and I have the same probelm "KeyError: "No object named 'RandomShift' found in 'transforms' registry!", I also refer to #12 to do ,the result is “.....Requirement already satisfied: certifi>=2020.06.20 in ./.local/lib/python3.6/site-packages (from matplotlib>=3.1.1->lvis) (2020.6.20)
Requirement already satisfied: pillow>=6.2.0 in ./.local/lib/python3.6/site-packages (from matplotlib>=3.1.1->lvis) (7.2.0)
Installing collected packages: lvis
Successfully installed lvis-0.5.3
”,everything is successful but it still repo this wrong(No object named 'RandomShift' found in 'transforms' registry!) and not to solve.
How did you solve the problem,upstairs?Hope someone to tell me.Thanks so much.

@xiexu0210
Copy link
Author

Hello, when I run it with 'pods_train --num-gpus 1' and I have the same probelm "KeyError: "No object named 'RandomShift' found in 'transforms' registry!", I also refer to #12 to do ,the result is “.....Requirement already satisfied: certifi>=2020.06.20 in ./.local/lib/python3.6/site-packages (from matplotlib>=3.1.1->lvis) (2020.6.20)
Requirement already satisfied: pillow>=6.2.0 in ./.local/lib/python3.6/site-packages (from matplotlib>=3.1.1->lvis) (7.2.0)
Installing collected packages: lvis
Successfully installed lvis-0.5.3
”,everything is successful but it still repo this wrong(No object named 'RandomShift' found in 'transforms' registry!) and not to solve.
How did you solve the problem,upstairs?Hope someone to tell me.Thanks so much.
hi, I think you need to re install the environment,The steps are as follows
pytorch=1.6 python==3.7
git clone https://github.com/thomasbrandon/mish-cuda
cd mish-cuda
python setup.py build install
cd ..
git clone git@github.com:megvii-model/YOLOF.git
cd YOLOF/
python setup.py develop
cd ./playground/detection/coco/yolof/yolof.res50.C5.1x
pods_train --num-gpus 2

@chensnathan
Copy link
Collaborator

@xiexu0210 Hi, have you modify any code in the repo? And does the bug occur every time you run with YOLOF?

@chensnathan
Copy link
Collaborator

@walynlee Hi, maybe you should uninstall the previous cvpods first, then re-install YOLOF locally follow the steps.

@walynlee
Copy link

walynlee commented Mar 31, 2021

@chensnathan Hello, I haven't modified any code yet,it already report errors,this is my environment, should I uninstall pytorch1.7 and install pytorch1.6 and update my python version?

Environment info:


sys.platform linux
Python 3.6.9 (default, Jan 26 2021, 15:33:00) [GCC 8.4.0]
numpy 1.19.3
cvpods 0.1 @/home/a303/cvpods/cvpods
cvpods compiler GCC 7.5
cvpods CUDA compiler 10.0
cvpods arch flags /home/a303/cvpods/cvpods/_C.cpython-36m-x86_64-linux-gnu.so; cannot find cuobjdump
cvpods_ENV_MODULE
PyTorch 1.7.0 @/home/a303/.local/lib/python3.6/site-packages/torch
PyTorch debug build True
CUDA available True
GPU 0 GeForce RTX 2080 Ti
CUDA_HOME :/usr/local/cuda-10.0
Pillow 7.2.0
torchvision 0.8.1 @/home/a303/.local/lib/python3.6/site-packages/torchvision
torchvision arch flags /home/a303/.local/lib/python3.6/site-packages/torchvision/_C.so; cannot find cuobjdump
cv2 4.4.0


PyTorch built with:

  • GCC 7.3
  • C++ Version: 201402
  • Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 10.2
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75
  • CuDNN 7.6.5
  • Magma 2.5.2

@chensnathan
Copy link
Collaborator

Could you post your training log?

@xiexu0210
Copy link
Author

Could you post your training log?

hi,The error was reported only once 。
I ran through your code and changed the number of GPUs to 3。The result was very bad。
image

I have another question, How to debug your code,I only run it as a command line

@chensnathan
Copy link
Collaborator

The model diverges during your training. When you use fewer GPUs, you should warm up more iterations.

@qijindao
Copy link

Hi,
You can refer to #12 to re-install cvpods.
BTW, we are working on a neat implementation in this pr (#13). It will be merged when it is ready.

thanks for your reply. but I have some other problems. There are bugs in your code during training

ERROR [03/30 20:32:14 c2.engine.base_runner]: Exception during training:
Traceback (most recent call last):
File "/DATA/xiexu/yf/YOLOF/cvpods/engine/base_runner.py", line 84, in train
self.run_step()
File "/DATA/xiexu/yf/YOLOF/cvpods/engine/base_runner.py", line 185, in run_step
loss_dict = self.model(data)
File "/home/xiexu/anaconda3/envs/yfb/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/xiexu/anaconda3/envs/yfb/python3.7/site-packages/torch/nn/parallel/distributed.py", line 511, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/xiexu/anaconda3/envs/yfb/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "../yolof_base/yolof.py", line 131, in forward
anchors, pred_anchor_deltas, gt_instances)
File "/home/xiexu/anaconda3/envs/yfb/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
return func(*args, **kwargs)
File "../yolof_base/yolof.py", line 245, in get_ground_truth
box_pred = self.box2box_transform.apply_deltas(box_delta, all_anchors)
File "/DATA/xiexu/yf/YOLOF/cvpods/modeling/box_regression.py", line 93, in apply_deltas
deltas).all().item(), "Box regression deltas become infinite or NaN!"
AssertionError: Box regression deltas become infinite or NaN!

Hi,I met the same problem when my code have been trained for a little time.' Box regression deltas become infinite or NaN!'suddenly occurs.How did you solve the problem?

@tangjiuqi097
Copy link

Could you post your training log?

hi,The error was reported only once 。
I ran through your code and changed the number of GPUs to 3。The result was very bad。
image

I have another question, How to debug your code,I only run it as a command line

Hi, if you use Pycharm to debug, in Run/Debug Configurations, you can

set the working directory to the code path which you want to run, e.g. YOLOF/playground/detection/coco/yolof/yolof.res50.C5.1x

set the script path to YOLOF/tools/train_net.py.

@shenhaibb
Copy link

Could you post your training log?

hi,The error was reported only once 。
I ran through your code and changed the number of GPUs to 3。The result was very bad。
image
I have another question, How to debug your code,I only run it as a command line

Hi, if you use Pycharm to debug, in Run/Debug Configurations, you can

set the working directory to the code path which you want to run, e.g. YOLOF/playground/detection/coco/yolof/yolof.res50.C5.1x

set the script path to YOLOF/tools/train_net.py.

image

I set like this, why the problem still has

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants