Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resizing 448 darknet: ./src/network.c:392: resize_network: Assertion `0' failed. #426

Open
AhaEdgar opened this issue Jan 21, 2018 · 12 comments

Comments

@AhaEdgar
Copy link

anyone can help me to solve this issue?
i change yolo.2.0cfg and add one cnn layer.

@Li-Lai
Copy link

Li-Lai commented Jan 24, 2018

The way you ask for help is funny.You don't post the changes, the ghost knows what you've changed.

@christopher5106
Copy link

I have the same issue on a NVIDIA V100 (I choose -gencode arch=compute_70,code=sm_70) while everything works well on NVIDIA 1080 TI:

Region Avg IOU: 0.141527, Class: 0.026888, Obj: 0.652683, No Obj: 0.566986, Avg Recall: 0.000000, count: 4
10: 363.796783, 444.599030 avg, 0.000000 rate, 0.045145 seconds, 10 images
Resizing
544
darknet: ./src/network.c:392: resize_network: Assertion `0' failed.
Aborted (core dumped)

@AlexeyAB
Copy link
Collaborator

AlexeyAB commented Mar 8, 2018

@Ahagpp @christopher5106 You can try to use this fork, I fixed excessively memory allocation for several (unfortunate) network sizes: https://github.com/AlexeyAB/darknet

Also if you use GPU V100 - you can use Tensor Cores for Mixed Precision calculations - how to use it: (now mixed precision supported for 1xGPu and for multi-GPU): AlexeyAB#407

@christopher5106
Copy link

christopher5106 commented Mar 8, 2018

Sounds good, working well on DGX-Station with V100. On Power9 with V100, I have a problem when using CUDNN=1 with CUDNN 7.0

27 reorg / 2 26 x 26 x 64 -> 13 x 13 x 256
28 route 27 24
29 conv 1024 3 x 3 / 1 13 x 13 x1280 -> 13 x 13 x1024
30 conv 125 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 125
31 detection
Loading weights from darknet19_448.conv.23...
seen 32

after what it freezes. Without CUDNN it works well but I cannot benefit from half precision.

@AlexeyAB
Copy link
Collaborator

AlexeyAB commented Mar 8, 2018

@christopher5106 To localize the problem, there are a few questions:

  • Does it freez only for training, or for detection too?
  • Does it work with GPU=1 CUDNN=0 in the Makefile?
  • Does it work with GPU=0 CUDNN=0?
  • Do you use OpenCV?
  • Did you try to use mixed-precision -DCUDNN_HALF in the Makefile to train on V100? (now it supports multi-GPU for DGX)
  • Do you use little endian 64-bit Linux?

@christopher5106
Copy link

it seems like there was a performance issue, we did a complete reinstall and the problem sounds to have disappeared. thanks a lot, I ll tell you more about this next week

@christopher5106
Copy link

On some runs, I get

Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 6
78444: -nan, -nan avg, 0.000010 rate, 0.180000 seconds, 78444 images

Is that normal ? When I re run it, it is ok.

@AlexeyAB
Copy link
Collaborator

AlexeyAB commented Mar 12, 2018

@christopher5106

Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 6

If these lines occur sometimes - then this is normal.
If at some point all the lines contain nan, then the training went wrong.


78444: -nan, -nan avg, 0.000010 rate, 0.180000 seconds, 78444 images

This is always - the training went wrong.

@christopher5106
Copy link

The training went wrong indeed... is that normal ?

@AlexeyAB
Copy link
Collaborator

@christopher5106 No, this is not normal. Something wrong in the: dataset, model or source code.

@nurCoban
Copy link

@Ahagpp I've got same prob. Increase subdivision in cfg file. Its solve this problem.

@interface-bin
Copy link

I have the same same problem and I solved it by two steps:

  1. edit the Makefile and rebuild the project
    ARCH= -gencode arch=compute_30,code=sm_30 \ -gencode arch=compute_35,code=sm_35 \ -gencode arch=compute_50,code=[sm_50,compute_50] \ -gencode arch=compute_52,code=[sm_52,compute_52] \ -gencode arch=compute_60,code=sm_60 \ -gencode arch=compute_61,code=sm_61
    because my GPU is GTX 1080 and it's corresponding compute is 6.1
  2. edit src/network.c and comment the sentence out
    if(l.workspace_size > 2000000000) assert(0);

and after this two steps, I solved the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants