Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to fine-tune FlowNet2 #152

Closed
E1EV1 opened this issue Jul 9, 2018 · 12 comments
Closed

Fail to fine-tune FlowNet2 #152

E1EV1 opened this issue Jul 9, 2018 · 12 comments
Assignees

Comments

@E1EV1
Copy link

E1EV1 commented Jul 9, 2018

Hi,

I'm trying to fine-tune FlowNet2 with my own dataset. I formatted my database in lmdb and modified the FlowNet2_train.prototxt for fitting with my problematic.

Then when I started my training, I faced with a "Segmentation Fault" during the CustomDataLayerPrefetch and I don't know where the error comes from.

Any suggestion ?

capture_erreur_actuelle_sigsegv

@E1EV1
Copy link
Author

E1EV1 commented Jul 10, 2018

I just find a path of research in the Linux terminal, there is a layer that is not created.
It could explain why I have a Segmentation Fault, I think the layer_pointer point to an undefined layer.
I'll keep you informed

first abort

@E1EV1
Copy link
Author

E1EV1 commented Jul 10, 2018

Unfortunately, solving the problem of the creation of the layer_flow_gt_aug_FlowAugmentation1_0_split didn't change anything.
I put you the screenshot of gdb where we could see that the Tread 6 received signal SIGSEGV.
erreur debug

@nikolausmayer
Copy link
Contributor

Hi,
is it possible that there is a difference between the libraries used at compile time and the ones used at runtime? For example, a "popular" error is that people have multiple Caffe installations which interfere with each other.

@E1EV1
Copy link
Author

E1EV1 commented Jul 10, 2018

Thank you for your reply, normally there is no risk of this type.
For the time being, I've install only Flownet2 with your Caffe version on this computer to avoid interference.

@E1EV1
Copy link
Author

E1EV1 commented Jul 12, 2018

For two days, I tried a lot of things that we can read on forums: I recompiled all FlowNet2, modified the .bashrc, modified the makefile.config but without success.
I just find something to try, on the Nvidia documentation, we can read that Cuda 8 doesn't work correctly with gcc if gcc version > 5.3.1. My gcc is 5.4 so I will downgrade for testing.

If somebody has an idea I'm more than interested

@E1EV1
Copy link
Author

E1EV1 commented Jul 12, 2018

Modify the gcc version is useless, since Cuda 8.0.61 gcc 5.4 is allowed.
I try to debug directly the thread now, I put the result of debug below if anyone finds an explanation.

'''Thread 6 "caffe" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffbef4f700 (LWP 18797)]
0x00007ffff747fe60 in void* caffe::CustomDataLayerPrefetch(void*) ()
from /home/ewan/Documents/flownet2-master/.build_release/tools/../lib/libcaffe.so.1.0.0-rc3
(gdb) thread apply all bt

Thread 6 (Thread 0x7fffbef4f700 (LWP 18797)):
#0 0x00007ffff747fe60 in void* caffe::CustomDataLayerPrefetch(void*) ()
from /home/ewan/Documents/flownet2-master/.build_release/tools/../lib/libcaffe.so.1.0.0-rc3
#1 0x00007fffe08286ba in start_thread (arg=0x7fffbef4f700)
at pthread_create.c:333
#2 0x00007ffff579d41d in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 5 (Thread 0x7fffc5003700 (LWP 18795)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 ()
at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
#1 0x00007fffc6437a57 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2 0x00007fffc63f02c7 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3 0x00007fffc6436e80 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4 0x00007fffe08286ba in start_thread (arg=0x7fffc5003700)
at pthread_create.c:333
#5 0x00007ffff579d41d in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 4 (Thread 0x7fffc5804700 (LWP 18794)):
#0 0x00007ffff579174d in poll () at ../sysdeps/unix/syscall-template.S:84
---Type to continue, or q to quit---
#1 0x00007fffc643548b in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2 0x00007fffc649a78f in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3 0x00007fffc6436e80 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4 0x00007fffe08286ba in start_thread (arg=0x7fffc5804700)
at pthread_create.c:333
#5 0x00007ffff579d41d in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 3 (Thread 0x7fffc6005700 (LWP 18793)):
#0 0x00007ffff579e8c8 in accept4 (fd=17, addr=..., addr_len=0x7fffc6004a68,
flags=524288) at ../sysdeps/unix/sysv/linux/accept4.c:40
#1 0x00007fffc6436216 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2 0x00007fffc642a80d in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3 0x00007fffc6436e80 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4 0x00007fffe08286ba in start_thread (arg=0x7fffc6005700)
at pthread_create.c:333
#5 0x00007ffff579d41d in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 2 (Thread 0x7fffc75f2700 (LWP 18791)):
#0 0x00007ffff579174d in poll () at ../sysdeps/unix/syscall-template.S:84
#1 0x00007fffd401f64c in ?? () from /lib/x86_64-linux-gnu/libusb-1.0.so.0
#2 0x00007fffe08286ba in start_thread (arg=0x7fffc75f2700)
---Type to continue, or q to quit---
at pthread_create.c:333
#3 0x00007ffff579d41d in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 1 (Thread 0x7ffff7f6db00 (LWP 18787)):
#0 0x00007fffc600c454 in fatBinaryCtl ()
from /usr/lib/nvidia-390/libnvidia-fatbinaryloader.so.390.25
#1 0x00007fffc6418fb0 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2 0x00007fffc6419bf3 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3 0x00007fffc6369de5 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4 0x00007fffc636a0f0 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#5 0x00007fffe1e7ddcd in ?? () from /usr/lib/x86_64-linux-gnu/libcudnn.so.7
#6 0x00007fffe1e737f0 in ?? () from /usr/lib/x86_64-linux-gnu/libcudnn.so.7
#7 0x00007fffe1e80f31 in ?? () from /usr/lib/x86_64-linux-gnu/libcudnn.so.7
#8 0x00007fffe1e84621 in ?? () from /usr/lib/x86_64-linux-gnu/libcudnn.so.7
#9 0x00007fffe1e781bc in ?? () from /usr/lib/x86_64-linux-gnu/libcudnn.so.7
#10 0x00007fffe1e5fff2 in ?? () from /usr/lib/x86_64-linux-gnu/libcudnn.so.7
#11 0x00007fffe1e9a15f in ?? () from /usr/lib/x86_64-linux-gnu/libcudnn.so.7
#12 0x00007fffe0fd6e21 in cudnnCreate ()
from /usr/lib/x86_64-linux-gnu/libcudnn.so.7
#13 0x00007ffff749ec8f in caffe::CuDNNConvolutionLayer::LayerSetUp(std::vector<caffe::Blob, std::allocator<caffe::Blob> > const&, std::vector<caffe::Blob, std::allocator<caffe::Blob> > const&) ()
---Type to continue, or q to quit---
/home/ewan/Documents/flownet2-master/.build_release/tools/../lib/libcaffe.so.1.0.0-rc3
#14 0x00007ffff73b4065 in caffe::Net::Init(caffe::NetParameter const&) () from /home/ewan/Documents/flownet2-master/.build_release/tools/../lib/libcaffe.so.1.0.0-rc3
#15 0x00007ffff73b5891 in caffe::Net::Net(caffe::NetParameter const&, caffe::Net const*) () from /home/ewan/Documents/flownet2-master/.build_release/tools/../lib/libcaffe.so.1.0.0-rc3
#16 0x00007ffff73865ca in caffe::Solver::InitTrainNet() () from /home/ewan/Documents/flownet2-master/.build_release/tools/../lib/libcaffe.so.1.0.0-rc3
#17 0x00007ffff7387907 in caffe::Solver::Init(caffe::SolverParameter const&) () from /home/ewan/Documents/flownet2-master/.build_release/tools/../lib/libcaffe.so.1.0.0-rc3
#18 0x00007ffff7387caa in caffe::Solver::Solver(caffe::SolverParameter const&, caffe::Solver const*) ()
from /home/ewan/Documents/flownet2-master/.build_release/tools/../lib/libcaffe.so.1.0.0-rc3
#19 0x00007ffff7594c43 in caffe::Solver* caffe::Creator_AdamSolver(caffe::SolverParameter const&) ()
from /home/ewan/Documents/flownet2-master/.build_release/tools/../lib/libcaffe.so.1.0.0-rc3
#20 0x000000000040a6e8 in train() ()
#21 0x00000000004075a8 in main ()'''

@nikolausmayer
Copy link
Contributor

Your backtrace indicates that you are using CuDNN version 7. We've only ever used version 5. I know that it's relatively easy to make the code compatible with version 6, but I never tried 7.

@E1EV1
Copy link
Author

E1EV1 commented Jul 13, 2018

Thank you for your reply, I will try by downgrading my CuDNN version.
Normally it should not change much if I refer to https://github.com/lmb-freiburg/flownet2/issues/92 but we never know.

It's very strange, I can run FlowNet2 and build without problem but I can't train or fine-tune.

@nikolausmayer
Copy link
Contributor

Hm, that's strange, but it really might be a problem with CuDNN. But it might be worth asking the people in #92 whether they actually used training, or just testing 😉

@E1EV1
Copy link
Author

E1EV1 commented Jul 13, 2018

Thank you for all your help @nikolausmayer ,
Yes that's why I downgraded my CuDNN version but unfortunately I've always got the same issue :(
I put the error message below.

My setup : Ubuntu 16.04, 980Ti, Cuda 8.0.61, CuDNN 5.1, gcc 5.4, python 3.5.
If anyone have a suggestion I'm really interested :)

'''Thread 6 "caffe" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffc9f02700 (LWP 10153)]
0x00007ffff7481460 in void* caffe::CustomDataLayerPrefetch(void*) ()
from /home/ewan/Documents/flownet2-master/.build_release/tools/../lib/libcaffe.so.1.0.0-rc3
(gdb) thread apply all bt

Thread 6 (Thread 0x7fffc9f02700 (LWP 10153)):
#0 0x00007ffff7481460 in void* caffe::CustomDataLayerPrefetch(void*) () from /home/ewan/Documents/flownet2-master/.build_release/tools/../lib/libcaffe.so.1.0.0-rc3
#1 0x00007fffe776c6ba in start_thread (arg=0x7fffc9f02700) at pthread_create.c:333
#2 0x00007ffff579f41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 5 (Thread 0x7fffcbf47700 (LWP 10152)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
#1 0x00007fffcd37ba57 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2 0x00007fffcd3342c7 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3 0x00007fffcd37ae80 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4 0x00007fffe776c6ba in start_thread (arg=0x7fffcbf47700) at pthread_create.c:333
#5 0x00007ffff579f41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 4 (Thread 0x7fffcc748700 (LWP 10151)):
#0 0x00007ffff579374d in poll () at ../sysdeps/unix/syscall-template.S:84
#1 0x00007fffcd37948b in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2 0x00007fffcd3de78f in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3 0x00007fffcd37ae80 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4 0x00007fffe776c6ba in start_thread (arg=0x7fffcc748700) at pthread_create.c:333
#5 0x00007ffff579f41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 3 (Thread 0x7fffccf49700 (LWP 10150)):
#0 0x00007ffff57a08c8 in accept4 (fd=17, addr=..., addr_len=0x7fffccf48a68, flags=524288) at ../sysdeps/unix/sysv/linux/accept4.c:40
#1 0x00007fffcd37a216 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2 0x00007fffcd36e80d in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3 0x00007fffcd37ae80 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4 0x00007fffe776c6ba in start_thread (arg=0x7fffccf49700) at pthread_create.c:333
#5 0x00007ffff579f41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 2 (Thread 0x7fffce536700 (LWP 10148)):
#0 0x00007ffff579374d in poll () at ../sysdeps/unix/syscall-template.S:84
#1 0x00007fffdaf6364c in ?? () from /lib/x86_64-linux-gnu/libusb-1.0.so.0
#2 0x00007fffe776c6ba in start_thread (arg=0x7fffce536700) at pthread_create.c:333
#3 0x00007ffff579f41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 1 (Thread 0x7ffff7f6db00 (LWP 10144)):
#0 0x00007fffccf50454 in fatBinaryCtl () from /usr/lib/nvidia-390/libnvidia-fatbinaryloader.so.390.25
#1 0x00007fffcd35cfb0 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2 0x00007fffcd35dbf3 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3 0x00007fffcd2adde5 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4 0x00007fffcd2ae0f0 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#5 0x00007fffe842068d in ?? () from /usr/lib/x86_64-linux-gnu/libcudnn.so.5
#6 0x00007fffe84160b0 in ?? () from /usr/lib/x86_64-linux-gnu/libcudnn.so.5
#7 0x00007fffe8423906 in ?? () from /usr/lib/x86_64-linux-gnu/libcudnn.so.5
#8 0x00007fffe8426f11 in ?? () from /usr/lib/x86_64-linux-gnu/libcudnn.so.5
#9 0x00007fffe841aa7c in ?? () from /usr/lib/x86_64-linux-gnu/libcudnn.so.5
#10 0x00007fffe84072d2 in ?? () from /usr/lib/x86_64-linux-gnu/libcudnn.so.5
#11 0x00007fffe843ca4f in ?? () from /usr/lib/x86_64-linux-gnu/libcudnn.so.5
#12 0x00007fffe7eea714 in cudnnCreate () from /usr/lib/x86_64-linux-gnu/libcudnn.so.5
#13 0x00007ffff749edb1 in caffe::CuDNNConvolutionLayer::LayerSetUp(std::vector<caffe::Blob, std::allocator<caffe::Blob> > const&, std::vector<caffe::Blob, std::allocator<caffe::Blob> > const&) () from /home/ewan/Documents/flownet2-master/.build_release/tools/../lib/libcaffe.so.1.0.0-rc3
#14 0x00007ffff73b5e25 in caffe::Net::Init(caffe::NetParameter const&) () from /home/ewan/Documents/flownet2-master/.build_release/tools/../lib/libcaffe.so.1.0.0-rc3
#15 0x00007ffff73b7651 in caffe::Net::Net(caffe::NetParameter const&, caffe::Net const*) () from /home/ewan/Documents/flownet2-master/.build_release/tools/../lib/libcaffe.so.1.0.0-rc3
#16 0x00007ffff738838a in caffe::Solver::InitTrainNet() () from /home/ewan/Documents/flownet2-master/.build_release/tools/../lib/libcaffe.so.1.0.0-rc3
#17 0x00007ffff73896c7 in caffe::Solver::Init(caffe::SolverParameter const&) () from /home/ewan/Documents/flownet2-master/.build_release/tools/../lib/libcaffe.so.1.0.0-rc3
#18 0x00007ffff7389a6a in caffe::Solver::Solver(caffe::SolverParameter const&, caffe::Solver const*) ()
from /home/ewan/Documents/flownet2-master/.build_release/tools/../lib/libcaffe.so.1.0.0-rc3
#19 0x00007ffff7595983 in caffe::Solver* caffe::Creator_AdamSolver(caffe::SolverParameter const&) ()
from /home/ewan/Documents/flownet2-master/.build_release/tools/../lib/libcaffe.so.1.0.0-rc3
#20 0x000000000040a6e8 in train() ()'''

@nikolausmayer nikolausmayer self-assigned this Jul 16, 2018
@E1EV1
Copy link
Author

E1EV1 commented Jul 18, 2018

Ok I found why I had an error !!!
I had some pictures in my dataset which didn't have the same size than the others.
Now all the images have the same size and I can fine-tune without SIGSEGV.

Thank you @nikolausmayer for your help

@E1EV1 E1EV1 closed this as completed Jul 18, 2018
@nikolausmayer
Copy link
Contributor

Nice job. I guess it would be good if the converters or data layers checked for this... 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants