Replies: 19 comments
-
Yes, I am working on that. This is extremely experimental, but if you want to try it out, that work is in the |
Beta Was this translation helpful? Give feedback.
-
that's nice. I am also thinking if we can re-implement with cuda, but the synchronization is a big challenge. Let me keep this thread open and I would give the feedback after using the new c++ interface. thank you for sharing. |
Beta Was this translation helpful? Give feedback.
-
Does this only work on pytorch-1.9? I tried torch1.7 but got some compile errors.
|
Beta Was this translation helpful? Give feedback.
-
Yes, only PyTorch 1.9. |
Beta Was this translation helpful? Give feedback.
-
Did you benchmark the performance? is this much faster than cython version? |
Beta Was this translation helpful? Give feedback.
-
Thank you. For some reasons I can't upgrade torch. Can you give me the benchmark timing result in your environment? |
Beta Was this translation helpful? Give feedback.
-
It varies and all is preliminary. The flag |
Beta Was this translation helpful? Give feedback.
-
amazing. I will try that. thank you. |
Beta Was this translation helpful? Give feedback.
-
I just try the branch with pytorch 1.9. But I found out the decoder speed is still very slow (about 40ms). the input resolution is 961 x 721 is this line (https://github.com/openpifpaf/openpifpaf/blob/libtorch-extension/openpifpaf/decoder/cifcaf.py#L255) you used for profile? |
Beta Was this translation helpful? Give feedback.
-
by the way, my cpu is intel E5-2620 v4, what kind of cpu did you use to obtain below 1ms? |
Beta Was this translation helpful? Give feedback.
-
The important question is what is the model input size, not original image. Or more precisely what is the tensor shape of input when you are benchmarking. I tested with |
Beta Was this translation helpful? Give feedback.
-
I just reimplement c++ part and got 1ms for decoding with E5-2620 v4, but only 80 ms for jetson nx cpu (https://developer.nvidia.com/embedded/jetson-xavier-nx-devkit). as compared to python part, c++ is faster about 60~70 times. maybe it's good to reimplement it with gpu computing. So you got a good performance with low resolution (240x135)? |
Beta Was this translation helpful? Give feedback.
-
When you benchmark the C++ decoder, make sure you allow for a warm-up. One of the most time consuming ops is now memory allocation. That allocation depends on the input size. Therefore, memory allocation is dynamically increased when you feed larger images to the decoder. If you always feed the same image size, then the allocation only happens for the first image. |
Beta Was this translation helpful? Give feedback.
-
No, I translated python implementation to C++ before official implementation was available, my implementation only uses STL and implements data structures to work with 2,3,4D arrays in order to process For openpifpaf 17 keypoint detection it can work in real time on boards, however 133 keypoint detection runs slow on my workstation on board even slower. I am still in the process of profiling my C++ implementation but most of the time goes to memory allocations and in my case on filtering operations done on |
Beta Was this translation helpful? Give feedback.
-
@vladimirmujagic My implementation uses a little trick to avoid memory allocations and memory resets for accumulation buffers: it keeps a separate float that defines the lower bound. In the first prediction, values between 0 and 1 are filled and the lower bound is 0. For the next image, only the lower bound is increased to 1 but the memory is not overwritten/reset. The memory is sparsely filled with values between 1 and 2. This offset is corrected when reading. Because setting and reading is sparse, this is a lot faster than a dense reset of all memory. |
Beta Was this translation helpful? Give feedback.
-
well, I found out it's really hard to implement accumulate cifhr with cuda. The accumulation (https://github.com/openpifpaf/openpifpaf/blob/main/openpifpaf/functional.pyx#L212) is difficult to computed in parallel with cuda threads. Some threads may accumulate at the same point. Using Does there exists an approximated function which can be computed in parallel and would produce similar accuracy? |
Beta Was this translation helpful? Give feedback.
-
What part of the accumulation you have problems with? Is it perhaps filtering operations ? In my case ~
Will try to parallelize with CUDA |
Beta Was this translation helpful? Give feedback.
-
Which step you use with In the whole pipeline, the first step is to accumulate to cifhr, and the second step is to filter cifcaf, which part did you use with What I found out is that it's difficult to do accumulation to cifhr in parallel. like this line (https://github.com/openpifpaf/openpifpaf/blob/main/openpifpaf/functional.pyx#L212), some cuda threads may accumulate at the same point (yy,xx), that is, some cuda threads may modify to the same memory location at the same time, this is not allowed in parallel computing. |
Beta Was this translation helpful? Give feedback.
-
I found out that decoder performance is not fast as compared to the network forwarding.
And I found that you use cython to optimize the process, but in edge device like jetson nano this is still not fast enough.
So can we optimize the decoder with cuda or some other parallel technique?
Beta Was this translation helpful? Give feedback.
All reactions