Can we optimize decoder performance? #478

twmht · 2021-07-22T07:29:31Z

twmht
Jul 22, 2021

I found out that decoder performance is not fast as compared to the network forwarding.

And I found that you use cython to optimize the process, but in edge device like jetson nano this is still not fast enough.

So can we optimize the decoder with cuda or some other parallel technique?

svenkreiss · 2021-07-22T07:45:01Z

svenkreiss
Jul 22, 2021
Maintainer

Yes, I am working on that. This is extremely experimental, but if you want to try it out, that work is in the libtorch-extension branch. Cython is removed and the entire decoder is rewritten in C++ and interfaced with the new PyTorch-way as of 1.9.0. If you try it out, I'd be curious to hear your feedback.

0 replies

twmht · 2021-07-22T08:20:14Z

twmht
Jul 22, 2021
Author

that's nice. I am also thinking if we can re-implement with cuda, but the synchronization is a big challenge.

Let me keep this thread open and I would give the feedback after using the new c++ interface.

thank you for sharing.

0 replies

twmht · 2021-07-22T08:40:08Z

twmht
Jul 22, 2021
Author

@svenkreiss

Does this only work on pytorch-1.9? I tried torch1.7 but got some compile errors.

/home/shared/nfs/acer-share/new_openpifpaf/openpifpaf/csrc/src/module.cpp: In function ‘void TORCH_LIBRARY_init_openpifpaf_decoder(torch::Library&)’:
/home/shared/nfs/acer-share/new_openpifpaf/openpifpaf/csrc/src/module.cpp:16:33: error: ‘class torch::class_<openpifpaf::decoder::CifCaf>’ has no member named ‘def_static’
 #define STATIC_GETSET(C, T, V) .def_static("set_"#V, [](T v) { C = v; }).def_static("get_"#V, []() { return C; })
                                 ^
/home/shared/nfs/acer-share/new_openpifpaf/openpifpaf/csrc/src/module.cpp:21:9: note: in expansion of macro ‘STATIC_GETSET’
         STATIC_GETSET(openpifpaf::decoder::CifCaf::block_joints, bool, block_joints)
         ^~~~~~~~~~~~~
/home/shared/nfs/acer-share/new_openpifpaf/openpifpaf/csrc/src/module.cpp:16:33: error: ‘class torch::class_<openpifpaf::decoder::CifDet>’ has no member named ‘def_static’
 #define STATIC_GETSET(C, T, V) .def_static("set_"#V, [](T v) { C = v; }).def_static("get_"#V, []() { return C; })
                                 ^
/home/shared/nfs/acer-share/new_openpifpaf/openpifpaf/csrc/src/module.cpp:53:9: note: in expansion of macro ‘STATIC_GETSET’
         STATIC_GETSET(openpifpaf::decoder::CifDet::max_detections_before_nms, int64_t, max_detections_before_nms)
         ^~~~~~~~~~~~~
/home/shared/nfs/acer-share/new_openpifpaf/openpifpaf/csrc/src/module.cpp: In function ‘void TORCH_LIBRARY_init_openpifpaf(torch::Library&)’:
/home/shared/nfs/acer-share/new_openpifpaf/openpifpaf/csrc/src/module.cpp:16:33: error: ‘class torch::class_<openpifpaf::decoder::utils::CifHr>’ has no member named ‘def_static’
 #define STATIC_GETSET(C, T, V) .def_static("set_"#V, [](T v) { C = v; }).def_static("get_"#V, []() { return C; })
                                 ^
/home/shared/nfs/acer-share/new_openpifpaf/openpifpaf/csrc/src/module.cpp:73:9: note: in expansion of macro ‘STATIC_GETSET’
         STATIC_GETSET(openpifpaf::decoder::utils::CifHr::neighbors, int64_t, neighbors)
         ^~~~~~~~~~~~~
/home/shared/nfs/acer-share/new_openpifpaf/openpifpaf/csrc/src/module.cpp:16:33: error: ‘class torch::class_<openpifpaf::decoder::utils::CifSeeds>’ has no member named ‘def_static’
 #define STATIC_GETSET(C, T, V) .def_static("set_"#V, [](T v) { C = v; }).def_static("get_"#V, []() { return C; })
                                 ^
/home/shared/nfs/acer-share/new_openpifpaf/openpifpaf/csrc/src/module.cpp:84:9: note: in expansion of macro ‘STATIC_GETSET’
         STATIC_GETSET(openpifpaf::decoder::utils::CifSeeds::threshold, double, threshold)
         ^~~~~~~~~~~~~
/home/shared/nfs/acer-share/new_openpifpaf/openpifpaf/csrc/src/module.cpp:16:33: error: ‘class torch::class_<openpifpaf::decoder::utils::CifDetSeeds>’ has no member named ‘def_static’
 #define STATIC_GETSET(C, T, V) .def_static("set_"#V, [](T v) { C = v; }).def_static("get_"#V, []() { return C; })
                                 ^
/home/shared/nfs/acer-share/new_openpifpaf/openpifpaf/csrc/src/module.cpp:94:9: note: in expansion of macro ‘STATIC_GETSET’
         STATIC_GETSET(openpifpaf::decoder::utils::CifDetSeeds::threshold, double, threshold)
         ^~~~~~~~~~~~~
/home/shared/nfs/acer-share/new_openpifpaf/openpifpaf/csrc/src/module.cpp:16:33: error: ‘class torch::class_<openpifpaf::decoder::utils::CafScored>’ has no member named ‘def_static’
 #define STATIC_GETSET(C, T, V) .def_static("set_"#V, [](T v) { C = v; }).def_static("get_"#V, []() { return C; })
                                 ^
/home/shared/nfs/acer-share/new_openpifpaf/openpifpaf/csrc/src/module.cpp:102:9: note: in expansion of macro ‘STATIC_GETSET’
         STATIC_GETSET(openpifpaf::decoder::utils::CafScored::default_score_th, double, default_score_th)
         ^~~~~~~~~~~~~
/home/shared/nfs/acer-share/new_openpifpaf/openpifpaf/csrc/src/module.cpp:16:33: error: ‘class torch::class_<openpifpaf::decoder::utils::NMSKeypoints>’ has no member named ‘def_static’
 #define STATIC_GETSET(C, T, V) .def_static("set_"#V, [](T v) { C = v; }).def_static("get_"#V, []() { return C; })
                                 ^
/home/shared/nfs/acer-share/new_openpifpaf/openpifpaf/csrc/src/module.cpp:111:9: note: in expansion of macro ‘STATIC_GETSET’
         STATIC_GETSET(openpifpaf::decoder::utils::NMSKeypoints::instance_threshold, double, instance_threshold)
         ^~~~~~~~~~~~~

0 replies

svenkreiss · 2021-07-22T09:21:07Z

svenkreiss
Jul 22, 2021
Maintainer

Yes, only PyTorch 1.9.

0 replies

twmht · 2021-07-22T09:41:54Z

twmht
Jul 22, 2021
Author

@svenkreiss

Did you benchmark the performance? is this much faster than cython version?

0 replies

svenkreiss · 2021-07-22T10:14:18Z

svenkreiss
Jul 22, 2021
Maintainer

Yes.

0 replies

twmht · 2021-07-23T05:37:27Z

twmht
Jul 23, 2021
Author

@svenkreiss

Thank you. For some reasons I can't upgrade torch. Can you give me the benchmark timing result in your environment?

0 replies

svenkreiss · 2021-07-23T09:26:26Z

svenkreiss
Jul 23, 2021
Maintainer

It varies and all is preliminary. The flag --force-complete-pose that we use in benchmark makes it slower but you can expect roughly 12ms for the decoder with that flag and all benchmark configurations that produces the same AP as before. Without --force-complete-pose and on the webcam setting with smaller images, it can get even below 1ms.

0 replies

twmht · 2021-07-23T09:49:20Z

twmht
Jul 23, 2021
Author

@svenkreiss

amazing. I will try that.

thank you.

0 replies

twmht · 2021-07-27T07:21:55Z

twmht
Jul 27, 2021
Author

@svenkreiss

I just try the branch with pytorch 1.9.

But I found out the decoder speed is still very slow (about 40ms).

the input resolution is 961 x 721

is this line (https://github.com/openpifpaf/openpifpaf/blob/libtorch-extension/openpifpaf/decoder/cifcaf.py#L255) you used for profile?

0 replies

twmht · 2021-07-27T12:54:40Z

twmht
Jul 27, 2021
Author

by the way, my cpu is intel E5-2620 v4, what kind of cpu did you use to obtain below 1ms?

0 replies

vladimirmujagic · 2021-08-16T13:29:00Z

vladimirmujagic
Aug 16, 2021

The important question is what is the model input size, not original image. Or more precisely what is the tensor shape of input when you are benchmarking.

I tested with 240x135 and got ~40ms for decoding on AMD Ryzen Threadripper 2950X 16-Core Processor

0 replies

twmht · 2021-08-18T11:10:28Z

twmht
Aug 18, 2021
Author

@vladimirmujagic

I just reimplement c++ part and got 1ms for decoding with E5-2620 v4, but only 80 ms for jetson nx cpu (https://developer.nvidia.com/embedded/jetson-xavier-nx-devkit).

as compared to python part, c++ is faster about 60~70 times. maybe it's good to reimplement it with gpu computing.

So you got a good performance with low resolution (240x135)?

0 replies

svenkreiss · 2021-08-18T13:48:06Z

svenkreiss
Aug 18, 2021
Maintainer

When you benchmark the C++ decoder, make sure you allow for a warm-up. One of the most time consuming ops is now memory allocation. That allocation depends on the input size. Therefore, memory allocation is dynamically increased when you feed larger images to the decoder. If you always feed the same image size, then the allocation only happens for the first image.

0 replies

vladimirmujagic · 2021-08-18T15:54:39Z

vladimirmujagic
Aug 18, 2021

@twmht

No, 245x135 postprocessing takes 50-70ms.

I translated python implementation to C++ before official implementation was available, my implementation only uses STL and implements data structures to work with 2,3,4D arrays in order to process cif and caf outputs.

For openpifpaf 17 keypoint detection it can work in real time on boards, however 133 keypoint detection runs slow on my workstation on board even slower.

I am still in the process of profiling my C++ implementation but most of the time goes to memory allocations and in my case on filtering operations done on cif and caf by cif_hr and caf_scored.

0 replies

svenkreiss · 2021-08-18T19:30:30Z

svenkreiss
Aug 18, 2021
Maintainer

@vladimirmujagic My implementation uses a little trick to avoid memory allocations and memory resets for accumulation buffers: it keeps a separate float that defines the lower bound. In the first prediction, values between 0 and 1 are filled and the lower bound is 0. For the next image, only the lower bound is increased to 1 but the memory is not overwritten/reset. The memory is sparsely filled with values between 1 and 2. This offset is corrected when reading. Because setting and reading is sparse, this is a lot faster than a dense reset of all memory.

0 replies

twmht · 2021-08-23T09:56:06Z

twmht
Aug 23, 2021
Author

well, I found out it's really hard to implement accumulate cifhr with cuda.

The accumulation (https://github.com/openpifpaf/openpifpaf/blob/main/openpifpaf/functional.pyx#L212) is difficult to computed in parallel with cuda threads. Some threads may accumulate at the same point. Using atomicAdd here is slow.

Does there exists an approximated function which can be computed in parallel and would produce similar accuracy?

0 replies

vladimirmujagic · 2021-08-23T10:32:43Z

vladimirmujagic
Aug 23, 2021

@twmht

What part of the accumulation you have problems with? Is it perhaps filtering operations ?

In my case ~48% execution time goes to this function

Vector2D filter_vector3D(
    blobnd<float> &v, int index_d4, int filter_index,
    blobnd<bool> &mask
) {
    unsigned d1 = v.shape[0];
    unsigned d2 = v.shape[1];
    unsigned d3 = v.shape[2];
    unsigned d4 = v.shape[3];

    Vector2D result;
    for(unsigned i = 0; i < d2; i++) {
        vector<float> sample_result;
        for(unsigned j = 0; j < d3; j++) {
            for(unsigned k = 0; k < d4; k++) {
                if(mask(j, k))
                     sample_result.push_back(v(index_d4, i, j, k));
            }
        }
        result.push_back(sample_result);
    }

    return result;
}

Will try to parallelize with CUDA

0 replies

twmht · 2021-08-24T01:34:26Z

twmht
Aug 24, 2021
Author

@vladimirmujagic

Which step you use with filter_vector3D?

In the whole pipeline, the first step is to accumulate to cifhr, and the second step is to filter cifcaf, which part did you use with filter_vector3D?

What I found out is that it's difficult to do accumulation to cifhr in parallel. like this line (https://github.com/openpifpaf/openpifpaf/blob/main/openpifpaf/functional.pyx#L212), some cuda threads may accumulate at the same point (yy,xx), that is, some cuda threads may modify to the same memory location at the same time, this is not allowed in parallel computing.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can we optimize decoder performance? #478

{{title}}

Replies: 19 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Can we optimize decoder performance? #478

twmht Jul 22, 2021

Replies: 19 comments

svenkreiss Jul 22, 2021 Maintainer

twmht Jul 22, 2021 Author

twmht Jul 22, 2021 Author

svenkreiss Jul 22, 2021 Maintainer

twmht Jul 22, 2021 Author

svenkreiss Jul 22, 2021 Maintainer

twmht Jul 23, 2021 Author

svenkreiss Jul 23, 2021 Maintainer

twmht Jul 23, 2021 Author

twmht Jul 27, 2021 Author

twmht Jul 27, 2021 Author

vladimirmujagic Aug 16, 2021

twmht Aug 18, 2021 Author

svenkreiss Aug 18, 2021 Maintainer

vladimirmujagic Aug 18, 2021

svenkreiss Aug 18, 2021 Maintainer

twmht Aug 23, 2021 Author

vladimirmujagic Aug 23, 2021

twmht Aug 24, 2021 Author

twmht
Jul 22, 2021

svenkreiss
Jul 22, 2021
Maintainer

twmht
Jul 22, 2021
Author

twmht
Jul 22, 2021
Author

svenkreiss
Jul 22, 2021
Maintainer

twmht
Jul 22, 2021
Author

svenkreiss
Jul 22, 2021
Maintainer

twmht
Jul 23, 2021
Author

svenkreiss
Jul 23, 2021
Maintainer

twmht
Jul 23, 2021
Author

twmht
Jul 27, 2021
Author

twmht
Jul 27, 2021
Author

vladimirmujagic
Aug 16, 2021

twmht
Aug 18, 2021
Author

svenkreiss
Aug 18, 2021
Maintainer

vladimirmujagic
Aug 18, 2021

svenkreiss
Aug 18, 2021
Maintainer

twmht
Aug 23, 2021
Author

vladimirmujagic
Aug 23, 2021

twmht
Aug 24, 2021
Author