Review Cascade detection parallel stripe allocation policy #3652

WeiChungChang · 2020-07-29T05:18:37Z

System information (version)

OpenCV => 4.2
Operating System / Platform => Linux
Compiler => GCC

Detailed description

Review Cascade detection parallel tasks(stripes) allocation policy.

Background

Modern CPU usually has multiple cores; ex, for benchmarking platform platform Intel(R) Core(TM) i7-9850H CPU @ 2.60GHz there are 12 cores (6 real + 6 Hyper-Threading)
According to current parallel task assignment algorithm (link), the parallel tasks are not divided uniformlys which leads to performance downgrade.
int nstripes = cvCeil(szw.width/32.);
Also, for object detection such as faces finding, the loading of each stripe is NOT uniform in nature. It is, ex for human faces, the stronger signals usually happen at the middle within a frame.
Experiment shows to reach better performance, it should insure there are at least (3 * cores) parallel tasks (stripes).

Summary

Performance downgrade for portrait image (the height is larger than width) and the
Performance downgrade for images where signal concentrates within specific regions.
Strange implementation to divide stripe by image's width instead of height.

Steps to reproduce

Please refer to the test image below.

Since the test image is of 620x930 pixels, by the formula of current algorithm, we divide the image by 20 slices = ceil(620/32).

The figure below show the execution time of the parallel body [CascadeClassifierInvoker ] (https://github.com/opencv/opencv/blob/master/modules/objdetect/src/cascadedetect.cpp#L1016)

for each slices. Notice for human faces the hot area usually locates at the middle region of input image. At the bottom of input image, the response often drop largely. For this test image it seems that the toes invoke some degree of response so there are two slices having strong response at the bottom.

Also notice that the execution time for the strongest of weakest slice differ by > 7 times.

The figure below shows the scheduling result for 20 slices upon (6 real + 6 hyper-threading) cores.
Notice that the execution time ~ 113 ms.

A meaningful feature is the CPUs utilization rate; for this case it is about 67.8838%

Notice that the inefficiency comes from non-uniform tasks; to achieve fine-grained division, we doubles the # of slices to insure the # of tasks > 3 * # of cores.

The experiment result is as below:

CPUs utilization rate raises to 91%; compared to original 67.8838%.

The execution time becomes 85 ms; a 32.94% speedup gain is achieved.

General solution:
To resolve this issue, instead of adaptively task division by hot region, a general way is a fine-grained task assignment.
Experiment shows if # of slices > 3 * # of cores, even execution time of each slice differs much from each other, the CPU utilization can still be kept to > 80%.

Issue submission checklist

V I report the issue, it's not a question
V I checked the problem with documentation, FAQ, open issues,
answers.opencv.org, Stack Overflow, etc and have not found solution
V I updated to latest OpenCV version and the issue is still there
V There is reproducer code and related data files: videos, images, onnx, etc

The text was updated successfully, but these errors were encountered:

asmorkalov · 2020-07-29T06:08:06Z

@vpisarev please take a look on the proposal.

mshabunin · 2020-07-29T10:27:16Z

@WeiChungChang , could you please propose a PR with your changes? Any performance measurements will be great too.

asmorkalov added category: ml Classic Machine Learning optimization labels Jul 29, 2020

asmorkalov transferred this issue from opencv/opencv Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review Cascade detection parallel stripe allocation policy #3652

Review Cascade detection parallel stripe allocation policy #3652

WeiChungChang commented Jul 29, 2020 •

edited

asmorkalov commented Jul 29, 2020

mshabunin commented Jul 29, 2020

Review Cascade detection parallel stripe allocation policy #3652

Review Cascade detection parallel stripe allocation policy #3652

Comments

WeiChungChang commented Jul 29, 2020 • edited

System information (version)

Detailed description

Steps to reproduce

Issue submission checklist

asmorkalov commented Jul 29, 2020

mshabunin commented Jul 29, 2020

WeiChungChang commented Jul 29, 2020 •

edited