You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Modern CPU usually has multiple cores; ex, for benchmarking platform platform Intel(R) Core(TM) i7-9850H CPU @ 2.60GHz there are 12 cores (6 real + 6 Hyper-Threading)
According to current parallel task assignment algorithm (link), the parallel tasks are not divided uniformlys which leads to performance downgrade. int nstripes = cvCeil(szw.width/32.);
Also, for object detection such as faces finding, the loading of each stripe is NOT uniform in nature. It is, ex for human faces, the stronger signals usually happen at the middle within a frame.
Experiment shows to reach better performance, it should insure there are at least (3 * cores) parallel tasks (stripes).
Summary
Performance downgrade for portrait image (the height is larger than width) and the
Performance downgrade for images where signal concentrates within specific regions.
Strange implementation to divide stripe by image's width instead of height.
Steps to reproduce
Please refer to the test image below.
Since the test image is of 620x930 pixels, by the formula of current algorithm, we divide the image by 20 slices = ceil(620/32).
The figure below show the execution time of the parallel body [CascadeClassifierInvoker ] (https://github.com/opencv/opencv/blob/master/modules/objdetect/src/cascadedetect.cpp#L1016)
for each slices. Notice for human faces the hot area usually locates at the middle region of input image. At the bottom of input image, the response often drop largely. For this test image it seems that the toes invoke some degree of response so there are two slices having strong response at the bottom.
Also notice that the execution time for the strongest of weakest slice differ by > 7 times.
The figure below shows the scheduling result for 20 slices upon (6 real + 6 hyper-threading) cores.
Notice that the execution time ~ 113 ms.
A meaningful feature is the CPUs utilization rate; for this case it is about 67.8838%
Notice that the inefficiency comes from non-uniform tasks; to achieve fine-grained division, we doubles the # of slices to insure the # of tasks > 3 * # of cores.
The experiment result is as below: CPUs utilization rate raises to 91%; compared to original 67.8838%.
The execution time becomes 85 ms; a 32.94% speedup gain is achieved.
General solution:
To resolve this issue, instead of adaptively task division by hot region, a general way is a fine-grained task assignment. Experiment shows if # of slices > 3 * # of cores, even execution time of each slice differs much from each other, the CPU utilization can still be kept to > 80%.
Issue submission checklist
V I report the issue, it's not a question
V I checked the problem with documentation, FAQ, open issues,
answers.opencv.org, Stack Overflow, etc and have not found solution
V I updated to latest OpenCV version and the issue is still there
V There is reproducer code and related data files: videos, images, onnx, etc
The text was updated successfully, but these errors were encountered:
System information (version)
Detailed description
Review Cascade detection parallel tasks(stripes) allocation policy.
int nstripes = cvCeil(szw.width/32.);
Steps to reproduce
Since the test image is of 620x930 pixels, by the formula of current algorithm, we divide the image by 20 slices = ceil(620/32).
The figure below show the execution time of the parallel body [CascadeClassifierInvoker ] (https://github.com/opencv/opencv/blob/master/modules/objdetect/src/cascadedetect.cpp#L1016)
for each slices. Notice for human faces the hot area usually locates at the middle region of input image. At the bottom of input image, the response often drop largely. For this test image it seems that the toes invoke some degree of response so there are two slices having strong response at the bottom.
Also notice that the execution time for the strongest of weakest slice differ by > 7 times.
The figure below shows the scheduling result for 20 slices upon (6 real + 6 hyper-threading) cores.
Notice that the execution time ~ 113 ms.
A meaningful feature is the CPUs utilization rate; for this case it is about 67.8838%
Notice that the inefficiency comes from non-uniform tasks; to achieve fine-grained division, we doubles the # of slices to insure the # of tasks > 3 * # of cores.
The experiment result is as below:
CPUs utilization rate raises to 91%; compared to original 67.8838%.
The execution time becomes 85 ms; a 32.94% speedup gain is achieved.
General solution:
To resolve this issue, instead of adaptively task division by hot region, a general way is a fine-grained task assignment.
Experiment shows if # of slices > 3 * # of cores, even execution time of each slice differs much from each other, the CPU utilization can still be kept to > 80%.
Issue submission checklist
answers.opencv.org, Stack Overflow, etc and have not found solution
The text was updated successfully, but these errors were encountered: