Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review Cascade detection parallel stripe allocation policy #3652

Open
WeiChungChang opened this issue Jul 29, 2020 · 2 comments
Open

Review Cascade detection parallel stripe allocation policy #3652

WeiChungChang opened this issue Jul 29, 2020 · 2 comments
Labels
category: ml Classic Machine Learning optimization

Comments

@WeiChungChang
Copy link

WeiChungChang commented Jul 29, 2020

System information (version)
  • OpenCV => 4.2
  • Operating System / Platform => Linux
  • Compiler => GCC
Detailed description

Review Cascade detection parallel tasks(stripes) allocation policy.

  • Background
  1. Modern CPU usually has multiple cores; ex, for benchmarking platform platform Intel(R) Core(TM) i7-9850H CPU @ 2.60GHz there are 12 cores (6 real + 6 Hyper-Threading)
  2. According to current parallel task assignment algorithm (link), the parallel tasks are not divided uniformlys which leads to performance downgrade.
    int nstripes = cvCeil(szw.width/32.);
  3. Also, for object detection such as faces finding, the loading of each stripe is NOT uniform in nature. It is, ex for human faces, the stronger signals usually happen at the middle within a frame.
  4. Experiment shows to reach better performance, it should insure there are at least (3 * cores) parallel tasks (stripes).
  • Summary
  1. Performance downgrade for portrait image (the height is larger than width) and the
  2. Performance downgrade for images where signal concentrates within specific regions.
  3. Strange implementation to divide stripe by image's width instead of height.
Steps to reproduce
  1. Please refer to the test image below.
    1

Since the test image is of 620x930 pixels, by the formula of current algorithm, we divide the image by 20 slices = ceil(620/32).

The figure below show the execution time of the parallel body [CascadeClassifierInvoker ] (https://github.com/opencv/opencv/blob/master/modules/objdetect/src/cascadedetect.cpp#L1016)
1
for each slices. Notice for human faces the hot area usually locates at the middle region of input image. At the bottom of input image, the response often drop largely. For this test image it seems that the toes invoke some degree of response so there are two slices having strong response at the bottom.

Also notice that the execution time for the strongest of weakest slice differ by > 7 times.

The figure below shows the scheduling result for 20 slices upon (6 real + 6 hyper-threading) cores.
Notice that the execution time ~ 113 ms.
original

A meaningful feature is the CPUs utilization rate; for this case it is about 67.8838%

Notice that the inefficiency comes from non-uniform tasks; to achieve fine-grained division, we doubles the # of slices to insure the # of tasks > 3 * # of cores.

The experiment result is as below:
gantt1
CPUs utilization rate raises to 91%; compared to original 67.8838%.

The execution time becomes 85 ms; a 32.94% speedup gain is achieved.

General solution:
To resolve this issue, instead of adaptively task division by hot region, a general way is a fine-grained task assignment.
Experiment shows if # of slices > 3 * # of cores, even execution time of each slice differs much from each other, the CPU utilization can still be kept to > 80%.

Issue submission checklist
  • V I report the issue, it's not a question
  • V I checked the problem with documentation, FAQ, open issues,
    answers.opencv.org, Stack Overflow, etc and have not found solution
  • V I updated to latest OpenCV version and the issue is still there
  • V There is reproducer code and related data files: videos, images, onnx, etc
@asmorkalov asmorkalov added category: ml Classic Machine Learning optimization labels Jul 29, 2020
@asmorkalov
Copy link
Contributor

@vpisarev please take a look on the proposal.

@mshabunin
Copy link
Contributor

@WeiChungChang , could you please propose a PR with your changes? Any performance measurements will be great too.

@asmorkalov asmorkalov transferred this issue from opencv/opencv Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: ml Classic Machine Learning optimization
Projects
None yet
Development

No branches or pull requests

3 participants