Skip to content

Conversation

kallaballa
Copy link
Contributor

Pertaining Issue: #5697

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • The PR is proposed to the proper branch
  • There is a reference to the original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake

@vpisarev
Copy link
Contributor

@kallaballa, thank you for the pull request. I have some doubts regarding it.

  1. it references the issue opencl has no impact on the performance of face detection #5697 that complains about zero performance gain when using OpenCV + T-API for inference. However, a typical deep learning model includes a few dozens if not 100+ layers, so the preprocessing, which is just a few operations.
  2. in the experimental inference engine https://github.com/vpisarev/ficus/tree/master/lib/NN, which we slowly, part by part, migrate to OpenCV, there is a single-pass "blobFromImage" (it's called differently there) preprocessing operation (https://github.com/vpisarev/ficus/blob/master/lib/NN/Preprocess.fx) - I would prefer to migrate it and optimize using OpenCL rather than to use multi-pass collection of primitive operations.
  3. If the whole pipeline is to capture high-resolution images/video stream, downscale it and run a deep net, then resize might be one of the bottleneck. In such a case user could manually copy the input to UMat, call OpenCV's resize, download result to CPU (.getMat()) and then run the rest of preprocessing on a downscaled image.
  4. It would be nice to do a little research and provide numbers - by how much this patch improves performance on different hardware: 1. desktop CPU with discrete GPU, 2. mobile Intel/AMD CPU with integrated graphics, 3. ARM with Mali or another OpenCL-capable graphics.
  5. If the performance, as measured in item 4, is not noticeably better than with CPU (it could well be the case, see item 1), I would prefer to keep preprocessing on CPU. From time to time we get various problem with our OpenCL kernels. I would like to keep things more stable and predictable, unless there is clear advantage.
  6. In OpenCV 5 we plan, among other things, to make OpenCV much more friendly to various GPU/NPU accelerators. UMat should become universal data structure for buffers, directly accessible by various acceleration APIs, including OpenCL, Vulkan, GLSL, CUDA etc. Then we can get back to such a patch. But, as I said in 2., I would prefer the whole blobFromImage(s) to be converted into a single well-optimized parallel loop. Then it could be ported to OpenCL/GLSL/CUDA etc. and then user can directly call blobFromImages and pass UMat there. Now it's not the proper time yet, in my opinion.

@kallaballa
Copy link
Contributor Author

Alright. At least I'll provide some numbers based on my machine.

@kallaballa
Copy link
Contributor Author

kallaballa commented Jun 29, 2023

I modified one of the V4D demos to track face detection time using getTickCount().
My machine: 11th Gen Intel(R) Core(TM) i7-1160G7 @ 1.20GHz with Iris Xe Graphics
Video used (from 00:01:00): https://www.youtube.com/watch?v=hUAT8Jm_dvw

In this scenario 1000 iterations of FaceDetectorYN::detect take:

  • With patch: 25.7823s
  • Without patch: 34.0648s

I'll add flame graphs.

I understand your arguments but given the simplicity of the patch the risk/gain ratio isn't so bad :)

@kallaballa
Copy link
Contributor Author

Flame graphs zoomed in on FaceDetectorYN::detect
Without patch:
fg_without

With patch:
fg_with

@kallaballa
Copy link
Contributor Author

kallaballa commented Jun 29, 2023

Btw. I made similar patches e.g for TrackerKCF with considerable performance gain. I guess I should drop those?

@vpisarev
Copy link
Contributor

@kallaballa, thank you for the quick response. The acceleration in your case is noticeable, indeed! What's the resolution of images that you feed to blobFromImagesWithParams?

I think, I can propose a compromise solution that will make everybody happy.
Let's implement "compute follow data" principle (that is already used in OpenCV when it comes to T-API):
if inputs are UMat's (images.kind == _InputArray::STD_VECTOR_UMAT), use your new branch, otherwise use the original one.

  1. OpenCV video capture can return the frames in UMat's, not just Mat's, and some video capture backends support UMat output in very efficient way.
  2. You can use .getUMat() function to convert each processed image into UMat quickly.
  3. Yes, users will have to modify their apps to use your branch, but for now such explicit behaviour for this for the better.

@kallaballa
Copy link
Contributor Author

@kallaballa, thank you for the quick response. The acceleration in your case is noticeable, indeed! What's the resolution of images that you feed to blobFromImagesWithParams?

960x540

I think, I can propose a compromise solution that will make everybody happy. Let's implement "compute follow data" principle (that is already used in OpenCV when it comes to T-API): if inputs are UMat's (images.kind == _InputArray::STD_VECTOR_UMAT), use your new branch, otherwise use the original one.

  1. OpenCV video capture can return the frames in UMat's, not just Mat's, and some video capture backends support UMat output in very efficient way.
  2. You can use .getUMat() function to convert each processed image into UMat quickly.
  3. Yes, users will have to modify their apps to use your branch, but for now such explicit behaviour for this for the better.

Sounds good. Will implement it like that.

@kallaballa kallaballa force-pushed the blobFromImagesWithParams branch 2 times, most recently from c231091 to e026820 Compare July 5, 2023 11:16
@kallaballa kallaballa force-pushed the blobFromImagesWithParams branch from 4610c9d to 8f72b3b Compare July 16, 2023 05:04
fixed separate code paths for face detect
@kallaballa kallaballa force-pushed the blobFromImagesWithParams branch from 8f72b3b to 00804cb Compare July 16, 2023 06:43
@kallaballa
Copy link
Contributor Author

I think that's it. 00804cb

@kallaballa
Copy link
Contributor Author

I think that's it. 00804cb

If you are alright with the general approach, i'd improve the implementation a bit more.

@kallaballa
Copy link
Contributor Author

Also I wrote a test that compares detection. there are differences (some frames not detected with UMat) I am trying to track down.

void getChannelFromBlob(UMat& m, InputArray blob, int i, int j, int rows, int cols, int type) {
UMat ublob = blob.getUMat();
int offset = i * cols + j;
int offset = (i * ublob.step.p[0] + j * ublob.step.p[1]) / ublob.elemSize();
Copy link
Contributor Author

@kallaballa kallaballa Jul 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to take into account step() and elemSize() for the offset. Now it works on par.

@asmorkalov asmorkalov added this to the 4.9.0 milestone Sep 15, 2023
@kallaballa
Copy link
Contributor Author

I have put other more code on the fast-path by porting NaryEltwiseLayer to UMat. I understand given the developments around 5.0 that this doesn't have priority but there is considerable gain. Should i post figures and make a PR? Only part left to port is ResizeLayer to keep it on the GPU all the time.

@kallaballa
Copy link
Contributor Author

Here a flame graph of the current state:
flame

if(blob_.kind() == _InputArray::UMAT)
blob = blob_.getUMat();
else if(blob_.kind() == _InputArray::MAT) {
blob = blob_.getMat().getUMat(flag);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current UMat design has limitation for storing results of .getMat() / .getUMat somewhere (should be used locally only) - upstream lifetime check should pass.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so clone?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.clone() is overkill.

blob_.getMat() return temporary object - it should be alive till .getUMat(flag) release.


We don't need this method at all as there is _InputArray::getUMat(): https://github.com/opencv/opencv/blob/4.8.0/modules/core/src/matrix_wrap.cpp#L126C5-L126C27

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx!

void blobFromImagesWithParams(InputArrayOfArrays images, OutputArray blob, const Image2BlobParams& param) {
CV_TRACE_FUNCTION();

if (images.kind() == _InputArray::STD_VECTOR_UMAT) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

broken indentation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asmorkalov
Copy link
Contributor

modules/dnn/src/dnn_utils.cpp:298: tab in indent.
+	if (images.kind() == _InputArray::STD_VECTOR_UMAT) {
modules/dnn/src/dnn_utils.cpp:299: tab in indent.
+		if(blob.kind() == _InputArray::UMAT) {
modules/dnn/src/dnn_utils.cpp:300: tab in indent.
+			UMat& u = blob.getUMatRef();
modules/dnn/src/dnn_utils.cpp:301: tab in indent.
+			blobFromImagesWithParamsImpl<cv::UMat>(images, u, param);
modules/dnn/src/dnn_utils.cpp:302: tab in indent.
+			return;
modules/dnn/src/dnn_utils.cpp:303: tab in indent.
+		} else if(blob.kind() == _InputArray::MAT) {
modules/dnn/src/dnn_utils.cpp:304: tab in indent.
+			UMat u = blob.getMatRef().getUMat(ACCESS_WRITE);
modules/dnn/src/dnn_utils.cpp:305: tab in indent.
+			blobFromImagesWithParamsImpl<cv::UMat>(images, u, param);
modules/dnn/src/dnn_utils.cpp:306: tab in indent.
+			u.copyTo(blob);
modules/dnn/src/dnn_utils.cpp:307: tab in indent.
+			return;
modules/dnn/src/dnn_utils.cpp:308: tab in indent.
+		}
modules/dnn/src/dnn_utils.cpp:309: tab in indent.
+	} else if (images.kind() == _InputArray::STD_VECTOR_MAT) {
modules/dnn/src/dnn_utils.cpp:310: tab in indent.
+		if(blob.kind() == _InputArray::UMAT) {
modules/dnn/src/dnn_utils.cpp:311: tab in indent.
+			Mat m = blob.getUMatRef().getMat(ACCESS_WRITE);
modules/dnn/src/dnn_utils.cpp:312: tab in indent.
+			blobFromImagesWithParamsImpl<cv::Mat>(images, m, param);
modules/dnn/src/dnn_utils.cpp:313: tab in indent.
+			m.copyTo(blob);
modules/dnn/src/dnn_utils.cpp:314: tab in indent.
+			return;
modules/dnn/src/dnn_utils.cpp:315: tab in indent.
+		} else if(blob.kind() == _InputArray::MAT) {
modules/dnn/src/dnn_utils.cpp:316: tab in indent.
+			Mat& m = blob.getMatRef();
modules/dnn/src/dnn_utils.cpp:317: tab in indent.
+			blobFromImagesWithParamsImpl<cv::Mat>(images, m, param);
modules/dnn/src/dnn_utils.cpp:318: tab in indent.
+			return;
modules/dnn/src/dnn_utils.cpp:319: tab in indent.
+		}
modules/dnn/src/dnn_utils.cpp:320: tab in indent.
+	}

@kallaballa
Copy link
Contributor Author

fbe2ccb

padWithDivisor(input_image, pad_image);
// Build blob from input image
input_blob = dnn::blobFromImage(pad_image);
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vpisarev T-API declares what we should not have such code separation on "user" side.

@asmorkalov
Copy link
Contributor

@opencv-alalek @vpisarev Is it ready for merge?

@asmorkalov asmorkalov merged commit c2f909f into opencv:4.x Oct 20, 2023
@kallaballa
Copy link
Contributor Author

Yay!

@asmorkalov asmorkalov mentioned this pull request Nov 3, 2023
IskXCr pushed a commit to Haosonn/opencv that referenced this pull request Dec 20, 2023
Pertaining Issue: opencv#5697

### Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

- [x] I agree to contribute to the project under Apache 2 License.
- [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
- [x] The PR is proposed to the proper branch
- [x] There is a reference to the original bug report and related work
- [x] There is accuracy test, performance test and test data in opencv_extra repository, if applicable
      Patch to opencv_extra has the same branch name.
- [x] The feature is well documented and sample code can be built with the project CMake
thewoz pushed a commit to thewoz/opencv that referenced this pull request Jan 4, 2024
Pertaining Issue: opencv#5697

### Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

- [x] I agree to contribute to the project under Apache 2 License.
- [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
- [x] The PR is proposed to the proper branch
- [x] There is a reference to the original bug report and related work
- [x] There is accuracy test, performance test and test data in opencv_extra repository, if applicable
      Patch to opencv_extra has the same branch name.
- [x] The feature is well documented and sample code can be built with the project CMake
thewoz pushed a commit to thewoz/opencv that referenced this pull request May 29, 2024
Pertaining Issue: opencv#5697

### Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

- [x] I agree to contribute to the project under Apache 2 License.
- [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
- [x] The PR is proposed to the proper branch
- [x] There is a reference to the original bug report and related work
- [x] There is accuracy test, performance test and test data in opencv_extra repository, if applicable
      Patch to opencv_extra has the same branch name.
- [x] The feature is well documented and sample code can be built with the project CMake
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants