Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Loss from OpenCV 4.5.5 to 4.7.0 using CUDA backend #23278

Closed
4 tasks done
cesarpgouveia opened this issue Feb 19, 2023 · 9 comments · Fixed by #23528
Closed
4 tasks done

Performance Loss from OpenCV 4.5.5 to 4.7.0 using CUDA backend #23278

cesarpgouveia opened this issue Feb 19, 2023 · 9 comments · Fixed by #23528
Assignees
Milestone

Comments

@cesarpgouveia
Copy link
Contributor

cesarpgouveia commented Feb 19, 2023

System Information

OpenCV versions tested: 4.5.5, 4.7.0
Operating System / Platform: Ubuntu 18.04
Device: NVIDIA Jetson TX2 DevKit
CUDA version: 10.2
CUDNN version: 8.2.1

Detailed description

Hi,

I was using OpenCV 4.5.5, backend CUDA on a NVIDIA Jetson TX2 Devkit with the specs defined above. A couple of days I decided to update to OpenCV 4.7.0 to check if I had some boost in performance for the models I'm currently using. However what I did saw was a performance loss (in terms of execution time) for the majority of the models. Do you know what is the reason for this loss of performance?

This is the execution times obtained for both versions of OpenCV:

Test 1

  • Device: TX2 DevKit
  • CUDA version: 10.2
  • CUDNN version: 8.2.1
  • OpenCV version: 4.7.0


Version Model 1 Model 2 Model 3 Model 4
Input Size (112, 112) (112, 112) (112, 112) (112, 112)
Model Architecture Resnet100 MobileFaceNet Resnet100 Resnet18
Jetson CPU 702 20.5 699 167
Jetson GPU 91.7 10.5 91.6 52.2

Test 2

  • Device: TX2 DevKit
  • CUDA version: 10.2
  • CUDNN version: 8.2.1
  • OpenCV version: 4.5.5


Version Model 1 Model 2 Model 3 Model 4
Input Size (112, 112) (112, 112) (112, 112) (112, 112)
Model Architecture Resnet100 MobileFaceNet Resnet100 Resnet18
Jetson CPU 1088 23.1 1096 257
Jetson GPU 60.9 5.34 60.7 19.9

Note: Both tests were built with the same OpenCV flags and requirements, the only thing that changed was the version of both opencv and opencv_contrib. Moreover, all the execution times presented in those tables are in ms.

Steps to reproduce

You can use this piece of code to reproduce this issue/loss of performance:

#include <thread>
#include <fstream>

#include <opencv2/imgproc.hpp>
#include <opencv2/dnn.hpp>
#include <opencv2/core.hpp>
#include <opencv2/highgui.hpp>

int main(int argc, char** argv)
{
    auto imageToTest = argv[1];
    auto modelToTest = argv[2];
    int modelInputWidth = atoi(argv[3]);
    int modelInputHeight = atoi(argv[4]);

    cv::Size currSize = cv::Size(modelInputWidth, modelInputHeight);
    std::string modelToTestOnnx = modelToTest;
    std::string imagefilename = imageToTest;
    unsigned int num_inferences = 100;

    cv::dnn::Net net = cv::dnn::readNetFromONNX(modelToTestOnnx);

    net.setPreferableBackend(cv::dnn::DNN_BACKEND_CUDA);
    net.setPreferableTarget(cv::dnn::DNN_TARGET_CUDA);

    cv::Mat img = cv::imread(imagefilename, cv::IMREAD_ANYCOLOR);
    cv::Mat resized;
    cv::resize(img, resized, currSize);
	
	std::vector<cv::Mat> imgBatch = { resized };
    bool swaprbchannels = false;
    cv::Mat blob = cv::dnn::blobFromImages(imgBatch, 1.0f / 255.0f, cv::Size(), cv::Scalar(), swaprbchannels, false, CV_32F);

    net.setInput(blob);

    std::vector<cv::String> unconnectedOutLayerNames = net.getUnconnectedOutLayersNames();

    std::vector<cv::Mat> outputs;
    outputs.clear();

    auto timeLoadModelPlusInference1 = std::chrono::high_resolution_clock::now();

    net.forward(outputs, unconnectedOutLayerNames);

    auto timeLoadModelPlusInference2 = std::chrono::high_resolution_clock::now();

    std::chrono::duration<double, std::milli> ms_doubleTimeLoadModelPlusInference = timeLoadModelPlusInference2 - timeLoadModelPlusInference1;

    std::cout << "Execution time (load model + inference): " << ms_doubleTimeLoadModelPlusInference.count() << std::endl; // in ms

    auto time1 = std::chrono::high_resolution_clock::now();

    try {
        for (size_t i = 0; i < num_inferences; i++)
		net.forward(outputs, unconnectedOutLayerNames);
    }
    catch (std::exception& ex)
    {
        std::cout << ex.what() << std::endl;
    }

    auto time2 = std::chrono::high_resolution_clock::now();

    std::chrono::duration<double, std::milli> ms_double = time2 - time1;
    std::cout << "Execution time inference only: " << ms_double.count() / num_inferences << std::endl; // in ms

    std::cout << "Outputs Size: " << outputs[0].size[0] << "x" << outputs[0].size[1] << std::endl;
    std::cout << "Outputs value: " << outputs[0] << std::endl;
}

Issue submission checklist

  • I report the issue, it's not a question
  • I checked the problem with documentation, FAQ, open issues, forum.opencv.org, Stack Overflow, etc and have not found any solution
  • I updated to the latest OpenCV version and the issue is still there
  • There is reproducer code and related data files (videos, images, onnx, etc)
@zihaomu zihaomu added category: gpu/cuda (contrib) OpenCV 4.0+: moved to opencv_contrib and removed bug labels Feb 20, 2023
@zihaomu
Copy link
Member

zihaomu commented Feb 20, 2023

Hi @cesarpgouveia, thanks for your details performance test. Please try to use the 4.x branch and re-build the OpenCV from source code.

Duplicate ##23234

@cesarpgouveia
Copy link
Contributor Author

cesarpgouveia commented Feb 20, 2023

So I just built with OpenCV 4.x (4.7.0-dev) and this is the updated table.

Test 3

  • Device: TX2 DevKit
  • CUDA version: 10.2
  • CUDNN version: 8.2.1
  • OpenCV version: 4.X (4.7.0-dev)


Version Model 1 Model 2 Model 3 Model 4
Input Size (112, 112) (112, 112) (112, 112) (112, 112)
Model Architecture Resnet100 MobileFaceNet Resnet100 Resnet18
Jetson CPU ERROR ERROR ERROR 166
Jetson GPU ERROR ERROR ERROR 34.7

ERROR: 

terminate called after throwing an instance of 'cv::Exception'
what(): OpenCV(4.7.0-dev) /home/vbuser/opencv/modules/core/src/matrix.cpp:1177: error: (-211:One of the arguments' values is out of range) Bad new number of rows in function 'reshape'

These are the architectures from both Model 2 and 4 obtained from Netron:
Netron.zip

So, Model 1, 2, and 3 are crashing now (on branch 4.7.0-dev: 4.x) and although Model 4 has now a better performance than release 4.7.0 (from 52.2 to 34.7), his performance (in terms of execution time) is not better than release 4.5.5 (from 19.9 to 34.7). Do you know why models 1-3 are now crashing (they were working perfectly fine in 4.5.5 and 4.7.0 releases), is there an issue with a certain layer?

@zihaomu zihaomu added this to the 4.8.0 milestone Feb 23, 2023
@WanliZhong
Copy link
Member

Thanks for your report! This information is important for us. Could you paste your models? I will test each layer in few days.

@cesarpgouveia
Copy link
Contributor Author

Sorry for the late response @WanliZhong, here they are:
model1.zip
model3.zip
Models 2 and 4 are too big to send over this opencv issue platform, even with compression.

@WanliZhong
Copy link
Member

@cesarpgouveia I run model1 with onnxruntime and it throws an error. I wonder if your model has something wrong?

2023-04-19 13:23:11.0813207 [E:onnxruntime:, sequential_executor.cc:368 onnxruntime::SequentialExecutor::Execute] Non-zero status code returned while running PRelu node. Name:'conv_1_relu' Status Message: D:\a\_work\1\s\onnxruntime\core/providers/cpu/math/element_wise_ops.h:503 onnxruntime::BroadcastIterator::Init axis == 1 || axis == largest was false. Attempting to broadcast an axis by a dimension other than 1. 56 by 64

Traceback (most recent call last):
  File "c:\Users\Zoom\Desktop\New folder\test.py", line 13, in <module>
    outputs = ort_sess.run(None, {'data': input})
  File "C:\Software\miniconda3\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 200, in run
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running PRelu node. Name:'conv_1_relu' Status Message: D:\a\_work\1\s\onnxruntime\core/providers/cpu/math/element_wise_ops.h:503 onnxruntime::BroadcastIterator::Init axis == 1 || axis == largest was false. Attempting to broadcast an axis by a dimension other than 1. 56 by 64

the code is

import onnxruntime as ort
import onnx

model_path = "model1.onnx"
input = np.random.rand(1, 3, 112, 112).astype(np.float32)

onnx_model = onnx.load(model_path)
onnx.checker.check_model(onnx_model)

ort_sess = ort.InferenceSession(model_path)
outputs = ort_sess.run(None, {'data': input})

@WanliZhong
Copy link
Member

WanliZhong commented Apr 20, 2023

@cesarpgouveia After testing, model3 can run correctly on both CPU and GPU. Please update your opencv to latest version. I will test the inference time now.

@WanliZhong
Copy link
Member

This issue is confirmed that mul op do not use cuda because its a boardcast operation. This causes switching between gpu and cpu, which makes inference time longer. I will fix this bug.

@zihaomu
Copy link
Member

zihaomu commented Apr 20, 2023

This issue is confirmed that mul op do not use cuda because its a boardcast operation. This causes switching between gpu and cpu, which makes inference time longer. I will fix this bug.

can you show the details performance test layer by layer?

@WanliZhong
Copy link
Member

WanliZhong commented Apr 21, 2023

test with the latest OpenCV dev version

total: 26.7651

onnx_node!ResNet18/0_conv/Conv2D   0.1515ms
onnx_node!ResNet18/0_PReLU/Relu   0.0193ms
onnx_node!ResNet18/0_PReLU/Neg_1   0.0145ms
onnx_node!ResNet18/0_PReLU/Relu_1   0.0121ms
ResNet18/0_PReLU/Neg:0   0.0167ms
onnx_node!ResNet18/0_PReLU/mul   2.071ms
onnx_node!ResNet18/0_PReLU/add   0.0585ms
onnx_node!ResNet18/stack1_block1_shortcut_conv/Conv2D   0.1643ms
onnx_node!ResNet18/stack1_block1_1_bn/FusedBatchNormV3   0.0166ms
onnx_node!ResNet18/stack1_block1_1_conv/Conv2D   0.1179ms
onnx_node!ResNet18/stack1_block1_2_PReLU/Relu   0.0192ms
onnx_node!ResNet18/stack1_block1_2_PReLU/Neg_1   0.0114ms
onnx_node!ResNet18/stack1_block1_2_PReLU/Relu_1   0.0095ms
onnx_node!ResNet18/stack1_block1_2_PReLU/mul   4.0522ms
onnx_node!ResNet18/stack1_block1_2_PReLU/add   0.0857ms
onnx_node!ResNet18/stack1_block1_2_conv/Conv2D   0.1803ms
onnx_node!ResNet18/stack1_block2_1_bn/FusedBatchNormV3   0.013ms
onnx_node!ResNet18/stack1_block2_1_conv/Conv2D   0.0533ms
onnx_node!ResNet18/stack1_block2_2_PReLU/Relu   0.0145ms
onnx_node!ResNet18/stack1_block2_2_PReLU/Neg_1   0.0116ms
onnx_node!ResNet18/stack1_block2_2_PReLU/Relu_1   0.0093ms
onnx_node!ResNet18/stack1_block2_2_PReLU/mul   2.346ms
onnx_node!ResNet18/stack1_block2_2_PReLU/add   0.0483ms
onnx_node!ResNet18/stack1_block2_2_conv/Conv2D   0.0748ms
onnx_node!ResNet18/stack2_block1_shortcut_conv/Conv2D   0.1015ms
onnx_node!ResNet18/stack2_block1_1_bn/FusedBatchNormV3   0.0135ms
onnx_node!ResNet18/stack2_block1_1_conv/Conv2D   0.0639ms
onnx_node!ResNet18/stack2_block1_2_PReLU/Relu   0.0161ms
onnx_node!ResNet18/stack2_block1_2_PReLU/Neg_1   0.0137ms
onnx_node!ResNet18/stack2_block1_2_PReLU/Relu_1   0.0133ms
ResNet18/stack2_block2_2_PReLU/Neg:0   0.0177ms
onnx_node!ResNet18/stack2_block1_2_PReLU/mul   2.7318ms
onnx_node!ResNet18/stack2_block1_2_PReLU/add   0.0643ms
onnx_node!ResNet18/stack2_block1_2_conv/Conv2D   0.1083ms
onnx_node!ResNet18/stack2_block2_1_bn/FusedBatchNormV3   0.0139ms
onnx_node!ResNet18/stack2_block2_1_conv/Conv2D   0.0496ms
onnx_node!ResNet18/stack2_block2_2_PReLU/Relu   0.0147ms
onnx_node!ResNet18/stack2_block2_2_PReLU/Neg_1   0.0115ms
onnx_node!ResNet18/stack2_block2_2_PReLU/Relu_1   0.0096ms
onnx_node!ResNet18/stack2_block2_2_PReLU/mul   1.79ms
onnx_node!ResNet18/stack2_block2_2_PReLU/add   0.045ms
onnx_node!ResNet18/stack2_block2_2_conv/Conv2D   0.0701ms
onnx_node!ResNet18/stack3_block1_shortcut_conv/Conv2D   0.0776ms
onnx_node!ResNet18/stack3_block1_1_bn/FusedBatchNormV3   0.016ms
onnx_node!ResNet18/stack3_block1_1_conv/Conv2D   0.0479ms
onnx_node!ResNet18/stack3_block1_2_PReLU/Relu   0.0159ms
onnx_node!ResNet18/stack3_block1_2_PReLU/Neg_1   0.0135ms
onnx_node!ResNet18/stack3_block1_2_PReLU/Relu_1   0.0121ms
ResNet18/stack3_block2_2_PReLU/Neg:0   0.0173ms
onnx_node!ResNet18/stack3_block1_2_PReLU/mul   2.1251ms
onnx_node!ResNet18/stack3_block1_2_PReLU/add   0.043ms
onnx_node!ResNet18/stack3_block1_2_conv/Conv2D   0.0793ms
onnx_node!ResNet18/stack3_block2_1_bn/FusedBatchNormV3   0.012ms
onnx_node!ResNet18/stack3_block2_1_conv/Conv2D   0.0458ms
onnx_node!ResNet18/stack3_block2_2_PReLU/Relu   0.0133ms
onnx_node!ResNet18/stack3_block2_2_PReLU/Neg_1   0.0106ms
onnx_node!ResNet18/stack3_block2_2_PReLU/Relu_1   0.0091ms
onnx_node!ResNet18/stack3_block2_2_PReLU/mul   1.7766ms
onnx_node!ResNet18/stack3_block2_2_PReLU/add   0.043ms
onnx_node!ResNet18/stack3_block2_2_conv/Conv2D   0.0751ms
onnx_node!ResNet18/stack4_block1_shortcut_conv/Conv2D   0.0758ms
onnx_node!ResNet18/stack4_block1_1_bn/FusedBatchNormV3   0.0153ms
onnx_node!ResNet18/stack4_block1_1_conv/Conv2D   0.048ms
onnx_node!ResNet18/stack4_block1_2_PReLU/Relu   0.0151ms
onnx_node!ResNet18/stack4_block1_2_PReLU/Neg_1   0.013ms
onnx_node!ResNet18/stack4_block1_2_PReLU/Relu_1   0.012ms
ResNet18/stack4_block1_2_PReLU/Neg:0   0.0176ms
onnx_node!ResNet18/stack4_block1_2_PReLU/mul   2.1163ms
onnx_node!ResNet18/stack4_block1_2_PReLU/add   0.0396ms
onnx_node!ResNet18/stack4_block1_2_conv/Conv2D   0.0751ms
onnx_node!ResNet18/stack4_block2_1_bn/FusedBatchNormV3   0.0121ms
onnx_node!ResNet18/stack4_block2_1_conv/Conv2D   0.0485ms
onnx_node!ResNet18/stack4_block2_2_PReLU/Relu   0.0158ms
onnx_node!ResNet18/stack4_block2_2_PReLU/Neg_1   0.013ms
onnx_node!ResNet18/stack4_block2_2_PReLU/Relu_1   0.0121ms
onnx_node!ResNet18/stack4_block2_2_PReLU/mul   2.0351ms
onnx_node!ResNet18/stack4_block2_2_PReLU/add   0.037ms
onnx_node!ResNet18/stack4_block2_2_conv/Conv2D   0.072ms
onnx_node!ResNet18/E_batchnorm/FusedBatchNormV3   0.0142ms
onnx_node!ResNet18/E_batchnorm/FusedBatchNormV3__210   0.0169ms
onnx_node!ResNet18/E_flatten/Reshape   0.0014ms
onnx_node!ResNet18/E_dense/MatMul   0.0445ms
ResNet18/E_batchnorm/ReadVariableOp_1:0   0.0165ms
onnx_node!ResNet18/pre_embedding/batchnorm/mul_1   0.0156ms
embedding   0.001ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants