Performance Loss from OpenCV 4.5.5 to 4.7.0 using CUDA backend #23278

cesarpgouveia · 2023-02-19T15:08:52Z

System Information

OpenCV versions tested: 4.5.5, 4.7.0
Operating System / Platform: Ubuntu 18.04
Device: NVIDIA Jetson TX2 DevKit
CUDA version: 10.2
CUDNN version: 8.2.1

Detailed description

Hi,

I was using OpenCV 4.5.5, backend CUDA on a NVIDIA Jetson TX2 Devkit with the specs defined above. A couple of days I decided to update to OpenCV 4.7.0 to check if I had some boost in performance for the models I'm currently using. However what I did saw was a performance loss (in terms of execution time) for the majority of the models. Do you know what is the reason for this loss of performance?

This is the execution times obtained for both versions of OpenCV:

Test 1

Device: TX2 DevKit
CUDA version: 10.2
CUDNN version: 8.2.1
OpenCV version: 4.7.0

Version	Model 1	Model 2	Model 3	Model 4
Input Size	(112, 112)	(112, 112)	(112, 112)	(112, 112)
Model Architecture	Resnet100	MobileFaceNet	Resnet100	Resnet18
Jetson CPU	702	20.5	699	167
Jetson GPU	91.7	10.5	91.6	52.2

Test 2

Device: TX2 DevKit
CUDA version: 10.2
CUDNN version: 8.2.1
OpenCV version: 4.5.5

Version	Model 1	Model 2	Model 3	Model 4
Input Size	(112, 112)	(112, 112)	(112, 112)	(112, 112)
Model Architecture	Resnet100	MobileFaceNet	Resnet100	Resnet18
Jetson CPU	1088	23.1	1096	257
Jetson GPU	60.9	5.34	60.7	19.9

Note: Both tests were built with the same OpenCV flags and requirements, the only thing that changed was the version of both opencv and opencv_contrib. Moreover, all the execution times presented in those tables are in ms.

Steps to reproduce

You can use this piece of code to reproduce this issue/loss of performance:

#include <thread>
#include <fstream>

#include <opencv2/imgproc.hpp>
#include <opencv2/dnn.hpp>
#include <opencv2/core.hpp>
#include <opencv2/highgui.hpp>

int main(int argc, char** argv)
{
    auto imageToTest = argv[1];
    auto modelToTest = argv[2];
    int modelInputWidth = atoi(argv[3]);
    int modelInputHeight = atoi(argv[4]);

    cv::Size currSize = cv::Size(modelInputWidth, modelInputHeight);
    std::string modelToTestOnnx = modelToTest;
    std::string imagefilename = imageToTest;
    unsigned int num_inferences = 100;

    cv::dnn::Net net = cv::dnn::readNetFromONNX(modelToTestOnnx);

    net.setPreferableBackend(cv::dnn::DNN_BACKEND_CUDA);
    net.setPreferableTarget(cv::dnn::DNN_TARGET_CUDA);

    cv::Mat img = cv::imread(imagefilename, cv::IMREAD_ANYCOLOR);
    cv::Mat resized;
    cv::resize(img, resized, currSize);
	
	std::vector<cv::Mat> imgBatch = { resized };
    bool swaprbchannels = false;
    cv::Mat blob = cv::dnn::blobFromImages(imgBatch, 1.0f / 255.0f, cv::Size(), cv::Scalar(), swaprbchannels, false, CV_32F);

    net.setInput(blob);

    std::vector<cv::String> unconnectedOutLayerNames = net.getUnconnectedOutLayersNames();

    std::vector<cv::Mat> outputs;
    outputs.clear();

    auto timeLoadModelPlusInference1 = std::chrono::high_resolution_clock::now();

    net.forward(outputs, unconnectedOutLayerNames);

    auto timeLoadModelPlusInference2 = std::chrono::high_resolution_clock::now();

    std::chrono::duration<double, std::milli> ms_doubleTimeLoadModelPlusInference = timeLoadModelPlusInference2 - timeLoadModelPlusInference1;

    std::cout << "Execution time (load model + inference): " << ms_doubleTimeLoadModelPlusInference.count() << std::endl; // in ms

    auto time1 = std::chrono::high_resolution_clock::now();

    try {
        for (size_t i = 0; i < num_inferences; i++)
		net.forward(outputs, unconnectedOutLayerNames);
    }
    catch (std::exception& ex)
    {
        std::cout << ex.what() << std::endl;
    }

    auto time2 = std::chrono::high_resolution_clock::now();

    std::chrono::duration<double, std::milli> ms_double = time2 - time1;
    std::cout << "Execution time inference only: " << ms_double.count() / num_inferences << std::endl; // in ms

    std::cout << "Outputs Size: " << outputs[0].size[0] << "x" << outputs[0].size[1] << std::endl;
    std::cout << "Outputs value: " << outputs[0] << std::endl;
}

Issue submission checklist

I report the issue, it's not a question
I checked the problem with documentation, FAQ, open issues, forum.opencv.org, Stack Overflow, etc and have not found any solution
I updated to the latest OpenCV version and the issue is still there
There is reproducer code and related data files (videos, images, onnx, etc)

The text was updated successfully, but these errors were encountered:

zihaomu · 2023-02-20T01:43:13Z

Hi @cesarpgouveia, thanks for your details performance test. Please try to use the 4.x branch and re-build the OpenCV from source code.

Duplicate ##23234

cesarpgouveia · 2023-02-20T11:38:34Z

So I just built with OpenCV 4.x (4.7.0-dev) and this is the updated table.

Test 3

Device: TX2 DevKit
CUDA version: 10.2
CUDNN version: 8.2.1
OpenCV version: 4.X (4.7.0-dev)

Version	Model 1	Model 2	Model 3	Model 4
Input Size	(112, 112)	(112, 112)	(112, 112)	(112, 112)
Model Architecture	Resnet100	MobileFaceNet	Resnet100	Resnet18
Jetson CPU	ERROR	ERROR	ERROR	166
Jetson GPU	ERROR	ERROR	ERROR	34.7

ERROR:

terminate called after throwing an instance of 'cv::Exception'
what(): OpenCV(4.7.0-dev) /home/vbuser/opencv/modules/core/src/matrix.cpp:1177: error: (-211:One of the arguments' values is out of range) Bad new number of rows in function 'reshape'

These are the architectures from both Model 2 and 4 obtained from Netron:
Netron.zip

So, Model 1, 2, and 3 are crashing now (on branch 4.7.0-dev: 4.x) and although Model 4 has now a better performance than release 4.7.0 (from 52.2 to 34.7), his performance (in terms of execution time) is not better than release 4.5.5 (from 19.9 to 34.7). Do you know why models 1-3 are now crashing (they were working perfectly fine in 4.5.5 and 4.7.0 releases), is there an issue with a certain layer?

WanliZhong · 2023-02-23T06:11:19Z

Thanks for your report! This information is important for us. Could you paste your models? I will test each layer in few days.

cesarpgouveia · 2023-02-24T16:56:21Z

Sorry for the late response @WanliZhong, here they are:
model1.zip
model3.zip
Models 2 and 4 are too big to send over this opencv issue platform, even with compression.

WanliZhong · 2023-04-19T05:25:33Z

@cesarpgouveia I run model1 with onnxruntime and it throws an error. I wonder if your model has something wrong?

2023-04-19 13:23:11.0813207 [E:onnxruntime:, sequential_executor.cc:368 onnxruntime::SequentialExecutor::Execute] Non-zero status code returned while running PRelu node. Name:'conv_1_relu' Status Message: D:\a\_work\1\s\onnxruntime\core/providers/cpu/math/element_wise_ops.h:503 onnxruntime::BroadcastIterator::Init axis == 1 || axis == largest was false. Attempting to broadcast an axis by a dimension other than 1. 56 by 64

Traceback (most recent call last):
  File "c:\Users\Zoom\Desktop\New folder\test.py", line 13, in <module>
    outputs = ort_sess.run(None, {'data': input})
  File "C:\Software\miniconda3\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 200, in run
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running PRelu node. Name:'conv_1_relu' Status Message: D:\a\_work\1\s\onnxruntime\core/providers/cpu/math/element_wise_ops.h:503 onnxruntime::BroadcastIterator::Init axis == 1 || axis == largest was false. Attempting to broadcast an axis by a dimension other than 1. 56 by 64

the code is

import onnxruntime as ort
import onnx

model_path = "model1.onnx"
input = np.random.rand(1, 3, 112, 112).astype(np.float32)

onnx_model = onnx.load(model_path)
onnx.checker.check_model(onnx_model)

ort_sess = ort.InferenceSession(model_path)
outputs = ort_sess.run(None, {'data': input})

WanliZhong · 2023-04-20T06:02:41Z

@cesarpgouveia After testing, model3 can run correctly on both CPU and GPU. Please update your opencv to latest version. I will test the inference time now.

WanliZhong · 2023-04-20T11:02:05Z

This issue is confirmed that mul op do not use cuda because its a boardcast operation. This causes switching between gpu and cpu, which makes inference time longer. I will fix this bug.

zihaomu · 2023-04-20T11:56:07Z

This issue is confirmed that mul op do not use cuda because its a boardcast operation. This causes switching between gpu and cpu, which makes inference time longer. I will fix this bug.

can you show the details performance test layer by layer?

WanliZhong · 2023-04-21T06:39:42Z

test with the latest OpenCV dev version

total: 26.7651

onnx_node!ResNet18/0_conv/Conv2D   0.1515ms
onnx_node!ResNet18/0_PReLU/Relu   0.0193ms
onnx_node!ResNet18/0_PReLU/Neg_1   0.0145ms
onnx_node!ResNet18/0_PReLU/Relu_1   0.0121ms
ResNet18/0_PReLU/Neg:0   0.0167ms
onnx_node!ResNet18/0_PReLU/mul   2.071ms
onnx_node!ResNet18/0_PReLU/add   0.0585ms
onnx_node!ResNet18/stack1_block1_shortcut_conv/Conv2D   0.1643ms
onnx_node!ResNet18/stack1_block1_1_bn/FusedBatchNormV3   0.0166ms
onnx_node!ResNet18/stack1_block1_1_conv/Conv2D   0.1179ms
onnx_node!ResNet18/stack1_block1_2_PReLU/Relu   0.0192ms
onnx_node!ResNet18/stack1_block1_2_PReLU/Neg_1   0.0114ms
onnx_node!ResNet18/stack1_block1_2_PReLU/Relu_1   0.0095ms
onnx_node!ResNet18/stack1_block1_2_PReLU/mul   4.0522ms
onnx_node!ResNet18/stack1_block1_2_PReLU/add   0.0857ms
onnx_node!ResNet18/stack1_block1_2_conv/Conv2D   0.1803ms
onnx_node!ResNet18/stack1_block2_1_bn/FusedBatchNormV3   0.013ms
onnx_node!ResNet18/stack1_block2_1_conv/Conv2D   0.0533ms
onnx_node!ResNet18/stack1_block2_2_PReLU/Relu   0.0145ms
onnx_node!ResNet18/stack1_block2_2_PReLU/Neg_1   0.0116ms
onnx_node!ResNet18/stack1_block2_2_PReLU/Relu_1   0.0093ms
onnx_node!ResNet18/stack1_block2_2_PReLU/mul   2.346ms
onnx_node!ResNet18/stack1_block2_2_PReLU/add   0.0483ms
onnx_node!ResNet18/stack1_block2_2_conv/Conv2D   0.0748ms
onnx_node!ResNet18/stack2_block1_shortcut_conv/Conv2D   0.1015ms
onnx_node!ResNet18/stack2_block1_1_bn/FusedBatchNormV3   0.0135ms
onnx_node!ResNet18/stack2_block1_1_conv/Conv2D   0.0639ms
onnx_node!ResNet18/stack2_block1_2_PReLU/Relu   0.0161ms
onnx_node!ResNet18/stack2_block1_2_PReLU/Neg_1   0.0137ms
onnx_node!ResNet18/stack2_block1_2_PReLU/Relu_1   0.0133ms
ResNet18/stack2_block2_2_PReLU/Neg:0   0.0177ms
onnx_node!ResNet18/stack2_block1_2_PReLU/mul   2.7318ms
onnx_node!ResNet18/stack2_block1_2_PReLU/add   0.0643ms
onnx_node!ResNet18/stack2_block1_2_conv/Conv2D   0.1083ms
onnx_node!ResNet18/stack2_block2_1_bn/FusedBatchNormV3   0.0139ms
onnx_node!ResNet18/stack2_block2_1_conv/Conv2D   0.0496ms
onnx_node!ResNet18/stack2_block2_2_PReLU/Relu   0.0147ms
onnx_node!ResNet18/stack2_block2_2_PReLU/Neg_1   0.0115ms
onnx_node!ResNet18/stack2_block2_2_PReLU/Relu_1   0.0096ms
onnx_node!ResNet18/stack2_block2_2_PReLU/mul   1.79ms
onnx_node!ResNet18/stack2_block2_2_PReLU/add   0.045ms
onnx_node!ResNet18/stack2_block2_2_conv/Conv2D   0.0701ms
onnx_node!ResNet18/stack3_block1_shortcut_conv/Conv2D   0.0776ms
onnx_node!ResNet18/stack3_block1_1_bn/FusedBatchNormV3   0.016ms
onnx_node!ResNet18/stack3_block1_1_conv/Conv2D   0.0479ms
onnx_node!ResNet18/stack3_block1_2_PReLU/Relu   0.0159ms
onnx_node!ResNet18/stack3_block1_2_PReLU/Neg_1   0.0135ms
onnx_node!ResNet18/stack3_block1_2_PReLU/Relu_1   0.0121ms
ResNet18/stack3_block2_2_PReLU/Neg:0   0.0173ms
onnx_node!ResNet18/stack3_block1_2_PReLU/mul   2.1251ms
onnx_node!ResNet18/stack3_block1_2_PReLU/add   0.043ms
onnx_node!ResNet18/stack3_block1_2_conv/Conv2D   0.0793ms
onnx_node!ResNet18/stack3_block2_1_bn/FusedBatchNormV3   0.012ms
onnx_node!ResNet18/stack3_block2_1_conv/Conv2D   0.0458ms
onnx_node!ResNet18/stack3_block2_2_PReLU/Relu   0.0133ms
onnx_node!ResNet18/stack3_block2_2_PReLU/Neg_1   0.0106ms
onnx_node!ResNet18/stack3_block2_2_PReLU/Relu_1   0.0091ms
onnx_node!ResNet18/stack3_block2_2_PReLU/mul   1.7766ms
onnx_node!ResNet18/stack3_block2_2_PReLU/add   0.043ms
onnx_node!ResNet18/stack3_block2_2_conv/Conv2D   0.0751ms
onnx_node!ResNet18/stack4_block1_shortcut_conv/Conv2D   0.0758ms
onnx_node!ResNet18/stack4_block1_1_bn/FusedBatchNormV3   0.0153ms
onnx_node!ResNet18/stack4_block1_1_conv/Conv2D   0.048ms
onnx_node!ResNet18/stack4_block1_2_PReLU/Relu   0.0151ms
onnx_node!ResNet18/stack4_block1_2_PReLU/Neg_1   0.013ms
onnx_node!ResNet18/stack4_block1_2_PReLU/Relu_1   0.012ms
ResNet18/stack4_block1_2_PReLU/Neg:0   0.0176ms
onnx_node!ResNet18/stack4_block1_2_PReLU/mul   2.1163ms
onnx_node!ResNet18/stack4_block1_2_PReLU/add   0.0396ms
onnx_node!ResNet18/stack4_block1_2_conv/Conv2D   0.0751ms
onnx_node!ResNet18/stack4_block2_1_bn/FusedBatchNormV3   0.0121ms
onnx_node!ResNet18/stack4_block2_1_conv/Conv2D   0.0485ms
onnx_node!ResNet18/stack4_block2_2_PReLU/Relu   0.0158ms
onnx_node!ResNet18/stack4_block2_2_PReLU/Neg_1   0.013ms
onnx_node!ResNet18/stack4_block2_2_PReLU/Relu_1   0.0121ms
onnx_node!ResNet18/stack4_block2_2_PReLU/mul   2.0351ms
onnx_node!ResNet18/stack4_block2_2_PReLU/add   0.037ms
onnx_node!ResNet18/stack4_block2_2_conv/Conv2D   0.072ms
onnx_node!ResNet18/E_batchnorm/FusedBatchNormV3   0.0142ms
onnx_node!ResNet18/E_batchnorm/FusedBatchNormV3__210   0.0169ms
onnx_node!ResNet18/E_flatten/Reshape   0.0014ms
onnx_node!ResNet18/E_dense/MatMul   0.0445ms
ResNet18/E_batchnorm/ReadVariableOp_1:0   0.0165ms
onnx_node!ResNet18/pre_embedding/batchnorm/mul_1   0.0156ms
embedding   0.001ms

cesarpgouveia added the bug label Feb 19, 2023

zihaomu added category: gpu/cuda (contrib) OpenCV 4.0+: moved to opencv_contrib and removed bug labels Feb 20, 2023

zihaomu added the duplicate label Feb 20, 2023

zihaomu assigned WanliZhong Feb 23, 2023

zihaomu added this to the 4.8.0 milestone Feb 23, 2023

WanliZhong mentioned this issue Apr 20, 2023

Can't infer a dim denoted by -1 in function 'cv::dnn::computeShapeByReshapeMask' #23520

Closed

4 tasks

WanliZhong mentioned this issue Apr 23, 2023

DNN/CUDA: make 'abcd op 1b11' broadcast eltwise operator support cuda #23528

Merged

6 tasks

asmorkalov closed this as completed in #23528 Apr 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Loss from OpenCV 4.5.5 to 4.7.0 using CUDA backend #23278

Performance Loss from OpenCV 4.5.5 to 4.7.0 using CUDA backend #23278

cesarpgouveia commented Feb 19, 2023 •

edited

zihaomu commented Feb 20, 2023 •

edited

cesarpgouveia commented Feb 20, 2023 •

edited

WanliZhong commented Feb 23, 2023

cesarpgouveia commented Feb 24, 2023

WanliZhong commented Apr 19, 2023

WanliZhong commented Apr 20, 2023 •

edited

WanliZhong commented Apr 20, 2023

zihaomu commented Apr 20, 2023

WanliZhong commented Apr 21, 2023 •

edited

Performance Loss from OpenCV 4.5.5 to 4.7.0 using CUDA backend #23278

Performance Loss from OpenCV 4.5.5 to 4.7.0 using CUDA backend #23278

Comments

cesarpgouveia commented Feb 19, 2023 • edited

System Information

Detailed description

Test 1

Test 2

Steps to reproduce

Issue submission checklist

zihaomu commented Feb 20, 2023 • edited

cesarpgouveia commented Feb 20, 2023 • edited

Test 3

WanliZhong commented Feb 23, 2023

cesarpgouveia commented Feb 24, 2023

WanliZhong commented Apr 19, 2023

WanliZhong commented Apr 20, 2023 • edited

WanliZhong commented Apr 20, 2023

zihaomu commented Apr 20, 2023

WanliZhong commented Apr 21, 2023 • edited

cesarpgouveia commented Feb 19, 2023 •

edited

zihaomu commented Feb 20, 2023 •

edited

cesarpgouveia commented Feb 20, 2023 •

edited

WanliZhong commented Apr 20, 2023 •

edited

WanliZhong commented Apr 21, 2023 •

edited