Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Up to 50% longer inference time on for onnx model for the same hardware, compiled the sam way. What can be the reason? #23223

Open
4 tasks done
TrueWodzu opened this issue Feb 6, 2023 · 20 comments
Assignees
Labels
bug category: dnn category: gpu/cuda (contrib) OpenCV 4.0+: moved to opencv_contrib needs reproducer Provide complete minimal reproducer with input data
Milestone

Comments

@TrueWodzu
Copy link

TrueWodzu commented Feb 6, 2023

System Information

OpenCV version: 4.6.0
Operating System / Platform: Ubuntu 20.04 / Windows / WSL2 / Docker
Compiler & compiler version: GCC 9,11, MSVC 2017, 2019, Python 3.10
CUDA: 11.4, 11.8

Detailed description

A word of preface:

I am observing up to 50% larger inference times on one environment compared to other. Both environments are running on the same hardware and OpenCV have been compiled the same way. What is interesting is that performance is constant regardless of operating system.
For example: I wrote a c++ executable and I've run it on a WSL2 image (Ubuntu 20.04/CUDA 11.8) where my inference time is 25 FPS and I have a docker image (Ubuntu 20.04/CUDA 11.8) where I compiled OpenCV exactly the same and inference time is 15 FPS.

I've tested it also on Windows. This time I wrote a Python script. I have two Python environments, in one I get 25 FPS in the other it is 15 FPS.

So the problem is not within my code . and it does not depend on the operating system. Each time I've used the same OpenCV version with the same set of options. I am suspecting that maybe during compilation of OpenCV something sometimes is compiled differently?

I've dug deeper. I've profiled OpenCV, here are the results:

ID name                                                                      count thr          min          max       median          avg       *self*          IPP   %       OpenCL   %
                                                                                               t-min        t-max     t-median        t-avg        total        t-IPP   %     t-OpenCL   %
  1 cv::dnn::dnn4_v20220524::Net::forward#net.cpp:93                            241   1       34.539     1569.779       35.574       43.279    10430.354        0.000   0        0.000   0
ID name                                                                      count thr          min          max       median          avg       *self*          IPP   %       OpenCL   %
                                                                                               t-min        t-max     t-median        t-avg        total        t-IPP   %     t-OpenCL   %
  1 cv::dnn::dnn4_v20220524::Net::forward#net.cpp:93                            241   1       58.370     1596.572       60.892       67.375    16237.484        0.000   0        0.000   0
                                                                                              58.370     1596.572       60.892       67.375    16237.484        0.000   0        0.000   0

This is the major difference. So I know the problem lies within dnn. I'looked at source and discovered that I can time "layers". So I did that.

Here are top 20 worst times (seconds) where inference is slower:

6.0574 :onnx_node!Slice_19
5.3924 :onnx_node!Slice_29
5.3768 :onnx_node!Slice_9
2.8331 :onnx_node!Slice_4
2.6924 :onnx_node!Slice_24
2.6597 :onnx_node!Slice_14
2.6464 :onnx_node!Slice_34
0.1466 :onnx_node!Concat_40
0.1194 :onnx_node!Slice_39
0.0923 :onnx_node!Mul_43
0.0827 :onnx_node!Mul_271
0.0702 :onnx_node!Mul_46
0.0666 :onnx_node!Concat_190
0.0564 :onnx_node!Mul_297
0.0534 :onnx_node!Mul_180
0.0517 :onnx_node!Mul_70
0.048 :onnx_node!Mul_125
0.0459 :onnx_node!Mul_52
0.045 :onnx_node!Mul_59
0.0449 :onnx_node!Mul_93
...
Total: 36.8912

And here where it is faster.

5.7526 :onnx_node!Slice_19
3.2699 :onnx_node!Slice_29
3.2128 :onnx_node!Slice_9
2.785 :onnx_node!Slice_14
2.7736 :onnx_node!Slice_24
1.5678 :onnx_node!Slice_34
1.5364 :onnx_node!Slice_4
0.144 :onnx_node!Concat_40
0.1209 :onnx_node!Slice_39
0.0768 :onnx_node!Reshape_342
0.0626 :onnx_node!Mul_243
0.0615 :onnx_node!Mul_46
0.0584 :onnx_node!Mul_70
0.0562 :onnx_node!Reshape_361
0.0551 :onnx_node!Mul_180
0.0541 :onnx_node!Mul_297
0.0536 :onnx_node!Mul_43
0.0507 :onnx_node!Mul_125
0.0494 :onnx_node!Mul_52
0.0474 :onnx_node!Mul_271

...
Total: 27.696

I am not sure if this is a bug or not, but drop in performance is quite serious and it would be good to know what can cause this, so this can be documented.

Steps to reproduce

This happens for any Yolov5 model translated to onnx. The difference can be observed on any model. The bigger model, the bigger difference. On my machine I can reproduce this every time by installing a fresh WSL2 image and fresh docker image based on Ubuntu 20.04. However, as said earlier. Docker is not the problem here, neither the operating system or compiler.

Build settings:

cmake .. -D CMAKE_BUILD_TYPE=RELEASE \
    -D WITH_IPP=OFF \
    -D WITH_OPENGL=OFF \
    -D WITH_QT=OFF \
    -D CMAKE_INSTALL_PREFIX=/usr/local \
    -D OPENCV_EXTRA_MODULES_PATH=../contrib/modules \
    -D OPENCV_ENABLE_NONFREE=ON \
    -D WITH_JASPER=OFF \
    -D WITH_TBB=ON \
    -D BUILD_JPEG=ON \
    -D WITH_SIMD=ON \
    -D WITH_FFMPEG=ON \
    -D ENABLE_LIBJPEG_TURBO_SIMD=ON \
    -D BUILD_DOCS=OFF \
    -D BUILD_EXAMPLES=OFF \
    -D BUILD_TESTS=OFF \
    -D BUILD_PERF_TESTS=OFF \
    -D BUILD_opencv_java=NO \
    -D BUILD_opencv_python=NO \
    -D BUILD_opencv_python2=NO \
    -D BUILD_opencv_python3=NO \
    -D BUILD_CUDA_STUBS=ON \
    -D OPENCV_DNN_CUDA=ON \
    -D WITH_CUDA=ON \
    -D CUDA_ARCH_BIN=7.5 \
    -D WITH_GTK=ON \
    -D OPENCV_GENERATE_PKGCONFIG=ON \

Benchmarking code:

#include <opencv2/opencv.hpp>

int main(int argc, char* argv[])
{
  cv::Mat img(1080, 1920, CV_8UC3);
  cv::randu(img, cv::Scalar(0, 0, 0), cv::Scalar(255, 255, 255));

  cv::dnn::Net net = cv::dnn::readNet("/home/test/dev/yolov5m_based.onnx"); // Modify accordingly.
  net.setPreferableBackend(cv::dnn::DNN_BACKEND_CUDA);
  net.setPreferableTarget(cv::dnn::DNN_TARGET_CUDA);

  cv::Mat blob;
  cv::dnn::blobFromImage(img, blob, 1. / 255., cv::Size(640, 640), cv::Scalar(), false, false);
  net.setInput(blob);
  std::vector<cv::Mat> outputs;

  // Don't measure this, GPU needs to warm up.
  for (int i = 0; i < 5; ++i)
    net.forward(outputs, "output0");

  auto c1 = cv::getTickCount();
  for (int i = 0; i < 200; ++i)
    net.forward(outputs, "output0");
  auto c2 = cv::getTickCount();

  std::cout << "TOTAL TIME: " << ((c2 - c1) / cv::getTickFrequency() * 1000) << std::endl;

  return 0;
}
  1. You must have Nvidia GPU.
  2. Install WSL2 image (Ubuntu 20.04).
  3. Install CUDA (11.8 and CuDNN 8.7 (don't install driver, install specific to WSL cuda packages).
  4. Build OpenCV with provided options (compiler does not matter).
  5. Build benchmarking program (compiler does not matter).
  6. Run program, on my machine TOTAL TIME: 7260.
  7. Install docker image based on Ubuntu 20.04 (you can do this from within WSL2!)
  8. Execute step 2,3,4.
  9. Run program, on my machine TOTAL TIME: 11061.

Of course there will be some small fluctuations in times but they will be very small.

You can obtain model from here

Issue submission checklist

  • I report the issue, it's not a question
  • I checked the problem with documentation, FAQ, open issues, forum.opencv.org, Stack Overflow, etc and have not found any solution
  • I updated to the latest OpenCV version and the issue is still there
  • There is reproducer code and related data files (videos, images, onnx, etc)
@TrueWodzu TrueWodzu added the bug label Feb 6, 2023
@TrueWodzu TrueWodzu changed the title Different ONNX graph generated on the same hardware(GPU) with the same version of OpenCV, 50% performance loss. Up to 50% longer inference time on for onnx modek for the same hardware, compiled the sam way. What can be the reason? Feb 6, 2023
@TrueWodzu TrueWodzu changed the title Up to 50% longer inference time on for onnx modek for the same hardware, compiled the sam way. What can be the reason? Up to 50% longer inference time on for onnx model for the same hardware, compiled the sam way. What can be the reason? Feb 6, 2023
@dkurt
Copy link
Member

dkurt commented Feb 7, 2023

Docker and WSL are not perfect candidates for performance benchmarking. Both do not have full access to the system resources. So this is not OpenCV issue.

@TrueWodzu
Copy link
Author

TrueWodzu commented Feb 7, 2023

Docker and WSL are not perfect candidates for performance benchmarking. Both do not have full access to the system resources. So this is not OpenCV issue.

I guess you missed part where I wrote "I've tested it also on Windows. This time I wrote a Python script. I have two Python environments, in one I get 25 FPS in the other it is 15 FPS." Both times I've compiled Python bindings via OpenCV the same way (the same set of options).

So I've already excluded System, Compiler, Language, OpenCV version, CUDA version.

I will work on reproducing, I just hoped that maybe someone has an idea what can be the issue here.

@zihaomu
Copy link
Member

zihaomu commented Feb 8, 2023

Hi @TrueWodzu, can you try to reproduce this issue with OpenCV 4.x branch?

@TrueWodzu
Copy link
Author

TrueWodzu commented Feb 8, 2023

Hi @zihaomu thank you for your interest. The same issue happens on 4.x branch. I did some more tests. I took the compiled application which runs slower on Docker and moved it to WSL. The application was running faster under WSL.

I've measured GPU clocks and I can see that under docker GPU is less utilized when the application runs:
WSL:

 gpu   pwr gtemp mtemp    sm   mem   enc   dec  mclk  pclk
# Idx     W     C     C     %     %     %     %   MHz   MHz
    0      4     68      -     0      0      0      0    405    300
    0     24     68      -     0      0      0      0   5500   1005
    0     25     69      -     7      1      0      0   5500   1005
    0     52     71      -    28     11      0      0   5500   1665
    0     85     73      -    42     25      0      0   5500   1875
    0     70     74      -    39     25      0      0   5500   1860
    0     73     75      -    39     25      0      0   5500   1875
    0     78     75      -    39     25      0      0   5500   1875
    0     80     75      -    38     24      0      0   5500   1875
    0     73     76      -    38     24      0      0   5500   1875
    0     73     77      -    40     25      0      0   5500   1860
    0     81     77      -    38     24      0      0   5500   1860
    0     79     78      -    39     24      0      0   5500   1860
    0     46     76      -    26     16      0      0   5500   1860
    0     24     74      -     0      0      0      0   5000    630
    0     23     73      -     0      0      0      0   5000    390
    0      9     72      -     0      0      0      0    810    360
    0      4     72      -     0      0      0      0    405    300

Docker:

# gpu   pwr gtemp mtemp    sm   mem   enc   dec  mclk  pclk
# Idx     W     C     C     %     %     %     %   MHz   MHz
    0      4     65      -     0      0      0      0    405    300
    0     17     65      -     1      0      0      0   5500   1005
    0     24     66      -     7      1      0      0   5500   1005
    0     33     67      -    27     10      0      0   5500   1005
    0     34     67      -    37     16      0      0   5500   1005
    0     30     67      -    38     16      0      0   5500   1005
    0     30     68      -    36     15      0      0   5500   1005
    0     30     68      -    36     15      0      0   5500   1005
    0     32     68      -    37     15      0      0   5500    960
    0     31     68      -    38     15      0      0   5500    945
    0     29     68      -    38     15      0      0   5500    945
    0     31     68      -    38     15      0      0   5500    945
    0     29     69      -    39     16      0      0   5500    945
    0     30     69      -    38     15      0      0   5500    945
    0     29     69      -    39     15      0      0   5500    930
    0     29     69      -    39     15      0      0   5500    930
    0     29     69      -    39     15      0      0   5500    930
    0     29     70      -    39     15      0      0   5500    930
    0     31     70      -    39     15      0      0   5500    930
    0     23     69      -    13      5      0      0   5000    690
    0     23     69      -     0      0      0      0   5000    435
    0      9     68      -     0      0      0      0    810    360
    0      4     68      -     0      0      0      0    405    360
    0      4     68      -     0      0      0      0    405    300

See how on WSL clock rises to 1875 and on Docker it stays at 1005?. However, I want to stress that in my opinion it is not the Docker issue. I've run Nvidia examples on both systems and their performance is exactly the same.
In my opinion the problem lies in some 3rd party library that OpenCV/onnx is using to do the math? Can you give ma a hint what libraries I should check?
I even did ldd my_app and it is linking to the same libraries on both systems including their versions.

@atinfinity
Copy link
Contributor

atinfinity commented Feb 8, 2023

@TrueWodzu

Question1

Could you share the result of following script?(WSL2 and Docker)

import cv2
print(cv2.getBuildInformation())

Question2

Could you share benchmark script and ONNX model?

@TrueWodzu
Copy link
Author

@atinfinity

Answer 1

General configuration for OpenCV 4.6.0 =====================================
  Version control:               4.6.0

  Extra modules:
    Location (extra):            /home/wodzu/dev/libs/opencv/contrib/modules
    Version control (extra):     4.6.0

  Platform:
    Timestamp:                   2023-02-07T08:30:08Z
    Host:                        Linux 5.15.79.1-microsoft-standard-WSL2 x86_64
    CMake:                       3.16.3
    CMake generator:             Unix Makefiles
    CMake build tool:            /usr/bin/make
    Configuration:               RELEASE

  CPU/HW features:
    Baseline:                    SSE SSE2 SSE3
      requested:                 SSE3
    Dispatched code generation:  SSE4_1 SSE4_2 FP16 AVX AVX2 AVX512_SKX
      requested:                 SSE4_1 SSE4_2 AVX FP16 AVX2 AVX512_SKX
      SSE4_1 (16 files):         + SSSE3 SSE4_1
      SSE4_2 (1 files):          + SSSE3 SSE4_1 POPCNT SSE4_2
      FP16 (0 files):            + SSSE3 SSE4_1 POPCNT SSE4_2 FP16 AVX
      AVX (4 files):             + SSSE3 SSE4_1 POPCNT SSE4_2 AVX
      AVX2 (31 files):           + SSSE3 SSE4_1 POPCNT SSE4_2 FP16 FMA3 AVX AVX2
      AVX512_SKX (5 files):      + SSSE3 SSE4_1 POPCNT SSE4_2 FP16 FMA3 AVX AVX2 AVX_512F AVX512_COMMON AVX512_SKX

  C/C++:
    Built as dynamic libs?:      YES
    C++ standard:                11
    C++ Compiler:                /usr/bin/c++  (ver 9.4.0)
    C++ flags (Release):         -fsigned-char -W -Wall -Wreturn-type -Wnon-virtual-dtor -Waddress -Wsequence-point -Wformat -Wformat-security -Wmissing-declarations -Wundef -Winit-self -Wpointer-arith -Wshadow -Wsign-promo -Wuninitialized -Wsuggest-override -Wno-delete-non-virtual-dtor -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -Wno-long-long -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections  -msse -msse2 -msse3 -fvisibility=hidden -fvisibility-inlines-hidden -O3 -DNDEBUG  -DNDEBUG
    C++ flags (Debug):           -fsigned-char -W -Wall -Wreturn-type -Wnon-virtual-dtor -Waddress -Wsequence-point -Wformat -Wformat-security -Wmissing-declarations -Wundef -Winit-self -Wpointer-arith -Wshadow -Wsign-promo -Wuninitialized -Wsuggest-override -Wno-delete-non-virtual-dtor -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -Wno-long-long -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections  -msse -msse2 -msse3 -fvisibility=hidden -fvisibility-inlines-hidden -g  -O0 -DDEBUG -D_DEBUG
    C Compiler:                  /usr/bin/cc
    C flags (Release):           -fsigned-char -W -Wall -Wreturn-type -Waddress -Wsequence-point -Wformat -Wformat-security -Wmissing-declarations -Wmissing-prototypes -Wstrict-prototypes -Wundef -Winit-self -Wpointer-arith -Wshadow -Wuninitialized -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -Wno-long-long -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections  -msse -msse2 -msse3 -fvisibility=hidden -O3 -DNDEBUG  -DNDEBUG
    C flags (Debug):             -fsigned-char -W -Wall -Wreturn-type -Waddress -Wsequence-point -Wformat -Wformat-security -Wmissing-declarations -Wmissing-prototypes -Wstrict-prototypes -Wundef -Winit-self -Wpointer-arith -Wshadow -Wuninitialized -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -Wno-long-long -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections  -msse -msse2 -msse3 -fvisibility=hidden -g  -O0 -DDEBUG -D_DEBUG
    Linker flags (Release):      -Wl,--gc-sections -Wl,--as-needed -Wl,--no-undefined
    Linker flags (Debug):        -Wl,--gc-sections -Wl,--as-needed -Wl,--no-undefined
    ccache:                      NO
    Precompiled headers:         NO
    Extra dependencies:          m pthread cudart_static dl rt nppc nppial nppicc nppidei nppif nppig nppim nppist nppisu nppitc npps cublas cudnn cufft -L/usr/local/cuda/lib64 -L/usr/lib/x86_64-linux-gnu
    3rdparty dependencies:

  OpenCV modules:
    To be built:                 aruco barcode bgsegm bioinspired calib3d ccalib core cudaarithm cudabgsegm cudacodec cudafeatures2d cudafilters cudaimgproc cudalegacy cudaobjdetect cudaoptflow cudastereo cudawarping cudev datasets dnn dnn_objdetect dnn_superres dpm face features2d flann freetype fuzzy gapi hfs highgui img_hash imgcodecs imgproc intensity_transform line_descriptor mcc ml objdetect optflow phase_unwrapping photo plot quality rapid reg rgbd saliency shape stereo stitching structured_light superres surface_matching text tracking video videoio videostab wechat_qrcode xfeatures2d ximgproc xobjdetect xphoto
    Disabled:                    world
    Disabled by dependency:      -
    Unavailable:                 alphamat cvv hdf java julia matlab ovis python2 python3 sfm ts viz
    Applications:                apps
    Documentation:               NO
    Non-free algorithms:         YES

  GUI:                           GTK3
    GTK+:                        YES (ver 3.24.20)
      GThread :                  YES (ver 2.64.6)
      GtkGlExt:                  NO
    VTK support:                 NO

  Media I/O:
    ZLib:                        /usr/lib/x86_64-linux-gnu/libz.so (ver 1.2.11)
    JPEG:                        build-libjpeg-turbo (ver 2.1.2-62)
    WEBP:                        build (ver encoder: 0x020f)
    PNG:                         /usr/lib/x86_64-linux-gnu/libpng.so (ver 1.6.37)
    TIFF:                        build (ver 42 - 4.2.0)
    JPEG 2000:                   build (ver 2.4.0)
    OpenEXR:                     build (ver 2.3.0)
    HDR:                         YES
    SUNRASTER:                   YES
    PXM:                         YES
    PFM:                         YES

  Video I/O:
    DC1394:                      YES (2.2.6)
    FFMPEG:                      YES
      avcodec:                   YES (58.54.100)
      avformat:                  YES (58.29.100)
      avutil:                    YES (56.31.100)
      swscale:                   YES (5.5.100)
      avresample:                YES (4.0.0)
    GStreamer:                   NO
    v4l/v4l2:                    YES (linux/videodev2.h)

  Parallel framework:            TBB (ver 2020.1 interface 11101)

  Trace:                         YES (with Intel ITT)

  Other third-party libraries:
    VA:                          NO
    Lapack:                      NO
    Eigen:                       NO
    Custom HAL:                  NO
    Protobuf:                    build (3.19.1)

  NVIDIA CUDA:                   YES (ver 11.8, CUFFT CUBLAS)
    NVIDIA GPU arch:             75
    NVIDIA PTX archs:

  cuDNN:                         YES (ver 8.7.0)

  OpenCL:                        YES (no extra features)
    Include path:                /home/wodzu/dev/libs/opencv/3rdparty/include/opencl/1.2
    Link libraries:              Dynamic load

  Python (for build):            /usr/bin/python3

  Java:
    ant:                         NO
    JNI:                         NO
    Java wrappers:               NO
    Java tests:                  NO

  Install to:                    /usr/local
-----------------------------------------------------------------

I won't be sending second config because it is identical, I diffed it.

Answer 2

See updated, "steps to reproduce" section.

@TrueWodzu
Copy link
Author

@alalek Since I've completed missing steps, is this issue now viable to look into?

@TrueWodzu
Copy link
Author

TrueWodzu commented Feb 10, 2023

I was able to recreate this on different machine. My laptop has RTX 2060, I've recreated this on GTX 1650

@atinfinity
Copy link
Contributor

I could reproduce on same hardware.

Software

  • Ubuntu 22.04(native)
    • OpenCV 4.7.0
  • Ubuntu 22.04(docker container)
    • OpenCV 4.7.0

Hardware

  • Intel(R) Core(TM) i7-9800X CPU @ 3.80GHz
  • GeForce RTX 2080 Ti

Execution time(TrueWodzu's code)

Ubuntu 22.04(native)

TOTAL TIME: 8330.61

Ubuntu 22.04(docker container)

TOTAL TIME: 12189.9

@TrueWodzu
Copy link
Author

@atinfinity Thank you so much for you time. Just to make sure, you did this without Windows and WSL layer? You did this on pure Ubuntu system?

@atinfinity
Copy link
Contributor

And, I checked inference of YOLOv4(Darknet).

In this case, there is no difference in execution time between "Ubuntu 22.04(native)" and "Ubuntu 22.04(docker container)".

@atinfinity
Copy link
Contributor

@TrueWodzu

Just to make sure, you did this without Windows and WSL layer? You did this on pure Ubuntu system?

I tried on pure Ubuntu system.

@atinfinity
Copy link
Contributor

I tried to use cuDNN sample(mnistCUDNN) on Ubuntu 22.04(native) and Ubuntu 22.04(docker container).
https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#verify

cp -r /usr/src/cudnn_samples_v8/ $HOME
cd $HOME/cudnn_samples_v8/mnistCUDNN
make
./mnistCUDNN

As a result, CUDA kernel processing time is slower than Ubuntu 22.04(native).
So, I think this may be affected by Docker. And, I think that this is not OpenCV problem.

@TrueWodzu
Copy link
Author

@atinfinity I've run mnistCUDNN example but I don't see any time differences. Also I think that these tests are way too short to make any conclusions.

@ZJDATY
Copy link

ZJDATY commented Jul 19, 2023

@TrueWodzu The 50% performance degradation is very similar to the phenomenon in my issue.
#23911

@TrueWodzu
Copy link
Author

@ZJDATY I've read your post, and I see you have put a lot of energy into finding /reproducing the problem. The thing is, that in my case this happens on GPU, your case is for CPU.

@ukoehler
Copy link

ukoehler commented Aug 9, 2023

I just stumbled across an issue that might be very similar or the reason. The same inference on 4.8.0 compared to 4.5.2 is sometimes slower. I traced it down to 4.5.2 using maximum 553.6 MB RAM wand 4.8.0 is using 1.89 GB RAM. Anybody seeing the problem might be swapping? Is the very high RAM usage a known issue?

@dkurt
Copy link
Member

dkurt commented Aug 9, 2023

@ukoehler, please add a link to model. If possible, provide per-layer timings using getPerfProfile on both 4.5.2 and 4.8.0. This might help determine which layer has regression.

std::vector<double> timings;
net.getPerfProfile(timings);
std::vector<String> names = net.getLayerNames();
CV_Assert(names.size() == timings.size());

for (int i = 0; i < names.size(); ++i)
{
    Ptr<dnn::Layer> l = net.getLayer(net.getLayerId(names[i]));
    std::cout << names[i] << " " << l->type << " " << timings[i] << std::endl;
}

@ukoehler
Copy link

ukoehler commented Aug 9, 2023

@dkurt Hi Dimitry, I created a new issue with the requested information:
#24134

@ukoehler
Copy link

I just ran more test and have increases from 1.368 s for version 4.5.2 to 4.351 s for version 4.8.0.

This version is just collecting show stopper bugs.

@asmorkalov asmorkalov added this to the 4.10.0 milestone Feb 7, 2024
@asmorkalov asmorkalov modified the milestones: 4.10.0, 4.11.0 Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug category: dnn category: gpu/cuda (contrib) OpenCV 4.0+: moved to opencv_contrib needs reproducer Provide complete minimal reproducer with input data
Projects
None yet
Development

No branches or pull requests

9 participants