Skip to content


Folders and files

Last commit message
Last commit date

Latest commit


Repository files navigation

CapeReviso Fork of tkDNN

This is a fork of tkDNN for the project CapeReviso.

Changes made

  • Added scripts, parameters and code to run it with IDS cameras as well as Logitech Brios using a systemd
  • implement continuous exposure correction
  • calculate privacy friendly background images
  • json output with unix time of each frame
  • queuing of images for for optimal performance in a live setup
  • end recording on day change, for having one file per day


  • main - still on TensorRT 7 for old devices
  • tensorrt8-branch - currect branch for new installed devices
  • deepsort - prototype for adding deepsort for subsequent tracking


tkDNN is a Deep Neural Network library built with cuDNN and tensorRT primitives, specifically thought to work on NVIDIA Jetson Boards. It has been tested on TK1(branch cudnn2), TX1, TX2, AGX Xavier, Nano and several discrete GPUs. The main goal of this project is to exploit NVIDIA boards as much as possible to obtain the best inference performance. It does not allow training.

If you use tkDNN in your research, please cite the following paper. For use in commercial solutions, write at and or refer to .

  title={A Systematic Assessment of Embedded Neural Networks for Object Detection},
  author={Verucchi, Micaela and Brilli, Gianluca and Sapienza, Davide and Verasani, Mattia and Arena, Marco and Gatti, Francesco and Capotondi, Alessandro and Cavicchioli, Roberto and Bertogna, Marko and Solieri, Marco},
  booktitle={2020 25th IEEE International Conference on Emerging Technologies and Factory Automation (ETFA)},

What's new

20 July 2021

  • Support to sematic segmentation README
  • Support 2D/3D Object Detection and Tracking README

24 November 2021

  • Support to sematic segmentation on cuda 11
  • Support to TensorRT8.

TensorRT8 (and therefore Jetpack 4.6) is currently supported only on the branch tensort8 due to performance issue with TensorRT8). We will merge it to the master as soon as those issues are fixed (probably in future minor releases).

FPS Results

Inference FPS of yolov4 with tkDNN, average of 1200 images with the same dimension as the input size, on

  • RTX 2080Ti (CUDA 10.2, TensorRT 7.0.0, Cudnn 7.6.5);
  • Xavier AGX, Jetpack 4.3 (CUDA 10.0, CUDNN 7.6.3, tensorrt 6.0.1 );
  • Xavier NX, Jetpack 4.4 (CUDA 10.2, CUDNN 8.0.0, tensorrt 7.1.0 ).
  • Tx2, Jetpack 4.2 (CUDA 10.0, CUDNN 7.3.1, tensorrt 5.0.6 );
  • Jetson Nano, Jetpack 4.4 (CUDA 10.2, CUDNN 8.0.0, tensorrt 7.1.0 ).
Platform Network FP32, B=1 FP32, B=4 FP16, B=1 FP16, B=4 INT8, B=1 INT8, B=4
RTX 2080Ti yolo4 320 118.59 237.31 207.81 443.32 262.37 530.93
RTX 2080Ti yolo4 416 104.81 162.86 169.06 293.78 206.93 353.26
RTX 2080Ti yolo4 512 92.98 132.43 140.36 215.17 165.35 254.96
RTX 2080Ti yolo4 608 63.77 81.53 111.39 152.89 127.79 184.72
AGX Xavier yolo4 320 26.78 32.05 57.14 79.05 73.15 97.56
AGX Xavier yolo4 416 19.96 21.52 41.01 49.00 50.81 60.61
AGX Xavier yolo4 512 16.58 16.98 31.12 33.84 37.82 41.28
AGX Xavier yolo4 608 9.45 10.13 21.92 23.36 27.05 28.93
Xavier NX yolo4 320 14.56 16.25 30.14 41.15 42.13 53.42
Xavier NX yolo4 416 10.02 10.60 22.43 25.59 29.08 32.94
Xavier NX yolo4 512 8.10 8.32 15.78 17.13 20.51 22.46
Xavier NX yolo4 608 5.26 5.18 11.54 12.06 15.09 15.82
Tx2 yolo4 320 11.18 12.07 15.32 16.31 - -
Tx2 yolo4 416 7.30 7.58 9.45 9.90 - -
Tx2 yolo4 512 5.96 5.95 7.22 7.23 - -
Tx2 yolo4 608 3.63 3.65 4.67 4.70 - -
Nano yolo4 320 4.23 4.55 6.14 6.53 - -
Nano yolo4 416 2.88 3.00 3.90 4.04 - -
Nano yolo4 512 2.32 2.34 3.02 3.04 - -
Nano yolo4 608 1.40 1.41 1.92 1.93 - -

MAP Results

Results for COCO val 2017 (5k images), on RTX 2080Ti, with conf threshold=0.001

CodaLab CodaLab CodaLab CodaLab tkDNN map tkDNN map
tkDNN tkDNN darknet darknet tkDNN tkDNN
MAP(0.5:0.95) AP50 MAP(0.5:0.95) AP50 MAP(0.5:0.95) AP50
Yolov3 (416x416) 0.381 0.675 0.380 0.675 0.372 0.663
yolov4 (416x416) 0.468 0.705 0.471 0.710 0.459 0.695
yolov3tiny (416x416) 0.096 0.202 0.096 0.201 0.093 0.198
yolov4tiny (416x416) 0.202 0.400 0.201 0.400 0.197 0.395
Cnet-dla34 (512x512) 0.366 0.543 - - 0.361 0.535
mv2SSD (512x512) 0.226 0.381 - - 0.223 0.378



This branch works on every NVIDIA GPU that supports the following (latest tested) dependencies:

  • CUDA 11.3 (or >= 10.2)
  • cuDNN 8.2.1 (or >= 8.0.4)
  • TensorRT 8.0.3 (or >=7.2)
  • OpenCV 4.5.4 (or >=4)
  • cmake 3.21 (or >= 3.15)
  • yaml-cpp 0.5.2
  • eigen3 3.3.4
  • curl 7.58
sudo apt install libyaml-cpp-dev curl libeigen3-dev

About OpenCV

To compile and install OpenCV4 with contrib us the script It will download and compile OpenCV in Download folder.

bash scripts/

If you have OpenCV compiled with cuda and contrib and want to use it with tkDNN pass ENABLE_OPENCV_CUDA_CONTRIB=ON flag when compiling tkDBB . If the flag is not passed,the preprocessing of the networks is computed on the CPU, otherwise on the GPU. In the latter case some milliseconds are saved in the end-to-end latency.

How to compile this repo

Build with cmake. If using Ubuntu 18.04 a new version of cmake is needed (3.15 or above). On both linux and windows ,the CMAKE_BUILD_TYPE variable needs to be defined as either Release or Debug.

git clone
cd tkDNN
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release -G"Ninja" .. 


Steps needed to do inference on tkDNN with a custom neural network.

  • Build and train a NN model with your favorite framework.
  • Export weights and bias for each layer and save them in a binary file (one for layer).
  • Export outputs for each layer and save them in a binary file (one for layer).
  • Create a new test and define the network, layer by layer using the weights extracted and the output to check the results.
  • Do inference.

Exporting weights

For specific details on how to export weights see HERE.

Darknet Parser

tkDNN implement and easy parser for darknet cfg files, a network can be converted with tk::dnn::darknetParser:

// example of parsing yolo4
tk::dnn::Network *net = tk::dnn::darknetParser("yolov4.cfg", "yolov4/layers", "coco.names");

All models from darknet are now parsed directly from cfg, you still need to export the weights with the descripted tools in the previus section.

Supported layers convolutional maxpool avgpool shortcut upsample route reorg region yolo
Supported activations relu leaky mish

Run the demo

This is an example using yolov4.

To run the an object detection first create the .rt file by running:

rm yolo4_fp32.rt        # be sure to delete(or move) old tensorRT files
./test_yolo4            # run the yolo test (is slow)

If you get problems in the creation, try to check the error activating the debug of TensorRT in this way:

cmake .. -DDEBUG=True

Once you have succesfully created your rt file, run the demo:

./demo yolo4_fp32.rt ../demo/yolo_test.mp4 y

In general the demo program takes 6 parameters:

./demo <network-rt-file> <path-to-video> <kind-of-network> <number-of-classes> <n-batches> <show-flag> <save-flag> <http-stream-port>


  • <network-rt-file> is the rt file generated by a test
  • <<path-to-video> is the path to a video file or a camera input
  • <kind-of-network> is the type of network. Thee types are currently supported: y (YOLO family), c (CenterNet family) and m (MobileNet-SSD family)
  • <number-of-classes>is the number of classes the network is trained on
  • <n-batches> number of batches to use in inference (N.B. you should first export TKDNN_BATCHSIZE to the required n_batches and create again the rt file for the network).
  • <show-flag> if set to 0 the demo will not show the visualization.
  • <save-flag> if set to 0 the results will be saved into result.mp4 (if n-batches ==1) and if set to 1 then the results will not be saved.
  • <http-stream-port> if set to 0 then http stream will be off but if port number is given eg: 8090 then the stream can seen the results in https://localhost:8090
  • <extyolo flag> if set to 1 then yolo like coordinates of the bounding boxes will be printed on the terminal

N.b. By default it is used FP32 inference

For specific details on how to run:

  • 2D object detection demos, details on FP16, INT8 and batching see HERE.
  • segmentation demos see HERE.
  • 2D/3D object detection and tracking demos see HERE.
  • mAP demo to evaluate 2D object detectors see HERE.


tkDNN on Windows 10 or Windows 11

For specific details on how to run tkDNN on Windows 10/11 see HERE.

Existing tests and supported networks

Test Name Network Dataset N Classes Input size Weights
yolo YOLO v21 COCO 2014 80 608x608 weights
yolo_224 YOLO v21 COCO 2014 80 224x224 weights
yolo_berkeley YOLO v21 BDD100K 10 416x736 weights
yolo_relu YOLO v2 (with ReLU, not Leaky)1 COCO 2014 80 416x416 weights
yolo_tiny YOLO v2 tiny1 COCO 2014 80 416x416 weights
yolo_voc YOLO v21 VOC 21 416x416 weights
yolo3 YOLO v32 COCO 2014 80 416x416 weights
yolo3_512 YOLO v32 COCO 2017 80 512x512 weights
yolo3_berkeley YOLO v32 BDD100K 10 320x544 weights
yolo3_coco4 YOLO v32 COCO 2014 4 416x416 weights
yolo3_flir YOLO v32 FREE FLIR 3 320x544 weights
yolo3_tiny YOLO v3 tiny2 COCO 2014 80 416x416 weights
yolo3_tiny512 YOLO v3 tiny2 COCO 2017 80 512x512 weights
dla34 Deep Leayer Aggreagtion (DLA) 343 COCO 2014 80 224x224 weights
dla34_cnet Centernet (DLA34 backend)4 COCO 2017 80 512x512 weights
mobilenetv2ssd Mobilnet v2 SSD Lite5 VOC 21 300x300 weights
mobilenetv2ssd512 Mobilnet v2 SSD Lite5 COCO 2017 81 512x512 weights
resnet101 Resnet 1016 COCO 2014 80 224x224 weights
resnet101_cnet Centernet (Resnet101 backend)4 COCO 2017 80 512x512 weights
csresnext50-panet-spp Cross Stage Partial Network 7 COCO 2014 80 416x416 weights
yolo4 Yolov4 8 COCO 2017 80 416x416 weights
yolo4_320 Yolov4 8 COCO 2017 80 320x320 weights
yolo4_512 Yolov4 8 COCO 2017 80 512x512 weights
yolo4_608 Yolov4 8 COCO 2017 80 608x608 weights
yolo4_berkeley Yolov4 8 BDD100K 10 540x320 weights
yolo4tiny Yolov4 tiny 9 COCO 2017 80 416x416 weights
yolo4x Yolov4x-mish 9 COCO 2017
yolo4tiny_512 Yolov4 tiny 9 COCO 2017 80 512x512 weights
80 640x640 weights
yolo4x-cps Scaled Yolov4 10 COCO 2017 80 512x512 weights
shelfnet ShelfNet18_realtime11 Cityscapes 19 1024x1024 weights
shelfnet_berkeley ShelfNet18_realtime11 DeepDrive 20 1024x1024 weights
dla34_cnet3d Centernet3D (DLA34 backend)4 KITTI 2017 1 512x512 weights
dla34_ctrack CenterTrack (DLA34 backend)12 NuScenes 3D 7 512x512 weights


  1. Redmon, Joseph, and Ali Farhadi. "YOLO9000: better, faster, stronger." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
  2. Redmon, Joseph, and Ali Farhadi. "Yolov3: An incremental improvement." arXiv preprint arXiv:1804.02767 (2018).
  3. Yu, Fisher, et al. "Deep layer aggregation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
  4. Zhou, Xingyi, Dequan Wang, and Philipp Krähenbühl. "Objects as points." arXiv preprint arXiv:1904.07850 (2019).
  5. Sandler, Mark, et al. "Mobilenetv2: Inverted residuals and linear bottlenecks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
  6. He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
  7. Wang, Chien-Yao, et al. "CSPNet: A New Backbone that can Enhance Learning Capability of CNN." arXiv preprint arXiv:1911.11929 (2019).
  8. Bochkovskiy, Alexey, Chien-Yao Wang, and Hong-Yuan Mark Liao. "YOLOv4: Optimal Speed and Accuracy of Object Detection." arXiv preprint arXiv:2004.10934 (2020).
  9. Bochkovskiy, Alexey, "Yolo v4, v3 and v2 for Windows and Linux" (
  10. Wang, Chien-Yao, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. "Scaled-YOLOv4: Scaling Cross Stage Partial Network." arXiv preprint arXiv:2011.08036 (2020).
  11. Zhuang, Juntang, et al. "ShelfNet for fast semantic segmentation." Proceedings of the IEEE International Conference on Computer Vision Workshops. 2019.
  12. Zhou, Xingyi, Vladlen Koltun, and Philipp Krähenbühl. "Tracking objects as points." European Conference on Computer Vision. Springer, Cham, 2020.