# YOLOv5 Quantization Test
This notebook aims to quickly walkthrough the quantization steps of YOLOv5, and it is also a small test.If you encounter any issues, the training session([PPTX](https://drive.google.com/file/d/1kTAOcGxkmKZKY0jRbUbwm8ti-YkjIlgo/view?usp=sharing)|[Recording](https://drive.google.com/file/d/1cdjDiWaRXNEyJ_kLtNXHBesTqJA2pL3l/view?usp=sharing)) could be a good reference.  
PTQ is basically a review. We focus on Partial Quantification and QAT.    
![quant_workflow](./data/images/quant_workflow.png)

# Setup  

**Please refer to [README.md/Setup](https://gitlab-master.nvidia.com/weihuaz/yolov5_quant_sample#setup) for installation. Before you launch the jupyter notebook, the following steps should be completed.**  
1. Cloned the [yolov5_quant_sample](https://gitlab-master.nvidia.com/weihuaz/yolov5_quant_sample).
2. Completed the coco2017 dataset perparation; 
3. Downloaded the Yolov5s pretrained model; 
4. Built and launched the docker;

There is well-organized version on the server(10.23.206.202:/raid/quant_quiz_yolov5), so you can skip the setup step.
Now you can launch the jupyter notebook in the docker, please refer to [connect server's jupyter notebook](https://blog.csdn.net/Accepted_Lam/article/details/103837677). 

# Experiments

Both PTQ, QAT and partial quantization have been implemented in this sample, so we can compare the accuracy and speed improvement with different methods. Now let's start the quiz with PTQ(Post-Training Quantization). The following steps should be run under `yolov5_quant_sample` path.

## 1. PTQ 

1)  `export.py` exports a pytorch model to onnx format.

In [None]:
!python models/export.py --weights ./weights/yolov5s.pt --img 640 --batch 1 --device 0

2) `onnx_to_trt.py` aims to build a TensorRT engine from a onnx model file, and save to the `weights` folder.
    You can specify to build different precisions(fp32/fp16/int8). 

   - Example 1: Build a int8 engine using TensorRT's native PTQ.   
   Notes: If you change the setting of the int8 calibrator, please delete the `trt\yolov5s_calibration.cache`. Otherwise, the change may not take effect. The setting of calibrator the batchsize for calibration, the number of calibration batch, the type of calibrator.

In [None]:
!rm trt/yolov5s_calibration.cache

In [None]:
!python trt/onnx_to_trt.py --model ./weights/yolov5s.onnx --dtype int8 --batch-size 32 --num-calib-batch 16

   - Example 2: Build a fp16 engine using TensorRT's native PTQ.

In [None]:
!python trt/onnx_to_trt.py --model ./weights/yolov5s.onnx --dtype fp16

3）Evaluate the accurary of TensorRT inference result. Take the post-PTQ int8 model as a example.  
Notes: The TensorRT engine name should be modified according to the output of the previous step.

In [None]:
!python trt/eval_yolo_trt.py --model ./weights/yolov5s-int8-32-16-minmax.trt -l

Please write down the current evaluation accuracy, which will serve as the basis for our subsequent optimization.

### Test 1 
By default, we use the IInt8MinMaxCalibrator, please change it to IInt8EntropyCalibrator2, see how the accuracy changes. 

## 2. PTQ with Partial Quantization

`trt/onnx_to_trt_partialquant.py` aims to build a TensorRT engine with partial quantization.  

1) Get the onnx model with `export.py`.

In [None]:
!python models/export.py --weights ./weights/yolov5s.pt --img 640 --batch 1 --device 0

2) Simplify the onnx model and delete useless layers or nodes. 

In [None]:
!python -m onnxsim ./weights/yolov5s.onnx ./weights/yolov5s-simple.onnx

3) Choose the sensitive layers. Need some manual operation, please refer to the code.   
    a) Print all the layers ids;   
    b) Combine the onnx model structure to choose the sensitive layers;  

In [None]:
!python trt/onnx_to_trt_partialquant.py --model ./weights/yolov5s-simple.onnx --dtype int8 --batch-size 32 --num-calib-batch 16

4) Evaluate the accurary of TensorRT inference result.  
Notes: The TensorRT engine name should be modified according to the output of the previous step.

In [None]:
!python trt/eval_yolo_trt.py --model ./weights/yolov5s-simple.trt -l

Normally, when you skip some quantization sensitive layers, you will see an improvement in accuracy.

### Test 2  
When we want to fallback some quantization-sensitive layers to fp16, how to set it?

## 3. Sensitivity Profile 
`yolo_quant_flow.py` is the main script for QAT experiment.  
First we need to insert the QDQ nodes, then we could do the QAT related experiments. Two files are modified to insert QDQ nodes(`models/common.py` and `models\yolo.py`). The mainly changes include:  
    a) Change `nn.Conv2d` to `quant_nn.QuantConv2d`;     
    b) Change `nn.MaxPool2d` to `quant_nn.QuantMaxPool2d`;  
  
We could use the quant_modules.initialize() function to replace the modules automatically. For efficient inference with TensorRT, manual insert is needed for residual block. Please refer to [TensorRT OSS/tool/pytorch-quantization/Further optimization](https://github.com/NVIDIA/TensorRT/blob/master/tools/pytorch-quantization/docs/source/tutorials/quant_resnet50.rst#further-optimization) for detail. Since the residual block is used in the backbone of yolov5s, we will discuss it in the later steps.

### 1) Do Sensitivity Profile  
You can do sensitivity profile by specify the flag `--sensitivity`, then build_sensitivity_profile() will be called.
It takes a long time to complete the entire analysis, please be patient. Or you can skip this step, no impact on the following steps.
![Sensitivity profile of yolov5s](./data/sensitivity%20profile%20of%20yolov5s.png)

In [None]:
!python yolo_quant_flow.py --data data/coco.yaml --cfg models/yolov5s.yaml --ckpt-path weights/yolov5s.pt --hyp data/hyp.qat.yaml --sensitivity

### 2) Skip Sensitive Layers  
Add the param `--skip-layers`, then skip_sensitive_layers() will be called.  We will skip 4 quant-sensitive layers based on the sensitivity profile.

## 4. QAT Finetuning and Deployment

`yolo_quant_flow.py` is the main script for QAT expriment. See the code comments for details.   
Run the script as below. The QDQ insert, calibration, QAT-finetuning and evalution will be performed.  

   - 1) QAT-finetuning  
   If CUDA memory out error is reported, you can try to use another GPU with large memory or decrease the batchsize.   
   QAT-Finetuning takes long time, you can skip this step and download the [post-QAT model](https://drive.google.com/file/d/1Q1u81E0yLVrwHgazTN-l38ZyEFL78ggz/view?usp=sharing) directly.     
 

In [None]:
!python yolo_quant_flow.py --data data/coco.yaml --cfg models/yolov5s.yaml --ckpt-path weights/yolov5s.pt --hyp data/hyp.qat.yaml --skip-layers

   - 2) Build TensorRT engine

In [None]:
!python trt/onnx_to_trt.py --model ./weights/yolov5s-qat.onnx --dtype int8 --qat

   - 3) Evaluate the accuray of TensorRT engine  
   Notes: The TensorRT engine name should be modified according to the output of the previous step.

In [None]:
!python trt/eval_yolo_trt.py --model ./weights/yolov5s-qat.trt -l

### Test 3  
Do not skip the quantization sensitive layers, try to quantize all the convolution layers, see how the accuracy will be?

## 5. Dynamic Shape Support(Optional)

We can export the model with dynamic shape, specify some or all tensor dimensions until runtime. And the inference shape can be adjusted during the runtime.

   - 1) Export to ONNX with dynamic shape support(with `--dynamic`)

In [None]:
!python models/export.py --weights ./weights/yolov5s.pt --img 640 --dynamic --device 0 

   - 2) Build the TensorRT engine with dynamic shape support, take the fp16 model as a example, it can also applied to post-QAT models. 

In [None]:
!python trt/onnx_to_trt.py --model ./weights/yolov5s.onnx --dtype fp16 --dynamic-shape

   - 3) Specify the inference shape and evaluate the engine  

In [None]:
!python trt/trt_dynamic/eval_yolo_trt_dynamic.py --model weights/yolov5s.trt -l 

## 6. Further Optimization (Improve QAT Throughput)  

Since the residual block is used in the backbone of yolov5s. TensorRT has extra runtime optimization about the residual add. In order to maximize the throughput of QAT, when inserting QDQ nodes, it's recommended to add extra quantizer to the `BasicBlock` and `Bottleneck`. Please refer to [TensorRT OSS/tool/pytorch-quantization/Further optimization](https://github.com/NVIDIA/TensorRT/blob/master/tools/pytorch-quantization/docs/source/tutorials/quant_resnet50.rst#further-optimization) for detail. And it is highly recommended to walk through the [Q/DQ Layer-Placement Recommendations](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#qdq-placement-recs) part of `TensorRT Developer Guide` before you start.  

### Test 4  
How to add the extra quantizer to the residual block for efficient inference? Try to modify the code, export to the tensorrt engine, and compare the throughput with `trtexex`.

# Answers  

### Test 1

Open the file 'trt\calibrator.py', change line 24~26 to:   

```
class Calibrator(trt.IInt8EntropyCalibrator2):  
    def __init__(self, stream, cache_file=""):  
        trt.IInt8EntropyCalibrator2.__init__(self)  
```

### Test 2

One method for your reference.  
1) Uncomment the lines 152~158 in 'trt/onnx_to_trt_partialquant.py', print the layer name and the corresponding ids.  
2) Using Netron to check the onnx model, to confirm the name of the layers you want to skip.  
3) Modify line 228 in 'trt/onnx_to_trt_partialquant.py', change the setting of fp16_lay_ids according to your decision.  
You may want to adjust the settings several times to meet the corresponding accuracy and throughput requirements. 

### Test 3

Without the param `--skip-layer`, we will do the quantization for all convolution blocks. Then you will see the evaluation accuracy after calibration. 

In [None]:
!python yolo_quant_flow.py --data data/coco.yaml --cfg models/yolov5s.yaml --ckpt-path weights/yolov5s.pt --hyp data/hyp.qat.yaml

### Test 4

Modify the `Bottleneck` in `models/common.py`. We can insert extra quantization/dequantization nodes as below.

```angular2
class Bottleneck(nn.Module):
    # Standard bottleneck
    def __init__(self, c1, c2, shortcut=True, g=1, e=0.5):  # ch_in, ch_out, shortcut, groups, expansion
        super(Bottleneck, self).__init__()
        c_ = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c_, c2, 3, 1, g=g)
        self.add = shortcut and c1 == c2

        # Added by maggie for QDQ debugging in order to improve throughput
        if self.add:
            self.residual_quantizer_1 = quant_nn.TensorQuantizer(quant_nn.QuantConv2d.default_quant_desc_input)
            self.residual_quantizer_2 = quant_nn.TensorQuantizer(quant_nn.QuantConv2d.default_quant_desc_input)

    def forward(self, x):
        #return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x))

        try:
            if self.add:
                return self.residual_quantizer_1(x) + self.residual_quantizer_2(self.cv2(self.cv1(x)))
            else:
                return self.cv2(self.cv1(x))
        except AttributeError as e:
            # Compatible with PTQ path, handle models without extra residual_quantizer
            # print('\'Bottleneck\' object has no attribute \'residual_quantizer_1\'')
            return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x))

```

During the optimization, the following commands are needed. You can compare the results before and after the extra QDQ insertion. 

Insert QAT, do calibration and export to onnx file

In [None]:
!python yolo_quant_flow.py --data data/coco.yaml --cfg models/yolov5s.yaml --ckpt-path weights/yolov5s.pt --hyp data/hyp.qat.yaml --num-finetune-epochs=0 --skip-eval-accuracy

Rename the onnx file(to distinguish from other models).

In [None]:
!mv weights/yolov5s.onnx ./weights/yolov5s_with_residual_quant.onnx

Export to TensorRT engine.

In [None]:
!python trt/onnx_to_trt.py --model ./weights/yolov5s_with_residual_quant.onnx --dtype int8 --qat --verbose

Test the throughput with trtexec.

In [None]:
!trtexec --loadEngine=./weights/yolov5s_with_residual_quant.trt

Check the onnx file with Netron, you will see the extra QDQ nodes as below. Then we can test the throughput with `trtexec`.
![further_optimization](./data/images/further_optimization.png)


From the engine layer information, we can see some additional `Reformat` and `Scale` operations have gone(Validate on A30).  
![tactic_selection](./data/images/tactic_selection.png)

# Notes

During the practices, there are some common bugs around new features(such as QAT, mixed precision), which may make customers feel frustrated. TensorRT team is working on it, but still needs time to alleviate.   
If you encounter problems, please contact us (reduced-precision-SA-vteam <reduced-precision-SA-vteam@exchange.nvidia.com>).
Some known TensorRT bugs are listed below which was not fixed in the nvcr.io/nvidia/tensorrt:21.09-py3.  
1. [200778538](https://nvbugswb.nvidia.com/NVBugs5/redir.aspx?url=/200778538) [Alibaba Cloud]PTQ accuracy drop caused by the failed fallback of some sensitive layers to FP16 on A10.    
2. [200774263](https://nvbugswb.nvidia.com/NVBugs5/redir.aspx?url=/200774263) TensorRT 8 cannot output the same acc as onnxruntime.