Yolov6 low fps on .trt model with fp16 #22

SHIVAM3052 · 2022-08-02T10:36:49Z

Using export.py file. I am able to convert my model from .onnx to .trt
While using the trt.py file.
Cuda utilization is only 354 MB and I m getting fps 23-27 only.
can help me to resolve this issue to increase my fps.

Linaom1214 · 2022-08-02T10:40:43Z

please select a small size model， current the model is so bigger.

SHIVAM3052 · 2022-08-02T10:43:46Z

I trained my model on yolov6s.pt

Cuda version 11.4 and cudnn 8.2.4

Linaom1214 · 2022-08-02T10:50:33Z

I trained my model on yolov6s.pt

Cuda version 11.4 and cudnn 8.2.4

can you show your code? so I can see more.

SHIVAM3052 · 2022-08-02T10:53:25Z

which python script do I have to share?
for e.g trt.py, or train.py

Linaom1214 · 2022-08-02T10:54:29Z

which python script do I have to share? for e.g trt.py, or train.py

trt.py

SHIVAM3052 · 2022-08-02T10:55:22Z

#trt.py

import sys
sys.path.append('../')
from trt_utils import preproc, vis
from trt_utils import BaseEngine
import numpy as np
import cv2
import time
import os

class Predictor(BaseEngine):
def init(self, engine_path , imgsz=(640,640)):
super(Predictor, self).init(engine_path)
self.imgsz = imgsz
self.n_classes = 1
self.class_names = [ 'license' ]

if name == 'main':
pred = Predictor(engine_path='./yolov6_3.trt')
#img_path = '../src/3.jpg'
#origin_img = pred.inference(img_path)
#cv2.imwrite("%s_yolov6.jpg" % os.path.splitext(
#os.path.split(img_path)[-1])[0], origin_img)
pred.detect_video('./720.mp4') # set 0 use a webcam
#pred.detect_video(0) # set 0 use a webcam

pred.get_fps()

Linaom1214 · 2022-08-02T10:58:56Z

#trt.py

import sys sys.path.append('../') from trt_utils import preproc, vis from trt_utils import BaseEngine import numpy as np import cv2 import time import os

class Predictor(BaseEngine): def init(self, engine_path , imgsz=(640,640)): super(Predictor, self).init(engine_path) self.imgsz = imgsz self.n_classes = 1 self.class_names = [ 'license' ]

if name == 'main': pred = Predictor(engine_path='./yolov6_3.trt') #img_path = '../src/3.jpg' #origin_img = pred.inference(img_path) #cv2.imwrite("%s_yolov6.jpg" % os.path.splitext( #os.path.split(img_path)[-1])[0], origin_img) pred.detect_video('./720.mp4') # set 0 use a webcam #pred.detect_video(0) # set 0 use a webcam
pred.get_fps()

the output of pred.get_fps() function is also 23-27 FPS????

SHIVAM3052 · 2022-08-02T11:00:45Z

#trt utils

import tensorrt as trt
import pycuda.autoinit
import pycuda.driver as cuda
import numpy as np
import cv2
import time

class BaseEngine(object):
def init(self, engine_path, imgsz=(640, 640)):
self.imgsz = imgsz
self.mean = None
self.std = None
self.n_classes = 1
self.class_names = ['license']

    # Allocates all buffers required for an engine, i.e. host/device inputs/outputs.
    logger = trt.Logger(trt.Logger.WARNING)
    trt.init_libnvinfer_plugins(logger, '')
    runtime = trt.Runtime(logger)
    with open(engine_path, "rb") as f:
        serialized_engine = f.read()
    engine = runtime.deserialize_cuda_engine(serialized_engine)
    self.context = engine.create_execution_context()
    self.inputs, self.outputs, self.bindings = [], [], []
    self.stream = cuda.Stream()
    for binding in engine:
        size = trt.volume(engine.get_binding_shape(binding))
        dtype = trt.nptype(engine.get_binding_dtype(binding))
        # Allocate host and device buffers
        host_mem = cuda.pagelocked_empty(size, dtype)
        device_mem = cuda.mem_alloc(host_mem.nbytes)
        # Append the device buffer to device bindings.
        self.bindings.append(int(device_mem))
        if engine.binding_is_input(binding):
            self.inputs.append({'host': host_mem, 'device': device_mem})
        else:
            self.outputs.append({'host': host_mem, 'device': device_mem})

def infer(self, img):
    self.inputs[0]['host'] = np.ravel(img)
    # transfer data to the gpu
    for inp in self.inputs:
        cuda.memcpy_htod_async(inp['device'], inp['host'], self.stream)
    # run inference
    self.context.execute_async_v2(
        bindings=self.bindings,
        stream_handle=self.stream.handle)
    # fetch outputs from gpu
    for out in self.outputs:
        cuda.memcpy_dtoh_async(out['host'], out['device'], self.stream)
    # synchronize stream
    self.stream.synchronize()

    data = [out['host'] for out in self.outputs]
    return data

def detect_video(self, video_path):
    start_time = 0  # skip first {start_time} seconds
    fps = 0
    cap = cv2.VideoCapture(video_path)
    while True:
        ret, frame = cap.read()
        end_time = time.time()
        diff = end_time - start_time
        fps = 1 / (diff)
        start_time = end_time
        fps_text = "FPS : {:.2f}".format(fps)
        cv2.putText(frame, fps_text, (5, 30), cv2.FONT_HERSHEY_COMPLEX, 1, (0, 255, 255), 1)
        if not ret:
            break
        blob, ratio = preproc(frame, self.imgsz, self.mean, self.std)
        data = self.infer(blob)
        predictions = np.reshape(data, (1, -1, int(5 + self.n_classes)))[0]
        dets = self.postprocess(predictions, ratio)
        if dets is not None:
            final_boxes, final_scores, final_cls_inds = dets[:,
                                                        :4], dets[:, 4], dets[:, 5]
            frame = vis(frame, final_boxes, final_scores, final_cls_inds,
                        conf=0.5, class_names=self.class_names)
        cv2.imshow('frame', frame)
        if cv2.waitKey(25) & 0xFF == ord('q'):
            break
    cap.release()
    cv2.destroyAllWindows()

def inference(self, img_path, conf=0.5):
    origin_img = cv2.imread(img_path)
    img, ratio = preproc(origin_img, self.imgsz, self.mean, self.std)
    data = self.infer(img)
    predictions = np.reshape(data, (1, -1, int(5 + self.n_classes)))[0]
    dets = self.postprocess(predictions, ratio)
    if dets is not None:
        final_boxes, final_scores, final_cls_inds = dets[:,
                                                    :4], dets[:, 4], dets[:, 5]
        origin_img = vis(origin_img, final_boxes, final_scores, final_cls_inds,
                         conf=conf, class_names=self.class_names)
    return origin_img

@staticmethod
def postprocess(predictions, ratio):
    boxes = predictions[:, :4]
    scores = predictions[:, 4:5] * predictions[:, 5:]
    boxes_xyxy = np.ones_like(boxes)
    boxes_xyxy[:, 0] = boxes[:, 0] - boxes[:, 2] / 2.
    boxes_xyxy[:, 1] = boxes[:, 1] - boxes[:, 3] / 2.
    boxes_xyxy[:, 2] = boxes[:, 0] + boxes[:, 2] / 2.
    boxes_xyxy[:, 3] = boxes[:, 1] + boxes[:, 3] / 2.
    boxes_xyxy /= ratio
    dets = multiclass_nms(boxes_xyxy, scores, nms_thr=0.45, score_thr=0.1)
    return dets

def get_fps(self):
    # warmup
    import time
    img = np.ones((1, 3, self.imgsz[0], self.imgsz[1]))
    img = np.ascontiguousarray(img, dtype=np.float32)
    for _ in range(20):
        _ = self.infer(img)
    t1 = time.perf_counter()
    _ = self.infer(img)
    print(1 / (time.perf_counter() - t1), 'FPS')

def nms(boxes, scores, nms_thr):
"""Single class NMS implemented in Numpy."""
x1 = boxes[:, 0]
y1 = boxes[:, 1]
x2 = boxes[:, 2]
y2 = boxes[:, 3]

areas = (x2 - x1 + 1) * (y2 - y1 + 1)
order = scores.argsort()[::-1]

keep = []
while order.size > 0:
    i = order[0]
    keep.append(i)
    xx1 = np.maximum(x1[i], x1[order[1:]])
    yy1 = np.maximum(y1[i], y1[order[1:]])
    xx2 = np.minimum(x2[i], x2[order[1:]])
    yy2 = np.minimum(y2[i], y2[order[1:]])

    w = np.maximum(0.0, xx2 - xx1 + 1)
    h = np.maximum(0.0, yy2 - yy1 + 1)
    inter = w * h
    ovr = inter / (areas[i] + areas[order[1:]] - inter)

    inds = np.where(ovr <= nms_thr)[0]
    order = order[inds + 1]

return keep

def multiclass_nms(boxes, scores, nms_thr, score_thr):
"""Multiclass NMS implemented in Numpy"""
final_dets = []
num_classes = scores.shape[1]
for cls_ind in range(num_classes):
cls_scores = scores[:, cls_ind]
valid_score_mask = cls_scores > score_thr
if valid_score_mask.sum() == 0:
continue
else:
valid_scores = cls_scores[valid_score_mask]
valid_boxes = boxes[valid_score_mask]
keep = nms(valid_boxes, valid_scores, nms_thr)
if len(keep) > 0:
cls_inds = np.ones((len(keep), 1)) * cls_ind
dets = np.concatenate(
[valid_boxes[keep], valid_scores[keep, None], cls_inds], 1
)
final_dets.append(dets)
if len(final_dets) == 0:
return None
return np.concatenate(final_dets, 0)

def preproc(image, input_size, mean, std, swap=(2, 0, 1)):
if len(image.shape) == 3:
padded_img = np.ones((input_size[0], input_size[1], 3)) * 114.0
else:
padded_img = np.ones(input_size) * 114.0
img = np.array(image)
r = min(input_size[0] / img.shape[0], input_size[1] / img.shape[1])
resized_img = cv2.resize(
img,
(int(img.shape[1] * r), int(img.shape[0] * r)),
interpolation=cv2.INTER_LINEAR,
).astype(np.float32)
padded_img[: int(img.shape[0] * r), : int(img.shape[1] * r)] = resized_img

padded_img = padded_img[:, :, ::-1]
padded_img /= 255.0
if mean is not None:
    padded_img -= mean
if std is not None:
    padded_img /= std
padded_img = padded_img.transpose(swap)
padded_img = np.ascontiguousarray(padded_img, dtype=np.float32)
return padded_img, r

_COLORS = np.array(
[
0.000, 0.447, 0.741,
0.850, 0.325, 0.098,
0.929, 0.694, 0.125,
0.494, 0.184, 0.556,
0.466, 0.674, 0.188,
0.301, 0.745, 0.933,
0.635, 0.078, 0.184,
0.300, 0.300, 0.300,
0.600, 0.600, 0.600,
1.000, 0.000, 0.000,
1.000, 0.500, 0.000,
0.749, 0.749, 0.000,
0.000, 1.000, 0.000,
0.000, 0.000, 1.000,
0.667, 0.000, 1.000,
0.333, 0.333, 0.000,
0.333, 0.667, 0.000,
0.333, 1.000, 0.000,
0.667, 0.333, 0.000,
0.667, 0.667, 0.000,
0.667, 1.000, 0.000,
1.000, 0.333, 0.000,
1.000, 0.667, 0.000,
1.000, 1.000, 0.000,
0.000, 0.333, 0.500,
0.000, 0.667, 0.500,
0.000, 1.000, 0.500,
0.333, 0.000, 0.500,
0.333, 0.333, 0.500,
0.333, 0.667, 0.500,
0.333, 1.000, 0.500,
0.667, 0.000, 0.500,
0.667, 0.333, 0.500,
0.667, 0.667, 0.500,
0.667, 1.000, 0.500,
1.000, 0.000, 0.500,
1.000, 0.333, 0.500,
1.000, 0.667, 0.500,
1.000, 1.000, 0.500,
0.000, 0.333, 1.000,
0.000, 0.667, 1.000,
0.000, 1.000, 1.000,
0.333, 0.000, 1.000,
0.333, 0.333, 1.000,
0.333, 0.667, 1.000,
0.333, 1.000, 1.000,
0.667, 0.000, 1.000,
0.667, 0.333, 1.000,
0.667, 0.667, 1.000,
0.667, 1.000, 1.000,
1.000, 0.000, 1.000,
1.000, 0.333, 1.000,
1.000, 0.667, 1.000,
0.333, 0.000, 0.000,
0.500, 0.000, 0.000,
0.667, 0.000, 0.000,
0.833, 0.000, 0.000,
1.000, 0.000, 0.000,
0.000, 0.167, 0.000,
0.000, 0.333, 0.000,
0.000, 0.500, 0.000,
0.000, 0.667, 0.000,
0.000, 0.833, 0.000,
0.000, 1.000, 0.000,
0.000, 0.000, 0.167,
0.000, 0.000, 0.333,
0.000, 0.000, 0.500,
0.000, 0.000, 0.667,
0.000, 0.000, 0.833,
0.000, 0.000, 1.000,
0.000, 0.000, 0.000,
0.143, 0.143, 0.143,
0.286, 0.286, 0.286,
0.429, 0.429, 0.429,
0.571, 0.571, 0.571,
0.714, 0.714, 0.714,
0.857, 0.857, 0.857,
0.000, 0.447, 0.741,
0.314, 0.717, 0.741,
0.50, 0.5, 0
]
).astype(np.float32).reshape(-1, 3)

def vis(img, boxes, scores, cls_ids, conf=0.5, class_names=None):
for i in range(len(boxes)):
box = boxes[i]
cls_id = int(cls_ids[i])
score = scores[i]
if score < conf:
continue
x0 = int(box[0])
y0 = int(box[1])
x1 = int(box[2])
y1 = int(box[3])

    color = (_COLORS[cls_id] * 255).astype(np.uint8).tolist()
    text = '{}:{:.1f}%'.format(class_names[cls_id], score * 100)
    txt_color = (0, 0, 0) if np.mean(_COLORS[cls_id]) > 0.5 else (255, 255, 255)
    font = cv2.FONT_HERSHEY_SIMPLEX

    txt_size = cv2.getTextSize(text, font, 0.4, 1)[0]
    cv2.rectangle(img, (x0, y0), (x1, y1), color, 2)

    txt_bk_color = (_COLORS[cls_id] * 255 * 0.7).astype(np.uint8).tolist()
    cv2.rectangle(
        img,
        (x0, y0 + 1),
        (x0 + txt_size[0] + 1, y0 + int(1.5 * txt_size[1])),
        txt_bk_color,
        -1
    )
    cv2.putText(img, text, (x0, y0 + txt_size[1]), font, 0.4, txt_color, thickness=1)

return img

SHIVAM3052 · 2022-08-02T11:04:00Z

def detect_video(self, video_path): ======> from this function i am getting fps 23-27

pred.video() gives overall frames of a video more than 500

Linaom1214 · 2022-08-02T11:05:24Z

def detect_video(self, video_path): ======> from this function i am getting fps 23-27

pred.video() gives overall frames of a video more than 500
the result of get_fps() function is ???

Linaom1214 · 2022-08-02T11:09:20Z

you platform is 1080Ti?

SHIVAM3052 · 2022-08-02T11:14:06Z

I am using RTX 3070 8gb

Linaom1214 · 2022-08-02T11:16:19Z

sharing screenshot 515 fps

I am using RTX 3070 8gb

the get_fps() is right , i think maybe detect video need a warmup, this is only a simple demo, after i will update code!

you can aslo use you camera test the model.

Linaom1214 · 2022-08-02T11:18:46Z

i think you can set the waitkey more small.
https://github.com/Linaom1214/tensorrt-python/blob/08bb095c022044e254361db18a2f4f4040976c23/utils/utils.py#L78

you can try set 1

SHIVAM3052 · 2022-08-02T11:19:36Z

Sure, I am eagerly waiting for the updates on the camera test model.

Linaom1214 · 2022-08-02T11:22:02Z

Sure, I am eagerly waiting for the updates on the camera test model.

#22 (comment)

try this solution

SHIVAM3052 · 2022-08-02T11:25:31Z

Thanks, for the solution.

FPS increased to 90-98 fps

Cuda utilisation is still 354 MB

Is there any solution to utilizing full gpu/cuda.

SHIVAM3052 · 2022-08-02T11:26:23Z

So, I can increase my fps more.

Linaom1214 · 2022-08-02T11:32:07Z

Thanks, for the solution.

FPS increased to 90-98 fps

Cuda utilisation is still 354 MB

Is there any solution to utilizing full gpu/cuda.

opencv frame updates are slower than inference time, in fact your model may run faster.
Using multiple batch may increase GPU usage.

SHIVAM3052 · 2022-08-02T11:38:22Z

Kindly provide me some links or sources to implement multiple batches for video inference.
.

SHIVAM3052 closed this as completed Aug 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Yolov6 low fps on .trt model with fp16 #22

Yolov6 low fps on .trt model with fp16 #22

SHIVAM3052 commented Aug 2, 2022 •

edited

Loading

Linaom1214 commented Aug 2, 2022

SHIVAM3052 commented Aug 2, 2022

Linaom1214 commented Aug 2, 2022

SHIVAM3052 commented Aug 2, 2022

Linaom1214 commented Aug 2, 2022

SHIVAM3052 commented Aug 2, 2022

Linaom1214 commented Aug 2, 2022 •

edited

Loading

SHIVAM3052 commented Aug 2, 2022

SHIVAM3052 commented Aug 2, 2022

Linaom1214 commented Aug 2, 2022

Linaom1214 commented Aug 2, 2022

SHIVAM3052 commented Aug 2, 2022

Linaom1214 commented Aug 2, 2022

Linaom1214 commented Aug 2, 2022

SHIVAM3052 commented Aug 2, 2022

Linaom1214 commented Aug 2, 2022

SHIVAM3052 commented Aug 2, 2022

SHIVAM3052 commented Aug 2, 2022

Linaom1214 commented Aug 2, 2022

SHIVAM3052 commented Aug 2, 2022

Yolov6 low fps on .trt model with fp16 #22

Yolov6 low fps on .trt model with fp16 #22

Comments

SHIVAM3052 commented Aug 2, 2022 • edited Loading

Linaom1214 commented Aug 2, 2022

SHIVAM3052 commented Aug 2, 2022

Linaom1214 commented Aug 2, 2022

SHIVAM3052 commented Aug 2, 2022

Linaom1214 commented Aug 2, 2022

SHIVAM3052 commented Aug 2, 2022

Linaom1214 commented Aug 2, 2022 • edited Loading

SHIVAM3052 commented Aug 2, 2022

SHIVAM3052 commented Aug 2, 2022

Linaom1214 commented Aug 2, 2022

Linaom1214 commented Aug 2, 2022

SHIVAM3052 commented Aug 2, 2022

Linaom1214 commented Aug 2, 2022

Linaom1214 commented Aug 2, 2022

SHIVAM3052 commented Aug 2, 2022

Linaom1214 commented Aug 2, 2022

SHIVAM3052 commented Aug 2, 2022

SHIVAM3052 commented Aug 2, 2022

Linaom1214 commented Aug 2, 2022

SHIVAM3052 commented Aug 2, 2022

SHIVAM3052 commented Aug 2, 2022 •

edited

Loading

Linaom1214 commented Aug 2, 2022 •

edited

Loading