slow with batch size > 1 #56

PythonImageDeveloper · 2020-02-11T13:03:20Z

Hi,
I set max_batch_size = 3 and I want to speed up model with 3 input image as parallel instead serial,
But I converted the model with batch_size = 3 correctly, when I run trt_ssd.py, I achieve this results:
for 1 batch_size : process time is 0.002 sec with 1080 TI,
for 3 batch_size : process time is 0.006 sec with 1080 TI,
That means the system process as serial not parallel, why?

jkjung-avt · 2020-02-11T13:32:34Z

Does your processing time include image preprocessing? (Do you get a different result with trt_ssd_async.py?)

PythonImageDeveloper · 2020-02-11T14:32:08Z

time is only for trt_ssd.detect(img, 0.3). this object is included preprocessing step, but for simplicity I concatenate same image for 3 times before feed to the np.copyto(self.host_inputs[0], img_resized.ravel()), like this :
img_pred = np.concatenate([img_pred,img_pred,img_pred])
I don't test with trt_ssd_async.py.

I do following steps:
1- change input size from (1,3,300,300) to (3,3,300,300)
2- change to builder.max_batch_size = 3
4- change self.context.execute_async(
batch_size=3,
bindings=self.bindings,
stream_handle=self.stream.handle)

Notice that when I comment self.stream.synchronize() in the ssd.py, I get first few result with 0.002 sec and then the time is growing reach to 0.06, and then the line self.stream.synchronize() remain uncomment, I get 0.06 for all result, why?
in my opinion the self.stream.synchronize() likely be asynchronous, not synchronize if possible.

jkjung-avt · 2020-02-12T02:32:29Z

Instead of timing the whole trt_ssd.detect() function, I think it makes more sense for you to only time the "cuda.memcpy_xxx"s, "context.execute_async" and "cuda.stream.synchronize" in that function.

By the way, the "self.stream.synchronize" call cannot be commented out. Otherwise, you cannot be sure GPU has finished processing the image.

PythonImageDeveloper · 2020-02-19T13:29:15Z

This is my TensorRT OCR custom model when I use batch_size = 1, I get 0.02 sec and when I use batch_size= 10, I get 0.2 sec, which means, this batch_size input images running as serializing, not parallel, why?

Batch_size = 1

TensorRT All Time: 0.02888178825378418
cuda.memcpy_htod_async: cuda_inputs: 0.00016927719116210938
self.context.execute_async: 0.0031588077545166016
cuda.memcpy_dtoh_async : host_outputs: 9.1552734375e-05
stream.synchronize(): 0.018606901168823242

Batch_size = 10

TensorRT All Time: 0.22867369651794434
cuda.memcpy_htod_async: cuda_inputs: 0.00018334388732910156
self.context.execute_async: 0.0013976097106933594
cuda.memcpy_dtoh_async : host_outputs: 9.894371032714844e-05
stream.synchronize(): 0.20677971839904785

jkjung-avt · 2020-05-01T01:49:45Z

Duplicated issue: #106

jkjung-avt closed this as completed May 1, 2020

abhigoku10 mentioned this issue Aug 19, 2021

Yolov5 #467

Closed

jkjung-avt mentioned this issue Jun 11, 2022

batch inference vs fps #561

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slow with batch size > 1 #56

slow with batch size > 1 #56

PythonImageDeveloper commented Feb 11, 2020

jkjung-avt commented Feb 11, 2020

PythonImageDeveloper commented Feb 11, 2020 •

edited

jkjung-avt commented Feb 12, 2020

PythonImageDeveloper commented Feb 19, 2020 •

edited

jkjung-avt commented May 1, 2020

slow with batch size > 1 #56

slow with batch size > 1 #56

Comments

PythonImageDeveloper commented Feb 11, 2020

jkjung-avt commented Feb 11, 2020

PythonImageDeveloper commented Feb 11, 2020 • edited

jkjung-avt commented Feb 12, 2020

PythonImageDeveloper commented Feb 19, 2020 • edited

jkjung-avt commented May 1, 2020

PythonImageDeveloper commented Feb 11, 2020 •

edited

PythonImageDeveloper commented Feb 19, 2020 •

edited