Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slow with batch size > 1 #56

Closed
PythonImageDeveloper opened this issue Feb 11, 2020 · 5 comments
Closed

slow with batch size > 1 #56

PythonImageDeveloper opened this issue Feb 11, 2020 · 5 comments

Comments

@PythonImageDeveloper
Copy link

Hi,
I set max_batch_size = 3 and I want to speed up model with 3 input image as parallel instead serial,
But I converted the model with batch_size = 3 correctly, when I run trt_ssd.py, I achieve this results:
for 1 batch_size : process time is 0.002 sec with 1080 TI,
for 3 batch_size : process time is 0.006 sec with 1080 TI,
That means the system process as serial not parallel, why?

@jkjung-avt
Copy link
Owner

Does your processing time include image preprocessing? (Do you get a different result with trt_ssd_async.py?)

@PythonImageDeveloper
Copy link
Author

PythonImageDeveloper commented Feb 11, 2020

time is only for trt_ssd.detect(img, 0.3). this object is included preprocessing step, but for simplicity I concatenate same image for 3 times before feed to the np.copyto(self.host_inputs[0], img_resized.ravel()), like this :
img_pred = np.concatenate([img_pred,img_pred,img_pred])
I don't test with trt_ssd_async.py.

I do following steps:
1- change input size from (1,3,300,300) to (3,3,300,300)
2- change to builder.max_batch_size = 3
4- change self.context.execute_async(
batch_size=3,
bindings=self.bindings,
stream_handle=self.stream.handle)

Notice that when I comment self.stream.synchronize() in the ssd.py, I get first few result with 0.002 sec and then the time is growing reach to 0.06, and then the line self.stream.synchronize() remain uncomment, I get 0.06 for all result, why?
in my opinion the self.stream.synchronize() likely be asynchronous, not synchronize if possible.

@jkjung-avt
Copy link
Owner

Instead of timing the whole trt_ssd.detect() function, I think it makes more sense for you to only time the "cuda.memcpy_xxx"s, "context.execute_async" and "cuda.stream.synchronize" in that function.

By the way, the "self.stream.synchronize" call cannot be commented out. Otherwise, you cannot be sure GPU has finished processing the image.

@PythonImageDeveloper
Copy link
Author

PythonImageDeveloper commented Feb 19, 2020

This is my TensorRT OCR custom model when I use batch_size = 1, I get 0.02 sec and when I use batch_size= 10, I get 0.2 sec, which means, this batch_size input images running as serializing, not parallel, why?

Batch_size = 1

TensorRT All Time: 0.02888178825378418
cuda.memcpy_htod_async: cuda_inputs: 0.00016927719116210938
self.context.execute_async: 0.0031588077545166016
cuda.memcpy_dtoh_async : host_outputs: 9.1552734375e-05
stream.synchronize(): 0.018606901168823242

Batch_size = 10

TensorRT All Time: 0.22867369651794434
cuda.memcpy_htod_async: cuda_inputs: 0.00018334388732910156
self.context.execute_async: 0.0013976097106933594
cuda.memcpy_dtoh_async : host_outputs: 9.894371032714844e-05
stream.synchronize(): 0.20677971839904785

@jkjung-avt
Copy link
Owner

Duplicated issue: #106

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants