Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Id like to make some performance related changes #356

Open
TRex22 opened this issue Nov 5, 2022 · 7 comments
Open

Id like to make some performance related changes #356

TRex22 opened this issue Nov 5, 2022 · 7 comments

Comments

@TRex22
Copy link

TRex22 commented Nov 5, 2022

Hey there,

Im my research I need to generate a lot of GradCAM heat-maps which I then do further processing on. Im using the GradCAM class as I dont need other implementations right now - although I may decide to use some in the future. Ive performance tuned my work to as much as it can be. Using the torch profiler I discovered the greatest cost is due to moving data between the Cuda device and CPU.

Im already batching the GradCAM (and found a batch size of 8 seemed to be the most performant - more has a diminishing return).

Where I think there is improvement is allowing the GradCAM class and maybe others to use the Torch versions of Numpy functions to limit when the data is moved off device. I believe many of the actions could be done before moving the results back which should give a nice speedup. And in my case if I can get back a tensor result on device rather than a numpy result on CPU that would help in my work.

Im happy to investigate implementing this as a optional / configurable change - if thats something more agreeable. So that you have to opt in somehow. I just wanted to know if thats something others would like - before I fork and do it as I dont want to have to maintain a fork that diverges from the base work.

Id have to make changes to the BaseCam and GradCAM classes at a minimum.

My time right now is also very much limited which is why I wanted to ask if there is an appetite for these changes - as I might get a working concept that does what I need but not be able to change all the other CAM implementations too.

@jacobgil
Copy link
Owner

jacobgil commented Nov 5, 2022

Hey,
I'm definitely open to changing everything to be done with pure torch on the device instead of numpy.
But before investing in this large change,
Should we get some data point on how much faster Grad-CAM would be this way ? Could also be in a quick and dirty implementation of this.
To get a motivation for doing this and see if it's worth it.

@TRex22
Copy link
Author

TRex22 commented Nov 5, 2022

Okay awesome. Ill put together a very rough fork and see if I can time the difference between master and the branch

@TRex22
Copy link
Author

TRex22 commented Nov 5, 2022

Cool so I have made a proof of concept. Its quite crude.
Appears to be somewhat faster in my research code out of the box (Will get more concrete results to post later).

I will continue this later. Its very late where I live so Im calling it a night.

I have tried to add comments / document as I go. My aim in the draft PR is to get just enough to work that I can build some proper benchmarks before and after so we can see if this is worth implementing.

One potential change is around the use of cv2.resize. I have not found a one to one equivalent, and at least for my particular use-case I managed to remove it. When completing the PR this will have to be looked at more closely.

@TRex22
Copy link
Author

TRex22 commented Nov 5, 2022

Im thinking of benchmarking a data-set like Cityscapes on a default Resnet50 or something similar. Ill post the code in the PR when I write it too.

I want it to be as generic as possible and make use of the torchvision stuff. So thats its a fair before and after comparison.

@TRex22
Copy link
Author

TRex22 commented Nov 6, 2022

Okay some good and some bad news.
My benchmark shows its faster but only marginally.

Ill do a more extensive benchmark across some more models later.

The reason my own code is noticeably faster is possibly because I use the tensor on the GPU and dont have to convert the Numpy result back into a tensor after computing it - removing all those operations.

Ill create some nice looking markdown tables and maybe some graphs later

I also want to make the benchmark more advanced and try scaling batch size and other bits to see what happens. I luckily have access to a fairly good compute machine (and can schedule get way bigger machines if need be)

@TRex22
Copy link
Author

TRex22 commented Nov 6, 2022

Also tests super failing right now 😭 ... but to be expected

@TRex22
Copy link
Author

TRex22 commented Dec 1, 2022

Sorry about going quiet. Working full time and needing to write my thesis draft + some papers. And presenting at a conference in two weeks ...

But my experiments are slower than expected ... by days so will be investing more time into this soonish

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants