Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LArbys/cafef] IO optimization #80

Open
drinkingkazu opened this issue Dec 30, 2016 · 1 comment
Open

[LArbys/cafef] IO optimization #80

drinkingkazu opened this issue Dec 30, 2016 · 1 comment
Assignees

Comments

@drinkingkazu
Copy link

drinkingkazu commented Dec 30, 2016

After benchmarking w/ pascal titan x and comparing its speed w.r.t. maxwell titan x, I realized speed does not scale often from one network to another. It turned out, for a subject network I was testing with, IO wait became significant when using pascal due to increased computation power (i.e. gpu computation time reduced, IO wait on CPU unchanged, causing the fraction to become more significant).

This came down to 2 components: one is memory copy from cpu to gpu SRAM about which I cannot do anything simple, and another thing is an IO optimization that is possible.

IO optimization has 2 components to it:
a) memory allocation of blob @ each mini-batch data loading
b) copying data from larcv IO manager into blob

To optimize the performance both should be threaded.
This requires a total of 3 threads:
0) LArCV IO thread (already exist and in use)

  1. blob memory allocation thread ... to cover a)
  2. data copy from larcv to blob upon the completion of 0) and 1), which means this thread maintains above 2 threads.

I implement above in root data layer where the main thread instantiate 2) which in turn instantiates 0) and 1).
Also it would be nice to implement an option to measure time spent at each stage and report periodically so that anyone can try and notice this kind of problem in future.

@drinkingkazu
Copy link
Author

drinkingkazu commented Dec 30, 2016

Additional note...

IO in test is 576 x 576 single channel images.
Network used is 5 resnet modules where each module has 3 bottleneck (1-3-1) sets, total of 14 layers.
Mini batch size is 10.
Currently cpu => gpu SRAM memory data transfer costs about 64% of the training time using maxwell titanx if IO can be optimized.
We want to keep that dominating (i.e. being much larger) the effect, when I said we should optimize IO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant