You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After benchmarking w/ pascal titan x and comparing its speed w.r.t. maxwell titan x, I realized speed does not scale often from one network to another. It turned out, for a subject network I was testing with, IO wait became significant when using pascal due to increased computation power (i.e. gpu computation time reduced, IO wait on CPU unchanged, causing the fraction to become more significant).
This came down to 2 components: one is memory copy from cpu to gpu SRAM about which I cannot do anything simple, and another thing is an IO optimization that is possible.
IO optimization has 2 components to it:
a) memory allocation of blob @ each mini-batch data loading
b) copying data from larcv IO manager into blob
To optimize the performance both should be threaded.
This requires a total of 3 threads:
0) LArCV IO thread (already exist and in use)
blob memory allocation thread ... to cover a)
data copy from larcv to blob upon the completion of 0) and 1), which means this thread maintains above 2 threads.
I implement above in root data layer where the main thread instantiate 2) which in turn instantiates 0) and 1).
Also it would be nice to implement an option to measure time spent at each stage and report periodically so that anyone can try and notice this kind of problem in future.
The text was updated successfully, but these errors were encountered:
IO in test is 576 x 576 single channel images.
Network used is 5 resnet modules where each module has 3 bottleneck (1-3-1) sets, total of 14 layers.
Mini batch size is 10.
Currently cpu => gpu SRAM memory data transfer costs about 64% of the training time using maxwell titanx if IO can be optimized.
We want to keep that dominating (i.e. being much larger) the effect, when I said we should optimize IO.
After benchmarking w/ pascal titan x and comparing its speed w.r.t. maxwell titan x, I realized speed does not scale often from one network to another. It turned out, for a subject network I was testing with, IO wait became significant when using pascal due to increased computation power (i.e. gpu computation time reduced, IO wait on CPU unchanged, causing the fraction to become more significant).
This came down to 2 components: one is memory copy from cpu to gpu SRAM about which I cannot do anything simple, and another thing is an IO optimization that is possible.
IO optimization has 2 components to it:
a) memory allocation of blob @ each mini-batch data loading
b) copying data from larcv IO manager into blob
To optimize the performance both should be threaded.
This requires a total of 3 threads:
0) LArCV IO thread (already exist and in use)
I implement above in root data layer where the main thread instantiate 2) which in turn instantiates 0) and 1).
Also it would be nice to implement an option to measure time spent at each stage and report periodically so that anyone can try and notice this kind of problem in future.
The text was updated successfully, but these errors were encountered: