Specify the device in multiple GPUs system #29

HolyWu · 2016-06-07T03:27:35Z

Hi lltcggie,

Could you please add an option in the interface so that the user can specify which device to use if he has more than one GPU in the system? We use this library for video upscaling, so if the processing can be distributed to different GPUs at the same time, the time needed will reduce a lot. Thank you very much.

lltcggie · 2016-06-08T02:12:12Z

使用するGPUを指定できるようにしました。

chungexcy · 2016-06-08T02:41:02Z

@lltcggie Talking about video processing, is that possible to merge several input images into the four-dimension-array data blob structure during the inference? Will this improve the efficiency, compared to processing single image every time?

lltcggie · 2016-06-08T03:12:49Z

複数枚の画像を同時に処理したほうが効率がいいじゃないか、ということですね？
予想ですが、crop_sizeを大きくするのとそこまで違いはないと思います。
とくに、一度に複数枚処理する場合はVRAMの容量を考えるとcrop_sizeを小さくする必要があります。
つまり複数枚処理して高速化することと、crop_sizeを大きくして1枚あたりの処理を高速化することはトレードオフの関係にあるので、わざわざ実装する価値はないと思います。

chungexcy · 2016-06-09T00:26:25Z

Yes, multi images doesn't make too much sense. And to combine into their framework, it might be a totally different story.

For crop-size you have mentioned, it is true that larger size will give you faster speed. I did a de-noising experiment on three 3072*3072 generated images. VRAM with other application + caffe model is around 300MB. Below table is the time consuming and total VRAM on my GTX960.

------------ --------- cuDNN ----------- ------------ CUDA --------
crop-size -- --- VRAM ------ TIME ------ ------ VRAM ------ TIME --
128 -------- -- 403MB ------ 14.394s --- ----- 658MB ------ 23.173s
256 -------- -- 499MB ------ 12.693s --- ---- 1537MB ------ 20.579s
384 -------- -- 654MB ------ 12.391s --- ---- 2976MB ------ 20.110s
512 -------- -- 870MB ------ 12.297s
1024 ------- - 2340MB ------ 12.235s

Although, we can get improvement by increasing the crop size, this improvement is slowing down. What's really important is real image has different width and height, especially for video resolution. In most case, the crop-size cannot divide both width and height. So a very large crop can waste too much on some data blobs (rightest and most bottom patches).

For example, 384 is the best for denoising on 1920-by-1080, (require to process 1920-by-1152 ), and 432 is best for denoising on 3840-by-2160, (require to process 3888-by-2160 , better than 384 here, which is 3840-by-2304 ). I believe assigning crop size for width and height separately might be the best solution for this problem.

Another thinking is still about the batch size.

From my another caffe CNN project on FCN training (forward+backward on 500-by-300 images) on a dual Titan X server, I observed s 20% speed-up with batch-size=10 than batch-size=2 (which is batch-size=5 than batch-size=1 on each card). This improvement might be more obvious in a dual GPU system and in training process, but single GPU may still benefit even in forward only. This might be one reason for AlexNet and SRCNN to use batch-size=128 in training.

As you already crop the image into smaller patches, and each patches only need 300MB, I believe it's worth to try batch-size>1 by combining 2 or 4 patches into one four-dims data blob, do the forward, and recover them back in the four-dims output. In this way, the VRAM is still under controlled. And we don't need multiple images.

I believe efficiency of waifu2x-caffe can be further optimized, since I have seen other CNN costing more power on my GPU ;)

chungexcy · 2016-06-09T03:48:40Z

Maybe caffe and cudnn themselves do a better job on batch-size dimension than width and height dimension. Anyway, It's possible to try optimizing it in this way :)

nagadomi · 2016-06-09T05:29:36Z

cudnnConvolutionForward has been optimized for batch size = 1 (from cuDNN v4 Release Notes).

But in my experiment, -batch_size option improves the processing time by 0~13% on cuDNN v4 and Torch7.

lltcggie · 2016-06-09T14:38:14Z

なるほど、four-dims data blobとはbatchのことを示していたんですね。
チャンネル軸があるのを忘れてたので、別の何かのことかと勘違いしていました。
コマンドライン版ではbatch sizeを指定できるようにしてあるので、もしよろしければお試しください。
ちなみに、過去にGTX 660(Kepler世代)で実験した時にbatch sizeを増やしても処理速度が向上することはなかったので、GUI版ではbatch sizeの指定欄を入れる必要ないと判断しました。
しかしGPUの世代、あるいはcuDNNのバージョン次第では違う結果になるかもしれません。
もしbatch sizeを増やして速くなるのであれば、GUIの方でもbatch sizeを指定できるようにしたいと思います。

chungexcy · 2016-06-11T05:25:43Z

I tried the CUI version. I have to say that batch-size sometimes help, sometimes not. Also this benchmark on CUI version is not reliable and comparable, because the speed is much slower than in GUI version (maybe every time, the Net initializing and Net clearing cost too much time) and the time consuming is fluctuating a little bit in CUI version.

3 pictures of 3072*3072 => 6144x6144
                CUI version
crop-size = 256
batch-size= 1   Duration : 00:00:52,36
batch-size= 2   Duration : 00:00:51,16**
batch-size= 4   Duration : 00:00:51,22
batch-size= 8   Duration : 00:00:51,16**

crop-size = 512
batch-size= 1   Duration : 00:00:50,89
batch-size= 2   Duration : 00:00:50,86
batch-size= 3   Duration : 00:00:50,70**
batch-size= 4   Duration : 00:00:51,00
batch-size= 6   Duration : 00:00:51,25

crop-size = 1024
batch-size= 1   Duration : 00:00:50,75**

10 pictures of 720x480 => 1440x960
                CUI version                 GUI version
crop-size = 240
batch-size= 1   Duration : 00:00:15,72      00:00:06.156

crop-size = 320 (1600 x  960 = 1536000)
batch-size= 1   Duration : 00:00:16,47      00:00:06.687

crop-size = 360 (1440 x 1080 = 1555200)
batch-size= 1   Duration : 00:00:16,38      00:00:06.703

crop-size = 480
batch-size= 1   Duration : 00:00:15,27**    00:00:05.875**
batch-size= 2   Duration : 00:00:15,72
batch-size= 3   Duration : 00:00:16,32

10 pictures of 1920x1080, donoising at 1920x1080
                CUI version                 GUI version
crop-size = 384 (1920 x 1152)
batch-size= 1   Duration : 00:00:19,79**    00:00:09.750**
batch-size= 2   Duration : 00:00:20,28

crop-size = 120
batch-size= 1   Duration : 00:00:20,41      00:00:10.610
batch-size= 2   Duration : 00:00:19,91
batch-size= 3   Duration : 00:00:19,19
batch-size= 4   Duration : 00:00:19,14**
batch-size= 6   Duration : 00:00:19,36
batch-size= 8   Duration : 00:00:19,52
batch-size=12   Duration : 00:00:19,69

In the last example of (1920x1080), if crop-size cannot divide width or height, we really waste lots of calculation. A single large image processing may not care this, but for thousands frame in video processing, this mount of time is a lot. crop-size=120 is too small and can not fully utilize the GPU with batch-size=1. And even in crop-size=120 with some waste of padding(?), the speed is faster than crop-size=384, with the batch-size=4. crop-size=384 wastes 1/9 of calculation.

So, can you create APIs of width crop and height crop for the project of @HolyWu , not necessary for GUI option? So we can assign the size separately and not waste too much..

lltcggie · 2016-06-12T04:19:40Z

入力にフォルダにすればCUIとGUIとも同じ条件で計れると思います。
ちなみにGUIにbatch size指定をつけるのを渋っている理由は、比較的知識のないユーザーが使うGUIにbatch size指定をつけると、ソフトが強制終了するユーザーが増えてめんどくさいことになるのが容易に予想できるからです。
batch sizeを指定することで明らかに速くなるとわかれば、GUIで指定できるようにしても良いとは思っているのですが…

chungexcy · 2016-06-12T07:24:55Z

Yeah, I totally agree with you that adding batch-size into the GUI is not a good choice for average users. And the results didn't show us a clear evidence of the benefits of batch-size, which seems that your current design of batch-size=1 is actually the best choice in practice and can avoid a lot troubles for users.
Now, I'm just wondering about the crop size for video processing. And yes, adding too much choices for users is not always a good decision...

lltcggie added a commit that referenced this issue Jun 8, 2016

使用するGPUデバイスの指定に対応 #29

f976d5e

HolyWu closed this as completed Jul 4, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specify the device in multiple GPUs system #29

Specify the device in multiple GPUs system #29

HolyWu commented Jun 7, 2016

lltcggie commented Jun 8, 2016

chungexcy commented Jun 8, 2016

lltcggie commented Jun 8, 2016

chungexcy commented Jun 9, 2016

chungexcy commented Jun 9, 2016

nagadomi commented Jun 9, 2016 •

edited

Loading

lltcggie commented Jun 9, 2016

chungexcy commented Jun 11, 2016

lltcggie commented Jun 12, 2016

chungexcy commented Jun 12, 2016

Specify the device in multiple GPUs system #29

Specify the device in multiple GPUs system #29

Comments

HolyWu commented Jun 7, 2016

lltcggie commented Jun 8, 2016

chungexcy commented Jun 8, 2016

lltcggie commented Jun 8, 2016

chungexcy commented Jun 9, 2016

chungexcy commented Jun 9, 2016

nagadomi commented Jun 9, 2016 • edited Loading

lltcggie commented Jun 9, 2016

chungexcy commented Jun 11, 2016

lltcggie commented Jun 12, 2016

chungexcy commented Jun 12, 2016

nagadomi commented Jun 9, 2016 •

edited

Loading