-
Notifications
You must be signed in to change notification settings - Fork 838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specify the device in multiple GPUs system #29
Comments
使用するGPUを指定できるようにしました。 |
@lltcggie Talking about video processing, is that possible to merge several input images into the four-dimension-array data blob structure during the inference? Will this improve the efficiency, compared to processing single image every time? |
複数枚の画像を同時に処理したほうが効率がいいじゃないか、ということですね? |
Yes, multi images doesn't make too much sense. And to combine into their framework, it might be a totally different story. For crop-size you have mentioned, it is true that larger size will give you faster speed. I did a de-noising experiment on three 3072*3072 generated images. VRAM with other application + caffe model is around 300MB. Below table is the time consuming and total VRAM on my GTX960.
Although, we can get improvement by increasing the crop size, this improvement is slowing down. What's really important is real image has different width and height, especially for video resolution. In most case, the crop-size cannot divide both width and height. So a very large crop can waste too much on some data blobs (rightest and most bottom patches). For example, 384 is the best for denoising on 1920-by-1080, (require to process 1920-by-1152 ), and 432 is best for denoising on 3840-by-2160, (require to process 3888-by-2160 , better than 384 here, which is 3840-by-2304 ). I believe assigning crop size for width and height separately might be the best solution for this problem. Another thinking is still about the batch size. From my another caffe CNN project on FCN training (forward+backward on 500-by-300 images) on a dual Titan X server, I observed s 20% speed-up with batch-size=10 than batch-size=2 (which is batch-size=5 than batch-size=1 on each card). This improvement might be more obvious in a dual GPU system and in training process, but single GPU may still benefit even in forward only. This might be one reason for AlexNet and SRCNN to use batch-size=128 in training. As you already crop the image into smaller patches, and each patches only need 300MB, I believe it's worth to try batch-size>1 by combining 2 or 4 patches into one four-dims data blob, do the forward, and recover them back in the four-dims output. In this way, the VRAM is still under controlled. And we don't need multiple images. I believe efficiency of waifu2x-caffe can be further optimized, since I have seen other CNN costing more power on my GPU ;) |
Maybe caffe and cudnn themselves do a better job on batch-size dimension than width and height dimension. Anyway, It's possible to try optimizing it in this way :) |
But in my experiment, -batch_size option improves the processing time by 0~13% on cuDNN v4 and Torch7. |
なるほど、four-dims data blobとはbatchのことを示していたんですね。 |
I tried the CUI version. I have to say that batch-size sometimes help, sometimes not. Also this benchmark on CUI version is not reliable and comparable, because the speed is much slower than in GUI version (maybe every time, the Net initializing and Net clearing cost too much time) and the time consuming is fluctuating a little bit in CUI version.
In the last example of (1920x1080), if crop-size cannot divide width or height, we really waste lots of calculation. A single large image processing may not care this, but for thousands frame in video processing, this mount of time is a lot. crop-size=120 is too small and can not fully utilize the GPU with batch-size=1. And even in crop-size=120 with some waste of padding(?), the speed is faster than crop-size=384, with the batch-size=4. crop-size=384 wastes 1/9 of calculation. So, can you create APIs of width crop and height crop for the project of @HolyWu , not necessary for GUI option? So we can assign the size separately and not waste too much.. |
入力にフォルダにすればCUIとGUIとも同じ条件で計れると思います。 |
Yeah, I totally agree with you that adding batch-size into the GUI is not a good choice for average users. And the results didn't show us a clear evidence of the benefits of batch-size, which seems that your current design of batch-size=1 is actually the best choice in practice and can avoid a lot troubles for users. |
Hi lltcggie,
Could you please add an option in the interface so that the user can specify which device to use if he has more than one GPU in the system? We use this library for video upscaling, so if the processing can be distributed to different GPUs at the same time, the time needed will reduce a lot. Thank you very much.
The text was updated successfully, but these errors were encountered: