Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preview server for new waifu2x #122

Closed
nagadomi opened this issue Jun 20, 2016 · 21 comments
Closed

Preview server for new waifu2x #122

nagadomi opened this issue Jun 20, 2016 · 21 comments

Comments

@nagadomi
Copy link
Owner

nagadomi commented Jun 20, 2016

I published http://waifu2x-dev.udp.jp/. This server supports new waifu2x models.
(code and pretrained models are avaliable at upconv branch)

EDIT: This changes was merged into http://waifu2x.udp.jp/

  • new scale method is 2.4x faster than current scale method and better quality
  • new noise_scale method is 3x faster than current noise_scale method and better quality
  • new noise reduction model supports color restoration for YUV420 JPEG (chroma subsampling)

photo model is not trained yet. I will add photo models. and http://waifu2x.udp.jp/ will be switch to new model in a few weeks.

@YukiSakuma
Copy link

YukiSakuma commented Jun 20, 2016

saw TTA

Finally
but what do you mean by n-pass auto I know n is an integer but how it would run?

@chungexcy
Copy link

Thanks for the new method. I tried your new model in caffe. Also found 2.4x improvement in scale 2x.
Since cuDNN v5.1 mentioned a new forward algorithm, I change the default algorithm in conv_layer into other 7 algorithms. But didn't find any improvement under caffe. I'm not sure whether I did the correct things, or there IS no difference. Have you ever done such experiment before?

typedef enum
{
    CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM         = 0,
    CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM = 1,
    CUDNN_CONVOLUTION_FWD_ALGO_GEMM                  = 2,
    CUDNN_CONVOLUTION_FWD_ALGO_DIRECT                = 3,
    CUDNN_CONVOLUTION_FWD_ALGO_FFT                   = 4,
    CUDNN_CONVOLUTION_FWD_ALGO_FFT_TILING            = 5,
    CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD              = 6,
    CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD_NONFUSED     = 7
} cudnnConvolutionFwdAlgo_t;

@nagadomi
Copy link
Owner Author

nagadomi commented Jun 21, 2016

@chungexcy

I also tested WINOGRAD.
Long time ago, I tested WINOGRAD on GTX760 and cuDNN v5 RC. The results is ... WINOGRAD is very slow.
Recently, I tested WINOGRAD on GTX1080 and cuDNN v5. WANDGRAD is 2.7x faster than IMPLICIT_GEMM (test code) . I am not sure, cause is cuDNN version or hardware.
In my implementation (waifu2x in this repo), the forward algorithm is automatically selected by cudnnFindConvolutionForwardAlgorithm (It is feature of cudnn.torch) when batch converting or runinig web service. Maybe WINOGRAD is selected.

@nagadomi
Copy link
Owner Author

nagadomi commented Jun 21, 2016

but what do you mean by n-pass auto I know n is an integer but how it would run?

TTA(test-time augmentation) averages upscaling results the following 8 augmented inputs.
tta

I this is called TTA 8-pass. TTA 4-pass uses 4 augmented input from this. so TTA 2-pass uses 2 augmented inputs.
TTA 8-pass achieve best upscaling performance but it is 8x slower than non TTA method. TTA 4-pass is 4x slow, TTA 2-pass is 2x slow.
n-pass auto selects the number of augmented patterns by input image size. 8-pass for small image, 1 pass for large image. This is best effort.

@chungexcy
Copy link

@nagadomi Thanks for your testing and coding. Last time I thought there was no improvement... But I forgot to comment the cudnnGetConvolutionForwardAlgorithm in caffe. So everyone was set to be auto...

So I retried your new model on latest caffe with cuDNN 5.1RC. Input patch is splitted into many 480x360 with batch=1.

Mode    Name                                            Time            Crop_size   VRAM
auto                                                    0:00:04.137760  480x360     609MiB
0   CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM            0:00:06.197489  480x360     609MiB
1   CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM    0:00:04.124851  480x360     609MiB
2   CUDNN_CONVOLUTION_FWD_ALGO_GEMM                     0:00:06.731829  240x270     2862MiB
3   CUDNN_CONVOLUTION_FWD_ALGO_DIRECT                   status == CUDNN_STATUS_SUCCESS (9 vs. 0)  CUDNN_STATUS_NOT_SUPPORTED
4   CUDNN_CONVOLUTION_FWD_ALGO_FFT                      error == cudaSuccess (2 vs. 0)  out of memory
5   CUDNN_CONVOLUTION_FWD_ALGO_FFT_TILING               0:00:31.901851  480x360
6   CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD                 0:00:03.845546  480x360     611MiB
7   CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD_NONFUSED        error == cudaSuccess (2 vs. 0)  out of memory

Here, WINOGRAD is 1.6x of IMPLICIT_GEMM, and 1.07x of IMPLICIT_PRECOMP_GEMM (this one should be the default setting). Since Caffe seems not supporting deconv layer with cuDNN, my results should be higher, if only benchmarking conv layer like in your code. Also, FFT algorithm seems not good at small kernel?

Could you also try IMPLICIT_PRECOMP_GEMM on your New GTX 1080?

@chungexcy
Copy link

Hi nagadomi, I seem to get your score on WINOGRAD.
I tried these 3 algorithms on Very Deep Super Resolution Network (Accurate Image Super-Resolution Using Very Deep Convolutional Networks). The structure is one 1x64x3x3, followed by 18 64x64x3x3, and followed by 1 64x1x3x3. Here is the results.

0   CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM            0:00:28.527458
1   CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM    0:00:15.827436
6   CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD                 0:00:10.990773

WINOGRAD is 2.60x of IMPLICIT_GEMM, and 1.44x of IMPLICIT_PRECOMP_GEM. I guess GTX1080 would behave similar to this one.

@nagadomi
Copy link
Owner Author

nagadomi commented Jun 22, 2016

Here is the results. It is similar with your results. code

algo    mean processing time
IMPLICIT_GEMM   0.26035809516907
IMPLICIT_PRECOMP_GEMM   0.13740830421448
WINOGRAD    0.090363383293152

@chungexcy
Copy link

Thanks, nagadomi.
I also found the behavior of cudnnGetConvolutionForwardAlgorithm is really wired.

In the above 20 3x3 conv layers network I mentioned before, this function return one 1(IMPLICIT_PRECOMP_GEMM) for the first 1x64x3x3 conv, and 19 6(WINOGRAD) for the 18 64x64x3x3 conv plus 1 64x1x3x3 conv.
while in your network design, all 6 conv layers in new model and all 7 conv layers in old model, are set to be 1(IMPLICIT_PRECOMP_GEMM).

This doesn't make too much sense, since I found

if pad==1
    // found this in VGG
    if satisfiy some condition
        WINOGRAD 
    else
        IMPLICIT_PRECOMP_GEMM
else if pad==0
    // found this in your model
    IMPLICIT_PRECOMP_GEMM

The pad:0 disable the acceleration on cudnn v5.

After I test on some conditions, I found some results, which is not always consistent with cuDNN's decision.

1. for the first layer of 3xch, 
    if ch<=96
        IMPLICIT_PRECOMP_GEMM is is faster. (same as cuDNN).
2. for the rest layer of mxn, 
    if m,n>=32
        WINOGRAD is faster.  (cuDNN using >=48)
3. for the rest layer of mxn, 
     if m,n>=512
        IMPLICIT_PRECOMP_GEMM is slightly slower,
        but requires much less memory.  (same as cuDNN)

On my GTX960, all layers using WINOGRAD can achieve +8.1% in new waifu2x model and +22.8% in old waifu2x model.
And if I set the first layer to be IMPLICIT_PRECOMP_GEMM, new model achieve +8.8%.
If I set the first two layers to be IMPLICIT_PRECOMP_GEMM, old model achieve +24.7%.

@chungexcy
Copy link

Correctness:
And if I set the first two layers to be IMPLICIT_PRECOMP_GEMM, new model achieve +8.8%.
If I set the first one layer to be IMPLICIT_PRECOMP_GEMM, old model achieve +24.7%.

@nagadomi
Copy link
Owner Author

nagadomi commented Jun 22, 2016

Thanks for reporting.
I think pad=1 works for waifu2x models.

  • set pad=1 to all conv layers
  • manually removing (number of conv layers * 2)px border from output

It is same as pad=0.

@chungexcy
Copy link

chungexcy commented Jun 23, 2016

Thanks for this suggestion. I tested this idea and here is the result.

Since cuDNN v5 does not turn on WINOGRAD if input channel <= 32, the first 3 layers doesn't required padding. Only the last 3 conv on new model and 4 conv on old model need padding to trigger the acceleration.

  • On old model, (120 times of forward on 480x360 patch)
Algo(Pad)                           Time        +       Note
1(0)-6(0)-6(0)-6(0)-6(0)-6(0)-6(0)  26.122208   24.7%   Best    ***
6(0)-6(0)-6(0)-6(0)-6(0)-6(0)-6(0)  26.518610   22.8%   Force all WINOGRAD
1(0)-1(0)-1(0)-6(0)-6(0)-6(0)-6(0)  26.916889   21.0%   No padding, compared with auto-optimization
1(0)-1(0)-1(0)-6(1)-6(1)-6(1)-6(1)  27.137580   20.0%   Padding enabled, cuDNN v5 auto-optimization
1(0)-1(0)-1(0)-1(0)-1(0)-1(0)-1(0)  32.566156    0.0%   Baseline, current implementation
  • On new model, (360 times of forward on 480x360 patch)
Algo(Pad)                           Time        +       Note
1(0)-1(0)-6(0)-6(0)-6(0)-6(0)       36.792309   8.78%   Best    ***
6(0)-6(0)-6(0)-6(0)-6(0)-6(0)       37.015582   8.13%   Force all WINOGRAD
1(0)-1(0)-1(0)-6(0)-6(0)-6(0)       37.026379   8.10%   No padding, compared with auto-optimization
1(0)-1(0)-1(0)-6(1)-6(1)-6(1)       37.391198   7.04%   Padding enabled, cuDNN v5 auto-optimization
1(0)-1(0)-1(0)-1(0)-1(0)-1(0)       40.023677   0.00%   Baseline

All results are the pure forward performance based on many 480x360 patches (python is much slower on the rest process compared to c++).

Padding only affects 1% on this resolution, and we can expected 2% loss on a smaller 240x180 size.
Manually tuning the configuration can achieve even faster than the behavior of cuDNN v5.

I have reported these two findings (padding and channel==32 not turning WINOGRAD on) to cuDNN team. Hopefully, I can get some feedback from them and solve this from the source.

@nagadomi
Copy link
Owner Author

nagadomi commented Jun 23, 2016

Here is auto tuning results of new waifu2x model using cudnn.torch.

Autotuning SC     Forward: Time: 0.02048 Memory:    95264 Algorithm: 1 Weight:           16x27 Input:     1x3x128x128 Output:    1x16x126x126   
Autotuning backwardFilter: Time: 0.06758 Memory:     9736 Algorithm: 3 Weight:           16x27 Input:     1x3x128x128 Output:    1x16x126x126   
Autotuning   backwardData: Time: 0.04096 Memory:    32768 Algorithm: 4 Weight:           16x27 Input:     1x3x128x128 Output:    1x16x126x126


Autotuning SC     Forward: Time: 0.04416 Memory:    32768 Algorithm: 6 Weight:          32x144 Input:    1x16x126x126 Output:    1x32x124x124   
Autotuning backwardFilter: Time: 0.09830 Memory:     9616 Algorithm: 3 Weight:          32x144 Input:    1x16x126x126 Output:    1x32x124x124   
Autotuning   backwardData: Time: 0.05222 Memory:    65536 Algorithm: 4 Weight:          32x144 Input:    1x16x126x126 Output:    1x32x124x124


Autotuning SC     Forward: Time: 0.09571 Memory:   131072 Algorithm: 6 Weight:          64x288 Input:    1x32x124x124 Output:    1x64x122x122   
Autotuning backwardFilter: Time: 0.21018 Memory:     9496 Algorithm: 3 Weight:          64x288 Input:    1x32x124x124 Output:    1x64x122x122   
Autotuning   backwardData: Time: 0.08227 Memory:   131072 Algorithm: 4 Weight:          64x288 Input:    1x32x124x124 Output:    1x64x122x122


Autotuning SC     Forward: Time: 0.20733 Memory:   524288 Algorithm: 6 Weight:         128x576 Input:    1x64x122x122 Output:   1x128x120x120   
Autotuning backwardFilter: Time: 0.51818 Memory:     9376 Algorithm: 3 Weight:         128x576 Input:    1x64x122x122 Output:   1x128x120x120   
Autotuning   backwardData: Time: 0.20275 Memory:   524288 Algorithm: 4 Weight:         128x576 Input:    1x64x122x122 Output:   1x128x120x120


Autotuning SC     Forward: Time: 0.34358 Memory:  1048576 Algorithm: 6 Weight:        128x1152 Input:   1x128x120x120 Output:   1x128x118x118   
Autotuning backwardFilter: Time: 0.81974 Memory:     9292 Algorithm: 1 Weight:        128x1152 Input:   1x128x120x120 Output:   1x128x118x118   
Autotuning   backwardData: Time: 0.33488 Memory:  1048576 Algorithm: 4 Weight:        128x1152 Input:   1x128x120x120 Output:   1x128x118x118


Autotuning SC     Forward: Time: 0.66362 Memory:  2097152 Algorithm: 6 Weight:        256x1152 Input:   1x128x118x118 Output:   1x256x116x116   
Autotuning backwardFilter: Time: 1.35894 Memory:     9208 Algorithm: 1 Weight:        256x1152 Input:   1x128x118x118 Output:   1x256x116x116   
Autotuning   backwardData: Time: 0.59699 Memory:  2097152 Algorithm: 4 Weight:        256x1152 Input:   1x128x118x118 Output:   1x256x116x116

Autotuning        Forward: Time: 0.09728 Memory:    80744 Algorithm: 1 Weight:       256x3x4x4 Input:   1x256x116x116 Output:     1x3x228x228   
Autotuning backwardFilter: Time: 0.29498 Memory:    14008 Algorithm: 3 Weight:       256x3x4x4 Input:   1x256x116x116 Output:     1x3x228x228   
Autotuning   backwardData: Time: 0.16925 Memory:        0 Algorithm: 0 Weight:       256x3x4x4 Input:   1x256x116x116 Output:     1x3x228x228

Forward Algorithm 6=WINOGRAD, so WINOGRAD is selected.
I am not sure what is difference, code is here

@chungexcy
Copy link

I read your output codes. They are related to cudnnFindConvolutionForwardAlgorithm within the branch of benchmark=true? But I think auto tuning should be related to cudnnGetConvolutionForwardAlgorithm?

In Caffe, I set an if conditions on this, either auto get the algorithm, or read from a self-added line for algorithm in prototxt to conduct these previous tests.

@nagadomi
Copy link
Owner Author

nagadomi commented Jun 23, 2016

In cudnn.torch, when cudnn.benchmark=true, cudnnFindConvolutionForwardAlgorithm is used. when cudnn.benchmark=false(by default), cudnnGetConvolutionForwardAlgorithm is used. But if user specified a forward algorithm with setMode() (self.fmode is not nil), that specific algorithm is used.

In my understanding. cudnnFindConvolutionForwardAlgorithm(fastest=true) selects fastest algorithm using runtime benchmark (it should cache results. selected algo is used repeatable). cudnnGetConvolutionForwardAlgorithm(fastest=true) selects fastest algorithm using rule based logic based on NVIDIA's knowledges.

@chungexcy
Copy link

chungexcy commented Jun 26, 2016

Thanks for your explanation. I also follow your method and change to cudnnFindConvolutionForwardAlgorithm in caffe to benchmark and make a runtime desicion. code is here.

The result below is the algorithm choice for the new waifu2x model on one forward of 480x360 patch on my GTX960. (one round of initialization and one round of forward)

Autotuning SC     Forward: Time: 0.06118 Memory:   117608 Algorithm: 1
Autotuning SC     Forward: Time: 0.13574 Memory:   114272 Algorithm: 1
Autotuning SC     Forward: Time: 0.28048 Memory:   131072 Algorithm: 6
Autotuning SC     Forward: Time: 0.87158 Memory:   524288 Algorithm: 6
Autotuning SC     Forward: Time: 1.79267 Memory:  1048576 Algorithm: 6
Autotuning SC     Forward: Time: 3.66102 Memory:  2097152 Algorithm: 6

Autotuning SC     Forward: Time: 0.36982 Memory:  1098152 Algorithm: 1
Autotuning SC     Forward: Time: 1.05338 Memory:  1087808 Algorithm: 1
Autotuning SC     Forward: Time: 2.39629 Memory:   131072 Algorithm: 6
Autotuning SC     Forward: Time: 8.80400 Memory:   524288 Algorithm: 6
Autotuning SC     Forward: Time: 18.22976 Memory:  1048576 Algorithm: 6
Autotuning SC     Forward: Time: 42.40477 Memory:  2097152 Algorithm: 6

Here my optimized algo for conv2 is 1 instead of 6. which is consistent with my previous result that 0.5% of improvement is observed on real image inference.

I believe your benchmark result totally makes sense, that algo 6 is best for conv2 on your GTX 1080. Since we found 1080 benefit 51% on that 20 layers network (instead of 44% on GTX960), WINOGRAD seems more efficient on Pascal and is indeed better for conv2 settings for the future GPU.

Anyway, if Nvidia can change the policy on cuDNN, everything will be fine.

Could you try comparing the overall inference speed of the entire network, between all IMPLICIT_PRECOMP_GEMM and your best decision (1-6-6-6-6-6-6) on your GTX1080 new architecture, and see how much improvement you can get (hopefully more than 8.7%)?

@nagadomi
Copy link
Owner Author

nagadomi commented Jun 26, 2016

 (3x128x128 patch by waifu2x default setting, average of 100 times)
1-1-1-1-1-1-1(waifu2x new): 0.002844
1-6-6-6-6-6-1(waifu2x new): 0.002189
1-1-1-6-6-6-1(waifu2x new): 0.002257
1-1-1-1-1-1(conv only, remove deconv&relus): 0.002605
1-6-6-6-6-6(conv only, remove deconv&relus): 0.001829
1-1-1-6-6-6(conv only, remove deconv&relus): 0.001899

 (3x480x360 patch, average of 20 times)
1-1-1-1-1-1-1(waifu2x new): 0.02715
1-6-6-6-6-6-1(waifu2x new): 0.02068
1-1-1-6-6-6-1(waifu2x new): 0.02028 (best decision of cudnnFindConvolutionForwardAlgorithm in this patch size)
1-1-1-1-1-1(conv only, remove deconv&relus): 0.02350
1-6-6-6-6-6(conv only, remove deconv&relus): 0.01725
1-1-1-6-6-6(conv only, remove deconv&relus): 0.01700

It seems that best forward algo depends on a patch sizes.

@chungexcy
Copy link

Wow... That's ... a lot of improvement, far beyond my expectation... Definitely worth optimizing it, especially for Pascal architecture.

1 . The corresponding benchmark on my GTX 960 is below, for your reference. Time is calculated on new model, all without padding.

(3x128x128 patch by waifu2x default setting, average of 100 times)
6-6-6-6-6-6         0.010411      cudnnGetConvolutionForwardAlgorithm(if padding)
1-6-6-6-6-6         0.010319      best from test
1-1-6-6-6-6         0.010341      cudnnFindConvolutionForwardAlgorithm
1-1-1-6-6-6         0.010496
1-1-1-1-1-1         0.011422

(3x480x360 patch, average of 20 times)
6-6-6-6-6-6         0.103201
1-6-6-6-6-6         0.102130
1-1-6-6-6-6         0.101858      cudnnFindConvolutionForwardAlgorithm / best from test
1-1-1-6-6-6         0.102811      cudnnGetConvolutionForwardAlgorithm(if padding)
1-1-1-1-1-1         0.110872

2 . I also try to understanding how cudnnGetConvolutionForwardAlgorithm works. Here is my findings.

128x128+padding=1
Autotuning SC     Algorithm: 6
Autotuning SC     Algorithm: 6
Autotuning SC     Algorithm: 6
Autotuning SC     Algorithm: 6
Autotuning SC     Algorithm: 6
Autotuning SC     Algorithm: 6

480x360+padding=1
Autotuning SC     Algorithm: 1
Autotuning SC     Algorithm: 1
Autotuning SC     Algorithm: 1
Autotuning SC     Algorithm: 6
Autotuning SC     Algorithm: 6
Autotuning SC     Algorithm: 6

Without padding, I will get all Algo:1.

With padding, the threshold is located at 209x209 and 210x210. The number of pixels (width x height) matters, where width and height do not really matter individually.

Could you verify this finding on your GTX 1080? Thanks a lot.

From your previous results, cudnnGetConvolutionForwardAlgorithm on cuDNN v5 seems to be optimized for Pascal. If cudnnGetConvolutionForwardAlgorithm behaves the same on Pascal and Maxwell, adding a padding on all or partial layers, may be the easiest way to address this, while can still keep the compatibility for the future version (without changing the default Caffe to assign our pre-defined algos).

@nagadomi
Copy link
Owner Author

nagadomi commented Jul 3, 2016

waifu2x-caffe supported new models. https://github.com/lltcggie/waifu2x-caffe/releases/tag/1.1.5
waifu2x.udp.jp is still using old models 😞

And, now I added new models for photo.

@YukiSakuma
Copy link

YukiSakuma commented Jul 4, 2016

What's the difference between RGB and the new UpRGB model?
edit: I did some initial testing on the new version (1.1.4) of waifu2x-caffe it produces some artifacting when using denoise (with TTA) and then upscaling whether it's RGB or UpRGB
edit: it's like the same issue with this one: lltcggie/waifu2x-caffe#7

@nagadomi
Copy link
Owner Author

nagadomi commented Jul 4, 2016

I don't know much about waifu2x-caffe. You should create issue on waifu2x-caffe's repo, if the issue is only happen on waifu2x-caffe.

I think UpRGB is the same model as http://waifu2x-dev.udp.jp/ .

@nagadomi
Copy link
Owner Author

I merged this changes into master branch.
Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants