# ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks

<img src="https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbuEqNc%2FbtrqN2L0EbI%2FbpM7c6SG3ewk5AOxhLHY40%2Fimg.png">
baboon : 개코원숭이

- Three key components of SRGAN
    - Network architecture (RRDB)
    - Adversarial loss (idea from relativstic GAN)
    - Perceptual loss ( by using the features before activation)

## Proposed Methods
##### MAIN AIM : To improve the overall perceptual quality for SR 
### Network Architecture

<img src="https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbMn9n7%2FbtrqOgjBfNo%2FBsqG1joBXE4Wfcs7gPH5j1%2Fimg.png">

<img src = "https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FthFaB%2FbtrqNLRE0Va%2FydtkURqs9B0Ni4tQKRiYxk%2Fimg.png">

- We employ the basic architecture of SRResNet
- Left: We remove the BN layers in residual block in SRGAN.
- Right: RRDB block. $\beta$ is the residual scaling parameter.

##### Removing Batch Nomalize layers
- When the statistics of training and testing datasets diﬀer a lot, **BN layers tend to introduce unpleasant artifacts and limit the generalization ability.**
- Furthermore, removing BN layers helps to improve generalization ability and to reduce computational complexity and memory usage.

- Based on the observation that more layers and connections could always boost performance, the proposed RRDB employs a **deeper and more complex structure** than the original residual block in SRGAN.  


- Exploit several techniques
    - 1) Residual scaling : multiplying a constant $\beta$ between 0 and 1
    - 2) Smaller initialization : supplementary material.   
    
        - (we empirically ﬁnd residual architecture is easier to train when the initial parameter variance becomes smaller.)

### Relativistic Discriminator

- Enhance the discriminator based on the Relativistic GAN
- A relativistic discriminator tries to predict the probability that a real image $x_r$is relatively more realistic than a fake one $x_f$

<img src= "https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FdGCpa0%2FbtrqN2LYSeE%2FFqwx2bud9hd0FeWx2eD2V0%2Fimg.png">

- Relativistic average Discriminator RaD [2], denoted as $D_{Ra}$
- $\sigma$ is the sigmoid function and $C(x)$ is the non-transformed discriminator output.

#### Discriminator loss
$$L_D^{Ra} = - \mathbb{E}_{x_r}[\log(D_{Ra}(x_r, x_f))]- \mathbb{E}_{x_f}[1- \log(D_{Ra}(x_f, x_r))]$$

#### Adversarial loss for generator
$$L_G^{Ra} = - \mathbb{E}_{x_r}[1- \log(D_{Ra}(x_r, x_f))]- \mathbb{E}_{x_f}[\log(D_{Ra}(x_f, x_r))]$$

where $x_f = G(x_i), x_i $ : input LR image

### Perceptual Loss

- Use features before the activation layers
    - First, the activated features are very sparse (Fig.)
    - Second, using features after activation also causes inconsistent reconstructed brightness compared with the ground-truth  (Sec. 4.4.)

**The total loss for the generator**
$$L_G = L_{percep} + \lambda L_G^{Ra} + \eta L_1$$
where $L_1 = \mathbb{E}_{x_i}||G(x_i) - y||_1$  
$\lambda, \eta :$ coefficient to balance different loss term
($\lambda= 5×10^{−3}, η = 1×10^{−2}$ )


<img src="https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fb0TySQ%2FbtrqRTuGsGE%2FOlNBJNgK1yoHHg32kXeUWK%2Fimg.png">

### Network Interpolation
- we ﬁrst train a PSNR-oriented network $G_{PSNR}$and then obtain a GAN-based network $G_{GAN}$ by ﬁne-tuning.

#### interpolated model $G_{INTERP}$
$$\theta_{G}^{INTERP} = (1-\alpha)\theta_G^{PSNR} + \alpha \theta_G^{GAN}$$
  
where $\theta_{G}^{INTERP}, \theta_G^{PSNR}, \theta_G^{GAN}$ : parameters of $ {G}^{INTERP}, G^{PSNR}, G^{GAN}$  
$\alpha \in [0,1]$

#### Network interpolation enjoys two merits.
- First, the interpolated model is able to produce meaningful results for any feasible $\alpha$ without introducing artifacts.
- Second, we can continuously balance perceptual quality and ﬁdelity **without re-training the model.**

## Experiments

### Training Detail
- Mini-batch : 16, 
- Spatial size of cropped HR patch : 128 × 128

- Training process
    - First, we train a PSNRoriented model with the L1 loss.
    - We then employ the trained PSNR-oriented model as an initialization for the generator
- Pre-training with pixel-wise loss helps GAN-based methods to obtain more visually pleasing results.
    - Reason 1: Avoid undesired local optima for the generator
    - Reason 2: After pre-training, the discriminator receives relatively good super-resolved images instead of extreme fake ones (black or noisy images) at the very beginning,


### Qualitative Results

<img src="https://cdn-images-1.medium.com/max/1010/1*w_Fm9_Z6ou4W195hmsOiWQ.png">

- Compare SRCNN [4], EDSR [20] and RCAN [12],SRGAN [1] , EnhanceNet
- ESRGAN produces more natural texture and gets rid of artifacts

### Ablation Study
<img src="https://cdn-images-1.medium.com/max/1024/1*7WGOHvc0_gJiD73qIJ6gqQ.png">

**BN removal** 
 - when a network is deeper and more complicated, the model with BN layers is more likely to introduce unpleasant artifacts.  
 
**Before activation in perceptual loss**
<img src="https://cdn-images-1.medium.com/max/1024/1*no2qgNjByseQsq3pXVUplQ.png">

  
**RaGAN**
- beneﬁt learning sharper edges and more detailed textures.

**Deeper network with RRDB**
- improve the recovered textures, especially for the regular structures like the roof of image 6

### Network Interpolation
<img src="https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FzcMW0%2FbtrqUSIzoga%2FmmBwsdYIjje4j8dhwXKfrk%2Fimg.png">

- The pure GAN-based method produces sharp edges and richer textures but with some unpleasant artifacts
- The pure PSNRoriented method outputs cartoon-style blurry images.
- By contrast, image interpolation fails to remove these artifacts eﬀectively.