# Zoom to Learn, Learn to Zoom
#### Xuaner Zhang
#### 2019 CVPR 

## Key idea
- Raw Sensor data
- Optical magnification

<img src="https://ceciliavision.github.io/images/project-page/cvpr19.jpg">

## Abstract
- This paper shows that when applying machine learning to __digital zoom__, it is beneficial to operate on real, **RAW sensor data**
- Existing __learning-based super-resolution methods__ do not use real sensor data, instead operating on processed **RGB images**.
- Ground-truth data via optical zoom and contribute a dataset, SR-RAW
- **Novel contextual bilateral loss** that is robust to mild misalignment between input and outputs images
- Synthesizing sensor data by resampling high-resolution RGB images is an oversimplified approximation of **real sensor data and noise, resulting in worse image quality**.

## 1. Introduction
- We propose to improve the quality of super-resolution by starting with real raw sensor data.
- Recently, single image super-resolution has progressed with deep models and learned image priors from large-scale datasets
- these methods are **constrained**
    - the input image is a downsampled version of the high-resolution image, indirectly reducing the noise level in the input.
    - 8-bit RGB image that has been processed by the camera’s image signal processor (ISP), which trades off high-frequency signal in higher-bit raw sensor data for other objectives  
    
    
-  The **fundamental challenge** is obtaining ground truth for this task
    - One approach is to synthesize sensor data from 8-bit RGB images that are passed through some synthetic noise model
    - The reason is that sensor noise comes from a variety of sources  

    
- To enable learning from real raw sensor data for better computational zoom, we propose to **capture real data with a zoom lens**
- SR-RAW contains ground-truth high-resolution images taken with high optical zoom levels.
  
  
- SR-RAW brings up a new challenge : different camera configurations
- Mildly misaligned input-output image pairs make pixel-wise loss functions unsuitable for training
- We thus introduce a novel contextual bilateral loss (CoBi) that is robust to such mild misalignment
    - CoBi draws inspiration from the recently proposed contextual loss (CX)

### Our contributions
- We demonstrate **the utility of using real high-bit sensor data for computational zoom, rather than processed 8-bit RGB images or synthetic sensor models.**
- **SR-RAW, the first dataset for super-resolution from raw data, with optical ground truth**
- We propose a **novel contextual bilateral loss (CoBi)** that handles slightly misaligned image pairs

## 2. Related Work 
- Image super-resolution has advanced from traditional ﬁltering to learning-based methods.
    - Recently, deep neural networks have been applied to super-resolution, trained with a **variety of losses**
- GAN : RGB as input 
    - SRGAN, ESRGAN,LapSRN
- Image Processing with Raw Data : synthetic bayer mosaics 
    - propose method to super resolve image by jointly optimizing for the camera image processing pipeline and super-resolution from raw sensor data.

## 3. Dataset With Optical Zoom Sequences

- SR-RAW contains raw sensor data and **ground-truth high-resolution images** taken with a zoom lens at **various zoom levels**
- For data preprocessing, we **align** the captured images with different zoom levels via field of view (FOV) matching and geometric transformation. 


<img src="./figure2_1.png">

### 3.1. Data Capture with a Zoom Lens
- Each pair of images forms an input-output pair for training a model: the short-focal-length raw sensor image is used as input and the long-focal-length RGB image is regarded as the groundtruth for super-resolution.

    - For example, the RGB image taken with a 70mm focal length serves as the 2X zoom ground truth for the raw sensor data taken with a 35mm focal length.
    - we collect 7 images under 7 optical zoom settings per scene for data collection efficiency
- In total, we collect 500 sequences in indoor and outdoor scenes. ISO ranges from 100 to 400.
  
  
- During data capture, camera settings are important.
     - First, depth of field (DOF) changes with focal length and it is **not practical** to adjust aperture size for each focal length to make DOF identical.
     - We choose a small aperture size (at least f/20) to minimize the DOF difference (still notice- able in Figure 2 B2), using a tripod to capture indoor scenes with a long exposure time. 
     - Second, we use **the same exposure time for all images in a sequence** so that noise level is not affected by focal length change.
     - But we still observe noticeable illumination variations. This color variation is another motivation for us to **avoid using pixel-to-pixel losses for training.**
     
### 3.2. Data Preprocessing
- RGB-L, RAW-L, RGB-H, RAW-H, and FOV(시야각)
- We apply a Euclidean motion model that allows image rotation and translation via enhanced correlation coefficient minimization.
- A scale offset is applied to the image if the optical zoom does not perfectly match the target zoom ratio.


### 3.3. Misalignment Analysis
- Misalignment is unavoidable during data capture and can hardly be eliminated by the preprocessing step.
-  misalignment is inherently caused by the perspective changes
- The described misalignment in SR-RAW usually causes **40-80 pixel shifts in an 8-megapixel image pair.**

## 4. Contextual Bilateral Loss
- The contextual loss was proposed to train with unaligned data pairs.
- Source image P: feature point ${p_{i}}_{i=1}^{N}$ and target image Q : feature point ${q_{j}}_{j=1}^{M}$  
- For each source image feature $p$, it searches for the nearest neighbor (NN) feature match $q$ such that $q = arg\min_{q}\mathbb{D}(p,q)_{j=1}^{M}$ under some distance measure $\mathbb{D}(p,q)$
- Given input image P and its target Q, the contextual loss tries to minimize the summed distance of all matched feature pairs, formulated as
$$\mathrm{CX}(P,Q) = \frac{1}{N}\sum_{i}^N \min_{j=1,\ldots,M}(\mathbb{D}_{p_{i},q_{j}})$$
- We hypothesize that these artifacts are caused by inaccurate feature matching in the contextual loss.
- We thus analyze the percentage of features that are matched uniquely

- Inspired by the edge-preserving bilateral ﬁlter [28], we integrate the spatial pixel coordinates and pixel-level RGB information into the image features.
$$\mathrm{CoBi}(P,Q)= \frac{1}{N}\sum_{i}^N \min_{j=1,\ldots,M}(\mathbb{D}_{p_{i},q_{j}} + w_s\mathbb{D}'_{p_{i},q_{j}})$$
where $\mathbb{D}'_{p_{i},q_{j}} = ||(x_i, y_i) - (x_j,y_j)||_2$  

- $(x_i,y_i)$ and $(x_j,y_j)$ are spatial coordinates of feature $p_i$ and $q_j$
- $w_s$ denote the the weight of spatial awareness for nearest neighbor search.
- we use pretrained VGG-19 feature and select 'conv1_2', "conv2_2" and "conv3_2" as our deep feature
- Cosine distance is used to measure feature similarity. $\mathrm{similarity}(x,y) = \cos(\theta) = \frac{x \cdot y}{|x||y|}$

#### Our ﬁnal loss function is deﬁned as

$$\mathrm{CoBi}_{\mathrm{RGB}}(P,Q,n) + \lambda \mathrm{CoBi}_{\mathrm{VGG}}(P,Q)$$

where we use $n \times n$ RGB patches as features for $\mathrm{CoBi}_{RGB}$



## 5. Experimental Setup
- We pack each 2 × 2 block in the raw Bayer mosaic into 4 channels as input for our model. The packing reduces the spatial resolution of the image by a factor of two in width and height, without any loss of signal.
- We adopt a 16-layer ResNet architecture followed by $\log_{2} N + 1$ up-convolution layers where N is the zoom factor.
- We split 500 sequences in SR-RAW into training, validation, and test sets with a ratio of 80:10:10 
- so training : 400, validation : 50,  test : 50 
- For a 4X zoom model, we get 3 input-output pairs per sequence for training, and for an 8X zoom model, we get 1 pair per sequence.
- We randomly crop 64 × 64 patches from a full-resolution Bayer mosaic as input for training.

### 5.1. Baselines
- SRGAN, SRResnet, LapSRN and ESRGAN
- we ﬁrst try to ﬁne-tune their models on SR-RAW. However, we notice little difference in average performance

### 5.2. Controlled Experiments on Our Model

- “Ours-png”: For comparison, we also train a copy of our model (“Ours-png”) using 8-bit processed RGB images to evaluate the beneﬁts of having real raw sensor data.
- “Ours-syn-raw”. To test whether synthesized raw data can replace real sensor data for training,

## 6. Results

- To quantitatively evaluate the presented approach, we use the standard **SSIM** and **PSNR** metrics, as well as the recently proposed learned perceptual metric **LPIPS**
- GAN-based methods often generate noisy artifacts and lead to low PSNR and SSIM scores
- Bicubic upsampling and SRResnet produce blurry results and get a low score in LPIPS.

<img src="./table_1.png">

<img src="./figure_5.png">
<img src="./figure_6.png">