# VIDEO DECOMPOSITION PRIOR: EDITING VIDEOS LAYER BY LAYER

Paper Authors: 

| Name | Shcool | Mail |
| ---- | ------ | ---- |
| Gaurav Shrivastava | University of Maryland, College Park | gauravsh@umd.edu|
| Abhinav Shrivastava | University of Maryland, College Park | abhinav@cs.umd.edu
| Ser-Nam Lim | University of Central Florida | sernam@ucf.edu |


Project Authors: Alper Bahçekapılı, Furkan Küçük


Paper Summary: Paper is a deep learning framework to edit videos without supervision. Namely following three points are adressed in the paper:

* Video Relighting
* Video Dehazing
* Unsupervised Video Object Segmentation



## Overall Logic of the Paper:

Paper approaches the problem with the intuition from video editing programs. As in these programs, they treat the videos as they are composed of multiple layers. For relighting problem, one layer is relight-map and the other is dark frame. In dehazing, again similar to reliht one layer is t-map etc. For segmentation, layers are foreground objects layer and the other is background layer. 

All optimization is done in the inference time. So for each of the video, we train models from the ground up. Paper realize given solutions with two main modules. RGB-net and $\alpha$-net models. For each of the problem type, these models quantity(1 RGB-net for relight, 2 RGB-net for segmentation) and purpose change. 

These models harness the information that is obtained by flow between the frames. Inclusion of optical flow captures motion effectively and makes the model significantly moer resilient to variations in lighting.

## Modules Overview

**RGBnet:** Given that we only optimize the weights over a single video, a shallow convolutional U-Net is sufficent for the task. This model takes $X_t$ of the video seq. and outputs RGB layer. 

**$\alpha$ Layer:** Similar to RGBNet arcitecture is again shallow U-Net for predicting the t-maps or opacity layer. This layer takes RGB representation of the forward optical flow($F^{RGB}_{t\rightarrow t-1}$) 

## Video Relighting

<center>
    <figure>
        <img src="figures/figure-1.png" alt="Video Relighting" title="Figure 1" width="1000">
        <figcaption>Figure 1: Video Relighting</figcaption>
    </figure>
</center>

$F^{(1)}_{RGB}$, $F^{(1)}_{\alpha}$, $\gamma^{-1}$ are optimized with the following loss objectives(below are general definition of the losses. Each module updates these a little)

**Overall Loss Objective:** $L_{final}$ = $\lambda_{rec}$ $L_{rec}$ + $\lambda_{warp}$ $L_{warp}$ (1)

**Reconstruction Loss:** $\sum_t ||X_t - \hat{X_t}||_1 + || \phi (X_t) - \phi (\hat{X_t})||_1$ (2)

**Optical Flow Warp Loss** $\sum_t || F_{t-1 \rightarrow t} (X_{t-1}^o) -  X_{t}^o  || $ (3)



Relit video is reconstructed with the following equation.

$X_t^{out} = A_t * (X_t^{in})^{\gamma}$,  $\forall t \in (1,T] $ (4)


For the VDP framework authors update eq. 4 as follows

$log(X_t^{in}) = \gamma^{-1}(log(1/A_t)+log(x_t^{out}))$, $\forall t \in (1,T] $ (5)




Relighting task is evaluated on SDSD dataset where the video has relit and dark version of these. SSIM and PSNR metrics are utilized in order to evaluate quantatively.

## Unsupervised Video Object Segmentation

## Video Dehazing


## Conclusion




In [1]:
from video_dip.models.modules.segmentation import SegmentationVDPModule
model = SegmentationVDPModule.load_from_checkpoint("/home/alpfischer/METU-Courses/VideoDIP/video_dip_segmentation/0dsjbs02/checkpoints/epoch=6-step=560.ckpt")
model.eval()






SegmentationVDPModule(
  (rgb_net): UNet(
    (encoder): Sequential(
      (0): Sequential(
        (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): LeakyReLU(negative_slope=0.2)
      )
      (1): Sequential(
        (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): LeakyReLU(negative_slope=0.2)
      )
      (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
      (3): Sequential(
        (0): Conv2d(64, 96, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): LeakyReLU(negative_slope=0.2)
      )
      (4): Sequential(
        (0): Conv2d(96, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
     

In [10]:
from video_dip.data.datamodule import VideoDIPDataModule
from video_dip.models.optical_flow.raft import RAFT, RAFTModelSize

data_module = VideoDIPDataModule(
    input_path="datasets/input/blackswan", 
    target_path="datasets/GT/blackswan",
    flow_model=RAFT(RAFTModelSize.LARGE),
    flow_path="datasets/input/blackswan_flow",
    batch_size=2, 
    num_workers=8
)
data_module.setup()

  from .autonotebook import tqdm as notebook_tqdm
100%|██████████| 49/49 [00:11<00:00,  4.43it/s]


In [11]:

val_dataloader = data_module.val_dataloader()
val_dataloader

<torch.utils.data.dataloader.DataLoader at 0x7ff4dc7ac040>

In [12]:
next(iter(val_dataloader))

{'flow': tensor([[[[ -8.4796,  -8.4772,  -8.4681,  ...,  -8.2424,  -8.2456,  -8.2439],
           [ -8.4801,  -8.4803,  -8.4708,  ...,  -8.2435,  -8.2433,  -8.2443],
           [ -8.4708,  -8.4735,  -8.4692,  ...,  -8.2423,  -8.2421,  -8.2438],
           ...,
           [ -1.0452,  -1.0462,  -1.0372,  ..., -13.7213, -13.6594, -13.5899],
           [ -1.0232,  -1.0244,  -1.0187,  ..., -13.7206, -13.6522, -13.5704],
           [ -0.9952,  -0.9988,  -0.9951,  ..., -13.7151, -13.6375, -13.5460]],
 
          [[ -0.9110,  -0.9079,  -0.9097,  ...,  -0.9562,  -0.9684,  -0.9751],
           [ -0.9150,  -0.9101,  -0.9111,  ...,  -0.9581,  -0.9669,  -0.9745],
           [ -0.9181,  -0.9116,  -0.9116,  ...,  -0.9608,  -0.9675,  -0.9736],
           ...,
           [  0.7273,   0.7126,   0.7090,  ...,   7.0085,   6.8805,   6.7654],
           [  0.7456,   0.7349,   0.7285,  ...,   7.0139,   6.8489,   6.7055],
           [  0.7647,   0.7598,   0.7563,  ...,   7.0242,   6.8311,   6.6463]]],
 
 
   

In [None]:
#https://drive.google.com/drive/folders/1aXPEo17npP45v2Fv2TBCRtfdnjb-AqYt?usp=sharing

In [14]:
output = model(next(iter(val_dataloader))["input"].to(model.device))

In [18]:
output["rgb"][0].

torch.Size([2, 3, 480, 856])