# 배경지식

## Change of Variable Theorem

$z$ ~ $\pi (z)$ # given a random variable z and its know pdf

$z = f^{-1}(x)$     $x = f(z)$ # new random variable using a 1 : 1 mapping funciton

$p(x) = \pi(z)|\frac{dz}{dx}| = \pi(f^{-1}(x))|(f^{-1})'(x)|$ # at single variable

$p(X) = \pi(Z)|det(\frac{dZ}{dX})|$ # at multivariable # 뒤에 붙은 Jacobian Matrix가 각 변수들의 변화량에 관련된 정보들을 담고 있음



## Normalizing Flows

<img src="https://user-images.githubusercontent.com/66329748/160296144-d6d6a1ba-64da-48f9-8211-8b86618d8f01.png" width="500"/>

VAE와는 달리 Encoder, Decoder 함수가 별개의 함수가 아니라 f(x)와 그 역함수로 구성

z의 distribution을 구하면 p(x)도 구할 수 있음

<img src="https://user-images.githubusercontent.com/66329748/160296149-b4677b26-659a-48be-b6e1-2d5bd61043c6.jpg" width="500"/>

*Change of Variable Theorem 과 마찬가지의 방법으로 식 도출*

<img src="https://user-images.githubusercontent.com/66329748/160296148-d52fed58-8ae1-4c16-bfd0-3b6781645176.jpg" width="500"/>

식의 변형과정

(1) → (2) 과정 : $\frac{df_{i}^{-1}}{dz_i} = \frac{dz_{i-1}}{dz_i} = (\frac{dz_i}{dz_{i-1}})^{-1} = (\frac{df_i}{dz_{i-1}})^{-1}$

(2) → (3) 과정 : $det(M)det(M^{-1}) = det(I) = 1~~~~~\therefore det(M^{-1}) = \frac{1}{det(M)} = det(M)^{-1}$

<img src="https://user-images.githubusercontent.com/66329748/160296147-69a68cf5-5ad8-4431-90e6-2bd3314c5f8a.jpg" width="500"/>

반복적인 연산을 통해 최종 식 도출

$\therefore$ f의 jacobian의 determinant를 쉽게 구할 수 있어야된다. f가 역함수를 가져야 한다.

# NICE: Non-linear Independent Components Estimation 리뷰

## Abstract

NICE ← deep learning framework modeling complex high-dimensional densities

good representation ← the data has a distribution that is easy to model

latent space로 mapping하는 non-linear deterministic transformation을 학습 → 위의 두 조건 만족시키면서

criterion ← exact log-likelihood. tractable!

## Introduction

a good representation ← the distribution of the data is easy to model

find a transformation *h = f(x)*

resulting distribution(new pdf) factorizes 

<img src="https://user-images.githubusercontent.com/66329748/160296143-99aee7b4-a74f-4aae-9e99-22b7dd8b5352.png" width="300"/>

Change of Variable Theorem에 따라 

<img src="https://user-images.githubusercontent.com/66329748/160296140-2674d40c-1919-43ef-9e0b-5fe4343810fc.png" width="300"/>

첫 설명과 식이 다른 이유 : 첫 설명 때는 x = f(h) 꼴이라서 함수의 방향이 반대일 뿐

> In this paper, we choose f such that the determinant of the Jacobian is trivially obtained. Moreover, its inverse $f^{-1}$is also trivially obtained
> 

core idea ← split x into two blocks (x1, x2) and apply as building block a transformation from (x1, x2) to (y1, y2). ***Coupling Layer***

<img src="https://user-images.githubusercontent.com/66329748/160296138-f475ed1a-7f78-420c-962b-52f45c46e720.png" width="300"/>

m은 임의의 complex function (논문의 실험에서는 ReLU MLP) → deep learning function 사용 가능

## Learning Bijective Transformations of Continuous Probabilities

maximum likelihood를 이용하기 위해 

<img src="https://user-images.githubusercontent.com/66329748/160296132-0d403c69-34f3-46df-ab2b-5b4aefc0d7c3.png" width="500"/>

Introduction의 첫 식을 이용하여 식 변형

<img src="https://user-images.githubusercontent.com/66329748/160296128-b66b3844-ec74-4d1f-90a9-cbaf5d818d9a.png" width="500"/>

NICE → learning invertible preprocessing transform of the dataset → increase likelihood arbitrarily 

<img src="https://user-images.githubusercontent.com/66329748/160296137-8e5a095f-9662-4557-8949-221fde59a494.png" width="500"/>

Computational graph of a coupling layer

## Architecture

### Triangular Structure

build a neural network with triangular weight matrices and bijective activation functions 

→ limiting design choices to depth and selection of non-linearities 

→ consider a family of functions with triangular Jacobian

→ determinant of the Jacobian is also made easy to compute

### Coupling Layer

<img src="https://user-images.githubusercontent.com/66329748/160296136-2450901c-d83f-47d0-a33d-af8b7864cf05.png" width="500"/>
<br>
<img src="https://user-images.githubusercontent.com/66329748/160296135-14abe03c-9c62-4c2d-983a-b8294d7baa7f.png" width="500"/>
<br>
<img src="https://user-images.githubusercontent.com/66329748/160296133-9edbe5fb-9f8d-40d0-9063-dff0b461e618.png" width="500"/>

there is no restriction placed on the choice of coupling function $m$

g를 단순 additive coupling law로 만들면 계산이 매우 쉬워짐 → det()값이 1이 되어버림

### Allowing Rescaling

특정 dimension에 weight을 더 주고 다른 dimension에는 weight을 덜 주는 역할

# NICE 구현

## import

In [None]:
import os
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import transforms, datasets
from torch.distributions import Distribution, Uniform

## configs

In [None]:
cfg = {
  'MODEL_SAVE_PATH': './saved_models/',

  'USE_CUDA': True,

  'TRAIN_BATCH_SIZE': 256,

  'TRAIN_EPOCHS': 75,

  'NUM_COUPLING_LAYERS': 4,

  'NUM_NET_LAYERS': 6,  # neural net layers for each coupling layer

  'NUM_HIDDEN_UNITS': 1000
}

## modules

In [None]:
class CouplingLayer(nn.Module):
  """
  Implementation of the additive coupling layer from section 3.2 of the NICE
  paper.
  """

  def __init__(self, data_dim, hidden_dim, mask, num_layers=4):
    super().__init__()

    assert data_dim % 2 == 0

    self.mask = mask

    modules = [nn.Linear(data_dim, hidden_dim), nn.LeakyReLU(0.2)]
    for i in range(num_layers - 2):
      modules.append(nn.Linear(hidden_dim, hidden_dim))
      modules.append(nn.LeakyReLU(0.2))
    modules.append(nn.Linear(hidden_dim, data_dim))

    self.m = nn.Sequential(*modules)

  def forward(self, x, logdet, invert=False):
    if not invert:
      x1, x2 = self.mask * x, (1. - self.mask) * x
      y1, y2 = x1, x2 + (self.m(x1) * (1. - self.mask))
      return y1 + y2, logdet

    # Inverse additive coupling layer
    y1, y2 = self.mask * x, (1. - self.mask) * x
    x1, x2 = y1, y2 - (self.m(y1) * (1. - self.mask))
    return x1 + x2, logdet


class ScalingLayer(nn.Module):
  """
  Implementation of the scaling layer from section 3.3 of the NICE paper.
  """
  def __init__(self, data_dim):
    super().__init__()
    self.log_scale_vector = nn.Parameter(torch.randn(1, data_dim, requires_grad=True))

  def forward(self, x, logdet, invert=False):
    log_det_jacobian = torch.sum(self.log_scale_vector)

    if invert:
        return torch.exp(- self.log_scale_vector) * x, logdet - log_det_jacobian

    return torch.exp(self.log_scale_vector) * x, logdet + log_det_jacobian


class LogisticDistribution(Distribution):
  def __init__(self):
    super().__init__()

  def log_prob(self, x):
    return -(F.softplus(x) + F.softplus(-x))

  def sample(self, size):
    if cfg['USE_CUDA']:
      z = Uniform(torch.cuda.FloatTensor([0.]), torch.cuda.FloatTensor([1.])).sample(size)
    else:
      z = Uniform(torch.FloatTensor([0.]), torch.FloatTensor([1.])).sample(size)

    return torch.log(z) - torch.log(1. - z)


## model

In [None]:
class NICE(nn.Module):
  def __init__(self, data_dim, num_coupling_layers=3):
    super().__init__()

    self.data_dim = data_dim

    # alternating mask orientations for consecutive coupling layers
    masks = [self._get_mask(data_dim, orientation=(i % 2 == 0))
                                            for i in range(num_coupling_layers)]

    self.coupling_layers = nn.ModuleList([CouplingLayer(data_dim=data_dim,
                                hidden_dim=cfg['NUM_HIDDEN_UNITS'],
                                mask=masks[i], num_layers=cfg['NUM_NET_LAYERS'])
                              for i in range(num_coupling_layers)])

    self.scaling_layer = ScalingLayer(data_dim=data_dim)

    self.prior = LogisticDistribution()

  def forward(self, x, invert=False):
    if not invert:
      z, log_det_jacobian = self.f(x)
      log_likelihood = torch.sum(self.prior.log_prob(z), dim=1) + log_det_jacobian
      return z, log_likelihood

    return self.f_inverse(x)

  def f(self, x):
    z = x
    log_det_jacobian = 0
    for i, coupling_layer in enumerate(self.coupling_layers):
      z, log_det_jacobian = coupling_layer(z, log_det_jacobian)
    z, log_det_jacobian = self.scaling_layer(z, log_det_jacobian)
    return z, log_det_jacobian

  def f_inverse(self, z):
    x = z
    x, _ = self.scaling_layer(x, 0, invert=True)
    for i, coupling_layer in reversed(list(enumerate(self.coupling_layers))):
      x, _ = coupling_layer(x, 0, invert=True)
    return x

  def sample(self, num_samples):
    z = self.prior.sample([num_samples, self.data_dim]).view(self.samples, self.data_dim)
    return self.f_inverse(z)

  def _get_mask(self, dim, orientation=True):
    mask = np.zeros(dim)
    mask[::2] = 1.
    if orientation:
      mask = 1. - mask     # flip mask orientation
    mask = torch.tensor(mask)
    if cfg['USE_CUDA']:
      mask = mask.cuda()
    return mask.float()

## train

In [None]:
# Data
transform = transforms.ToTensor()
dataset = datasets.MNIST(root='./data/mnist', train=True, transform=transform, download=True)
dataloader = torch.utils.data.DataLoader(dataset=dataset, batch_size=cfg['TRAIN_BATCH_SIZE'],
                                         shuffle=True, pin_memory=True)

model = NICE(data_dim=784, num_coupling_layers=cfg['NUM_COUPLING_LAYERS'])
if cfg['USE_CUDA']:
  device = torch.device('cuda')
  model = model.to(device)

# Train the model
model.train()

opt = optim.Adam(model.parameters())

for i in range(cfg['TRAIN_EPOCHS']):
  mean_likelihood = 0.0
  num_minibatches = 0

  for batch_id, (x, _) in enumerate(dataloader):
      x = x.view(-1, 784) + torch.rand(784) / 256.
      if cfg['USE_CUDA']:
        x = x.cuda()

      x = torch.clamp(x, 0, 1)

      z, likelihood = model(x)
      loss = -torch.mean(likelihood)   # NLL

      loss.backward()
      opt.step()
      model.zero_grad()

      mean_likelihood -= loss
      num_minibatches += 1

  mean_likelihood /= num_minibatches
  print('Epoch {} completed. Log Likelihood: {}'.format(i, mean_likelihood))

  # if i % 5 == 0:
    # save_path = os.path.join(cfg['MODEL_SAVE_PATH'], '{}.pt'.format(i))
    # torch.save(model.state_dict(), save_path)
    # print(model.state_dict())


  'with `validate_args=False` to turn off validation.')


Epoch 0 completed. Log Likelihood: -1046.8690185546875
Epoch 1 completed. Log Likelihood: -854.7686767578125
Epoch 2 completed. Log Likelihood: -672.8143920898438
Epoch 3 completed. Log Likelihood: -494.6536560058594
Epoch 4 completed. Log Likelihood: -320.71441650390625
Epoch 5 completed. Log Likelihood: -151.33949279785156
Epoch 6 completed. Log Likelihood: 12.598835945129395
Epoch 7 completed. Log Likelihood: 170.87037658691406
Epoch 8 completed. Log Likelihood: 322.28997802734375
Epoch 9 completed. Log Likelihood: 466.75970458984375
Epoch 10 completed. Log Likelihood: 603.1981201171875
Epoch 11 completed. Log Likelihood: 731.8975830078125
Epoch 12 completed. Log Likelihood: 852.1170043945312
Epoch 13 completed. Log Likelihood: 964.4251098632812
Epoch 14 completed. Log Likelihood: 1068.567138671875
Epoch 15 completed. Log Likelihood: 1165.2664794921875
Epoch 16 completed. Log Likelihood: 1254.1492919921875
Epoch 17 completed. Log Likelihood: 1335.5655517578125
Epoch 18 completed. Lo

### 논문에서 MNIST Dataset 기준 Log-likelihood는 1980.50