In [3]:

print("Hello")


Hello


# Invertible Networks

The invertnn library contains many implementations of invertible or bijective layer types & activations for the construction of Invertible Neural Networks (INNs). For a reference of the available classes, see

 * [Invertible Transformations](../api_ref.html#module-invertnn.pytorch.invertible_transforms)
 * [Orthogonal Transformation](../api_ref.html#module-invertnn.pytorch.orthogonal_transform)

## Why invert Neural Networks ?

There are several uses for invertible Neural Networks. The most important are in the field of Probabilistic Modeling, Variational Inference and Generative Adversarial Networks (GANs).

In the context of GANs, but also in Variational Autoencoders (VAEs), a part of the model consists of a Generator Network $G$, which is a deterministic function which maps (in the most common form, without loss of generality) a random vector $Z \in \mathbb{R}^n$ to a transformed random vector $X \in \mathbb{R}^m$.

The random vector $Z$ usually follows a known parametric distribution, e.g $Z \sim P_z$, such as a multivariate normal with diagonal covariance matrix or similar. The parameters of this distribution ( $\theta_z$ ) can again be the function of some input (for example in VAEs or conditional models). But what is important here is that  the distribution of $Z$ and specifically the logarithm of it's probability density function $\log{p(z)}$ can be calculated efficiently. 

If the function $G$ would be a smooth and invertible function, we could apply the so called **Change of Variables** technique (see paragraph below, or [Wikipedia Article](https://en.wikipedia.org/wiki/Probability_density_function#Dependent_variables_and_change_of_variables)) to calculate the density of $G(Z)$ given the density of $Z$ and the determinant of the jacobian of the inverse mapping.

Therefore, if we can construct and train invertible neural networks while retaining their universal function approximator property, we can therefore approximate arbitrary probability densities. Not only that, but we can also sample from them and reconstruct the latent code (e.g. $Z$) which has been used to generate any sample $X$.

If we can do that, we can do both: Apply techniques from Variational Inference, such as divergence minimization to minimize the divergence between the empirical distribution of the reconstructed latent code $\hat{Z} = G^{-1}(X)$, e.g. we assume $\hat{Z} \sim Q_{Z}$ and can attempt to minimize, for example, the KL-Divergence ${KL}(Q_z || P_z )$.

We can also apply methods from Generative Adversarial Learning, using Discriminator Functions or Critics to train the Generator Network $G$ to generate samples $\hat{X} = G(z)$ where we have both:

 * The distribution of $\hat{X}$ is indistinguishable from the distribution of observed training examples $X$
 * The inverse empirical distribution of $\hat{Z}$ is indistinguishable from the distribution $P_z$
 

### Change of Variables Technique

The following paragraph is a short extract from the [Wikipedia Article on Random Variables](https://en.wikipedia.org/wiki/Probability_density_function)

If the probability density function of a random variable $X$ is given as $f_X(x)$, it is possible (but often not necessary; see below) to calculate the probability density function of some variable $Y = g(X)$. This is also called a “change of variable” and is in practice used to generate a random variable of arbitrary shape $f_{g(X)} = f_Y$  using a known (for instance, uniform) random number generator.

If the function $g$ is monotonic, then the resulting density function is

$$
f_Y(y) = \left| \frac{d}{dy} \big(g^{-1}(y)\big) \right| \cdot f_X\big(g^{-1}(y)\big)
$$

Here $g^{−1}$ denotes the inverse function.

This follows from the fact that the probability contained in a differential area must be invariant under change of variables. That is,

$$
\left| f_Y(y)\, dy \right| = \left| f_X(x)\, dx \right|
$$

This technique can obviously also be applied in a chained-manner, for example in a layered neural network, each
layer could calculate the determinant of the jacobian of the inverse function and just multiply that.

This leads us to the following question:

## Which kinds of Neural Networks are invertible ?

For practical purposes, we require the inversion to be tractable and efficient, and require that it is possible to easily calculate gradients with respect to the network parameters in both directions, as well as calculate the log of the absolute value of the determinant of the jacobian in the inverse direction. **It should be approximately equally efficient to forward and backward propagate through the network in either direction** 

Obviously, compositions of invertible functions are invertible themselves. Therefore, we can tackle this problem layerwise. 

Within invertnn, we require invertible layers to conform to a well defined **interface**. They need to be subclasses of **torch.nn.Module** and also derive from the following (partially) abstract base class:

```python
class InvertibleModule(object,  metaclass=abc.ABCMeta):

    @abc.abstractmethod
    def invert(self, output):
        """Calculate the inverted transformation of forward"""
        pass

    @abc.abstractmethod
    def inv_jacobian_logabsdet(self, output):
        """
        Log of the absolute value of the derminant of the jacobian of
        the inverted transform at output

        returns: inverted output, log of absolute value of jacobian
        """
        pass
    
    def inverted_module(self):
        '''Return a module that's the inversion of this module'''
        return InvertedModule(self)

```

In order to make it a bit simpler, we have certain mixin classes, which make it easier to add default functionality for layers which either do not change the density (i.e. they are "volume preserving"):

```python
class InvertibleVolumePreservingMixin(object):

    def inv_jacobian_logabsdet(self, output):
        """
        Log of the absolute valie of the derminant of the jacobian of
        the inverted transform at output
        """
        return self.invert(output), torch.ones((output.shape[0], 1)).to(output.device)
```

or for those Layer types which apply a purely componentwise transformation, which makes it possible to work with the gradient, instead of the full Jacobian:

```python

class InvertibleComponentwiseMixin(object):

    def inv_jacobian_logabsdet(self, output):
        """
        Log of the absolute valie of the derminant of the jacobian of
        the inverted transform at output
        """
        ovar = autograd.Variable(output.detach(), requires_grad=True)
        inverse = self.invert(ovar)
        grad = autograd.grad([inverse], [ovar], [ torch.ones_like(inverse) ])
        jacobian_log_abs_det = torch.sum(grad[0].abs().log().view(output.shape[0], -1), dim=1)
        return inverse, jacobian_log_abs_det
```

Based on this functionality, some invertible transformations can be implemented rather trivially, such as in this example:

```python
class InvertibleTanh(InvertibleComponentwiseMixin, InvertibleModule, nn.Tanh):

    def invert(self, output):
        return 0.5 * (torch.log(1+output) - torch.log(1-output))
```

### Invertible Activation Functions

For an activation function to be invertible, it needs to be strictly monotonic. This is true for, for example, the *Sigmoid*, *Tanh* and *LeakyReLU* activation functions, but not for *ReLUs*. For the aforementioned, we have corresponding activation functions in the [Invertible Transformations Module](../api_ref.html#module-invertnn.pytorch.invertible_transforms) of invertnn.

The following invertible Activation Functions have been implemented in invertnn:

 * [Invertible Sigmoid](../api_ref.html#invertnn.pytorch.invertible_transforms.InvertibleSigmoid)
 * [Invertible Tanh](../api_ref.html#invertnn.pytorch.invertible_transforms.InvertibleTanh)
 * [Invertible LeakyReLU](../api_ref.html#invertnn.pytorch.invertible_transforms.InvertibleLeakyReLU)
 * [Invertible Baird Activation](../api_ref.html#invertnn.pytorch.invertible_transforms.InvertibleBairdActivation) - see [Baird et. al: One-Step Neural Network Inversion with PDF
Learning and Emulation](http://leemon.com/papers/2005bsi.pdf)
 
Please note that we recommend against using LeakyReLU if you want to learn probability distortions, given that their curvature is either zero or undefined everywhere, therefore we have no informative gradient with respect to how to increase or decrease the density.

### Invertible Common Layer Types

 * [Invertible Sequential](../api_ref.html#invertnn.pytorch.invertible_transforms.InvertibleSequential) - the inverse here is to just apply the inverse in reversed order and sum the jacobian log abs determinant.
 * [Invertible Shuffle](../api_ref.html#invertnn.pytorch.invertible_transforms.InvertibleShuffle) - An invertible shuffling operation. Useful in combination with Coupling Layers (see further below)
 * [Invertible PixelShuffle](../api_ref.html#invertnn.pytorch.invertible_transforms.InvertiblePixelShuffle) - Invertible Variant of [PixelShuffle Operation](https://pytorch.org/docs/master/nn.html#torch.nn.PixelShuffle) - which can be used to implement Deconvolution-Like Operations. Basically it trades channel depth for resolution in one direction or another.
 * [Invertible Concat](../api_ref.html#invertnn.pytorch.invertible_transforms.InvertibleConcat) - On the forward pass, this concats additional input. On the inverse pass, it removes and stores that additional input into the "restored_input" property.
 * [Invertible Concat Noise](../api_ref.html#invertnn.pytorch.invertible_transforms.InvertibleConcatNoise) - Like invertible concat, but on every forward pass, it samples the input to concat from an instance of **torch.distributions.Distribution** and restores it on inverse pass ( subtracting the log propability of the restored input from the log abs det of the jacobian )
 
### Adding & Extracting (White-) Noise 

An invertible network has to introduce or extract random noise, whenever the degrees of freedom increase or decrease (e.g. if the number of neurons, or, for convolutional networks, CHANNELS x WIDTH x HEIGHT ).

When injecting noise, we usually first generate randomness by repeatedly drawing independent samples from **univariate distributions** such as uniform, univariate gaussian or bernoulli distributions. 

On the backward pass, the independence assumption is very likely not to hold anymore. We cannot expect the noise we extract on the backward pass to be componentwise independent from each other. To be realistic, the noise we get back is probably highly correlated. 

If we then **assume independence** and calculate things like KL-Divergence or do maximum likelihood learning it will simply not work well, because the assumption is severely violated.

One approach to solve this, is to apply a so called **Whitening Transform**. Whitening means, we decorrelate the samples. This does not give us true statistical independence, but we're getting much closer.

More on Statistical Whitening can be found in this [excellent blog post on statistical whitening by Joe Louis Marino (archived on archive.org)](https://web.archive.org/web/20180813034201/http://joelouismarino.github.io/blog_posts/blog_whitening.html)

As can be seen from the above post, whitening can be implemented by multiplying random noise vectors with, in the case of PCA Whitening, an 
orthogonal and a diagonal matrix, or in the case of ZCA Whitening, by multiplying with an orthogonal, a diagonal and the 
transpose of the first orthogonal Matrix.

Luckily, all of these are invertible operations. We just need to find the right matrices.

In order to make it easier to inject & extract the right kind of noise from the Network,  invertnn provides the class
[AdaptiveInverseWhiteningTransformer](../api_ref.html#invertnn.pytorch.white_noise.AdaptiveInverseWhiteningTransformer)

This class can provide a parametric **torch.distributions.Transformer**, which can correlate statistical noise on the
forward pass (when injecting noise) and decorrelate it on the way back (when reconstructing the original, independent, noise).


### Coupling Layers

The [Invertible Coupling Layer](../api_ref.html#invertnn.pytorch.invertible_transforms.InvertibleCoupling Layer) is an implementation of the so called Coupling Layer from [Dinh. et al: Density estimation using Real NVP](https://arxiv.org/pdf/1605.08803.pdf).

This coupling layer can be applied to both affine (i.e. 1D, 2D, 3D input) or plain input. It operates on a per-channel basis. If we have **D** input channels, it will couple the first **d** layers with the last **D-d** layers using arbitrary nonlinear transformation networks $s: \mathbb{R}^d \mapsto \mathbb{R}^{D-d}$ and $s: \mathbb{R}^d \mapsto \mathbb{R}^{D-d}$. The output will consist of *d* unchanged channels, and ** D-d** coupled channels.
        
More on these coupling layers is to follow. For now, please refer to the paper above. 

### Invertible Linear Layers (Orthogonal, Diagonal & Singular Value Composition Layers)

The following Layers are intended to be more or less Drop-In Replacements for (square) Linear Layers, as well as for Pixelwise-Convolutions with a Kernel-Size of 1.

 * [Orthogonal Transform](../api_ref.html#invertnn.pytorch.orthogonal_transform.OrthogonalTransform) - Orthogonal Transformation using parametrization as product of Householder Reflectors.
 * [Orthogonal Transform 2D](../api_ref.html#invertnn.pytorch.orthogonal_transform.OrthogonalTransform2D) - Orthogonal Spatial Transformation using parametrization as product of Householder Reflectors.
 * [Diagonal Linear Transform](../api_ref.html#invertnn.pytorch.orthogonal_transform.DiagonalLinearTransform) - Diagonal Matrix Transformation.
 * [Diagonal Linear Transform 2D](../api_ref.html#invertnn.pytorch.orthogonal_transform.DiagonalLinearTransform2D) - Diagonal Spatial Transformation.
 
Also, these layers can be composed ( for example into a SVD based composition ) using the [InvertibleSequential]((../api_ref.html#invertnn.pytorch.invertible_transforms.InvertibleSequential) Layer.

These Layer types are one of the core contributions of the invertnn library so far. It is mostly based on the Orthogonal and SVD Reparametrizations from these papers: [Mhammedi et. al: Efficient Orthogonal Parametrisation of Recurrent Neural Networks Using Householder Reflections](https://arxiv.org/pdf/1612.00188.pdf) as well as [Zhang et al.: Stabilizing Gradients for Deep Neural Networks via Efficient SVD Parameterization](https://arxiv.org/pdf/1803.09327.pdf)

##### Background for Orthogonal Layers

These layers enable invertible Neural Networks in the sense we described it above, and without most of the limitations imposed by the Coupling Layers of the RealNVP Paper.

Now, some Background

#### Orthogonal Matrices

An [Orthogonal Matrix](https://en.wikipedia.org/wiki/Orthogonal_matrix) is a matrix whose inverse is also the transpose. Which means it is **trivially invertible**. Multiplication with an Orthogonal Matrix preserves distances and angles, and is therefore **Volume Preserving**, which means it is also trivial to calculate the jacobian log determinant (always 1). 

#### Diagonal Matrices

A **Diagonal Matrix** in turn is a Matrix with zeros everywhere, except on the diagonal. It has a compact representation if we store only the diagonal elements. It's inverse is just the componentwise reciprocal, therefore it is also **trivially invertible** (even though we can have division by zero or numerical problems if we get coefficients close to zero). Multiplication with a compactly represented Diagonal Matrix is also efficient, we can use the compontwise tensor product for that. Likewise, **calculation of the log of the determinant of the Jacobian is trivial**.

#### SIngular Value Decomposition

Using Singular Value Decomposition (SVD) **Any real Matrix** can be decomposed via stable numerical methods into the product of orthogonal and diagonal matrices. And if the original matrix was square, this also allows us to invert the matrix by taking the transpose of the orthogonal matrices, the inverse of the diagonal matrix and applying it in inverse order. If the original matrix was not square, we will, at least, get a least-squares solution to the inverse, if a solution is possible.

#### QR Decomposition using Householder Reflectors

The **QR** decomposition algorithm based on Householder matrices allows to factorize any square matrix (with shape $ N x N $) into the product of an Orthogonal Matrix **Q** and an upper triangular matrix **R**. As a by-product, we also get a factorization of the Orthogonal Matrix Q into $N$ so called householder reflectors, each of which can be represented as a so called Householder matrix of shape (N x N), or more compactly as a vector of shape N.

If the original Matrix was orthogonal to begin with, we can use this to factorize it completely into Householder Matrices (i.e the resulting Matrix $R$ is provably going to be the identity Matrix).

Using just some trivial constant constraints on the householder reflection vectors (some entries need to be zero, or are limited to -1 or +1), the remaining free parameters can be varied arbitrarily, and the product can be **any** Orthogonal Matrix.

This means, these Householder Reflectors are suitable as a reparametrization method for Orthogonal Matrices.

#### SIngular Value Composition

That is, **we can factorize any real Matrix** of shape (N x M) into the product of **N** Householder Reflection Vectors of Shape **N**, one Diagonal Matrix represented as a Vector of Shape N, and **M** Householder Reflection Vectors of Shape **M**.

This again means, we can construct a reparametrization of any square matrix, which is trivially invertible and fulfills our requirements for an Invertible Transformation. **It is approximately equally efficient to forward and backward propagate through the network in either direction**. Also, the parametrization is universal in the sense that it is smooth, and can represent any real matrix.






## Related Papers

### Density Estimation using Invertible Neural Networks

It is possible to learn and approximate probability density functions using invertible neural networks based on the fact that we can apply the change of variable method 

. In [Baird et. al: One-Step Neural Network Inversion with PDF
Learning and Emulation](http://leemon.com/papers/2005bsi.pdf)

Baird et. al demonstrate the applicability of the Change of Variables Technique to learn PDFs, given invertible neural networks. They introduce a novel invertible activation function (available in invertnn as **InvertibleBairdActivation** in the **invertible_transforms** module) They also develop a PDF learning algorithm which we do not use.

 * [Dinh. et al: Density estimation using Real NVP](https://arxiv.org/pdf/1605.08803.pdf)
Dinh et al. introduce a new method for the construction of Invertible Neural Networks using so called **Coupling Layers** (implemented in invertnn as class **InvertibleCouplingLayer** in the **invertible_transforms** module). They also demonstrate the applicability of these inverted networks on Image Generation Tasks.

### Parametrization of Orthogonal Matrices using Householder Reflectors

[Mhammedi et. al: Efficient Orthogonal Parametrisation of Recurrent Neural Networks Using Householder Reflections](https://arxiv.org/pdf/1612.00188.pdf)

[Zhang et al.: Stabilizing Gradients for Deep Neural Networks via Efficient SVD Parameterization](https://arxiv.org/pdf/1803.09327.pdf)

### GAN Architectures involving approximate inversions / decoder architectures

 * ALi [Dumoulin et. al: Adversarially Learned Inference](https://arxiv.org/abs/1606.00704)
 * BiGAN: [Donahue et. al: Adversarial Feature Learning](https://arxiv.org/abs/1605.09782)
 * InfoGAN: [Chen et. al: InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets](https://arxiv.org/abs/1606.03657)

### Variational Autoencoders

 * VAEs: [Kingma & Welling: Auto-Encoding Variational Bayes
](https://arxiv.org/abs/1312.6114)
 


In [9]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.distributions as dists
import invertnn.pytorch.invertible_transforms as invertible
import invertnn.pytorch.orthogonal_transform as ortho

In [28]:
torch.sigmoid(torch.ones(1, dtype=torch.float64)*-50.0)

tensor([1.9287e-22], dtype=torch.float64)

In [11]:
model = invertible.InvertibleSequential(
            *[ortho.OrthogonalTransform(10, bias=True), ortho.DiagonalLinearTransform(10, bias=False),
                ortho.OrthogonalTransform(10, bias=True), invertible.InvertibleBairdActivation(layer_shape=(10,)),
                ortho.OrthogonalTransform(10, bias=True), invertible.InvertibleShuffle(reversed(range(10))),
                invertible.InvertibleConcat(10, 10), ortho.DiagonalLinearTransform(20, bias=False),
                invertible.InvertibleShuffle([0,2,4,6,8,10,12,14,16,18,1,3,5,7,9,11,13,15,17,19]),
                invertible.InvertibleCouplingLayer(8, 12,
                                                   self.create_mlp(8, 12, final_act=nn.Softsign),
                                                   self.create_mlp(8, 12, final_act=nn.Softsign)),
                invertible.InvertibleShuffle([0,2,4,6,8,10,12,14,16,18,1,3,5,7,9,11,13,15,17,19]),
                invertible.InvertibleCouplingLayer(12, 8,
                                                   self.create_mlp(12, 8, final_act=nn.Softsign),
                                                   self.create_mlp(12, 8, final_act=nn.Softsign)),
                ortho.OrthogonalTransform(20, bias=True), invertible.InvertibleLeakyReLU(), ])
model._modules['6'].input_b = torch.eye(10)



NameError: name 'self' is not defined

In [4]:
a = torch.randn((10, 2))
b = model.forward(a)
c = model.invert(b)


NameError: name 'model' is not defined

Author: 
    *[Kai Londenberg](Kai.Londenberg@googlemail.com), 2018*