# Self-Normalizing Neural Networks

### Papers and references

* Paper:
    * https://arxiv.org/abs/1706.02515
* Official github: tutorials
    * https://github.com/bioinf-jku/SNNs
* Paper notes:
    * https://github.com/kevinzakka/research-paper-notes/blob/master/snn.md
* Activation visualization:
    * https://github.com/shaohua0116/Activation-Visualization-Histogram
* Reddits:
    * https://www.reddit.com/r/MachineLearning/comments/6gd704/d_tutorials_and_implementations_for/
    * https://www.reddit.com/r/MachineLearning/comments/6g5tg1/r_selfnormalizing_neural_networks_improved_elu/ (more replies)


### Summary

* BN 을 사용하지 않고 activation function 인 elu 의 파라메터를 조정하여 자동으로 mean/variance 를 normalize 시킴
    * 어떻게 이게 되지...?
* FNN 의 경우 CNN/RNN 과 달리 BN 이 잘 작동하지 않는데, SELU 를 사용한 FNN 인 SNN 으로 성공적인 결과를 이끌어냄
    * 대체로 CNN/RNN 과 달리 FNN 은 XGBoost 등 RF 류 알고리즘들에 밀려 왔는데, 여기서 FNN 의 발전 가능성을 제시함
    * 다만, FNN 이 왜 BN 이 잘 작동하지 않는가? 간단하게 설명하는데 이해가 안 감.
    * 또한 CNN/RNN 의 경우 그럼 BN 에 비해서 장점이 없나? 이걸 실험해보자.
* 또한 SNN 에서의 dropout 도 제안함.
    * 뭔가 다르니까 제안했겠지?


### Scaled Exponential Linear Units (SELUs)

![selu](selu_eq.png)

* 사실 이 식에서 scale factor $\lambda$ 를 빼면 ELUs (Exponential Linear Units) 식이다. 
* 여기에 SELU는 zero-mean and unit-variance 를 만들기 위해 `alpha = 1.6732` and `lambda = 1.0507` 를 사용.
    * 원하는 mean/var 를 세팅하기 위해 다른 값도 지정이 가능하다. 오피셜 코드에서도 계산하는 방법을 제공.

* Properties
    * negative and positive values to control the mean
    * derivatives approaching 0 to dampen variance
    * slope larger than 1 to increase variance
    * continuous curve
    * 사실 이 프로퍼티는 잘 이해가 안 감. 논문을 더 봐야 할 듯.


### What I want to do

* 위에서 말했듯이 SNN 을 CNN/BN 구조에서 구현해보고, 이해하고, 결과를 비교해보자.
* 또한 dropout 도 구현해보고 이해하고, 결과를 비교해보자.

## Conclusion

* 논문에서 언급이 없는 것처럼, 딱히 BN 에 비해서 더 좋은 결과를 보여주지는 못하는 듯 하다.
* BN 이 적용하기 어려운 경우 대안으로 쓸만한 정도인 듯.
* 오피셜 리포지토리에 가보면 MNIST/CIFAR10 에 대해서 비교한 게 있는데, ELU/RELU 에 비해서 좋은 성능을 보이나, BN 과 비교는 없다.

# Questions

* 사실 이 실험을 하면서 발견한 것 중 의아한 건 BN 을 적용하면 step 이 어느정도 차기 전까지는 성능이 안 나온다.
* 왜...? 그럴 이유가 있나...?
* 흠...

In [3]:
import tensorflow as tf
import numpy as np
from tensorflow.examples.tutorials.mnist import input_data

In [4]:
from tensorflow.python.framework import ops

In [5]:
mnist = input_data.read_data_sets("../MNIST_data/", one_hot=True)

Extracting ../MNIST_data/train-images-idx3-ubyte.gz
Extracting ../MNIST_data/train-labels-idx1-ubyte.gz
Extracting ../MNIST_data/t10k-images-idx3-ubyte.gz
Extracting ../MNIST_data/t10k-labels-idx1-ubyte.gz


In [6]:
def selu(x):
    with ops.name_scope('elu') as scope:
        alpha = 1.6732632423543772848170429916717
        scale = 1.0507009873554804934193349852946
        return scale*tf.where(x>=0.0, x, alpha*tf.nn.elu(x))

In [7]:
# official tutorial 에서, 
# SNN init 은 FAN_IN / factor=1.0 / normal_dist 로 함.
# 그게 아래.
# MSRA (he) init 은 여기에 factor=2.0 임. relu 는 요걸로 사용함.
def snn_init():
    return tf.contrib.layers.variance_scaling_initializer(factor=1.0)

In [8]:
# dropout 의 경우 아래 테스트 모델에서는 붙일 데가 없음.
# 붙이면 CNN 에 붙여야 함. 붙나?

from tensorflow.python.framework import ops
from tensorflow.python.framework import tensor_util
from tensorflow.python.ops import math_ops
from tensorflow.python.ops import random_ops
from tensorflow.python.ops import array_ops

def dropout_selu(x, rate, alpha= -1.7580993408473766, fixedPointMean=0.0, fixedPointVar=1.0, 
                 noise_shape=None, seed=None, name=None, training=False):
    """Dropout to a value with rescaling."""

    def dropout_selu_impl(x, rate, alpha, noise_shape, seed, name):
        keep_prob = 1.0 - rate
        x = ops.convert_to_tensor(x, name="x")
        if isinstance(keep_prob, numbers.Real) and not 0 < keep_prob <= 1:
            raise ValueError("keep_prob must be a scalar tensor or a float in the "
                                             "range (0, 1], got %g" % keep_prob)
        keep_prob = ops.convert_to_tensor(keep_prob, dtype=x.dtype, name="keep_prob")
        keep_prob.get_shape().assert_is_compatible_with(tensor_shape.scalar())

        alpha = ops.convert_to_tensor(alpha, dtype=x.dtype, name="alpha")
        keep_prob.get_shape().assert_is_compatible_with(tensor_shape.scalar())

        if tensor_util.constant_value(keep_prob) == 1:
            return x

        noise_shape = noise_shape if noise_shape is not None else array_ops.shape(x)
        random_tensor = keep_prob
        random_tensor += random_ops.random_uniform(noise_shape, seed=seed, dtype=x.dtype)
        binary_tensor = math_ops.floor(random_tensor)
        ret = x * binary_tensor + alpha * (1-binary_tensor)

        a = tf.sqrt(fixedPointVar / (keep_prob *((1-keep_prob) * tf.pow(alpha-fixedPointMean,2) + fixedPointVar)))

        b = fixedPointMean - a * (keep_prob * fixedPointMean + (1 - keep_prob) * alpha)
        ret = a * ret + b
        ret.set_shape(x.get_shape())
        return ret

    with ops.name_scope(name, "dropout", [x]) as name:
        return utils.smart_cond(training,
            lambda: dropout_selu_impl(x, rate, alpha, noise_shape, seed, name),
            lambda: array_ops.identity(x))

In [6]:
class Model(object):
    def __init__(self, name, activ_fn, kernel_init=None, use_BN=False, lr=0.001):
        self.name = name
        
        with tf.variable_scope(name):
            self.X = tf.placeholder(tf.float32, shape=[None, 784], name='X')
            self.y = tf.placeholder(tf.float32, shape=[None, 10], name='y')
            self.training = tf.placeholder(tf.bool, name='training')

            net = tf.reshape(self.X, [-1, 28, 28, 1])

            n_filters = 32
            for i in range(3):
                with tf.variable_scope("conv{}".format(i)):
                    # conv
                    net = tf.layers.conv2d(net, n_filters, [3,3], padding='same', kernel_initializer=kernel_init)
                    if use_BN:
                        net = tf.layers.batch_normalization(net, training=self.training)
                    net = activ_fn(net)

                with tf.variable_scope("maxpool{}".format(i)):
                    # max_pool
                    net = tf.layers.max_pooling2d(net, [2,2], strides=2, padding='same')

                n_filters *= 2
                # [14, 14, 32], [7, 7, 64], [4, 4, 128]
                # 2048

            with tf.variable_scope("dense"):
                net = tf.contrib.layers.flatten(net)
                self.logits = tf.layers.dense(net, 10)

            with tf.variable_scope("prob"):
                self.prob = tf.nn.softmax(self.logits)

            with tf.variable_scope("accuracy"):
                self.accuracy = tf.equal(tf.argmax(self.logits, axis=1), tf.argmax(self.y, axis=1))
                self.accuracy = tf.reduce_mean(tf.cast(self.accuracy, tf.float32))

            with tf.variable_scope("loss"):
                self.loss = tf.nn.softmax_cross_entropy_with_logits(logits=self.logits, labels=self.y)
                self.loss = tf.reduce_mean(self.loss)

            with tf.variable_scope("optimizer"):
#                 if use_BN: # 여기서 굳이 use_BN 체크를 할 필요는 없는듯. 
                update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS, scope=name)
                with tf.control_dependencies(update_ops):
                    self.train_op = tf.train.AdamOptimizer(learning_rate=lr).minimize(self.loss)

            # summaries
            # Caution: When design multiple models in a single graph,
            # `tf.summary.merge_all` function tries merging every summaries of models.
            self.summary_op = tf.summary.merge([
                tf.summary.scalar("loss", self.loss),
                tf.summary.scalar("accuracy", self.accuracy)
            ])

In [7]:
tf.reset_default_graph()

snn = Model("SNN", activ_fn=selu, use_BN=False, kernel_init=snn_init())
relu = Model("ReLU", activ_fn=tf.nn.relu, use_BN=False)
relu_bn = Model("ReLU_BN", activ_fn=tf.nn.relu, use_BN=True)
# how about ELU?
elu = Model("ELU", activ_fn=tf.nn.elu, use_BN=False)
elu_bn = Model("ELU_BN", activ_fn=tf.nn.elu, use_BN=True)

models = [snn, relu, relu_bn, elu, elu_bn]

In [18]:
sess = tf.Session()
sess.run(tf.global_variables_initializer())

batch_size = 100
epoch_n = 10
N = mnist.train.num_examples
n_iter = N // batch_size

n_iter = 10 # for test
models = [snn, relu, relu_bn, elu, elu_bn]
# models = [snn, relu_bn, elu_bn]

for model in models:
    print model.name
    for epoch in range(epoch_n):
        avg_acc = 0.
        avg_loss = 0.
        for _ in range(n_iter):
            batch_x, batch_y = mnist.train.next_batch(batch_size)
            _, cur_acc, cur_loss = sess.run([model.train_op, model.accuracy, model.loss], 
                                            {model.X: batch_x, model.y: batch_y, model.training: True})
            avg_acc += cur_acc
            avg_loss += cur_loss

        avg_acc /= n_iter
        avg_loss /= n_iter

        test_acc, test_loss = sess.run([model.accuracy, model.loss], 
                                       {model.X: mnist.test.images, model.y: mnist.test.labels, model.training: False})

        print("[{}/{}] (train) acc: {:.2%}, loss: {:.3f} | (test) acc: {:.2%}, loss: {:.3f}".
              format(epoch+1, epoch_n, avg_acc, avg_loss, test_acc, test_loss))
    print

SNN
[1/10] (train) acc: 43.20%, loss: 1.816 | (test) acc: 66.48%, loss: 1.132
[2/10] (train) acc: 74.60%, loss: 0.826 | (test) acc: 85.36%, loss: 0.539
[3/10] (train) acc: 85.60%, loss: 0.501 | (test) acc: 90.27%, loss: 0.341
[4/10] (train) acc: 90.00%, loss: 0.325 | (test) acc: 92.03%, loss: 0.270
[5/10] (train) acc: 90.70%, loss: 0.287 | (test) acc: 93.93%, loss: 0.221
[6/10] (train) acc: 92.60%, loss: 0.244 | (test) acc: 94.09%, loss: 0.196
[7/10] (train) acc: 94.20%, loss: 0.175 | (test) acc: 95.17%, loss: 0.160
[8/10] (train) acc: 95.40%, loss: 0.167 | (test) acc: 95.20%, loss: 0.161
[9/10] (train) acc: 95.00%, loss: 0.168 | (test) acc: 95.45%, loss: 0.145
[10/10] (train) acc: 95.60%, loss: 0.142 | (test) acc: 95.22%, loss: 0.151

ReLU
[1/10] (train) acc: 26.20%, loss: 2.187 | (test) acc: 45.68%, loss: 1.973
[2/10] (train) acc: 60.50%, loss: 1.616 | (test) acc: 77.54%, loss: 1.048
[3/10] (train) acc: 79.20%, loss: 0.779 | (test) acc: 80.83%, loss: 0.578
[4/10] (train) acc: 83.90%,