**Activation functions**

As we did with `optimizers`, let's examine the logic of activation functions and their effect on model loss and accuracy. We'll use our simple four-layer CNN trained on MNIST and swap in different functions for our default `relu` activations. This network is optimized with `SGD`, which we've found isn't a great optimizer for the task (better to use SGD with some form of momentum). But maybe we'll get more variation in performance across activation functions with higher loss and lower accuracy? 

There are a ton of activation functions defined in `torch.nn` and `torch.nn.functional`. We'll use `nn.functional` and follow the order of the functions in the [documentation](https://pytorch.org/docs/stable/nn.functional.html). For some of these activations, we have the option of running the operation in-place by adding a `_`, or by passing `inplace=True` as an argument. We won't worry about that here. Let's get started!

**Threshold** replaces our input with a `value` below a specified `threshold`. There are no defaults for `threshold` or `value`. Let's arbitarily set our `threshold` at 1 and set the rest to -1. Thus we'll end up with a modified (and discontinuous) version of `relu`. Kinda weird, but it's not the best activation function. Let's examine performance:

`Epoch       Loss       Accuracy
   1         .1607      .9526
   2         .1033      .9690
   3         .0777      .9763
   4         .0598      .9814
   5         .0669      .9794
   6         .0516      .9836
   7         .0490      .9846
   8         .0473      .9852
   9         .0435      .9864
   10        .0463      .9861`
   
Not great initially, but not terrible for an odd function with somewhat arbitrary arguments.

**ReLU** is simply $max(0, x)$, zeroing out the negative values of the input. We've already tested this with `SGD` optimizer in the `optimizers` notebook. We can enter the values here:

Loss:

`Epoch       Threshold(-1,1)       ReLU  
   1          .1607                .1017
   2          .1033                .0614
   3          .0777                .0562
   4          .0598                .0409
   5          .0669                .0384
   6          .0516                .0336
   7          .0490                .0343
   8          .0473                .0391
   9          .0435                .0288
   10         .0463                .0314`

Accuracy:

`Epoch       Threshold(-1,1)       ReLU  
   1          .9526                .9669
   2          .9690                .9828
   3          .9763                .9809
   4          .9814                .9864
   5          .9794                .9873
   6          .9836                .9893
   7          .9846                .9876
   8          .9852                .9878
   9          .9864                .9909
   10         .9861                .9895`
   
ReLU is pretty good, and it clearly outperforms the threshold function (though `threshold` is getting close by epoch 10!)

**Hardtanh** is a (very) coarse approximation of `tanh`, by which $x=1$ if $x>1$ and $x=-1$ if $x<-1$. Otherwise $x$ remains $x$.

Loss:

`Epoch     Threshold(-1,1)    ReLU    Hardtanh  
   1        .1607            .1017    .2441
   2        .1033            .0614    .1427
   3        .0777            .0562    .1027
   4        .0598            .0409    .0820
   5        .0669            .0384    .0698
   6        .0516            .0336    .0610
   7        .0490            .0343    .0551
   8        .0473            .0391    .0515
   9        .0435            .0288    .0471
   10       .0463            .0314    .0458`

Accuracy:

`Epoch     Threshold(-1,1)    ReLU    Hardtanh  
   1        .9526            .9669    .9320
   2        .9690            .9828    .9598
   3        .9763            .9809    .9706
   4        .9814            .9864    .9770
   5        .9794            .9873    .9798
   6        .9836            .9893    .9827
   7        .9846            .9876    .9839
   8        .9852            .9878    .9850
   9        .9864            .9909    .9864
   10       .9861            .9895    .9862`
   
Loss and accuracy for `hardtanh` isn't even as good at the threshold function for eight out of 10 epochs, though it slightly outperforms `threshold` by the end of training. ReLU remains the best option.

**ReLU6** caps the ReLU output at 6, computing $min(max(0,x),6)$

Loss:

`Epoch     Threshold    ReLU    Hardtanh   ReLU6
   1        .1607       .1017    .2441     .1694   
   2        .1033       .0614    .1427     .0989
   3        .0777       .0562    .1027     .0710
   4        .0598       .0409    .0820     .0571
   5        .0669       .0384    .0698     .0545
   6        .0516       .0336    .0610     .0424
   7        .0490       .0343    .0551     .0421
   8        .0473       .0391    .0515     .0416
   9        .0435       .0288    .0471     .0342
   10       .0463       .0314    .0458     .0383`

Accuracy:

`Epoch     Threshold    ReLU    Hardtanh   ReLU6  
   1        .9526       .9669    .9320     .9497
   2        .9690       .9828    .9598     .9721
   3        .9763       .9809    .9706     .9778
   4        .9814       .9864    .9770     .9822
   5        .9794       .9873    .9798     .9828
   6        .9836       .9893    .9827     .9866
   7        .9846       .9876    .9839     .9853
   8        .9852       .9878    .9850     .9864
   9        .9864       .9909    .9864     .9888
   10       .9861       .9895    .9862     .9869`

ReLU (not ReLU6) still looks like our best bet.

**ELU** takes the minimum of 0 and x, rather than the maximum (as with ReLU):

$ELU(x) = max(0,x) + min(0, \alpha * (exp(x) - 1))$

Loss:

`Epoch     Threshold    ReLU    Hardtanh   ReLU6     ELU
   1        .1607       .1017    .2441     .1694    .1603
   2        .1033       .0614    .1427     .0989    .0928
   3        .0777       .0562    .1027     .0710    .0689
   4        .0598       .0409    .0820     .0571    .0548
   5        .0669       .0384    .0698     .0545    .0585
   6        .0516       .0336    .0610     .0424    .0417
   7        .0490       .0343    .0551     .0421    .0411
   8        .0473       .0391    .0515     .0416    .0449
   9        .0435       .0288    .0471     .0342    .0333
   10       .0463       .0314    .0458     .0383    .0379`

Accuracy:

`Epoch     Threshold    ReLU    Hardtanh   ReLU6     ELU  
   1        .9526       .9669    .9320     .9497    .9528
   2        .9690       .9828    .9598     .9721    .9744
   3        .9763       .9809    .9706     .9778    .9782
   4        .9814       .9864    .9770     .9822    .9827
   5        .9794       .9873    .9798     .9828    .9816
   6        .9836       .9893    .9827     .9866    .9868
   7        .9846       .9876    .9839     .9853    .9858
   8        .9852       .9878    .9850     .9864    .9850
   9        .9864       .9909    .9864     .9888    .9898
   10       .9861       .9895    .9862     .9869    .9863`
   
ELU is a solid second or third, but ReLU is still best.

**SELU** is computed as $scale * (max(0,x) + min(0, \alpha * (exp(x) - 1)))$

Here $\alpha = 1.6732632423543772848170429916717$
and $scale = 1.0507009873554804934193349852946$

Loss:

`Ep  Threshold   ReLU    Hardtanh   ReLU6     ELU      SELU
 1    .1607      .1017    .2441     .1694    .1603    .1389
 2    .1033      .0614    .1427     .0989    .0928    .0780
 3    .0777      .0562    .1027     .0710    .0689    .0607
 4    .0598      .0409    .0820     .0571    .0548    .0496
 5    .0669      .0384    .0698     .0545    .0585    .0574
 6    .0516      .0336    .0610     .0424    .0417    .0393
 7    .0490      .0343    .0551     .0421    .0411    .0389
 8    .0473      .0391    .0515     .0416    .0449    .0446
 9    .0435      .0288    .0471     .0342    .0333    .0328
 10   .0463      .0314    .0458     .0383    .0379    .0361`

Accuracy:

`Ep  Threshold    ReLU    Hardtanh   ReLU6     ELU      SELU
 1    .9526       .9669    .9320     .9497    .9528    .9591
 2    .9690       .9828    .9598     .9721    .9744    .9777
 3    .9763       .9809    .9706     .9778    .9782    .9806
 4    .9814       .9864    .9770     .9822    .9827    .9840
 5    .9794       .9873    .9798     .9828    .9816    .9810
 6    .9836       .9893    .9827     .9866    .9868    .9878
 7    .9846       .9876    .9839     .9853    .9858    .9869
 8    .9852       .9878    .9850     .9864    .9850    .9853
 9    .9864       .9909    .9864     .9888    .9898    .9903
 10   .9861       .9895    .9862     .9869    .9863    .9874`

**CELU** is defined as $max(0,x) + min(0, \alpha*(exp(x/a)-1))$, and confusingly rhymes with SELU?

Loss:

`Ep  Thld  ReLU    Hrdtnh   ReLU6    ELU     SELU    CELU
 1   .161  .1017   .2441    .1694   .1603   .1389   .1603
 2   .103  .0614   .1427    .0989   .0928   .0780   .0928
 3   .078  .0562   .1027    .0710   .0689   .0607   .0689
 4   .060  .0409   .0820    .0571   .0548   .0496   .0548
 5   .067  .0384   .0698    .0545   .0585   .0574   .0585
 6   .052  .0336   .0610    .0424   .0417   .0393   .0417
 7   .049  .0343   .0551    .0421   .0411   .0389   .0411
 8   .047  .0391   .0515    .0416   .0449   .0446   .0449
 9   .044  .0288   .0471    .0342   .0333   .0328   .0333
 10  .046  .0314   .0458    .0383   .0379   .0361   .0379`

Accuracy:

`Ep  Thld  ReLU    Hrdtnh   ReLU6    ELU     SELU    CELU
 1   .953  .9669   .9320    .9497   .9528   .9591   .9528
 2   .969  .9828   .9598    .9721   .9744   .9777   .9744
 3   .976  .9809   .9706    .9778   .9782   .9806   .9782
 4   .981  .9864   .9770    .9822   .9827   .9840   .9827
 5   .979  .9873   .9798    .9828   .9816   .9810   .9816
 6   .984  .9893   .9827    .9866   .9868   .9878   .9868
 7   .985  .9876   .9839    .9853   .9858   .9869   .9858
 8   .985  .9878   .9850    .9864   .9850   .9853   .9850
 9   .986  .9909   .9864    .9888   .9898   .9903   .9898
 10  .986  .9895   .9862    .9869   .9863   .9874   .9863`
 
Still haven't found an activation function that can beat ReLU..

**Leaky ReLU** augments ReLU by adding a small slope (default: 0.01) to negative values, yielding $max(0,x) + negative_slope * min(0,x)$ for input $x$

We'll record Leaky ReLU and the activation function with the top accuracy (ReLU) to four digits and the rest to three. (ReLU is winning at every epoch.)

Loss:

`Ep  Thld   ReLU   Hdtnh  ReLU6  ELU  SELU  CELU   LRLU
 1   .161  .1017   .244   .169  .160  .139  .160  .1679
 2   .103  .0614   .143   .099  .093  .078  .093  .1001
 3   .078  .0562   .103   .071  .069  .061  .069  .0729
 4   .060  .0409   .082   .057  .055  .050  .055  .0584
 5   .067  .0384   .070   .055  .059  .057  .059  .0612
 6   .052  .0336   .061   .043  .042  .039  .042  .0451
 7   .049  .0343   .055   .042  .041  .039  .041  .0438
 8   .047  .0391   .052   .042  .045  .045  .045  .0465
 9   .044  .0288   .047   .034  .033  .033  .033  .0347
 10  .046  .0314   .046   .038  .038  .036  .038  .0398`

Accuracy:

`Ep  Thld   ReLU   Hdtnh  ReLU6  ELU  SELU  CELU   LRLU
 1   .953  .9669   .932   .950  .953  .959  .953  .9504
 2   .969  .9828   .960   .972  .974  .978  .974  .9710
 3   .976  .9809   .971   .978  .978  .981  .978  .9772
 4   .981  .9864   .977   .982  .983  .984  .983  .9821
 5   .979  .9873   .980   .983  .982  .981  .982  .9812
 6   .984  .9893   .983   .987  .987  .988  .987  .9854
 7   .985  .9876   .984   .985  .986  .987  .986  .9848
 8   .985  .9878   .985   .986  .985  .985  .985  .9846
 9   .986  .9909   .986   .989  .990  .990  .990  .9888
 10  .986  .9895   .986   .987  .986  .987  .986  .9862`
 
Intuitively, it seems that Leaky ReLU, if not an improvement over ReLu, should at least be close. This test suggests otherwise - it's generally worse than the other ReLU variants.

**PReLU** modifies Leaky ReLU, replacing the constant *negative_slope* with a learnable *weight* parameter. This isn't entirely straightforward to implement, and there is no suggested default for the initial weight in the `nn.functional.prelu` implementation, so we'll skip this one.

**RReLU** is Leaky ReLU, but with a randomized slope for $x<0$, uniformly sampled with lower bound 0.125 and upper bound 0.333.

Let's simplify it further, and reduce everything but the top accuracy to two digits (except for `RReLU`)

Loss:

`Ep  Thld  ReLU  Htnh  RLU6 ELU  SELU CELU LRLU RReLU
 1   .16  .1017  .24   .17  .16  .14  .16  .17  .1730
 2   .10  .0614  .14   .10  .09  .08  .09  .10  .1027
 3   .08  .0562  .10   .07  .07  .06  .07  .07  .0746
 4   .06  .0409  .08   .06  .06  .05  .06  .06  .0594
 5   .07  .0384  .07   .06  .06  .06  .06  .06  .0627
 6   .05  .0336  .06   .04  .04  .04  .04  .05  .0452
 7   .05  .0343  .06   .04  .04  .04  .04  .04  .0450
 8   .05  .0391  .05   .04  .05  .05  .05  .05  .0475
 9   .04  .0288  .05   .03  .03  .03  .03  .04  .0351
 10  .05  .0314  .05   .04  .04  .04  .03  .04  .0434`

Accuracy:

`Ep  Thld  ReLU  Htnh  RLU6 ELU  SELU CELU LRLU RReLU
 1   .95  .9669  .93   .95  .95  .96  .95  .95  .9495
 2   .97  .9828  .96   .97  .97  .98  .97  .97  .9704
 3   .98  .9809  .97   .98  .98  .98  .98  .98  .9773
 4   .98  .9864  .98   .98  .98  .98  .98  .98  .9815
 5   .98  .9873  .98   .98  .98  .98  .98  .98  .9806
 6   .98  .9893  .98   .99  .99  .99  .99  .99  .9855
 7   .99  .9876  .98   .99  .99  .99  .99  .99  .9838
 8   .99  .9878  .99   .99  .99  .99  .99  .99  .9848
 9   .99  .9909  .99   .99  .99  .99  .99  .99  .9888
 10  .99  .9895  .99   .99  .99  .99  .99  .99  .9853`
 
The results are similar to Leaky ReLU.

**GLU** is a "gated linear unit." We split the input in half into $a$ and $b$, pass $b$ through a sigmoid function $\sigma$, and compute the elementwise product of $a$ and $\sigma(b)$. We'll skip this one because it doesn't work with our network's given input size.

**GeLU** is the product of the input $x$ and the cumulative distribution function of $x$.

For the rest of the activation functions, we'll just compare to the top performing function (as of now, it's all ReLU). We'll also accelerate things by training for just five epochs.

`Loss                          Accuracy`

`Ep    ReLU     GeLU           ReLU     GeLU
 1    .1017    .1567          .9669    .9538
 2    .0614    .0930          .9828    .9739
 3    .0562    .0682          .9809    .9787
 4    .0409    .0558          .9864    .9826
 5    .0384    .0589          .9873    .9818`

**Log Sigmoid** is the (natural) log of the sigmoid function: $\log(\frac{1}{1+\exp{-x_{i}}})$

`Loss                          Accuracy`

`Ep    ReLU   LogSigmoid       ReLU   LogSigmoid
 1    .1017    .3961          .9669    .8748
 2    .0614    .2659          .9828    .9198
 3    .0562    .2245          .9809    .9364
 4    .0409    .1799          .9864    .9421
 5    .0384    .1465          .9873    .9570`

**Hardshrink** uses the parameter $\lambda$ (default: 0.5) and returns the input $x$ if $x > |\lambda|$, 0 otherwise.

If this or any subsequent function doesn't come close to ReLU after an epoch, we'll keep moving on.

`Loss                          Accuracy`

`Ep    ReLU   Hardshrink       ReLU   Hardshrink
 1    .1017    .2039          .9669    .9415`

**Tanhshrink** is simply $x - \tanh(x)$ for input $x$

`Loss                          Accuracy`

`Ep    ReLU   Tanhshrink       ReLU   Tanhshrink
 1    .1017    2.3013          .9669   .1135`
 
I'm fairly certain `tanhshrink` isn't supposed to plug into this network, at least not without a change to the output of the prior layers.

**Softsign** is $\frac{x}{1+|x|}$ for input $x$

`Loss                          Accuracy`

`Ep    ReLU   Softsign         ReLU   Softsign
 1    .1017    .5148          .9669    .8710`

**Softplus** "is a smooth approximation of the ReLU function" (`nn` documentation), defined as follows:

Softplus$(x) = \frac{1}{\beta} * \log(1 + exp(\beta * x))$

By default, $\beta=1$. Above a threshold (default: 20), "the implementation reverts to the linear function" (does this mean it just returns $x$?)

`Loss                          Accuracy`

`Ep    ReLU   Softplus         ReLU   Softplus
 1    .1017    .2312          .9669    .9291`

**Softmin** is equal to softmax(-$x$) for input $x$, or $\frac{\exp(-x_{i})}{\sum_{j}\exp(-x_{j})}$. The elements of the softmin and softmax tensors are all in the range $[0,1]$ and sum to 1.

`Loss                          Accuracy`

`Ep    ReLU   Softmin          ReLU   Softmin
 1    .1017   2.3013           .9669   0.1135`
 
The loss and accuracy after one epoch is identical to that in `tanhshrink`.

**Softmax** is $\frac{\exp(x_{i})}{\sum_{j}\exp(x_{j})}$ (see Softmin)

`Loss                          Accuracy`

`Ep    ReLU   Softmax          ReLU   Softmax
 1    .1017   2.3013           .9669   .1135`
 
The network appears simply not to train with some of these activation functions.

**Softshrink** is $x - \lambda$ if $x > \lambda$, $x + \lambda$, if $x < -\lambda$, and 0 otherwise.

`Loss                          Accuracy`

`Ep    ReLU   Softshrink       ReLU   Softshrink
 1    .1017    2.3013         .9669    .1135`

**Gumbel Softmax** "samples from the Gumbel-Softmax distribution" (link in `nn.functional` documentation).

This one also doesn't work for `main.py`

**Log Softmax** computes the natural logarithm of the softmax function.

This one not only doesn't work for `main.py` - it returns `nan` loss early on in the first epoch.

**Tanh** is the hyperbolic tangent of the input.

`Loss                          Accuracy`

`Ep    ReLU    Tanh            ReLU    Tanh
 1    .1017   .3013           .9669   .9164`
 
Ah finally, the network is training again, but still not up to par with ReLU.

**Sigmoid** (saving the tried and true for last) is defined in the description of Log Sigmoid: $\frac{1}{1+\exp{-x_{i}}}$

`Loss                          Accuracy`

`Ep    ReLU   Sigmoid          ReLU   Sigmoid
 1    .1017   2.3040          .9669   .0982`
 
Even sigmoid doesn't work?

Whew! That's a lot of activation functions. All that work to confirm that ReLU is awesome (for simple CNNs at least). We didn't get into their use for different tasks - perhaps ReLU wouldn't have been so dominant if we weren't just running tests on a simple convolutional neural network. That's for another time.