# Neural Network Introduction

This notebook shall give you an introduction to the basics of neural networks. We will discuss:
1. Deep Learning Basics
2. Layer structure of neural networks
3. Loss functions
4. Optimizers
5. Activation functions

Further we will introduce [Tensorflow ](https://www.tensorflow.org/) as Deep Learning Framework. TensorFlow is an end-to-end open source platform for machine learning.

Some helpful **Cheat-Sheets** for **Python/Numpy/Tensorflow**: 
* [Tensorflow_Keras_CheatSheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Keras_Cheat_Sheet_Python.pdf)

## Imports libraries

In [None]:
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import tensorflow as tf
import tensorflow.keras as tfk
from tensorflow.keras.layers import Input,Dense,Activation,Conv2D
from tensorflow.keras.models import Model,Sequential

%matplotlib inline

from tensorflow.python.keras import backend as K
K.clear_session()

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        # Currently, memory growth needs to be the same across GPUs
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
        logical_gpus = tf.config.experimental.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
        # Memory growth must be set before GPUs have been initialized
        print(e)

## Auxiliary functions
This is just a helper function for plotting Activation functions

In [None]:
# the auxiliary function forming the diagram
def make_plot(x, f, df=None, name='Enter Name',f_name=None,df_name=None):
    plt.figure()
    plt.figure(figsize=(12,6))
    plt.title(name, fontsize=20, fontweight='bold')
    plt.xlabel('z', fontsize=15)
    plt.ylabel('Activation function value', fontsize=15)
    sns.set_style("whitegrid")
    
    if f_name is None:
        plt.plot(x, f, label="f (z)")
    else:
        plt.plot(x, f, label=f_name)
    if df is not None:
        if df_name is None:    
            plt.plot(x, df, label="f '(z)")
        else:
            plt.plot(x, df, label=df_name)
        
    plt.legend(loc=4, prop={'size': 15}, frameon=True,shadow=True, facecolor="white", edgecolor="black")
    #plt.savefig(f'..\\doc\\pics\\activation_functions\\{name}.png')
    plt.show()

# Deep Learning Basics

This tutorial accompanies the [lecture on Deep Learning Basics](https://www.youtube.com/watch?list=PLrAXtmErZgOeiKm4sgNOknGvNjby9efdf&v=O5xeyoRL95U) given as part of [MIT Deep Learning](https://deeplearning.mit.edu).

In this tutorial, we mention seven important types/concepts/approaches in deep learning, introducing the first 2 and providing pointers to tutorials on the others. Here is a visual representation of the seven:

![Deep learning concepts](https://i.imgur.com/EAl47rp.png)

At a high-level, neural networks are either encoders, decoders, or a combination of both. Encoders find patterns in raw data to form compact, useful representations. Decoders generate new data or high-resolution useful infomation from those representations. As the lecture describes, deep learning discovers ways to **represent** the world so that we can reason about it. The rest is clever methods that help use deal effectively with visual information, language, sound (#1-6) and even act in a world based on this information and occasional rewards (#7).

1. **Feed Forward Neural Networks (FFNNs)** - classification and regression based on features. See [Part 1](#Part-1:-Boston-Housing-Price-Prediction-with-Feed-Forward-Neural-Networks) of this tutorial for an example.
2. **Convolutional Neural Networks (CNNs)** - image classification, object detection, video action recognition, etc. See [Part 2](#Part-2:-Classification-of-MNIST-Dreams-with-Convolution-Neural-Networks) of this tutorial for an example.
3. **Recurrent Neural Networks (RNNs)** - language modeling, speech recognition/generation, etc. See [this TF tutorial on text generation](https://www.tensorflow.org/tutorials/sequences/text_generation) for an example.
4. **Encoder Decoder Architectures** - semantic segmentation, machine translation, etc. See [our tutorial on semantic segmentation](https://github.com/lexfridman/mit-deep-learning/blob/master/tutorial_driving_scene_segmentation/tutorial_driving_scene_segmentation.ipynb) for an example.
5. **Autoencoder** - unsupervised embeddings, denoising, etc.
6. **Generative Adversarial Networks (GANs)** - unsupervised generation of realistic images, etc. See [this TF tutorial on DCGANs](https://github.com/tensorflow/tensorflow/blob/r1.11/tensorflow/contrib/eager/python/examples/generative_examples/dcgan.ipynb) for an example.
7. **Deep Reinforcement Learning** - game playing, robotics in simulation, self-play, neural arhitecture search, etc. We'll be releasing notebooks on this soon and will link them here.

There are selective omissions and simplifications throughout these tutorials, hopefully without losing the essence of the underlying ideas.

## Neural Network Layers

A Neural Network consists of various different layers and can utilize different architectures to conquer the problems you have. The main principles all networks share are the Input and Output layers. Those have to be fitted specifically for your problems.

<img src="Pictures/small_fcn.PNG" width="500px" >

Now we want to build the network model as discribed in the picture using Tensorflow.
Therefore we create two different models using `functional` and `sequential` api

**1. Functional Api:** The functional API allows you to create models that have a lot more flexibility as you can easily define models where layers connect to more than just the previous and next layers. In fact, you can connect layers to (literally) any other layer. As a result, creating complex networks such as siamese networks and residual networks become possible.

**2. Sequential Api:**
The sequential API allows you to create models layer-by-layer for most problems. It is limited in that it does not allow you to create models that share layers or have multiple inputs or outputs.

In [None]:
# Resets all state generated by Keras.
K.clear_session()

# With the keras functional api:
x = Input(shape=(3,))
z = Dense(4)(x)
g = Activation(activation="relu")(z)
y = Dense(2)(g)
model = Model(x, y)
model.summary()

print()

# With the keras sequential api:
model = Sequential()
model.add(Input(shape=(3,)))
model.add(Dense(4))
model.add(Activation(activation='relu'))
model.add(Dense(2))

model.summary()

### Input Layer
The Input layer is always the first layer in the network. It makes sure to fit the input data (x-values,pictures, vectors) into Tensors, a shape that the network can understand. 
The basic implementation in Tensorflow/Keras looks like this:

``` python
x = Input(shape=(32,))
x = Input(shape=(28,28,1))
x = Input(shape=(512,512,3))
```


### Dense Layer
A dense layer consists of multiple neurons that are connected with all the neurons in the previous layer. In Tensorflow/Keras you can implement it using the Dense() Function, where you also can specify how many neurons this layer should have and what the activation should be.

In [None]:
# Resets all state generated by Keras.
K.clear_session()

x = tf.convert_to_tensor(np.random.uniform(low=-1,high=1,size=(3,1)),dtype=tf.float32)
# Create a dense layer with linear activation and calulate the outputs of random inputs
layer = Dense(3, activation='linear')
y = layer(x)

print(f"inputs:\n {x.numpy().flatten()}")
print(f"weights:\n {layer.weights[0].numpy().flatten()}")
print(f"y:\n {y.numpy()}")

<img src="https://raw.githubusercontent.com/tensorchiefs/dl_book/master/imgs/paper-pen.png" width="60" align="left" />  

           1. Calulate the output by hand! 
           
**Hint**
`layer.weights` returns weights and bias

<br>
<details>
<summary><b>Click here for one possible solution</b></summary>
    
```python
w = layer.weights[0] # Weights
b = layer.weights[1] # Bias
y_by_hand = x*w+b

print(f"y_by_hand:\n {y_by_hand}") # dot(input, weights) + bias

(y.numpy() == y_by_hand.numpy()).all()
```
    
</details>


In [None]:
############ YOUR CODE HERE ############
y_by_hand = 

print(f"y_by_hand:\n {y_by_hand}") # dot(input, weights) + bias

# Print True if y and y_by_hand is equal
(y.numpy() == y_by_hand.numpy()).all()

In [None]:
# Resets all state generated by Keras.
tfk.backend.clear_session()

# Dense
x = Input(shape=(32,))
y = Dense(16, activation='relu')(x)
y = Dense(32, activation='relu')(y)
y = Dense(64, activation='linear')(y)
model = Model(x, y)

model.summary()

<img src="https://raw.githubusercontent.com/tensorchiefs/dl_book/master/imgs/paper-pen.png" width="60" align="left" />  

           1. Can you calculate the Params yourself? 
           2. How do they compute? 
<details>
<summary><b>Click here for one possible solution</b></summary>

dense_params   = 32*16+16 : input*weights+bias <br>
dense_1_params = 16*32+32 : input*weights+bias <br>
dense_2_params = 32*64+64 : input*weights+bias
</details>


In [None]:
############ YOUR CODE HERE ############
dense_params = 0
dense_1_params = 0
dense_2_params = 0

print(f"Layer dense: {dense_params}")
print(f"Layer dense_1: {dense_1_params}")
print(f"Layer dense_2: {dense_2_params}")

### Convolution Layers


The basic structure is a little bit different now as we have a two dimensional picture. 
We use so called **Convolutions** to compress (and shrink) the information in the picture. They use filters to 'scan' the picture and detect edges or important parts of the image.

<img src="https://s3-us-west-2.amazonaws.com/static.pyimagesearch.com/keras-conv2d/keras_conv2d_padding.gif" width="500px" >

If the information is dense enough after those convolutions we can flatten the remaining picture (the so called feature map) and use a normal Dense layer to make the prediction


<img src="https://missinglink.ai/wp-content/uploads/2019/03/Frame-16.1.png" width="700px" >
 
 Those convolution layers have two effects, they extract the information from the picture and they shrink the picture if you dont apply some sort of frame around the picture (see first gif). This shrinking actually is a good thing as we want the picture to get smaller while we compress the information.
 
in Tensorflow/Keras the Convolution is called `Conv2D` (for 2 dimensional convolutions). With this function you need to specify the number of filters you want to have ( each filter creates one feature map,e.g one dimension) and how big the filter is. 

`Calulation of params = (kernel_height * kernel_width * input_image_channels + 1) * number_of_filters`

In [None]:
# Resets all state generated by Keras.
tfk.backend.clear_session()

# this is a logistic regression in Keras
x = Input(shape=(28,28,1))
y = Conv2D(filters=3,kernel_size=(3,3), activation='softmax')(x)
model = Model(x, y)
model.summary()

### Pooling Layers

Pooling layers help reducing the computational effort in the network by 'shrinking' the input. This is done by grouping small areas by their max/min or mean value to preserve the most important information while reducing the input size!
<img src="Pictures/pooling.png" width="700px" > 

In [None]:
# Resets all state generated by Keras.
tfk.backend.clear_session()

# this is a logistic regression in Keras
x = Input(shape=(28,28,1))
y = Conv2D(filters=3,kernel_size=(3,3))(x)
y = tf.keras.layers.MaxPooling2D(pool_size=(2, 2))(y)
model = Model(x, y)
model.summary()

#### Max Pooling

In [None]:
N=1
H=4
W=4
C=1
# Create a Tensor(Matrix) with random numbers and the dimension 4x4 
tensor = tf.convert_to_tensor(np.random.choice(np.arange(1,20),[N,H,W,C]),dtype=tf.float32)
x = tensor.numpy()[0,:,:,:].transpose(2,0,1)
print('Tensor bevor max_pool2d operation:')
print(x)

In [None]:
# Do MaxPooling on the tensor
max_pool = tf.nn.max_pool2d(tensor,ksize=[1,2,2,1],
                            strides=[1,2,2,1],padding='SAME')
print(f'Tensor after max_pool2d operation:')
print(max_pool.numpy()[0,:,:,:].transpose(2,0,1))

#### Avg Pooling

<img src="https://raw.githubusercontent.com/tensorchiefs/dl_book/master/imgs/paper-pen.png" width="60" align="left" />  

**Exercise:** 

               1. Try the same operation with the `average_pool2d` operation
               2. What do you think, which pooling operation is better?

In [None]:
############ YOUR CODE HERE ############


#### AVG Pooling vs. Max Pooling
Compare average pooling and max pooling on a custom binary image. 

In [None]:
from skimage.measure import block_reduce
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.image import imread

# Just play with the kernel-sizes
H = 9
W = 9

# Read two images
imgs = [imread('Pictures/black_x.png'), imread('Pictures/white_x.png')]

for img in imgs:

    max_pool = tf.nn.max_pool2d(np.expand_dims(img,0),ksize=[1,H,W,1],
                                 strides=[1,H,W,1],padding='VALID').numpy()[0,:]
    avg_pool = tf.nn.avg_pool2d(np.expand_dims(img,0),ksize=[1,H,W,1],
                                 strides=[1,H,W,1],padding='VALID').numpy()[0,:]
    
#     avg_pool=block_reduce(img, block_size=(H,W,1), func=np.mean)
#     max_pool=block_reduce(img, block_size=(H,W,1), func=np.max)
    min_pool=block_reduce(img, block_size=(H,W,1), func=np.min) # Not included in tensorflow :-(
    
    plt.figure(figsize=(12,8))
    plt.subplot(141)
    imgplot = plt.imshow(img,cmap=plt.cm.binary)
    plt.title('Original Image')

    plt.subplot(142)
    imgplot3 = plt.imshow(min_pool, cmap=plt.cm.binary)
    plt.title('Min pooling')

    plt.subplot(143)
    imgplot1 = plt.imshow(avg_pool, cmap=plt.cm.binary)
    plt.title('Avg pooling')
    
    plt.subplot(144)
    imgplot1 = plt.imshow(max_pool, cmap=plt.cm.binary)
    plt.title('Max pooling')

    plt.show()

* Min pooling gives better result for images with white background and black object
* Avg pooling gives same results regardless if background is black or white
* Max pooling gives better result for the images with black background and white object (Ex: MNIST dataset)

When classifying the `MNIST` digits dataset using CNN, max pooling is used because the background in these images is made black to reduce the computation cost.

### Dropout 

If you have a training set for your neural network that has very few samples, your network tends to overfit this data. That means that it is likely to learn features that only the training data includes resulting in bad prediction for other data that was not included in the dataset. 

Dropout is used to prevent exactly that by dropping out different neurons in each layer after every training step. This forces the network to take different paths even if the input data would be the same resulting in a more robust training and reducing the risk of overfitting! 

<img src="pictures/dropout.png" width="500" align="middle" />  

In [None]:
# Imagine this Tensor (Vector) as one Layer with weights (random numbers) in your Network
tensor = tf.keras.backend.random_normal(shape=[10,1],mean=0.,stddev=1.)

# Specify the dropout rate (Ratio)
# Inputs elements are randomly set to zero (and the other elements are rescaled). 
# This encourages each node to be independently useful, as it cannot rely on the output of other nodes.
dropout_rate = 0.3
drop_out_tensor = tf.nn.dropout(tensor,rate=dropout_rate)


print("Input Tensor:\n",tensor.numpy())
print("----------------")
print("Dropout Tensor:\n",drop_out_tensor.numpy())

One important thing to notice is, that the sum of the weights has to stay the same, thus the remaining (non zero) weights are rescalled with a factor

In [None]:
print(f'Original sum of weights: {tensor.numpy().sum()}')
scale = 1/(1-dropout_rate)

print(f'Dropout Tensor sum: {np.sum(drop_out_tensor[drop_out_tensor!=0].numpy())}')
print(f'Original Tensor scaled: {np.sum((tensor*scale)[drop_out_tensor!=0].numpy())}')

In [None]:
imgs = [imread('Pictures/black_x.png'), imread('Pictures/white_x.png')]

dropout_rate = 0.3
for img in imgs:

    do_img = tf.nn.dropout(img,rate=dropout_rate).numpy()
    do_img = np.uint8((255*do_img)/do_img.max())
    
#     avg_pool=block_reduce(img, block_size=(H,W,1), func=np.mean)
#     max_pool=block_reduce(img, block_size=(H,W,1), func=np.max)
    min_pool=block_reduce(img, block_size=(H,W,1), func=np.min) # Not included in tensorflow :-(
    
    plt.figure(figsize=(12,8))
    plt.subplot(141)
    imgplot = plt.imshow(img)
    plt.title('Original Image')

    plt.subplot(142)
    imgplot3 = plt.imshow(do_img)
    plt.title('Dropout')
    
    plt.show()

## Loss Functions

As part of the optimization algorithm, the error for the current state of the model must be estimated repeatedly. This requires the choice of an error function, conventionally called a loss function, that can be used to estimate the loss of the model so that the weights can be updated to reduce the loss on the next evaluation.

Neural network models learn a mapping from inputs to outputs from examples and the choice of loss function must match the framing of the specific predictive modeling problem, such as classification or regression. Further, the configuration of the output layer must also be appropriate for the chosen loss function.

### Mean Squared Error
Mean squared error is calculated as the average of the squared differences between the predicted and actual values. The result is always positive regardless of the sign of the predicted and actual values and a perfect value is 0.0. The squaring means that larger mistakes result in more error than smaller mistakes, meaning that the model is punished for making larger mistakes.

<img src="https://miro.medium.com/max/404/1*TCE9Kui4fbyZl5u3ARRBJw.png" align="center" style="width:30%" />  

In [None]:
y_true = [[0., 1.], [0., 1.]]
y_pred = [[0., 1.5], [0., 1.2]]

mse = tf.keras.losses.MeanSquaredError()
mse(y_true, y_pred).numpy()

### Binary Crossentropy
Binary classification loss function comes into play when solving a problem involving just **two classes**. 
For example, when predicting if a banknote is real or fake. 

<img src="https://miro.medium.com/max/548/1*PDtIfRHpMfbbXbhj26I-OA.png" align="center" style="width:50%" />  

In [None]:
y_true = [[0., 1.], [0., 0.]]
y_pred = [[1., 0.], [0., 0.]]
bce = tf.keras.losses.BinaryCrossentropy()
bce(y_true, y_pred).numpy()

### Categorical Crossentropy
Categorical Crossentropy is used when you have to classify between **multiple classes** (non binary). Here you most likely have your true_values as `one-hot-encoded vectors`, where the true class has a 1 and all the other classes have a 0. 

<img src="https://miro.medium.com/max/700/1*LIil7qwrVehQ8RxpwNXeUw.png" align="center" style="width:50%" />  

In [None]:
y_true = [[0, 1, 0], [0, 0, 1]]
y_pred = [[0.05, 0.95, 0], [0.1, 0.8, 0.1]]
cce = tf.keras.losses.CategoricalCrossentropy()
cce(y_true, y_pred).numpy()

## Optimizers

After defining the loss function we need to optimize it, so that it gets minimal. This is what the optimizer does. It takes the loss and optimizes it to a mimimum (carefull: this may also be a local minimum not a global one). Therefore we use gradient decent. Gradients in neural networks refer to vectors whose magnitude is the partial derivative of the function f(x) and is directed towards the greatest rate of increase of that function.

<img src="Pictures/optimizers.png" width="700px" > 


In [None]:
def f(x1):
    return x1**2.0 -4*x1

def f2(x1,x2):
    return x1**2.0 -x1*5 + x2**2.0

x = np.linspace(-5, 5, 300, dtype=np.float32)


plt.title('Function x**2 -2x')
plt.plot(x,f(x))
plt.show()

In [None]:
#starting value (randomly selected)
x1 = tf.Variable(15.0)
x2 = tf.Variable(5.0)

# # Using the build in optimizer 
# def f_opt():
#     return x1**2.0 -x1*5 + x2**2.0

# opt = tf.keras.optimizers.SGD(learning_rate=0.1)
# loss_fn = f_opt
# var_list_fn = [x1,x2]
# for i in range(0,100):
#     if i%10 == 0:
#         print(f"y={loss_fn().numpy()} x1={x1.numpy()} x2={x2.numpy()}")
#     opt.minimize(loss_fn, var_list_fn)


# # Apply gradient to optimizer
# for i in range(0,100):
#     with tf.GradientTape() as t:
#         y = f2(x1,x2)
#     g = t.gradient(y,[x1,x2])
#     if i%10 == 0:
#     g_and_v = zip([a for a in g],[x1,x2])
#     opt.apply_gradients(g_and_v)


# Computing the gradient by hand
l = []
learning_rate = 0.01
for i in range(0,500):
    #Calculating the gradient at each step!
    with tf.GradientTape() as t:
        y = f(x1)
    # calculated gradient
    g = t.gradient(y,[x1])
    # new x value calculated using the gradient and the learning rate
    x1.assign(x1-learning_rate*g[0].numpy())

    l.append(x1.numpy())


#Plotting the result    
plt.figure(figsize=(20,10))
plt.subplot(1,2,1)
plt.title('Loss curve')
plt.plot(l)
plt.subplot(1,2,2)
plt.title('calculated minimum')
plt.plot(x,f(x))
plt.plot(x1.numpy(),f(x1.numpy()),'o')
plt.legend(['function','minimum'])
plt.tight_layout()
print(f"Zeropoint for x**2 -2x: x1={x1.numpy()}")

<img src="https://raw.githubusercontent.com/tensorchiefs/dl_book/master/imgs/paper-pen.png" width="60" align="left" />  

**Small Exercise:** Play with the learning rate and see what happens if you make it smaller or larger, what happens if the learning rate is > 1 ? 

# Activation Functions

Activation functions are mathematical equations that determine the output of a neural network. The function is attached to each neuron in the network, and determines whether it should be activated (“fired”) or not, based on whether each neuron’s input is relevant for the model’s prediction. Activation functions also help normalize the output of each neuron to a range between 1 and 0 or between -1 and 1.

An additional aspect of activation functions is that they must be computationally efficient because they are calculated across thousands or even millions of neurons for each data sample. Modern neural networks use a technique called backpropagation to train the model, which places an increased computational strain on the activation function, and its derivative function.

## Why we use Activation Functions

Activation functions are one of the key elements of the neural network. Without them, our neural network would become a combination of linear functions, so it would be just a linear function itself. 

The non-linearity element allows for greater flexibility and creation of complex functions during the learning process. The activation function also has a significant impact on the speed of learning, which is one of the main criteria for their selection. Currently, the most popular one for hidden layers is probably `ReLU`.

## Linear

$${\displaystyle f(x)={x}}
\hspace{1cm}
{\displaystyle f'(x) = 1}$$

In [None]:
z = np.arange(-10, 10, 0.01)
f = z
df = np.ones_like(z)
make_plot(z, f, df, "Linear")

## Sigmoid

$${\displaystyle f(x)={\frac {1}{1+e^{-x}}}}
\hspace{1cm}
{\displaystyle f'(x) = f(x)\cdot(1-f(x))}$$

In [None]:
z = np.arange(-10, 10, 0.01)   
f = 1/(1 + np.exp(-z))
df = f*(1 - f)
make_plot(z, f, df, "Sigmoid")

## Tanh

$${\displaystyle f(x) = \tanh(x)={\frac {\sinh(x)}{\cosh (x)}}={\frac {e^{x}-e^{-x}}{e^{x}+e^{-x}}}}
\hspace{1cm}
{\displaystyle f'(x) = 1 - f(x)^{2}}$$

In [None]:
z = np.arange(-10, 10, 0.01)
f = np.tanh(z)
df = 1 - f*f
make_plot(z, f, df, "tanh")

## ELU

$${\displaystyle f(x) = \begin{cases}
x & \text{ if } x>0 \\ 
\alpha(e^z-1) & \text{ if } x\leq x 
\end{cases}}
\hspace{1cm}
{\displaystyle f'(x) = \begin{cases}
1 & \text{ if } x>0 \\ 
\alpha(e^z) & \text{ if } x< x 
\end{cases}}$$

In [None]:
alpha=1.
z = np.arange(-10, 10, 0.01)
f = np.where(z > 0, z, alpha*(np.exp(z) - 1))
df = np.where(z > 0, 1, alpha*(np.exp(z)))
make_plot(z, f, df, "ELU")

## ReLU

$${\displaystyle f(x) = \begin{cases}
0 & \text{ if } x<0 \\ 
x & \text{ if } x\geq x 
\end{cases}}
\hspace{1cm}
{\displaystyle f'(x) = \begin{cases}
0 & \text{ if } x<0 \\ 
1 & \text{ if } x\geq x 
\end{cases}}$$

In [None]:
z = np.arange(-10, 10, 0.01)
f = z * (z > 0)
df = 1. * (z > 0)
make_plot(z, f, df, "ReLU")

## Leaky ReLU

$${\displaystyle f(x) = \begin{cases}
0.01 \cdot x & \text{ if } x<0 \\ 
x & \text{ if } x\geq x 
\end{cases}}
\hspace{1cm}
{\displaystyle f'(x) = \begin{cases}
0.01 & \text{ if } x<0 \\ 
1 & \text{ if } x\geq x 
\end{cases}}$$

In [None]:
z = np.arange(-10, 10, 0.01)
f = np.where(z > 0, z, z * 0.01)
df = np.where(z > 0, 1, 0.01)
make_plot(z, f, df, "Leaky_ReLU")

## Softplus

$${\displaystyle f(x) = ln(1+e^{x})}
\hspace{1cm}
{\displaystyle f'(x)={\frac {e^x}{1+e^{x}}}={\frac {1}{1+e^{-x}}}}$$

In [None]:
z = np.arange(-10, 10, 0.01)
f = np.log(1 + np.exp(z))
df = 1/(1 + np.exp(-z)) # = Sigmoid
make_plot(z, f, df, "Softplus")

In [None]:
z = np.arange(-10, 10, 0.01)
f = np.log(1 + np.exp(z))
df = z * (z > 0)
make_plot(z, f, df, "Softplus vs Relu",f_name='Softplus',df_name='Relu')