# Chapter 4: Activation Functions

In [2]:
# Preface: Install necessary packages:
import numpy as np
import matplotlib.pyplot as plt
import nnfs
import math
nnfs.init()
from nnfs.datasets import spiral_data
from resources.classes import DenseLayer, ReLU, SoftMax

We use activation functions because, if the activation function is non-linear, it allows for deep neural networks to map non-linear functions. 

Your average neural network will only really consist of two kinds of activation functions: your hidden layer activation function, and your output activation function. You don't "need" to have uniform activation functions across all your hidden layer activation functions, but it's pretty standard practice.  

## Section 1: Different Types of Activation Functions
### Section 1.1: Step Activation Function
The step activation function most closely mimics how a neuron fires in the brain -- in the sense that it's all or nothing (0 or 1). Basically, if the (weight*input + bias) is >= 0, it will fire and output 1, otherwise, it will output a 0. 
This was commonly used a long time ago, but is no longer commonplace due to the complexity of datasets.

### Section 1.2: Linear Activation Function
The linear activation function is just the equation of a line: meaning it is quite literally y=x and where the output is equal to the input. This function is commonly used in the output layer of a regression model.

### Section 1.3: Sigmoid Activation Function
The problem with the traditional step function is it's not a very helpful representation of progress. It either does or does not fire (0 or 1) so it's difficult to say how the current and/or future steps may impact things. So, we sought to have something a little more informative. Hence the sigmoid activation function.
The sigmoid was the original "specific and granular" activation function used for neural networks. It has the equation y = 1/(1+e^x). It can produce outputs between 0 (toward -inf) and 1 (toward +inf), centering at 0.5 with an input of 0. The sigmoid function has now commonly been replaced with ReLU (rectified linear units -- which I'll dive into next). 

### Section 1.4: Rectified Linear Units (ReLU)
The ReLU function is just y=x clipped to 0 from the left side of the y axis. That means, it's really just a y = max(0, x). 
This is the most commonly used activation function - primarily because of it's efficiency and speed. It's much simpler than a Sigmoid function, which is non-linear, but it also preserves a similar non-linearity because of the piecewise definition. 

## Section 2: Why do we use activation functions? + Why does ReLU work?

One of the main reasons why we use activation functions is to model non-linear functions. Without an activation function specified, the activation function is just inherently y=x, meaning it can only model linear relationships. 

It's certainly unintuitive that you can use a piecewise linear function to model non-linear relationships, but it actually works when you think about it. That's because you can add and combine any amount of ReLU activations to create non-linear combinations. I probably botched that, but it makes sense to me. If you need a better explanation, I'd recommend looking [here](https://blog.dailydoseofds.com/p/a-visual-and-intuitive-guide-to-what).



# Section 3: Coding a ReLU activation function

The ReLU activation function can be coded as shown below:

In [3]:
# Our placeholder inputs
inputs = [0, 3, -4, -5, 10, 5, 10, 3]

# A simple loop iter which assigns 0 to the value if the value is less than 0, else just use the value.
outputs = [(max(0,x)) for x in inputs]

print(f"Before ReLU: {inputs}")
print(f"After ReLU: {outputs}")

# So, let's instantiate a ReLU class! P.S. this is just for an example, I'll also instantiate it in a separate python file so we can reference it everywhere, which will just be called the "Relu" class.
class ReLUExample:
    # Forward pass
    def forward(self, inputs):
        # We can leverage the np.maximum method, which just carries out the ReLU operation for us on the whole list in one go.  
        self.output = np.maximum(0, inputs)

Before ReLU: [0, 3, -4, -5, 10, 5, 10, 3]
After ReLU: [0, 3, 0, 0, 10, 5, 10, 3]


So, let's combine this with our knowledge and implementation of dense layers to create a working layer with a ReLU activation function:

In [4]:
# Creating non-linear placeholder data as shown in chapter three 
X, y = spiral_data(samples=100, classes=3)

# Instantiating our first dense layer and ReLU as it's activation function
Layer1 = DenseLayer(2, 3)
Activation1 = ReLU()

# One forward pass with our training data
Layer1.forward(X)
Activation1.forward(Layer1.output)

# Let's just view the first 5 rows of outputs...
print(Activation1.output[:5])

[[0.         0.         0.        ]
 [0.         0.00011395 0.        ]
 [0.         0.00031729 0.        ]
 [0.         0.00052666 0.        ]
 [0.         0.00071401 0.        ]]


## Section 4: The Softmax Activation Function

All the activation functions that we've seen before (such as the ReLU, Sigmoid, etc...) have been hidden layer activation functions. The Softmax activation function is the first one that we've seen which is normalized. That means: the summation of all outputs adds up to 1, with each output being similar to a confidence score in the category. As a result, the Softmax is particularly useful as an activation function for the output layer, where the output can be used to classify something.
The formula for a Softmax is:
$$
S(i, j) = \frac{e^{z(i, j)}}{\sum_{l=1}^{L} e^{z(i, l)}}
$$

Let's simplify the above down into a python function:

In [5]:
# Let's create some random inputs, using with we previously learned about np.random.randn(), with size 3 in the range [0, 1]
layer_outputs = np.random.randn(3)

# Just making a variable e for easier reference 
e = math.e

# I'm going to create bits and pieces so we don't have one giant terrifying formula...
exp_vals = [(float(e ** x)) for x in layer_outputs] # just the values from the layer_outputs, but exponentiated
sum_exp = sum(exp_vals)
normalized_vals = [(x/sum_exp) for x in exp_vals]

print(normalized_vals)
print(sum(normalized_vals))

[0.10431084574758263, 0.10301930069728654, 0.7926698535551309]
1.0


Great, now that was doing it the hard way; we can do this using Numpy as well to streamline our code, as shown below:

In [6]:
# Again creating a random input list of size 3 in the range of [0, 1]
layer_outputs = np.random.randn(3)

#exponentiating the values again
exp_vals = np.exp(layer_outputs)
#normalizing the values
normalized_vals = exp_vals / np.sum(exp_vals) 

print(normalized_vals)
print(sum(normalized_vals))

[0.18272969 0.18751894 0.6297514 ]
1.0


The next thing to do here is to modify our softmax function so that it accepts layer outputs in batches, which will be done below: 

In [7]:
layer_outputs = np.random.randn(3, 3)

exp_vals = np.exp(layer_outputs)
probabilities = exp_vals/np.sum(exp_vals, axis=1, keepdims=True)

print(probabilities)

[[0.21804984 0.32404155 0.45790863]
 [0.391328   0.34138915 0.2672828 ]
 [0.40432692 0.38627318 0.20939988]]


So, I'm sure you're asking, just as I was about 20 minutes before I wrote this, what does the "axis" and "keepdims" argument mean? I'll break the two down.

The axis argument tells np.sum on which dimension to sum. The examples I'll run through are "axis=None," "axis=0," and "axis=1." For each example, before I show you using python what that means, I'll (try to) explain it in English. 

The "axis=None" argument tells np.sum to sum all elements. So it'll go across the whole list, which we can easily visualize using the sum function. 
To build some intuition here, let's take a 1 x 3 array called Array. Let's say Array = [[a], [b], [c]], then doing np.sum(Array, axis=None) will just return (a+b+c).

In [8]:
layer_outputs = np.random.randn(3, 3)

print(f"Sum of all elements with a normal sum {sum(sum(layer_outputs))}")
print(f"Sum with an np.sum: {np.sum(layer_outputs, axis=None)}")

Sum of all elements with a normal sum -0.5614225268363953
Sum with an np.sum: -0.5614225268363953


Next, let's take a look at the "axis=0" argument, which will sum values across the rows. We can visualize this using an example 2 x 2 array called Array. Let Array = [[a, b], [c, d]], then doing np.sum(Array, axis=0) will return [a+c, b+d]. 

In [9]:
layer_outputs = np.random.randn(2, 2)

print(f"Plain layer_outputs: \n{layer_outputs}")
print(f"Sum of the columns with an np.sum: {np.sum(layer_outputs, axis=0)}")

Plain layer_outputs: 
[[ 0.7471883  -1.1889449 ]
 [ 0.77325296 -1.1838807 ]]
Sum of the columns with an np.sum: [ 1.5204413 -2.3728256]


Lastly, we can take a look at the "axis=1" argument, which will sum values across the columns. Let's again build some intuition here, using an example of a 2 x 2 array called Array. Let's say that Array = [[a, b], [c, d]], then doing np.sum(Array, axis=1) will return [a+b, c+d]. 

In [10]:
layer_outputs = np.random.randn(2, 2)

print(f"Plain layer_outputs: \n{layer_outputs}")
print(f"Sum of all row elements: {[float(sum(layer_outputs[x])) for x in range(len(layer_outputs))]}")
print(f"Sum of the rows with an np.sum: {np.sum(layer_outputs, axis=1)}")

Plain layer_outputs: 
[[-2.6591723   0.60631955]
 [-1.7558906   0.45093447]]
Sum of all row elements: [-2.0528526306152344, -1.3049561977386475]
Sum of the rows with an np.sum: [-2.0528526 -1.3049562]


The last thing to talk about here is the "keepdims=True" argument, which basically just requires that the output retains the reduced dimensions with shape 1. I'll provide two examples below to make it entirely clear, just in case. 

In [11]:
layer_outputs = np.random.randn(2, 2)

# keepdims=False (implied)
print(f"Sum of the rows with keepdims=False: \n{np.sum(layer_outputs, axis=1, keepdims=False)}")
print(f"Sum of the rows with keepdims=True: \n{np.sum(layer_outputs, axis=1, keepdims=True)}")

Sum of the rows with keepdims=False: 
[0.97553986 0.6151235 ]
Sum of the rows with keepdims=True: 
[[0.97553986]
 [0.6151235 ]]


With all this knowledge, now we can create a Softmax class. I'll write one out here as an example, but there'll also be one imported from classes.py.

In [12]:
class SoftMaxExample:
    def forward(self, inputs):
        exp_vals = np.exp(inputs - np.max(inputs, axis=1, keepdims=True)) # I'll explain why we subtract the largest entry in just a second.
        probabilities = exp_vals/np.sum(exp_vals, axis=1, keepdims=True) # We've shown this before!
        self.output = probabilities # Just setting the output...

I have a little bit of explaining to do here. The book states there are two main problems that neural networks face: dead neurons and exploding gradients. We know that the first one does, but the exploding gradients problem happens when we have numbers too big and the computer experiences overflow (if you don't know what that is, look it up). 

I'll show you a little example here, and we'll have to think back to our limit days of calculus. If you think about what the limit of a funtion y=e^x as x approaches -inf, then you'll realize that the limit is y=0. That is because what's really happening is it's becoming y=1/e^inf, where e^inf is just becoming inf and therefore 1/inf is just 0. On the other hand, doing np.exp(0) just returns 1 because (anything)^0 is just 1. We can take advantage of these properties to prevent the computer from overflowing! We just do so by subtracting the largest number from every entry, which will result in our entries being bounded by -inf on the left and 0 on the right, and therefore making all of our np.exp(x) result in an 0.0>=x<=1.0; best of all, it has no impact on our probability distribution. I hope that makes sense -- took me a bit to understand as well. This also has no 

So now, let's see our SoftMax class in action!

In [13]:
softmax = SoftMax()
softmax.forward([[1,2,3]])
print(f"The probabilities are: {softmax.output}")

The probabilities are: [[0.09003057 0.24472847 0.66524096]]


Let's build up on all of that to create a (almost) fully functioning neural network!

In [14]:
X, y = spiral_data(samples = 100, classes = 3)

#Let's initialize our neural net, with layers 1 and 2 being dense layers, but adding a ReLU on the outputs of L1 and a SoftMax on our output layer, L2. 
Layer1 = DenseLayer(2, 3)
Layer2 = DenseLayer(3, 3)
Activation1 = ReLU()
Activation2 = SoftMax()

Layer1.forward(X)
Activation1.forward(Layer1.output)
Layer2.forward(Activation1.output)
Activation2.forward(Layer2.output)

print(Activation2.output[:5])

[[0.33333334 0.33333334 0.33333334]
 [0.33333248 0.33333376 0.33333376]
 [0.33333203 0.3333339  0.33333406]
 [0.33333176 0.33333403 0.3333342 ]
 [0.3333314  0.33333418 0.3333344 ]]


This is working well, with a near 33% distribution across the classes. That comes as result of the random initialization of weights with np.random.randn, all of which is centered and normally distributed. If this were an actual model, we could then apply an argmax which would then just return the index of the most likely classification.

So clearly, we now have an almost functioning model - but it's still totally random. So, what we need to do now is learn about loss functions that can quantify to our model how wrong it is, so it can change its weights to become accurate!

### Anyways, that's it for this chapter! Thanks for following along with my annotations of *Neural Networks from Scratch* by Kinsley and Kukieła!