<h1 style="color:white;background-color:rgb(255, 108, 0);padding-top:1em;padding-bottom:0.7em;padding-left:1em;">2.1 Artificial Neuron Structure</h1>
<hr>

<h2>Introduction</h2>

In this lesson we will learn how the basic unit of a neural network, a single neuron works.
<br>
First we will discuss the theory behind the artificial neuron and then we code examples
<br>
with the help of the NumPy module.

First of all, let's import the NumPy module:

In [None]:
import numpy as np
import matplotlib.pyplot as plt

<h2>Atrificial neuron</h2>

The functioning of an artificial neuron can be best understood through its analogy with the biological neuron.

<center>
<img src="https://cdn-images-1.medium.com/max/1600/0*v4f4-nMoRMNrtUZG.png" width="80%"/>
</center>

Biological neurons take impulses from other neurons through their dendrites. If the intensity of these impulses
<br>
exceed a given level the neuron also emits a signal on its axon. The impulses from other neurons can either be
<br>
inhibited or prohibited, so the neuron is able to consider different inputs with different significance level to
<br>
its own output. In the artificial neuron model, the inputs of the model $(x_i)$ represent the intensity of impulses
<br>
coming from the other neurons. The inhibition/prohibition is achieved by assigning weights $(w_i)$ to the inputs.
<br>
In the cell body the accumulation of the input inpulses is carried out, that is modelled with a summation in the
<br>
artificial neuron model. The property that the biological neuron only activates when the intensity of the inputs
<br>
is greater than a certain level is achieved with a nonlinear function, called activation function in the artificial
<br>
nauron. The activation function serves as a threshold on the output. The final output of the neuron can be
<br>
fed to other neurons through the axon terminals.

So an artificial neuron does the following:

 - Take an input vector $\mathbf{x}$ and a weight vector $\mathbf{w}$
 - Compute the sum of the weighted inputs $net$ as the dot product of the vectors $\mathbf{x}$ and $\mathbf{w}$:
 
 $$net = \sum_{i=0}^n w_ix_i$$
 
 - Apply the activation function $f()$ on $net$ to compute the final result $y$:
 
 $$y = f(net)$$
 
Here $n$ is the number of inputs. Notice that there is an extra input (indexed with $0$). This input does not count
<br>
in the number of inputs $(n)$ and its value is always $1$. The associated weight $w_0$ is often referred to as bias
<br>
and it is signed with $b$ instead of $w_0$.

So in a compact form the output of a single artificial neuron can be computed like:

$$y=f\left(b+\sum_{i=0}^n w_ix_i\right)$$

The only one thing left to discuss is the activation function $f()$.
<br>
The purpose of the activation function is to threshold the weighted sum of the inputs and in most cases, to squash the
<br>
output into a region, so the output value will not be too large or too low which would make further computations difficult.

The first idea that comes to mind is a simple step function (with a hard threshold) like:

$$
f(\mathrm{x})=    
\begin{cases}
      1 & \text{if $\mathrm{x} >$ threshold}\\
      0 & \text{otherwise}
\end{cases}
$$

But it turns out that other activation functions can be better applied in case of more complex problems.
<br>
The most popular activation functions are the sigmoid, the hyperbolic tangent, the ReLU, Leaky ReLU and softmax
<br>
activations. These can be computed like:

Sigmoid:

$$\sigma (\mathrm{x})=\frac{1}{1+e^{-\mathrm{x}}}$$

Hyperbolic tangent:

$$\tanh{(\mathrm{x})}=\frac{\sinh{(\mathrm{x})}}{\cosh{(\mathrm{x})}}=\frac{e^\mathrm{x}-e^{-\mathrm{x}}}{e^\mathrm{x}+e^{-\mathrm{x}}}$$

ReLU:

$$R(\mathrm{x}) = \max{(0,\mathrm{x})}$$


Leaky ReLU:

$$LR(\mathrm{x}) = \max{(\alpha \mathrm{x},\mathrm{x})}, \text{ where $\alpha = 0.01$}$$

The Softmax function is an odd one between these. We will discuss it a bit later. First we have to see how these
<br>
activation functions look like and what are they good for.

Let's plot the the above mentioned activation functions:

In [None]:
#The inputs for the functions
x = np.arange(-4.0, 4.0, 0.01)

#Activation functions applied to the inputs
th = np.copy(x)
th[th>0] = 1
th[th<0] = 0
sigm = 1/ (1 + np.exp(-x))
tanh = np.tanh(x)
ReLU = np.clip(x, 0, None)
LR = np.copy(x)
LR[LR<0]=0.01*LR[LR<0]

#Plot the outputs of the activation functions
plt.figure(1)
plt.plot(x, x, label='Linear')
plt.plot(x, th, label='Step')
plt.plot(x, sigm, label='Sigmoid')
plt.plot(x, tanh, label='Hyperbolic tangent')
plt.title('Activation functions1', fontsize=20)
plt.xlabel('x', fontsize=10)
plt.ylabel('f(x)', fontsize=10)
plt.legend()
plt.xlim(-4,4)
plt.ylim(-1.2,1.2)

plt.figure(2)
plt.plot(x, ReLU, label='ReLU')
plt.plot(x, LR, label='Leaky ReLU')
plt.title('Activation functions2', fontsize=20)
plt.xlabel('x', fontsize=10)
plt.ylabel('f(x)', fontsize=10)
plt.legend()
plt.xlim(-4,4)
plt.ylim(-0.1,1.2)

plt.show()

From the output, the properties of the different activation functions can be seen.

The linear activation function is used if the output should not be bound and the prediction of
<br>
continuous values is required, like in regression problems.

The sigmoid function will output values between $0$ and $1$ $(0<\sigma (\mathrm{x})<1)$.
<br>
The output of this activation function can be interpreted as a probability, so it is widely used in
<br>
classification problems, where the output means the probability of belonging to a class.

The hyperbolic tangent activation function takes values between $-1$ and $1$ $(-1<\tanh{(\mathrm{x})}<1)$.
<br>
This activation function is videly used in case of binary classification problems, when the output should tell
<br>
if the sample belongs to either one or the other class.

The ReLU activation function solves the saturation problem of the sigmoid and hyperbolic tangent functions.

The Leaky ReLU solves the dying gradient problem of the ReLU activation function (The dying gradient is the reason
<br>
why we don't use the step function for activation.


Now it is time to discuss the Softmax function.

Imagine that you would like to classify the inputs into tree classes. According the the formerly introduced
<br>
activation functions you would choose the sigmoid activation and apply three neurons to calculate the outputs
<br>
where each neuron would represent a single class and their output is the probability that the given sample belongs
<br>
to the given class. However, you have samples that cannot belong to multiple classes at the same time. So the
<br>
class labels are mutually exclusive. But if you apply sigmoid activation than there is no garantee that two or three
<br>
neurons cannot activate for the same sample. This is where the softmax function is used.

The input of the softmax function is a **vector** not a scalar and it outputs a **probability distribution** instead of
<br>
single probabilities. This means that the output has the same dimension as the input vector and its values are between
<br>
zero and one and if you sum all the values of the output vector it adds up to one.

The softmax function can be computed like:

$$Softmax(\mathbf{x})_j = \frac{e^{\mathrm{x}_j}}{\sum_{k=1}^Ke^{\mathrm{x}_k}}$$

, where $j = 1 \dots K,\text{ and } \mathbf{x}=[x_1,\dots,x_K]$

<p style="margin-top:2em;">Now that we know how the output of an artificial neuron can be computed, it is time to implement one in which we specify the weights:</p>

In [None]:
#Create functions implementing the activation functions
def sigmoid(x):
    return (1/ (1 + np.exp(-x)))

def tanh(x):
    return (np.tanh(x))

def relu(x):
    return (np.clip(x, 0, None))

#Define a function for calculating the output of a single artificial neuron expecting the inputs and weights
def neuron(inputs, weights, activation):
    '''Function for defining an artificial neuron
    
    Inputs:
        inputs (numpy.ndarray) - The inputs of the neuron
        weights (numpy.ndarray) - The weights of the neuron (The length of the weights array must be len(inputs)+1
                                    because the last element of weights is the bias.)
        activation (function) - The activation function to use
        
    Returns:
        (?) -The output of the neuron
    '''
    
    inputs = np.append(inputs, 1)
    return (activation(weights.dot(inputs)))

#Compute the output of some neurons with fixed weights and an arbitrary activations
#for four samples: [0,0], [0,1], [1,0] and [1,1]

samples = np.array([[0,0], [0,1], [1,0], [1,1]])

#Weights: [2.5, 2.8, -4.2], activation: sigmoid
weights = np.array([2.5, 2.8, -4.2])

for sample in samples:
    out = neuron(sample, weights, sigmoid)
    print('Input:', sample, 'output:', out, 'sample is member of this class? (1=true, 0=false)', int(out>0.5))
    
print('\n')

#Plot the neuron activations:
y, x = np.meshgrid(np.linspace(0, 1, 100), np.linspace(0, 1, 100))
z=np.zeros((100,100))
for i in range(100):
    for j in range(100):
        z[(i,j)]=neuron(np.array([i/100,j/100]), weights, sigmoid)
    
plt.figure(1)
z_min, z_max = z.min(), z.max()
plt.pcolormesh(x, y, z, cmap='RdBu', vmin=z_min, vmax=z_max)
plt.title('First neuron activations')
plt.colorbar()
    
#Weights: [2.5, 2.8, -4.2], activation: sigmoid
weights = np.array([2.5, 2.8, -1.8])

for sample in samples:
    out = neuron(sample, weights, sigmoid)
    print('Input:', sample, 'output:', out, 'sample is member of this class? (1=true, 0=false)', int(out>0.5))
    
#Plot the neuron activations:
z=np.zeros((100,100))
for i in range(100):
    for j in range(100):
        z[(i,j)]=neuron(np.array([i/100,j/100]), weights, sigmoid)
    
plt.figure(2)
z_min, z_max = z.min(), z.max()
plt.pcolormesh(x, y, z, cmap='RdBu', vmin=z_min, vmax=z_max)
plt.title('Second neuron activations')
plt.colorbar()

plt.show()

Notice that with single neurons that have only two inputs we are able to create logical functions such as and, or ...

By appying several neurons paralelly on the same data we can compute may outputs at the same time. However, it is
<br>
important to notice, that the decision boundary (the threshold level) belonging to a single neuron will always be linear.
<br>
So, nonlinear functions such as the exclusive or (XOr) cannot be represented with a single neuron.
<br>
Although the XOr function can be computed as the combination of simpler linear functions like the ones we already implemented.
<br>
This is the idea from where neural networks, and thus deep learning originates.

<h2>Excersise 2.1</h2>
Based on the previously implemented artificial neurons, create a network of these neurons to compute the XOr connection
<br>
between two logical variables. For this you need to:

 - Express the XOr function as a combination of linear logical functions
 - Implement the missing functions with artificial neurons
 - Combine the neurons into a network that computes the XOr function

See solution here: [Excersise 2.1 solution](Excersise_2_1.ipynb)

Continue: [2.2 Training Process and Gradient Descent](Training_Gradient_Descent.ipynb)