# Forward Propagation

To build a strong foundational understanding of how feedforward propagation
works, we'll go through a toy example of training a neural network where the input
to the neural network is (1, 1) and the corresponding (expected) output is 0. Here, we
are going to find the optimal weights of the neural network based on this single
input-output pair. However, you should note that in reality, there will be thousands
of data points on which an ANN is trained.

The following figure shows the architecture of the neural network that we are going:

![img](./imgs/ANN6.png)

Every arrow in the preceding diagram contains exactly one float value (weight) that
is adjustable. There are 9 (6 in the first hidden layer and 3 in the second) floats that we
need to find, so that when the input is (1,1), the output is as close to (0) as possible.
This is what we mean by training the neural network. We have not introduced a bias
value yet, for simplicity purposes only – the underlying logic remains the same.

## Calculating the hidden layer unit values

We'll now assign weights to all of the connections. In the first step, we assign weights
randomly across all the connections. And in general, neural networks are
initialized with random weights before the training starts. Again, for simplicity, while
introducing the topic, we will not include the bias value while learning about
feedforward propagation and backpropagation. But we will have it while
implementing both feedforward propagation and backpropagation from scratch.

Let's start with initial weights that are randomly initialized between 0 and 1, but note
that the final weights after the training process of a neural network don't need to be
between a specific set of values. A formal representation of weights and values in the
network is provided in the following diagram (left half) and the randomly initialized
weights are provided in the network in the right half.

![img](./imgs/ANN7.png)

The hidden layer's unit values before activation are obtained as follows:

![imgs](./imgs/ANN8.png)

The hidden layer's unit values (before activation) that are calculated here are also
shown in the following diagram:

![imgs](./imgs/ANN9.png)

In [1]:
import numpy as np
def feed_forward(inputs, outputs, weights):
    # Matrix multiplication of inputs and weights       
    pre_hidden = np.dot(inputs,weights[0])+ weights[1]
    # Pass through activation function
    hidden = 1/(1+np.exp(-pre_hidden))
    # Output layer
    pred_out = np.dot(hidden, weights[2]) + weights[3]
    # Return error MSE
    mean_squared_error = np.mean(np.square(pred_out - outputs))
    return mean_squared_error

### Activation function Example

![imgs](./imgs/ann10.png)

In [3]:
def tanh(x):
    return (np.exp(x) - np.exp(-x)) / (np.exp(x) + np.exp(-x))

In [4]:
def relu(x):
    return np.where(x>0, x, 0)

In [5]:
def linear(x):
    return x

In [6]:
def softmax(x):
    return np.exp(x) / np.sum(np.exp(x))

## Calculating the output layer values

So far, we have calculated the final hidden layer values after applying the sigmoid
activation. Using the hidden layer values after activation, and the weight values
(which are randomly initialized in the first iteration), we will calculate the output
value for our network:

![imgs](./imgs/ANN11.png)

We perform the sum of products of the hidden layer values and weight values to
calculate the output value. Another reminder: we excluded the bias terms that need to
be added at each unit(node), only to simplify our understanding of the working
details of feedforward propagation and backpropagation for now and will include it
while coding up feedforward propagation and backpropagation:

![imgs](./imgs/ANN12.png)

Because we started with a random set of weights, the value of the output node is very
different from the target. In this case, the difference is 1.235 (remember, the target is
0).

## Loss Function

### Calculating loss during continuous variable prediction

<b> Mean squared error: </b> The mean squared error is the squared difference between the actual and the predicted values of the output. We take a square of the error, as the error can be positive or negative (when the predicted value is greater than the actual value and vice versa). Squaring ensures that positive and negative errors do not offset each other. We calculate the mean of the squared error so that the error over two different datasets is comparable when the datasets are not of the same size.

The mean squared error is typically used when trying to predict a value that
is continuous in nature.

In [7]:
def mse(p, y):
    return np.mean((p-y)**2)

<b>Mean absolute error:</b> The mean absolute error works in a manner that is
very similar to the mean squared error. The mean absolute error ensures
that positive and negative errors do not offset each other by taking an
average of the absolute difference between the actual and predicted values
across all data points.

Similar to the mean squared error, the mean absolute error is generally
employed on continuous variables. Further, in general, it is preferable to
have a mean absolute error as a loss function when the outputs to predict
have a value less than 1, as the mean squared error would reduce the
magnitude of loss considerably (the square of a number between 1 and -1 is
an even smaller number) when the expected output is less than 1.

In [None]:
def mae(p, y):
    return np.mean(np.abs(p-y))

### Calculating loss during categorical variable prediction

<b>Binary cross-entropy: </b>Cross-entropy is a measure of the difference between
two different distributions: actual and predicted. Binary cross-entropy is
applied to binary output data, unlike the previous two loss functions that
we discussed (which are applied during continuous variable prediction).

Note that binary cross-entropy loss has a high value when the predicted
value is far away from the actual value and a low value when the predicted
and actual values are close.

In [8]:
def binary_cross_entropy(p, y):
    return -np.mean((y * np.log(p) + (1 - y) * np.log(1 - p)))

<b>Categorical cross-entropy:</b> Categorical cross-entropy between an array of
predicted values ( p ) and an array of actual values ( y ) is implemented as
follows:

In [11]:
def categorical_cross_entropy(p, y):
    return -np.mean(np.log(p[np.arange(len(y)), y]))