# How to Build a Feed-Forward Neural Network

This notebook outlines two different neural network simulations. 

In the first simulation we will train a **2-layer feed-forward network** on different logic gates (AND/OR/XOR). We will train the network using the **delta-rule** and the network learn both the *AND* rule, as well as the *OR* rule. Finally, we will investigate the limitations of this network by training it on the *XOR* rule.

In the second simulation we will train a **3-layer neural network** with a set of non-linear hidden units. We will use **backpropagation** to train the network and show that it is able to learn the XOR rule. Finally, we will make an attempt to investigate the network's learned weights in order to understand it's learned solution.


## Import Statements

We need numpy in order to do math conveniently in python.

In [10]:
import numpy as np

## Training Environment

Before we build the network, define what the network will be tasked with. All tasks provided to the network are expressed as patterns in the training environment. There are two types of patterns. **Input patterns** are provided to the input layer of the network. **training patterns** represent the 'correct' response to the input patterns. 

We will define 3 different training environments that are based on the *boolean functions* AND, OR and XOR. The input patterns for each of the environments is the same. Each input pattern consists of 2 bits that are passed to the network. That is, the network takes as input two values that can be either 0 or 1. Each of the three boolean functions (AND/OR/XOR) defines how these two inputs are combined to a single binary output value. 

Boolean functions can be expressed in a *truth table*. Every row in a truth table represents a single input pattern with it's corresponding output pattern. Below you find the truth table for each of the boolean functions that we use for training. 


<br>
<center>**AND**</center>

| Input 1 | Input 2 | Output |
|---------|---------|--------|
| 0       | 0       | 0      |
| 0       | 1       | 0      |
| 1       | 0       | 0      |
| 1       | 1       | 1      |

<br>
<center>**OR**</center>

| Input 1 | Input 2 | Output |
|---------|---------|--------|
| 0       | 0       | 0      |
| 0       | 1       | 1      |
| 1       | 0       | 1      |
| 1       | 1       | 1      |


<br>
<center>**XOR**</center>

| Input 1 | Input 2 | Output |
|---------|---------|--------|
| 0       | 0       | 0      |
| 0       | 1       | 1      |
| 1       | 0       | 1      |
| 1       | 1       | 0      |



In [11]:
# define input patterns (same across task environments)
input_patterns = np.array([[ 0.,  0.],
                           [ 0.,  1.],
                           [ 1.,  0.],
                           [ 1.,  1.]])

# define output patterns for AND rule
output_patterns_AND = np.array([[ 0.],
                                [ 0.],
                                [ 0.],
                                [ 1.]])

# define output patterns for OR rule
output_patterns_OR = np.array([[ 0.],
                               [ 1.],
                               [ 1.],
                               [ 1.]])

output_patterns_XOR = np.array([[ 0.],
                                [ 1.],
                                [ 1.],
                                [ 0.]])

## Two-Layer Neural Network & Delta-Rule

### Network Training

Let's build the function of the actual network. Our desired network looks like this:

<img src="2LNetwork.png" alt="2 Layer Network" style="width: 250px;"/>

The input layer of the network encompasses two units. These two input units project to the unit of the output layer. Your task is to fill in the code in the function `train2LayerNetwork` that is used to train the network. The function will perform the following computations (steps marked in *italic* are the focus of this exercise)

1) Initialization: This step involves initializing layers and weights

*2) Feedforward Pass*: In this step we will compute the network's activity based on it's input pattern and it's weights.

*3) Backpard Pass*: In this step we will adjust the weights of the network based on the produced output patterns of the network and the feedback provided by the training patterns

Steps 2) and 3) will be performed for each input pattern in each training iteration. The following sections discuss each step in detail.

#### Forward Pass ####

In the forward pass we will propagate activity through the network, layer by layer. In feedforward networks, we already know the activity of the input layer as it corresponds to the input pattern. We will therefore begin with the computing the activity of the units in the second layer. Let's say that the second layer has $N$ units. Let $y_j$
be the activation of a unit in the second layer that we want to compute where $j \in \{1,...,N\}$. The activation of a unit is a function of its net input. We are usually interested in differtiable, non-monotonic functions, such as the sigmoidal activation function:

\begin{equation}
y_j = \frac{1}{1+e^{-net_{y_j}}}
\end{equation}

This function makes sure that the activation of a unit is bound between 0 and 1. However, in order to compute the activity $y_i$ we need to know the net input of a unit. The net input of unit $y_i$ is simply the sum of the activity of the sending units in the previous layer, weighted by their projection weights. Let's that the sending, input layer has $M$ units. Then the net input $y_i$ of a unit in the receiving layer corresponds to

\begin{equation}
net_{y_j} = \sum_{i=1}^M x_i w_{j,i} 
\end{equation}


where $x_i$ corresponds to the activity of input unit $i$ and $w_{j,i}$ corresponds to the weight of input unit $i$ to unit $j$ in the second layer. 

In a network with more than two layers, one would then proceed with computing the activation of the units in the third layer. This is done the same way as with the second layer: The activity of each unit in the third layer is some activation function of its net input. The net input is the weighted sum of the activities of the second layer where the weights correspond to the projection weights from the second to the third layer. In the feedforward pass one can apply this procedure layer by layer until the activity of the final (output) layer is computed. 

Now that we know the general rule to compute the activity of a neural network we can apply this to our example above. Say we want to compute the networks activity for the input pattern $[0,1]$. Since this is a very simple network, all we need to do is compute the activity of the unit $y_1$ in the second (output) layer. Let's assume that the network has the weights

$w_{1,1} = -0.5$ 

$w_{1,2} = 2$.

Since the input pattern is $[0,1]$, the the input layer units take on the following values:

$x_1 = 1$

$x_2 = 0$.

Now we can compute the net input of $y_1$:

$net_{y_1} = w_{1,1} x_1 + w_{1,2} x_2 = 0.5 * 1 + 2 * 0 = -0.5$

Finally we can compute the activation of the unit for the input $[0,1]$:

$y_1 = \frac{1}{1+e^{-net_{y_1}}} = \frac{1}{1+e^{-(-0.5)}} = 0.3775406688$

What does this output mean? It means that the network implements a function that maps the input pattern $[0,1]$ to the value $0.3775406688$. What if we wanted the network to implement a different function like the OR-rule? In this case we would prefer the network to produce an output more close to $1$. We could do this by changing the weights of the network in a smart way. What if we don't want to think about which weights to choose? What if we could make the network *learn* the right weights itself? We will discuss how this can be done for our two-layer network in the next section.


#### Backward Pass ####

Our goal is to teach the network a particular function (e.g. the OR-rule). That is, we want it to produce the correct output (e.g. $[1]$) for a given input pattern (e.g. $[0,1]$). Let's assume that we already have a network with some  weights, like the ones from the feedforward pass. We know from the forward pass in the previous section that its output for the input pattern $[0,1]$ isn't very close to what we want $[1]$. So let's teach the network by providing it some feedback about how well it did. We will do this by computing the error $E_j$ of output unit $j$. One way to compute the error is by taking the squared difference between the output of the network and the correct training pattern for a given output unit $j$:

\begin{equation}
E_{y_j} = 0.5(y_j - t_j)^2
\end{equation}

where $t_j$ is the correct training output for unit $j$. Note that the squared error is scaled by 0. This is done in order to make computations work out nicer below. 

We can then use the error as a feedback signal to adjust its weights. We know that the error is a function of the weights of the network, e.g. a function of $w_{1,1}$. Let's assume that this <font color="#C00000">error function of $w_{1,1}$ </font> looks like the solid red line in the following plot:

<img src="ErrorSurface.png" alt="2 Layer Network" style="width: 400px;"/>

where <font color="#000000"> $w_{1,1}^t$ </font> is the current weight of the network  at time step $t$ (e.g. -0.5) and <font color="#2F5597">$w_{1,1}^*$</font> is the <font color="#2F5597">optimal weight</font> of the network that provides the minimum error. In order to minimize the error we want to change our current weight <font color="#000000"> $w_{1,1}^t$ </font> so that it becomes <font color="#2F5597">$w_{1,1}^*$</font>. 

Perhaps we can compute <font color="#2F5597">$w_{1,1}^*$</font> directly by finding the minimum of the <font color="#C00000">error function</font>? Well, the problem is that we don't know the error function. On top of that, the error also depends on the other weight $w_{1,2}^t$. However, what if we had knowledge about the <font color="7030A0">derivative of the error with respect to $w_{1,1}^t$ given the current weights, i.e. $\frac{\partial E_{y_1}^t}{\partial w_{1,1}^t}$</font>? Then we could use the slope of the derivitave as an indicator in which direction we would have to change the weight $w_{1,1}^t$. In the exampple above, the derivative of the error function has a negative slope. This means, that, in order to minimize the error, we would have to increase the weight $w_{1,1}^t$ by some small amount <font color="#00B050">$\Delta w_{1,1}^t$</font>. That is, our <font color="C55A11">weight for the next time step $w_{1,1}^{t+1}$</font> is computed as

\begin{equation}
w_{1,1}^{t+1} = \underbrace{w_{1,1}^t}_\text{current weight} + \underbrace{\Delta w_{1,1}^t}_\text{weight change}
\end{equation}

where 

\begin{equation}
\Delta w_{1,1}^t = - \alpha \underbrace{\frac{\partial E_{y_1}^t}{\partial w_{1,1}^t}}_\text{slope error of derivative}
\end{equation}

where $\alpha$ is a constant that defines the step size of the weight change (how much we want to change weight in the direction of the derivative). We call $\alpha$ the *learning rate*. 

We still need to figure out how to compute <font color="7030A0"> $\frac{\partial E_{y_1}^t}{\partial w_{1,1}^t}$ </font>, that is, the partial derivative of the error with respect to $w_{1,1}^t$. We know that the error of unit $j$ is a function of its output activity $y_j$, the unit's activity $y_j$ is a function of its net input $net_{y_1}$, and it's net input is a function of the current weight $w_{1,1}^t$. Knowing this, we can apply the chain rule:

\begin{equation}
\frac{\partial E_{y_1}^t}{\partial w_{1,1}^t} = \underbrace{\frac{\partial E_{y_1}^t}{\partial y_1}}_\text{derivative 1} \quad
\underbrace{\frac{\partial y_1}{\partial net_{y_1}}}_\text{derivative 2} \quad
\underbrace{\frac{\partial net_{y_1}}{\partial w_{1,1}^t}}_\text{derivative 3}
\end{equation}

The first partial derivative $\frac{\partial E_{y_1}^t}{\partial y_1}$ is easy to compute:

\begin{equation}
\frac{\partial E_{y_1}^t}{\partial y_1} =  \frac{\partial 0.5(y_1 - t_1)^2}{\partial y_1} = (y_1 - t_1)
\end{equation}

The second partial derivative $\frac{\partial y_1}{\partial net_{y_1}}$ is a bit more complicated since we are dealing with a sigmoidal activation function $y_1 = 1/(1+e^{-net_{y_1}})$. However, it turns out that it's derivate can be simply computed as a function of $y_1$:

\begin{equation}
\frac{\partial y_1}{\partial net_{y_1}} = y_1 (1 - y_1)
\end{equation}

Finally, we compute the third partial derivative $\frac{\partial net_{y_1}}{\partial w_{1,1}^t}$ as 

\begin{equation}
\frac{\partial net_{y_1}}{\partial w_{1,1}^t} =  x_1
\end{equation}

Now let's put all pieces together in order to compute the final weight change for $w_{1,1}$:

\begin{equation}
\Delta w_{1,1}^t = - \alpha (y_1 - t_1) y_1 (1 - y_1) x_1
\end{equation}

A similar update rule can be applied for the other weight $w_{1,2}$.

#### Additional Remarks ####
Note that if we use a linear activation function, e.g. $y_j = m net_{y_j} + n$, then the weight change reduces to 

\begin{equation}
\Delta w_{j,i}^t = - \alpha (y_j - t_j) x_i
\end{equation}

where the constant $m$ gets absorbed in the learning rate. This weight update rule is termed the delta rule.

### Exercise 1.1

The function `train2LayerNetwork` initializes a simple 2-layer neural network with 2 input units and 1 output unit (like in the example above). Your task is to fill in the missing code for the computation of its output (forward pass), as well as the missing code for network training (backward pass). The missing code is marked with '`...`'.

Once you completed the code, run the simulation below. The simulation will output for each learning epoch the mean squared error across all training patterns. Try to train the network on the two different training patterns (AND, OR). Why do you think does the network not learn the task as well? You may set the debug variable to true for a more detailed output of the weight adjustments for each pattern.


In [12]:
def train2LayerNetwork(input_patterns, output_patterns, learning_rate, MSE_threshold, num_epochs, debug=False):

    ### network initialization ###

    # let's define the number input and output units as a function of the dimension of the input and output patterns respectively
    NInputUnits = input_patterns.shape[1]
    NOutputUnits = output_patterns.shape[1]

    # let's also log the error of the network
    MSE_log = np.zeros((1, num_epochs))

    # we will also randomly initialize the weights between the input and output layer, 
    # as well as the bias weights to the output layer
    # the weight matrix will have as many as number of rows as there are units in the output layer
    # and as many number of columns as there are units in the input layer
    # weights will be initialized with small random values, uniformly sampled between 0 and 0.1
    W_yx = np.random.uniform(0, 0.1,(NOutputUnits, NInputUnits))
    W_ybias = np.random.uniform(0, 0.1,(NOutputUnits, 1))
    
    ### network training ### 

    # the network will be trained in epochs.
    for epoch in range(num_epochs):

        # initialize mean squared error log for all patterns
        MSE_patterns = np.zeros((output_patterns.shape[0],))

        # within each training epoch, we will loop through every training pattern. 
        for pattern in range(input_patterns.shape[0]):

            # FORWARD PASS #

            # assign values to input layer
            x = input_patterns[pattern,]

            # compute net input of output layer
            y_net = np.dot(W_yx, x.T) + 1 * W_ybias

            # compute activation of output layer using sigmoidal activation function
            y = 1/(np.exp(- (y_net))+1)

            # ERROR BACKPROPAGATION #

            # compute the mean squared error of the output with respect to the correct training pattern
            MSE_patterns[pattern] = (y-output_patterns[pattern,])**2

            # compute derivative of the error with respect to the output unit activation
            dError_dAct = (y-output_patterns[pattern,])

            # compute derivative of output unit activation with respect to it's net input. 
            # Note that the derivative of a sigmoidal function y(x) = 1/(1+exp(-x)) 
            # is dy = y(x) * (1-y(x))
            dAct_dNet = y * (1-y)

            # compute the derivative of the net input of the output layer with respect to it's weights to the input layer
            dNet_dW_yx = x
            
            # compute the derivative of the net input of the output layer with respect to it's weights to the bias unit
            dNet_dW_ybias = 1

            # compute weight adjustment
            delta_W_yx = dError_dAct * dAct_dNet * dNet_dW_yx;
            delta_W_ybias = dError_dAct * dAct_dNet * dNet_dW_ybias;

            # For debugging
            if(debug):
                print('----------')
                print('pattern:')
                print(x)
                print('weights:')
                print(W_yx)
                print('output:')
                print(y)
                print('MSE:')
                print(MSE_patterns[pattern])
                print('-')
                print('output pattern:')
                print(output_patterns[pattern,])
                print('dError:')
                print(dError_dAct)
                print('dAct_dNet:')
                print(dAct_dNet)
                print('weight adjustment:')
                print(- learning_rate * delta_W_yx)

            # adjust weights based on learning rate
            W_yx = W_yx - learning_rate * delta_W_yx
            W_ybias = W_ybias - learning_rate * delta_W_ybias


        # log mean squared error for current epoch 
        MSE_log[0,epoch] = np.sum(MSE_patterns)/MSE_patterns.size

        # print mean squared error
        if epoch == 0:
            print('Training MSE:')
        print(MSE_log[0,epoch])
        
        # break if we error threshold is reached
        if MSE_log[0,epoch] < MSE_threshold:
            break
            
            
    
    

### Run Simulation

We will start with defining critical simulation parameters:
- `learning_rate` corresponds to the stepsize for each weight change
- `MSE_threshold` defines the mean-squared error at which training is stopped
- `max_num_epochs` corresponds to the maximum number of training iterations (in case MSE_threshold is not reached).


In [6]:
learning_rate = 0.3
MSE_threshold = 0.05
num_epochs = 100

train2LayerNetwork(input_patterns, output_patterns_AND, learning_rate, MSE_threshold, num_epochs, False)

Training MSE:
0.256930970518
0.246846104977
0.238148551492
0.230600943688
0.223999907416
0.218175743215
0.212989094443
0.208326576898
0.204096389057
0.200224356979
0.196650564164
0.193326568218
0.190213142868
0.187278464137
0.184496660265
0.18184665413
0.17931123852
0.176876335878
0.174530403994
0.172263957305
0.170069180035
0.167939612612
0.165869896808
0.163855568208
0.161892887009
0.159978700021
0.158110328154
0.156285474817
0.15450215149
0.152758617426
0.151053330975
0.149384910461
0.147752102879
0.146153758995
0.144588813634
0.143056270192
0.14155518852
0.140084675527
0.138643877925
0.137231976663
0.135848182687
0.134491733722
0.133161891828
0.131857941557
0.13057918855
0.129324958461
0.128094596122
0.126887464872
0.12570294601
0.124540438326
0.123399357687
0.122279136653
0.121179224112
0.120099084929
0.119038199595
0.117996063877
0.116972188463
0.115966098614
0.114977333797
0.114005447334
0.113050006034
0.112110589838
0.111186791451
0.11027821599
0.109384480623
0.108505214219
0.1

### The Role of Bias Units

All of the logical boolean functions have something in common: They require to map the inputs $[0,0]$ to the output $[0]$. Can our neural network actually learn this mapping? Let's compute the network's output in response to the input pattern $x_1=0, x_2=0$ for any given weights $w_{1,1}, w_{1,2}$:

$net_{y_1} = w_{1,1} x_1 + w_{1,2} x_2 = 0$

$y_1 = \frac{1}{1+e^{(-net_{y_1})}} = \frac{1}{1+e^{(-0)}} = 0.5$

That is, the network will not be able to produce an output of 0 for any given weights $w_{1,1}, w_{1,2}$. This is because the inputs $x_1=0, x_2=0$ lead to $net_{y_1} = 0$, which in turn yields an activation of $y_1 = 0.5$ due to the sigmoidal activation function:

<img src="BiasEffect.png" alt="effect of bias on sigmoidal activation function" style="width: 250px;"/>

What if we could shift the sigmoidal activation function to the right? Then a net input of 0 would yield an activation close to 0. Shifting the sigmoidal activation function corresponds to adding a negative bias term to the net input, i.e.


\begin{equation}
net_{y_1} = w_{1,1} x_1 + w_{1,2} x_2 \underbrace{-4}_\text{bias term}
\end{equation}

\begin{equation}
y_1 = \frac{1}{1+e^{(-(w_{1,1} x_1 + w_{1,2} x_2 - 2))}} = \frac{1}{1+e^{(4)}} = 0.0180
\end{equation}

Since the sigmmoidal activation function is bounded between 0 and 1, it's activation will never exactly 0 or 1. However, we can get it as close to 0 as possible in order to minimize the error.

What if we could teach the network to find the right bias to the output unit $y_j$? As shown above, the bias is just an additional term in the net input of a unit. We can therefore treat it as a separate input unit with value $b_{y_j}$ with a weight $w_{j}$ that projects to the output unit $y_j$:

<img src="2LNetworkBias.png" alt="2-layer network with bias unit" style="width: 400px;"/>

The net input of $y_j$ then amounts to

\begin{equation}
net_{y_1} = w_{1,1} x_1 + w_{1,2} x_2 + \underbrace{b_{y_j} w_{j}}_\text{bias term}
\end{equation}

Now that we have expressed the bias term as another unit with a projection weight to $y_j$, we scan imply set the bias input unit $b_{y_j} = 1$ for every input pattern and let the network learn its weight $w_{j,b}$ in order to find the optimal bias of the network:

\begin{equation}
w_{j,b}^{t+1} = \underbrace{w_{j,b}^t}_\text{current weight} + \underbrace{\Delta w_{j,b}^t}_\text{weight change}
\end{equation}

where 

\begin{equation}
\Delta w_{j,b}^t = - \alpha \underbrace{\frac{\partial E_{y_1}^t}{\partial w_{j,b}^t}}_\text{slope error of derivative}
\end{equation}

Note that any unit (except input units) in the network can have a bias unit. At is indeed usually the case that every unit of a layer in a network (except input layers) have their own bias units.



### Exercise 1.2

Modify `train2LayerNetwork` to implement a bias unit on the output unit, as well as a mechanism for learning the weight of that bias unit. Check if implementing a bias unit imporves learning on the AND & OR rule. 

Finally, test the network's ability to learn the XOR rule. Why does it still have trouble learning the rule? How could we modify the network even further in order to make it learn the XOR rule?


## Three-Layer Neural Network & Backpropagation

In [13]:
def train3LayerNetwork(input_patterns, output_patterns, learning_rate, MSE_threshold, num_epochs, debug=False):

    ### network initialization ###

    # let's define the number input and output units as a function of the dimension of the input and output patterns respectively
    NInputUnits = input_patterns.shape[1]
    NOutputUnits = output_patterns.shape[1]
    # set number of hidden units to 3
    NHiddenUnits = 3

    # no we can intialize the two layers of the network...
    x = np.zeros((1, NInputUnits)) # input layer
    h = np.zeros((1, NHiddenUnits)) # hidden layer
    y = np.zeros((1, NOutputUnits)) # output layer

    # let's also log the error of the network
    MSE_log = np.zeros((1, num_epochs))

    # initialize weights from input layer to the hidden layer
    W_hx = np.random.uniform(0, 0.1,(h.shape[1],x.shape[1]))
    
    # initialize weights from hidden layer to the output layer
    W_yh = np.random.uniform(0, 0.1,(y.shape[1],h.shape[1]))
    
    # initialize weights from bias unit to hidden layer and output layer
    W_hbias = np.random.uniform(0, 0.1,(h.shape[1],1))
    W_ybias = np.random.uniform(0, 0.1,(y.shape[1],1))

    ### network training ### 

    # the network will be trained in epochs.
    for epoch in range(num_epochs):

        # initialize mean squared error log for all patterns
        MSE_patterns = np.zeros((output_patterns.shape[0],))

        # shuffle input patterns
        # np.random.shuffle(input_patterns)

        # within each training epoch, we will loop through every training pattern. 
        for pattern in range(input_patterns.shape[0]):

            # FORWARD PASS #

            # assign values to input layer
            x = input_patterns[pattern,]

            # compute net input of hidden layer
            hbias = 1 * W_hbias
            h_net = np.dot(W_hx, x.transpose()) + hbias.transpose()
            
            # compute activation of hidden layer using sigmoidal activation function
            h = 1/(np.exp(- (h_net))+1)
            
            # compute net input of output layer
            ybias = 1 * W_ybias
            y_net = np.dot(W_yh, h.transpose()) + ybias.transpose()
            
            # compute activation of output layer using sigmoidal activation function
            y = 1/(np.exp(- (y_net))+1)

            # ERROR BACKPROPAGATION #

            # compute the mean squared error of the output with respect to the correct training pattern
            MSE_patterns[pattern] = (y-output_patterns[pattern,])**2

            # compute derivative of the error with respect to the output unit activation
            dError_dAct = (y-output_patterns[pattern,])

            # compute derivative of output unit activation with respect to it's net input. 
            # Note that the derivative of a sigmoidal function y(x) = 1/(1+exp(-x)) 
            # is dy = y(x) * (1-y(x))
            dAct_dNet = y * (1-y)
            
            # compute delta over output units
            delta_y = dError_dAct * dAct_dNet

            # compute the derivative of the net input of the output layer with respect to it's weights to the input layer
            dNet_dW_yh = h
            
            # compute the derivative of the net input of the output layer with respect to it's weights to the bias unit
            dNet_dW_ybias = 1
            dNet_dW_hbias = 1

            # compute weight adjustments from hidden to output layer
            delta_W_yh = delta_y * dNet_dW_yh;
            delta_W_ybias = delta_y * dNet_dW_ybias;
            
            dNet_y_dh = W_yh
            dh_dNet = h * (1-h)
            dNet_h_dx = x
        
            # compute derivative of net input of output layer with respect to activation of hidden layer units
            delta_h = delta_y * np.multiply(dNet_y_dh, dh_dNet).transpose()
            delta_W_hx = np.dot(delta_h, np.reshape([0,1], (1,2)))
        
            delta_W_hbias = delta_h * dNet_dW_hbias
            
            # adjust weights to output layer
            W_yh = W_yh - learning_rate * delta_W_yh
            W_ybias = W_ybias - learning_rate * delta_W_ybias
            
            # adjust weights to hidden layer
            W_hx = W_hx - learning_rate * delta_W_hx
            W_hbias = W_hbias - learning_rate * delta_W_hbias

        # log mean squared error for current epoch 
        MSE_log[0,epoch] = np.sum(MSE_patterns)/MSE_patterns.size

        # print mean squared error
        print(MSE_log[0,epoch])

In [16]:
learning_rate = 0.3
MSE_threshold = 0.05
num_epochs = 2000

train3LayerNetwork(input_patterns, output_patterns_XOR, learning_rate, MSE_threshold, num_epochs, False)

0.259337616472
0.259206538434
0.259106102524
0.25902906691
0.258969886823
0.258924322611
0.258889135897
0.258861855081
0.258840594958
0.258823918298
0.25881072983
0.258800195157
0.258791678805
0.258784696928
0.258778881222
0.258773951403
0.258769694236
0.258765947536
0.258762587994
0.258759521888
0.258756678017
0.258754002316
0.258751453755
0.258749001217
0.258746621119
0.258744295608
0.258742011181
0.258739757636
0.258737527278
0.258735314314
0.258733114387
0.258730924233
0.258728741409
0.258726564097
0.258724390945
0.258722220956
0.258720053397
0.258717887734
0.25871572358
0.258713560661
0.258711398782
0.25870923781
0.258707077656
0.258704918262
0.258702759595
0.258700601637
0.258698444382
0.258696287831
0.258694131991
0.258691976874
0.25868982249
0.258687668852
0.258685515975
0.25868336387
0.258681212551
0.25867906203
0.258676912316
0.25867476342
0.258672615352
0.258670468121
0.258668321733
0.258666176195
0.258664031515
0.258661887698
0.258659744748
0.258657602671
0.258655461471
0.2