# Training an AND gate

This notebook aims to train a neural network from scratch, we will be training a very simple AND gate.  
The only external library used here is numpy

We will be creating the following neural network
<img src = "https://raw.githubusercontent.com/lightknight64bit/Neural-network-from-scratch/ac0680cc7472b192dbc5c6dbd636c9b6f2193263/pic.svg">

As we can see the neural network has 1 input layer with 2 neurons, 1 hidden layer with 5 neurons and 1 output layer with only one neuron.   

The size of input layer and output layer is based on the number of inputs given and output format which in this case is 2 and 1 respectively(We are giving 2 bits as input and expecting an AND output of the two). The size of the hidden layer can be whatever we wish and is 5 for the scope of this notebook

## Let us now look at the equations required to train the neural network.

## Activation function

We are using the sigmoid activation function which is given by:-    
## $f(x) = \frac{1}{1 + exp(-x)}$  
let us denote this by
## $\sigma(x)$
The derivative of this function is 
## $\frac{exp(-x)}{1+exp(-x)} = (\sigma(x))(1-\sigma(x))$

## Feed forward
Now let the weights for the hidden layer be $W_{1}$ with bias $B_{1}$ and the weights for the output layer be $W_{2}$ with bias $B_{2}$
Size for $W_{1}$ will be $m$X$n$ where m is the no. of columns in the input matrix and n is the number of neurons. We have an input size of $4$X$2$ since we have only 4 combinations of $0s$ and $1s$ and only 2 bits.Our input matrix X looks like this:-

$X =  
\begin{pmatrix}
0 & 0 \\
0 & 1 \\
1 & 0 \\
1 & 1
\end{pmatrix}
$ 
 
Therefore the size of $W_1$ should be $2$X$5$ as input as 2 columns and we are using 5 neurons.  
  
The bias size should be $1$X$5$(One bias per neuron)

Similiarly we can infer that the weights $W_{2}$ for output layer should be $5$X$1$ and bias $B_{2}$ should be $1$X$1$  
Now when we give the inputs to the hidden layer, we get a new matrix $z$ which is given by the equation 
### $z = \sigma(XW_{1} + B_{1})$

The output $y$ is then given by 
### $y = \sigma(ZW_{2} + B_{2})$

In this notebook we will initalize our weight and bias matrices by $1$

## Backpropogation  
Backpropogation is the crux of training neural networks. We will use gradient descent to estimate weights. This notebook does not provide an explanation for gradient descent, however if you don't know then refer this article [Introduction to gradient descent.](https://montjoile.medium.com/an-introduction-to-gradient-descent-algorithm-34cf3cee752b)

Let the error function denoted by $E$ be 
### $E = \frac{1}{2}\Sigma{(y - Y)^2}$ 

where $y$ is our output and Y is our ground truth matrix which is
$Y = 
\begin{pmatrix}
0 \\
0 \\
0 \\
1
\end{pmatrix}
$
We will measure the Derivative of the Error with respect to the weights and biases and subtract the weights and bias matrices with their respective derivatives times the learning rate. For example :- 
Let the derivative of $E$ with respect to $W_{2}$ be 
### $\frac{dE}{dW_{2}}$

Then we will subtract learning rate times $\frac{dE}{dW_{2}}$ from $W_{2}$ i.e :-
### $W_{2} = W_{2} - \alpha*\frac{dE}{dW_{2}}$ 
where $\alpha$ is our learning rate

The equations for derivatives and bias are given by:-
### $\frac{dE}{dW_{2}} = Z^T*\delta_{2}$
### $\frac{dE}{dW_{1}} = X^T*\delta_{1}$
### $\frac{dE}{dB_{2}} = \delta_{2}$
### $\frac{dE}{dB_{1}} = \delta_{1}$
where 
### $\delta_{2} = (y-Y) \odot \sigma' (ZW_{2} + B_{2})$
and 
### $\delta_{1} = \delta_{2}W_{2}^T \odot \sigma' (XW_{1} + B_{1})$
$\odot$ represents hadarmard product of matrices.
For the bias matrices replace columns of the respective $\delta$ matrix with their row sums so it matches with the size.

For the derivation of these equations refer this article :- [Proof](https://sudeepraja.github.io/Neural/)


# Python implementation using numpy

In [2]:
import numpy as np

Matrix definitions

In [3]:
X = np.array([[0,0],[0,1], [1,0], [1,1]]).reshape(4,2)
Y = np.array([0, 0, 0, 1]).reshape(4,1)
w1 = np.ones([2, 5])
b1 = np.ones([1,5])
w2 = np.ones([5,1])
b2 = np.array([1])

Activation function

In [4]:
# activation function

def sigmoid(x):
    return 1/(1+np.exp(-x))
def sigmoid_der(x):
    return sigmoid(x)*(1-sigmoid(x))

## Training

In [5]:

epochs = 40000
for i in range(epochs):
    #Feed forward
    z = np.dot(X, w1) + b1
    z = sigmoid(z)
    y = np.dot(z, w2) + b2
    y = sigmoid(y)
    #backpropogation
    lr = 0.3
    E = y-Y
    d2 = np.multiply(E, sigmoid_der(np.dot(z, w2) + b2))
    d1 = np.multiply(np.dot(d2, w2.T), sigmoid_der(np.dot(X, w1) + b1))
    w2_err = np.dot(z.T, d2)
    w1_err = np.dot(X.T, d1)
    b2_err = d2
    b1_err = d1

    w2 -= lr* w2_err
    w1 -= lr*w1_err
    b2 = b2-lr*np.sum(b2_err)
    b1 = b1-lr*np.sum(b1_err, axis=0)




## Prediction

In [6]:
# The network has been trained, let's check out the output
x = [1,1]
z = sigmoid(np.dot(x, w1) + b1)
y = sigmoid(np.dot(z, w2)+ b2)
y

array([[0.99223727]])

We see that the output is very close to 1