# Simple 2-layer Neural Net with Vectorization
+ In forward and backward propagations, data are processed in batch, using maxtrix multiplication.
+ The MSE loss function is adopted, and for simplicity, bias is not considered.
+ It is shown that XOR operation can be achieved. 

In [None]:
import numpy as np
import random

eta = 0.9  # learning rate
epoch = 10000

### 2-layer Neural Network Model
+ Sigmoid activation functions (a.k.a logistic function) are used at outputs of both layer 1 and layer 2.
+ Total error is also obtained by summing up the individual error and averaging them. 
+ The weight is updated by first summing up all the errors and then deriving the change of the weight derivative
+ which is associated witht the total error. 
+ The calulation of these operation is perfomed by dot product of delta and output of each layer.
+ For more details on the definition of delta function, refer to the class note.

In [None]:
def sigmoid(x):
    return 1.0/(1+ np.exp(-x))

def sigmoid_deriv(x):
    return x * (1.0 - x)

class neuralnetwork:
    # neural network model
    
    def __init__(self, x, w1, w2, y):
        self.inputs   = x.T
#        print(self.inputs)
        self.weights1 = w1
        self.weights2 = w2
        self.target   = y
        self.output   = np.zeros(self.target.shape)

    def forwardprop(self):
        # forward processing of inputs and weights using sigmoid activation function
        self.layer1 = sigmoid(np.dot(self.weights1, self.inputs))
        self.output = sigmoid(np.dot(self.weights2, self.layer1))

    def backprop(self):
        # backward processing of appling the chain rule to find derivative of the loss function with respect to weights
        delta2 = (self.output - self.target) * sigmoid_deriv(self.output)
        delta1 = np.dot(self.weights2.T, delta2) * sigmoid_deriv(self.layer1)
        dw2 = np.dot(delta2, self.layer1.T)
        dw1 = np.dot(delta1, self.inputs.T)

        # update the weights with the derivative of the loss function
        self.weights1 -= eta * dw1 / batchsize
        self.weights2 -= eta * dw2 / batchsize

    def predict(self, x):
        # predict the output for a given input x
        self.layer1 = sigmoid(np.dot(self.weights1, x))
        self.output = sigmoid(np.dot(self.weights2, self.layer1))
        return (self.output)
        
    def calculaterror(self):
        # calculate error
        error = self.target - self.output
#        print("Output: ", self.output)
        return str(np.mean(np.abs(error)))


### Batch Gradient Descent Optimization
+ All the input data are processed in batch at both forward and backward propagations.
+ In comparison with SGD, batch GD optimizes more smoothly since the weights update are performed in batch, 
+ differently from that of SGD. In SGD, the weight update is acheived individually for each input data, using
+ the weigth of the other input data as initial weight at each step of optimization. 
+ This results in non-smooth path of SGD optimizaation, comparing with that of BGD (batch gradient descent).

In [None]:
if __name__ == "__main__":
 
    inputdata = np.array([[0,0],
                          [0,1],
                          [1,0],
                          [1,1]])
    
    batchsize = inputdata.shape[0]
    w2 = np.random.rand(1, 4)
    w1 = np.random.rand(4, inputdata.shape[1]) 

    targetvalue = np.array([0, 1, 1, 0])

    nn = neuralnetwork(inputdata, w1, w2, targetvalue)
  
    # training 
    for i in range(epoch):    
        nn.forwardprop()
        nn.backprop()
        if (i % 1000) == 0:
            print("Error: ", nn.calculaterror())
        
    print("output after training")   
    print(nn.output)

### Testing and Prediction
+ After training, you can verify that the required target is generated for a given input data.
+ It can be verified that the XOR operation is achieved.
+ Although the above neural net has one hidden layer, it may process much more complex input data. 
+ Theoretically, according to the universal approximation theorem, 2-layer neural net can approximate arbitary 
+ continuous function under mild assumptions on the activation function.
+ For more details on universal approximation theorem, refer to 
+ https://en.wikipedia.org/wiki/Universal_approximation_theorem

In [None]:
   # predicting and testing the output for a given input data
    x_prediction = np.array([[1, 1]])
    predicted_output = nn.predict(x_prediction.T)
    print("Predicted data based on trained weights: ")
    print("Input: ", x_prediction)
    print("Output: ", predicted_output)