    In this jupyter notebook, we are going to implement the various optimizers and regularizers.
    
    Let's start with importing the libraries that we're going to need

In [1]:
import numpy as np
import sklearn.datasets

    Let's write a neural network model for 2 hidden layers.
      
    First we want to write a function that initializes the parameters W1, b1, W2, b2 randomly.


In [2]:
''' inputs: n_x (input feature dim), n_h (num of hidden layer units), n_y (num of output units)'''
def two_layer_init(n_x, n_h, n_y):
    
    # initialize W's with gaussian random variables with variance 0.01
    # initialize b's with zeros, because they don't need symmetry broken
    W1 = np.random.randn(n_h, n_x) * 0.01
    b1 = np.zeros((n_h, 1))
    W2 = np.random.randn(n_y, n_h) * 0.01
    b2 = np.zeros((n_y, 1))
    
    # store the weights in a dictionary for easy access in future uses of the weights
    parameters = {
        'W1': W1,
        'b1': b1,
        'W2': W2,
        'b2': b2
    }
    
    return parameters

    Now let's test our two layer init function with small dimensions.

In [3]:
parameters = two_layer_init(5, 4, 3)
n_x, n_h, n_y = 5, 4, 3
# check that we use zeros for b's and randn for W's
for key, value in parameters.items():
    print(key)
    print(value)

# check shapes are correct
right_dims = [(n_h, n_x), (n_h, 1), (n_y, n_h), (n_y, 1)]
i = 0
for key, value in parameters.items():
    print(key, str(right_dims[i] == value.shape))
    i += 1

W1
[[-0.00653288 -0.00819667  0.00031673  0.00889022  0.01301598]
 [ 0.00206776  0.0089669   0.0127582  -0.00828307 -0.00073103]
 [-0.0018422   0.00376808 -0.01355366  0.01301257  0.0128692 ]
 [ 0.01187923 -0.0020258  -0.00485629  0.01572953 -0.00682283]]
b1
[[0.]
 [0.]
 [0.]
 [0.]]
W2
[[ 6.17140506e-03  5.63390061e-03  6.90247743e-03 -7.49576961e-03]
 [-9.64550975e-03  6.91329132e-03 -1.47138253e-02  1.79246889e-02]
 [-2.12096969e-03  1.15474087e-02 -1.95564450e-02  7.49957548e-05]]
b2
[[0.]
 [0.]
 [0.]]
W1 True
b1 True
W2 True
b2 True


    Let's now write a function to do forward propagation. 
    
    We're going to use ReLU for the hidden layer and sigmoid for the output layer.
    
    We also need to write a small ReLU and sigmoid function right before our forward propagation implementation.

In [30]:
def relu(Z):
    return np.maximum(Z, 0)

In [34]:
X = np.random.randn(100, 100)
print(X.shape)
Z = relu(X)
print(Z.shape)

(100, 100)
(100, 100)


In [31]:
def sigmoid(Z):
    return 1 / (1 + np.exp(-Z))

    We're going to need to save a cache for backward propagation. 

In [6]:
def two_layer_forward_propagation(X, parameters):
    
    # for backward propagation
    cache = {}
    
    Z1 = np.dot(parameters['W1'], X) + parameters['b1']
    A1 = relu(Z1)
    
    Z2 = np.dot(parameters['W2'], Z1) + parameters['b2']
    A2 = sigmoid(Z2)
    
    # turns out you don't need the last layers A and Z, but you do need X
    # you also need everything in the hidden layers
    cache['Z1'] = Z1
    cache['A1'] = A1
    
    return A2, cache

    We're almost at writing backward propagation. We need to write the function for computing cost.

In [7]:
def compute_cost(A2, Y):
    m = Y.shape[1]
    return -1 / m * np.sum(Y * np.log(A2) + (1 - Y) * np.log(1 - A2))

    Now let's write the code for backward propagation.
    
    However, before that, we need to write a function that calculates the relu derivative

In [8]:
def relu_derivative(Z):
    grad = np.where(Z >= 0, 1, 0)
    return grad

In [9]:
print("relu_derivative(-1): %d" % relu_derivative(-1))
print("relu_derivative(1): %d" % relu_derivative(1))

relu_derivative(-1): 0
relu_derivative(1): 1


In [10]:
def two_layer_backward_propagation(A2, Y, X, parameters, cache, learning_rate = 0.1):
    
    # we should get the things we need from fprop's cache now
    A1 = cache['A1']
    Z1 = cache['Z1']
    W1, b1, W2, b2 = parameters.values()
    
    # calculate the gradients
    dZ2 = A2 - Y
    dW2 = np.dot(dZ2, A1.T)
    db2 = np.sum(dZ2, axis=1, keepdims=True)
    dZ1 = np.dot(W2.T, dZ2) * relu_derivative(Z1)
    dW1 = np.dot(dZ1, X.T)
    db1 = np.sum(dZ1, axis=1, keepdims=True)
    
    # now that we have the gradients, let's perform gradient descent
    parameters['W1'] = parameters['W1'] - learning_rate * dW1
    parameters['b1'] = parameters['b1'] - learning_rate * db1
    parameters['W2'] = parameters['W2'] - learning_rate * dW2
    parameters['b2'] = parameters['b2'] - learning_rate * db2
    
    # what do we return?
    return parameters

    We're done with backward propagation now. 
    
    Let's write the function accuracy, because that is the metric that we care about.
    
    Cost is important, but accuracy is the most important for us.

In [11]:
def compute_accuracy(Y_pred, Y):
    print(Y_pred * Y)
    accuracy = np.mean(Y_pred * Y)
    return accuracy

In [12]:
a = np.array([np.random.randint(0, 2) for i in range(10)])
b = np.array([np.random.randint(0, 2) for i in range(10)])
print(a, b)
print(compute_accuracy(a, b))

[0 0 1 0 1 1 1 0 1 0] [1 1 1 0 1 1 0 0 0 1]
[0 0 1 0 1 1 0 0 0 0]
0.3


    We also need a function for prediction.

In [13]:
def predict(A2):
    return A2 >= 0.5

In [14]:
def two_layer_model(X_train, Y_train, X_test, Y_test, learning_rate = 0.1, num_epochs = 101):

    n_x, m = X_train.shape
    n_y, _ = Y_train.shape
    n_h = 6
    
    # initialize variables
    parameters = two_layer_init(n_x, n_h, n_y)
    
    # initialize list of cost and accuracy
    costs = []
    training_accuracies = []
    
    for epoch in range(num_epochs):
        
        # forward propagation
        A2, cache = two_layer_forward_propagation(X_train, parameters)

        # compute cost
        cost = compute_cost(A2, Y_train)

        # backward_propagation
        parameters = two_layer_backward_propagation(A2, Y_train, X_train, parameters, cache, learning_rate = learning_rate)
        
        # every 100 epochs, add cost to the list
        if epoch % 100 == 0:
            costs.append(cost)
            
            training_predictions = predict(A2)
            training_accuracy = compute_accuracy(training_predictions, Y_train)
            training_accuracies.append(training_accuracy)
            
            
            
    return parameters, training_accuracies

    We need a dataset to break our code on.
   

In [15]:
data = sklearn.datasets.load_iris()
X = data['data']
Y = data['target']
Y = Y.reshape((Y.shape[0], 1))
print(X.shape)
print(Y.shape)

(150, 4)
(150, 1)


    Now, we have the iris dataset.
    
    We should split it into a train and test set.

In [16]:
permutation = np.random.permutation(X.shape[0])
X = X[permutation]
Y = Y[permutation]
X = X.T
Y = Y.T
print(X.shape, Y.shape)
X_train, X_test = X[:, 0:100], X[:, 100:]
Y_train, Y_test = Y[:, 0:100], Y[:, 100:]

(4, 150) (1, 150)


In [17]:
print(X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)

(4, 100) (4, 50) (1, 100) (1, 50)


In [19]:
parameters, training_accuracies = two_layer_model(X_train, Y_train, Y_train, Y_test)

[[0 1 2 0 0 1 0 1 1 1 2 1 1 2 2 0 0 1 2 2 1 2 0 0 0 2 1 1 2 2 2 0 1 1 0 0
  0 2 2 2 1 2 2 1 2 1 1 0 0 0 0 2 2 2 1 0 0 1 2 2 0 2 2 2 0 1 1 2 0 0 2 2
  1 1 0 1 2 1 0 2 2 2 1 2 0 2 2 0 0 1 1 1 1 2 2 2 0 1 1 0]]
[[0 1 2 0 0 1 0 1 1 1 2 1 1 2 2 0 0 1 2 2 1 2 0 0 0 2 1 1 2 2 2 0 1 1 0 0
  0 2 2 2 1 2 2 1 2 1 1 0 0 0 0 2 2 2 1 0 0 1 2 2 0 2 2 2 0 1 1 2 0 0 2 2
  1 1 0 1 2 1 0 2 2 2 1 2 0 2 2 0 0 1 1 1 1 2 2 2 0 1 1 0]]
[1.08, 1.08]


  This is separate from the ipykernel package so we can avoid doing imports until
  This is separate from the ipykernel package so we can avoid doing imports until
  return umr_sum(a, axis, dtype, out, keepdims)


    Now we should generalize this to n layer neural networks. 
    
    We're going to need to rewrite initialize parameters.
   
    Let's also use He initialization

In [71]:
def he_initialization(layer_dims):
    '''
    input: layer_dims
    (n_x, n_1, n_2, ..., n_L)
    len = L + 1
    output: parameters
    {'W1', 'b1', 'W2', 'b2', ..., 'WL', 'bL'}
    weights : 1 -> L
    '''
    
    L = len(layer_dims) - 1
    parameters = {}
    
    for l in range(1, L + 1): # we want to include L
        parameters['W' + str(l)] = np.random.randn(layer_dims[l], layer_dims[l - 1]) * np.sqrt(2.0 / layer_dims[l - 1])
        parameters['b' + str(l)] = np.zeros((layer_dims[l], 1))
        
    return parameters

In [72]:
layer_dims = (100, 50, 6, 1)
parameters = he_initialization(layer_dims)
print(len(parameters) // 2)
#print(parameters)

3


    Now let's write a function to do forward propagation.

In [77]:
def forward_propagation(X, parameters):
    '''inputs: X, parameters
    outputs: AL, cache
    cache contains A, Z for layers 1:L for backprop
    '''
    
    L = len(parameters) // 2
    cache = {}
    
    A = X
    
    for l in range(1, L):
        Z = np.dot(parameters['W' + str(l)], A) + parameters['b' + str(l)]
        A = relu(Z)
        cache['Z' + str(l)] = Z 
        cache['A' + str(l)] = A
        # use relu for every layer but the last
        
    # last layer
    Z = np.dot(parameters['W' + str(L)], A) + parameters['b' + str(L)]
    A = sigmoid(Z)
    # dont add the Z and A for last layer, because they're not necessary for the backward propagation calculation
    
    return A, cache

In [78]:
X = np.random.rand(100, 100)
A, cache = forward_propagation(X, parameters)