# DL Course 1, Week 3: Shallow Neural Networks

## Neural Networks Overview

The neural diagrams can be confusing (for me at least), but the math is actually pretty simple:

- Single neuron supplies a vector of weights which are used to take dot product with input vector. 

$z = w^{T}x + b$

$a = \sigma(z)$

- In general can stack neurons on top of each other to get a layer, which mathetmacially is a "stack" of weight vectors i.e. a matrix which multiplies the input via matrix multiplication. 

$z^{[l]} = W^{[l]}x + b^{[1]}$

$a^{[l]} = \sigma(z^{[l]})$

$z^{[l+1]} = W^{[l+1]}a^{[l]} + b^{[1+1]}$

$a^{[l+1]} = \sigma(z^{[l+1]})$

That's all there is to it!

Note: The discussion in the wikipedia article on "Artificial Neural Networks" under the "Neural networks as functions" section is very helpful big-picture intuition to keep in mind. Neural networks are just functions that we build up with lots of parameters (e.g. weights stored in matrices). 
https://en.wikipedia.org/wiki/Artificial_neural_network#Neural_networks_as_functions

## Neural Network Representation 

![alt text](nn_rep.png "nn_rep")

![alt text](nn_rep2.png "nn_rep2")

- These visual "neuron" reperesentations are really just a crude way of expressing vectors. The input here is a vector of 3 components, the output of the hidden layer is a vector with 4 components and the final output is a vector with a single component i.e. just a regular old number. 

- The heaps of arrows are just a messy way of saying that each layer of inputs, that is each vector, is dotted (in the sense of the dot product) with a vector of weights in order to get one component of the next vector. [Technically the resulting component has a linear bias term added to it and then an activation function applied to it.] 

- Basically, the diagram can be read as saying "a transformation maps a vector of 3 components to one of 4 components to one of 1 component." And in fact, these transformations can be expressed as matrices of appropriate dimension [and a bias term is then added as well as a nonlinear activaton applied elementwise]. The "two layers" of this neural network correspond to these 2 transformations, and essentially, to these 2 matrices. [This matrix representation of a layer's stacked weights is considered "vectorization," but thinking of the matrices on their own merit seems simpler and clearer to me.]

## Vectorizing Across Multiple Examples

![alt text](vec1.png "vec1")

![alt text](vec2.png "vec2")

- Jargon: hidden units = neurons = rows in weight matrices

- Idea is simply to package input vectors into matrices so that computations through the network can be done more efficiently. [Since matrices multiply row by column, this amounts to lining up the input vectors next to each other column by column in a matrix and multiplying by the weight matrix of the layer.]

https://en.wikipedia.org/wiki/Matrix_multiplication

## Activation Functions

![alt text](act_fun1.png "act_fun1")

- Examples: sigmoid, tanh, relu ($max(0,z)$), generally some non-linear function

- Unlike sigmoid which is between 0 and 1, tanh is centered at 0 and between -1 and 1. 

- Discussion question: When might you use each of these activation functions? 

- Discussion question: Why might relu be a popular activation function? [What does non-zero derivative have to do with learning?]

- The point of activation functions is to enable the construction of nonlinear functions; without activation functions the neural network is limited to purely linear transformations (multiplying by matrices and adding vectors as opposed to applying non-trivial functions to each element). 

## Gradient Descent

![alt text](grad_desc1.png "grad_desc1")

- In order to optimize parameters (weights and biases) and minimize cost function (effectively total error on training set), set deriviatives of cost function with respect to the parameters to 0. 

- "Gradient descent" then refers to a method for finding the points of vanishing derivatives by modifying the parameters by an amount proprtional to minus the present derivatives (this makes sense since if the derivatives are 0 we're good and don't want to change anything, but if not then we want to move in the direction opposite to the slope in order to get to the minimum). 

## Backpropagation

![alt text](back_prop1.png "back_prop1")
![alt text](back_prop2.png "back_prop2")

- In order to calculate derivatives to minimize the cost function, must use chain rule since a neural network is really a composition of functions. 

## Random Initialization

![alt text](rand_init1.png "rand_init1")

![alt text](rand_init2.png "rand_init2")

- If one initializes the weight matrices to zeros, then will only ever end up with weight matrices that have identical rows after training. We want our neural nets to be capable of producing more general functions, so we randomly initialize the parameters to "break symmetry."

- Don't want the initial random weights to be too large though since that could result in activations where the derivative is near zero making gradient descent inefficient (since the parameters are shfited by an amount proportional to the derivatives fo the cost function which we know from backpropagation will depend on the derivatives of the activation functions). 

- Discussion question: Why is it okay to initialize the bias as zero? 

- Discussion question: Is there still a problem if the weights are all initially set to a non-zero but constant value? 

## Example: Can the 2 layer 4+1 neuron network learn 3d distance?

In [47]:
import tensorflow as tf
import keras
import numpy as np

x_train = [np.array([0, 0, 0]), np.array([1, 0, 0]), np.array([0, 1, 0]), np.array([0, 0, 1]), np.array([1, 1, 1])] 
y_train = np.array([0, 1, 1, 1, np.sqrt(3)])

x_test = [np.array([2, 0, 0]), np.array([1, 3, 5]), np.array([9.1, np.pi, -2])]
y_test = np.array([2, np.sqrt(35), np.sqrt(9.1**2 + np.pi**2 + 4)])

model = keras.Sequential([
    keras.layers.Dense(4, input_shape = np.shape(x_train[0]), activation=tf.nn.tanh),
    keras.layers.Dense(1, activation=tf.nn.tanh)
])

model.compile(optimizer='adam',
             loss='mean_squared_error',
             metrics=['accuracy'])

model.fit(np.array(x_train), y_train, epochs=10, batch_size=1)
test_loss, test_acc = model.evaluate(np.array(x_test), y_test)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [48]:
print(test_loss, test_acc)

37.925724029541016 0.0


In [49]:
model.predict(np.array(x_test))

array([[0.51054347],
       [0.88064134],
       [0.54801834]], dtype=float32)

In [45]:
y_test

array([2.        , 5.91607978, 9.83257873])

In [52]:
model.predict(np.array(x_train))

array([[0.09721245],
       [0.4206852 ],
       [0.470064  ],
       [0.42558667],
       [0.7324475 ]], dtype=float32)

In [51]:
y_train

array([0.        , 1.        , 1.        , 1.        , 1.73205081])