# Creating A small 2-Layer Neural Network
Please note that when counting neural network layers, you dont count the input layer
Also, please note that a[0] a[1] and a[2] are used to refer the the depth (or the layer) of the network in question
A3[1] will be the third node in the first hidden layer 

## Activation Functions
Recursive network uses sigmoid()
Instead of using sigmoid, there are other activation functions which are popular choices. 

While sigmoid equal to 1 / 1 + e^-z the TanH function almost always works better. Where

A = Tanh(z)

Which is equal to:
A = e^z - e^z / e^z + e^-z

This is effectively a shifted sigmoid function which goes between -1 and 1 where sigmoid goes between 0 and 1

For hidden layers, the use of TanH is almost always better than using sigmoid. 
This is in part because the function effectively makes the mean output of the 
function close to 0 which sort of has an effect of normalizing the data


## Exception!!!
For the final "output" layer, a sigmoid function usually works better as it outputs a number between 0 and 1 which is better for binary classificaiton problems.
So usually TanH is used for the hidden layers while Sigmoid is used for the output layer

## Problems with Both Sigmoid and TanH
When Z is either very large or very small, the derivitive (slope) of the activation function becomes very small (almost 0) which can make training slow down
To address this particular problem, another activation function is often used

## Relu - Rectify Linear Unit
This is perhaps the most commonly used activation function these days

A = max(0, z)

This has a derivitive of 0 when Z is below 0 and a derivitive of 1 when Z is positive.
One problem with this is that when Z is equal to exactually 0, the slope is not well defined, but in practice this occurs rarely

## Leaky Relu 

A = max(0.01*z, z)

Very similar to Relu, but produces a very small derivitive when Z is positive instead of 0
Usually performs better than Relu in practice

The advantages of using both the Leaky Relu and standard Relu functions are they allow for the network to train much faster than other activation functions

## A linear activation function (also known as an identity activation function) 
Will be equal to :

A = Z

This does not work and can not be used for the hidden layers of a network as this would defeat the whole purpose of those layers

One use-case is to use this in the final output layer when the network is attempting to calculate a real number (such as the price of a house for instance)

## Initializing Weights
In neural networks it is important to randomly initialize the weights (unlike in logistic regression where the weights can be initialized as 0)

You are able to initialize the bias values as 0 and that is fine, but not the weights

The problem is that it will make it so your A1[1] and A2[1] values are the same (as they are calculating the same thing)

Furthermore, when calculating the derivitives, dz1[1] will be equal to dz2[1]. 

So effectively, you are flattening all your nodes within that layer to a single node. The nodes will be refered to as being **symetric**

My intuition is that this would be true if the weights were all assigned the same value (say 0.25 or 1.0) as well as with 0

#### This is called the **symetry breaking** problem

Typically what you want to do is generate the weights using a noise function and then multiply the value by a small number so the weights are close to 0, but not 0 and different form each other

In [3]:
import numpy as np

w1 = np.random.randn(2,2) * 0.01
print(w1)

[[ 0.00882016  0.01501583]
 [-0.00329492  0.00791642]]


### Cool, but why multiple by 0.01 and not a number like 10 or 100?

Well, because we are usually using a sigmoid or TanH function, we want our initial weights to be close to 0 as that is where the slope is the greatest and therefore the network will train faster initially than if the slope was close to 0

There are times when other values can work better. 

## Dimensions of Matrices

For any variable foo, the derivitivive of foo (dfoo) 

In [5]:
x = np.random.randn(3, 2)
y = np.sum(x, axis=0, keepdims=True)
print(y.shape)

(1, 2)


In [7]:
x = np.random.randn(8, 1)
y = x.reshape(2, 2, 2)
print(y)

[[[ 1.15001352 -0.10311311]
  [ 0.95117643 -0.12162957]]

 [[ 0.57263723 -0.12357659]
  [ 0.83648332  1.61872106]]]


In [8]:
a = np.random.randn(3, 4)
b = np.random.randn(1, 4)
c = a + b
print(c.shape)

(3, 4)


In [10]:
d = np.random.randn(1, 3)
e = np.random.randn(3, 3)
f = d * e
print(f.shape)

(3, 3)


In [12]:
a = np.random.randn(12288, 150)
b = np.random.randn(150, 45)
c = np.dot(a, b)
print(c.shape)

(12288, 45)
