## Deep Neural Networks
Logistic regression can be considered to be a single layer neural network without a hidden layer

Deep neural networks typically have 3 or more hidden layers


### How Deep does my network need to be?

## Notations for Deep Networks

**L** is the number of layers in the network

**n[l]** is the number of units in layer l

n[1] is the number of nodes in the first layer (After the input data)

n[0] is equal to Nx or the number of input features put into the network

X is the input features put into the equasion

X = a[0]

Yhat = a[l]

m is the number of training examples

a[l] = output from the **l**-th layer

z[l] = input to the **l**-th layer

J() = the cost function for the network

## Parameter Dimensions

It is critical to understand the expected dimensions of your parameters when coding neural networks. The source of many bugs, especially for beginners is imporoper vector and matrix sizes. Here are some guidelines for how to think about the dimensions of various parameters

### Z and A
**z[l], a[l] = (n[l], 1)**

**Z[l], A[l] = (n[l], m)**

The dimensions of z[l] and a[l] are (n[l], 1), where n[l] is the number of neurons in the l-th layer. This is because z[l] and a[l] represent the activation of each neuron in the l-th layer, and a single neuron's activation is typically represented as a column vector with n[l] rows.

The dimensions of Z[l] and A[l] are (n[l], m), where m is the number of examples in the dataset. This is because Z[l] and A[l] represent the activations of all neurons in the l-th layer for all examples in the dataset. In other words, each column of Z[l] and A[l] represents the activation of all neurons in the l-th layer for a single example in the dataset.

It's worth noting that these dimensions may vary depending on the specific architecture of the neural network and the conventions used in its implementation. However, the dimensions provided are common conventions that are often used in neural network literature.

**dZ[l], dA[l] = (n[l], m)**

### How About W?

**W[l] = (n[l], n[l - 1])**

The shape of W[l] is (n[l], n[l-1]), where n[l] is the number of neurons in the l-th layer, and n[l-1] is the number of neurons in the previous (l-1)-th layer. This is because each neuron in the l-th layer is connected to all neurons in the previous (l-1)-th layer via a weight parameter, resulting in a weight matrix of shape (n[l], n[l-1]).

### X 

**X = (n[0], 1)**

X corresponds to the input data and therefore the first layer in the network. Therefore, it will have n[0] rows where n[0] corresponds with the number of nodes in the input layer. 

### b

**b[l] = (n[l], 1)**

Likewise, as the bias b[l] must be added to the output of each node it contains n[l] rows where each row represents the bias for a particular node in the layer.



## Why go Deep? Intuition about Networks
Suppose we are building a neural network for face detection. The network can be conceptualized as a series of layers, where each layer processes the output of the previous layer to extract increasingly complex features from the input image.

In the initial layers, the network may learn to identify simple patterns such as edges or corners in the image by examining a small region of pixels at a time. As the information is passed to the subsequent layers, the network begins to combine these simple patterns to form more complex features, such as face-like shapes with rough approximations of eyes, noses, or ears.

In the later layers, the network can further refine these features and start to recognize different face configurations by gathering together these features in a hierarchical manner. By examining a larger and larger part of the original image with each subsequent layer, the network can capture increasingly complex and abstract representations of the input.

It's worth noting that this process of feature extraction is not limited to face detection but is a general principle that underlies the operation of many types of neural networks. Additionally, the specific architectures and techniques used to implement these networks can vary widely and can have a significant impact on their performance.

## Circuit Theory and Deep Learning
Circuit theory provides a mathematical framework for understanding the behavior of neural networks with many hidden layers, and can help explain why deep networks often outperform shallow networks with a similar number of parameters.

One key insight from circuit theory is that adding more layers to a neural network increases the effective "depth" of the composite function that the network computes. Each layer of the network applies a nonlinear transformation to its inputs, which can be viewed as a type of filter that extracts useful features from the input data. By stacking many of these filters together, a deep network can extract increasingly complex and abstract features from the input, which can help improve the network's ability to learn complex patterns and relationships.

In contrast, a shallow network with a similar number of parameters would need to rely on fewer, more complex filters to extract useful features, which can lead to overfitting or underfitting of the training data. By adding more layers, a deep network can learn a more hierarchical representation of the input data, which can help it generalize better to unseen data and improve its performance on a variety of tasks.

Additionally, circuit theory provides insights into the optimization of deep networks, which can be challenging due to the large number of parameters and complex, highly nonlinear nature of the objective function. By modeling the network as a type of electrical circuit, researchers can use techniques such as backpropagation and stochastic gradient descent to optimize the network's weights and biases, and improve its performance on a given task.

In summary, circuit theory provides a theoretical foundation for understanding why deep networks with many hidden layers can outperform shallow networks with fewer hidden layers but more nodes in each layer. By modeling the network as a type of circuit, researchers can gain insights into how the network operates, how to optimize its performance, and how to design more efficient and effective network architectures.

## Forward Propagation
Please note that x in this equasion is the same as A[0], effitively making the lines of code identical

X: Z[1] = W[1]x + b[1]

A[1] = g[1](Z[1])

Z[2] = W[2]A[1] + b[2]

A[2] = g[2](Z[2])

Z[3] = W[3]A[2] + b[3]

A[3] = g[3](Z[3])

etc...

In a general abstract sense, the equasion for each layer is the same for each layer where
- Input = a[l - 1]
- Output = a[l], cache z[l], W[l], b[l]
- Calculate Z[l] = W[l] * a[l - 1] + b[l]
- a[l] = g[l](Z[l])

## Backwards Propagation
Similar to with the forward propagation step, for backwards propagation, you calculate the output for each layer using the same basic logic
- Input = da[l]
- Output = da[l - 1], dW[l], db[l]
- dz[l] = da[l] * g[l]'(z[l])
- dW[l] = dz[l] * a[l - 1].T
- db[l] = dz[l]
- da[l - 1] = W[l].T * dz[l]

## Understanding dz[l]
In a neural network, the term "dz[l]" typically refers to the derivative of the activation function applied to the output of the l-th layer of the network, with respect to the input to that layer.

To understand this, let's start with the basic definition of the output of a neuron in a neural network. The output of a neuron can be represented as:

**z[l] = W[l] * a[l-1] + b[l]**

where W[l] is the weight matrix for the l-th layer, a[l-1] is the output of the (l-1)-th layer (i.e., the input to the l-th layer), and b[l] is the bias vector for the l-th layer.

Once the output z[l] is computed, it is typically passed through an activation function g to produce the final output of the neuron:

**a[l] = g(z[l])**

Now, the derivative of the activation function with respect to the input to the layer (i.e., dz[l]) is an important quantity for computing the gradients during backpropagation. Specifically, during backpropagation, we need to compute the gradients of the cost function with respect to the parameters of the network (i.e., the weights and biases).

To compute these gradients, we use the chain rule of differentiation. The chain rule tells us that if a variable y depends on a variable x, which in turn depends on a variable z, then the derivative of y with respect to z can be computed as:

**dy/dz = dy/dx * dx/dz**

In the case of a neural network, the output of the l-th layer (i.e., a[l]) depends on the input to that layer (i.e., a[l-1]), which in turn depends on the output of the (l-2)-th layer (i.e., a[l-2]), and so on. So to compute the gradients of the cost function with respect to the parameters of the network, we need to use the chain rule to compute the derivatives of the output with respect to the input for each layer.

This is where dz[l] comes in. dz[l] represents the derivative of the activation function with respect to the input to the l-th layer. This quantity can be computed as:

**dz[l] = dg(z[l]) / dz[l]**

where dg(z[l]) is the derivative of the activation function g with respect to its input z[l]. Once we have computed dz[l], we can use it to compute the gradients of the cost function with respect to the parameters of the network using the chain rule.

In summary, dz[l] is an important quantity in computing the gradients during backpropagation in a neural network. It represents the derivative of the activation function with respect to the input to the l-th layer, and is used to compute the gradients of the cost function with respect to the parameters of the network.

## Parameters vs Hyper-Parameters
Parameters are things that are calculated by the algorhythm and include:
- W
- b

Hyper-parameters are parameters which you must feed into the model which affect the values of W and b and include:
- learning_rate (alpha)
- number of iterations 
- number of hidden layers (n(1), n(2), etc)
- number of nodes within layers (n[1], n[2], etc)
- choice of activation function
- momentum
- mini-batch size
- regularization parameters

As they are parameters for the parameters, they are called hyper-parameters

## What does Deep Learning have to do with the brain?
Not much really

## Training / Dev / Test sets
You dont really need a test set unless you need to ensure the model is unbiased
Training set is always the largest
Dev set is what you use to quickly test the performance of different network configurations
Test set is data you want to test the performance of the model on to determine its performance 

## Bias and Variance
High bias is poor performance of the training set data
High variance is poor performance of the test set data

Solutions for High Bias
- larger network
- longer training sessions
- (research new NN configurations)

Solutions for High Variancs
- more training data
- regularization
- (research new NN configurations)