# Neural Nets
## Motivation
* We may have to work with lots of features for most real world problems:
  * Say 100 original features. If we want to include quadratic, cubic or higher polynomial features, then grows incredibly high.
  * Say for 100 original features, including up to 6th order polynomials, we get around 10^8 features.
  * This will be prohibitively difficult to implement using linear or logistic regression.
  * Neural nets give a unique way of machine learning that might outset the problem of large number of features.
		
## Model Representation
![Neural Network](img/NeuralNet.png)
### Terminologies
$$ a_{i}^j  \textrm{−""activation" of unit "i in layer j}$$
$$\theta_{j} \textrm{- matrix of weights controlling function mapping from layer j to j + 1}$$

Hence
$$a_{1}^{(2)}=g(\Theta_{10}^{(1)}x_{0}+\Theta_{11}^{(1)}x_{1}+\Theta_{12}^{(1)}x_{2}+\Theta_{13}^{(1)}x_{3})=g(z_{1}^{(2)})$$
$$a_{2}^{(2)}=g(\Theta_{20}^{(1)}x_{0}+\Theta_{21}^{(1)}x_{1}+\Theta_{22}^{(1)}x_{2}+\Theta_{23}^{(1)}x_{3})=g(z_{2}^{(2)})$$
$$a_{3}^{(2)}=g(\Theta_{30}^{(1)}x_{0}+\Theta_{31}^{(1)}x_{1}+\Theta_{32}^{(1)}x_{2}+\Theta_{33}^{(1)}x_{3})=g(z_{3}^{(2)})$$
$$h_{\Theta}(x)=a_{1}^{(3)}=g(\Theta_{10}^{(2)}a_{0}^{(2)}+\Theta_{11}^{(2)}a_{1}^{(2)}+\Theta_{12}^{(2)}a_{2}^{(2)}+\Theta_{13}^{(2)} a_{3}^{(2)})=g(z_{1}^{(3)})$$
$$\Theta_{j} \textrm{ therefore is a } s_{j+1} * s_{j} + 1 \textrm{ dimension matrix where } s_{j} \textrm{ is the number of units in layer j}$$
	
### Vectorized implementation (forward propagation)
$$x_{0}=1$$
$$z^{(2)}=\Theta^{(1)} * X$$
$$a^{(2)}=g(z^{(2)})$$

$$a_{0}^{(2)}=1$$
$$z^{(3)}=\Theta^{(2)} * X$$
$$a^{(3)}=g(z^{(3)})$$

and so on…

## MultiClass Classification
![Multiclass classification](img/NeuralNetClassification.png)

http://blog.davidsingleton.org/nnrccar/

### Training: Cost function
This will be a generalization of the logistic regression

$$J(\Theta)=−\frac{1}{m}\Big[\sum_{i=1}^m\sum_{k=1}^Ky_{k}^{(i)}\log\big(h_{\Theta}(x^{(i)})_{k}\big)+(1−y_{k}^{(i)})\log\Big(1−\big(h_{\Theta}(x^{(i)})\big)_{k}\Big)\Big]+\frac{λ}{2m} \sum_{l=1}^L\sum_{i=1}^{s_{l}}\sum_{j=1}^{s_{l+1}}\big(\Theta_{ij}^l\big)^2$$
Where
* K is the total number of outputs; 1 for binary classification and ≥3 for multiclass classification
* L is the total number of layers in the neural net
* s<sub>l</sub>  is the total number of neurons in layer l
* h<sub>Θ</sub>(x) ∈ R<sup>K</sup>
* (h<sub>Θ</sub>(x<sup>(i)</sup>)<sub>j</sub>=j<sup>th</sup> hypothesis of neural net for the i<sup>th</sup> training set
* y<sub>k</sub><sup>(i)</sup> = k<sup>th</sup> output of the neural net for the i<sup>th</sup> training input

### Training: Back propagation
$$\textrm{Let }\delta_{j}^{(l)} \textrm{ = ""error" of node j in layer" l}$$
Therefore
$$\delta^{(4)}=a^{(4)}−y$$

For the other layers, the error is calculated as
$$\delta^{(l)}=(\Theta^{(l)})^{T}\delta^{(l+1)}∗g′(z^{(l)})$$

Each of these delta terms are basically a partial derivative of the cost at that particular point. And the intuition here is to calculate the gradient of the cost to find out in which direction each of these weights will have to slide so as to minimize the cost at each and every node. The g-prime component is basically the partial derivative of the hypothesis function. Given the use of sigmoid function, the following is it's derivative.
$$g′(z^{(l)})=a^{(l)}∗(1−a^{(l)})$$

Note: There is no delta1. Because that would signify the error in the input, which cannot exist.

### Random Initialization
* 0 init or same init of theta does not work for neural nets because
  * Because giving the same weight for all theta, will make all the activations will compute the same values
  * This will also imply that the delta (errors) in back prop will also be the same
  * So basically we will cut down the neural net to just one neuron per layer essentially.
* To break symmetry, initialize each Θ<sub>ij</sub><sup>(l)</sup> to a random value in [−ε,ε]

## Architecture
1. Chose the number of inputs (based on number of features)
2. Choose the number of outputs (classes)
3. Choose the number of layers (typically 1 hidden) and number of neurons for each layer
  * If more than 1 hidden layer is used, typically all hidden layers have the same number of neurons
  * With respect to number of hidden neurons, the more the better. Have to weigh out performance
  * It is typically comparable to the number of input features
4. Randomly initialize the weights
5. Loop for all inputs
  * Implement forward prop and find h<sub>Θ</sub>(x<sup>(i)</sup>) and all a<sub>ij</sub><sup>(l)</sup> for any x<sup>(i)</sup>
  * Implement code to complete cost function J(Θ)
  * Implement back prop to compute partial derivatives of J(Θ) with respect to Θ<sub>jk</sub><sup>(l)</sup> and get δ<sup>(l)</sup>
6. Compute the D terms including the regularization terms.
7. (Optionally during debug) use gradient checking to ensure that the D terms are correct
8. Use an advanced optimization method to minimize J(Θ)