# Week2 Notes

It's possible to cast a logistic regression as a simple neural network.  This is what we do here, along with introducting the backward pass and forward pass as part of the computation graph. Finally, we introduce a vectorized implementation and demonstrate its improved efficiency.


## Logistic Regression as a Neural Network
Casting logistic regression as a neural network.

### Binary Classification

The problem of Binary classification, is one of finding a mapping from a feature vector $\mathbf{x}$ to a target $y$ - y must be 1 or 0 - from examples in a historical dataset.

The 'hello world' problem in machine learning (at least for computer vision) is one of learning a mapping of an image ($x$) as either having a cat in the image ($y=1$) or not $y=0$.

An image is an array of numbers (from 0 to 255) made up of 3 channels (one for each of Red, Green and Blue) across the height and width of the image. Each number is the intensity of the pixel in each color channel at a given location in the image. We can think of this as a 3 dimensional array, $A$. The way to get a feature vector $x$ from $A$ is to unroll, unravel, or reshape $A$ into a vector. 

For an image that is $64 x 64$ pixels, there are $64 x 64 x 3 =1228$ numbers that make up the feature vector. The length of the feature vector is referenced as $n = n_{X}$.

Let's say we have $m = m_{train}$ examples of images with cats and without cats in them. Then we can define the data as a feature matrix $X$, target vector $Y$ as below:

$$
X = \left [ x^{(1)}, \ldots,  x^{(m)} \right ]
$$
$$
Y = \left [ y^{(1)}, \ldots,  y^{(m)} \right ]
$$

In this setting `X` is a matrix with `X.shape = (m, n)`, and `Y` is a row vector with `Y.shape=(1, m)`. This mean that the individual $x^{(i)}$ are column vectors.

### Logistic Regression

In the logistic regression model, we specify that a linear combination of the features, $x$, as inputs to a non-linear function, in this case the sigmoid function. It produces an estimate of the chance that $y$ is 1, $P(y=1|x) = \hat{y}$, as opposed to explicitly trying to model the target as as function of $x$.

If we tried to model the target, we can't get a smooth function that maps to 1 or 0 only. By using the sigmoid we ensure that the probability of $y$ being 1 is explicitly modelled, and bounded between 0 and 1. The sigmoid function is:

$$
\sigma(z) = \left( 1 + e^{-z} \right)^{-1}
$$


We will see later, that for hidden layers it might make sense to allow the non-linear function, also called the activation function, to be a hyperparameter.

If the linear combination is a large number, then $P(y=1|x)$ gets very close to 1. Using a certain framework (Maximum A Prior or MAP) we can predict that in such cases y will be 1.

If the linear combination is a large negative number, then  $P(y=1|x)$ gets very close to 0. Using the MAP framework we can predict that in such cases y will be 1.
 
If the linear combination is close to 0, then $P(y=1|x)$ is close to 0.5. Under MAP we can make a prediction either side of 0.5 (> 0.5 to 1, <0.5 to 0), but we can't be very sure about which it is.

#### with explicit bias `b`

Using parameters $\mathbf{w} = [w_1, \ldots, w_n]$ (a column vector) and b, a scalar bias parameter. We model $\hat{y}$ as $\hat{y} = \sigma(w^{T}x + b)$.

#### with implicit bias $\theta_0$

If we extend x by adding $x_0=1$, set an example feature as $x = (x_0, \ldots, x_n)$ and define $(w, b) = \mathbf{\theta}$ - still a column vector, with $\theta_0 = b$ then it we can write this model as:

$\hat{y}$ as $\hat{y} = \sigma(\mathbf{\theta}^{T}x)$

This formulation is not preferred because by having the bias term explicitly available, some of the future arithmetic becomes easier to understand.

### Logistic Regression Cost Function


### Gradient Descent
### Derivatives
### More Derivatives Examples
### Computation Graph
### Derivatives with a Computation Graph
### Gradient Descent on `m` examples

## Python and Vectorization
### Vectorization
### More Vectorization Examples
### Vectorizing Logistic Regression
### Vectorizing Logistic Regressions Gradient Output
### Broadcasting in Numpy
### A Note on Numpy Vectors
### Explanation of Logistic Regression Cost Function
