# Week2 Notes

It's possible to cast a logistic regression as a simple neural network.  This is what we do here, along with introducting the backward pass and forward pass as part of the computation graph. Finally, we introduce a vectorized implementation and demonstrate its improved efficiency.


## Logistic Regression as a Neural Network
Casting logistic regression as a neural network.

### Binary Classification

The problem of Binary classification, is one of finding a mapping from a feature vector $\mathbf{x}$ to a target $y$ - y must be 1 or 0 - from examples in a historical dataset.

The 'hello world' problem in machine learning (at least for computer vision) is one of learning a mapping of an image ($x$) as either having a cat in the image ($y=1$) or not $y=0$.

An image is an array of numbers (from 0 to 255) made up of 3 channels (one for each of Red, Green and Blue) across the height and width of the image. Each number is the intensity of the pixel in each color channel at a given location in the image. We can think of this as a 3 dimensional array, $A$. The way to get a feature vector $x$ from $A$ is to unroll, unravel, or reshape $A$ into a vector. 

For an image that is $64$x$64$ pixels, there are $64$x$64$x$3$$ =1228$ numbers that make up the feature vector. The length of the feature vector is referenced as $n = n_{X}$.

Let's say we have $m = m_{train}$ examples of images with cats and without cats in them. Then we can define the data as a feature matrix $X$, target vector $Y$ as below:

$$
X = \left [ x^{(1)}, \ldots,  x^{(m)} \right ]
$$
$$
Y = \left [ y^{(1)}, \ldots,  y^{(m)} \right ]
$$

In this setting `X` is a matrix with `X.shape = (m, n)`, and `Y` is a row vector with `Y.shape = (1, m)`. This mean that the individual $x^{(i)}$ are column vectors.

### Logistic Regression

In the logistic regression model, we specify that a linear combination of the features, $x$, as inputs to a non-linear function, in this case the sigmoid function. It produces an estimate of the chance that $y$ is 1, $P(y=1|x) = \hat{y}$, as opposed to explicitly trying to model the target as as function of $x$.

If we tried to model the target, we can't get a smooth function that maps to 1 or 0 only. By using the sigmoid we ensure that the probability of $y$ being 1 is explicitly modelled, and bounded between 0 and 1. The sigmoid function is:

$$
\sigma(z) = \left( 1 + e^{-z} \right)^{-1}
$$


We will see later, that for hidden layers it might make sense to allow the non-linear function, also called the activation function, to be a hyperparameter.

If the linear combination is a large number, then $P(y=1|x)$ gets very close to 1. Using a certain framework (Maximum A Prior or MAP) we can predict that in such cases y will be 1.

If the linear combination is a large negative number, then  $P(y=1|x)$ gets very close to 0. Using the MAP framework we can predict that in such cases y will be 1.
 
If the linear combination is close to 0, then $P(y=1|x)$ is close to 0.5. Under MAP we can make a prediction either side of 0.5 (> 0.5 to 1, <0.5 to 0), but we can't be very sure about which it is.

#### with explicit bias `b`

Using parameters $\mathbf{w} = [w_1, \ldots, w_n]$ (a column vector) and b, a scalar bias parameter. We model $\hat{y}$ as $\hat{y} = \sigma(w^{T}x + b)$.

#### with implicit bias $\theta_0$

If we extend x by adding $x_0=1$, set an example feature as $x = (x_0, \ldots, x_n)$ and define $(w, b) = \mathbf{\theta}$ - still a column vector, with $\theta_0 = b$ then it we can write this model as:

$\hat{y} = \sigma(\mathbf{\theta}^{T}x)$

This formulation is not preferred because by having the bias term explicitly available, some of the future arithmetic becomes easier to understand.

### Logistic Regression Cost Function

It is desirable that as we learn the parameters of the logistic regression that for each of the examples in our dataset, we try for each example $i$, to make $\hat{y^{(i)}}$ as close to $y^{(i)}$ as possible. We can measure the loss, or distance, of each estimate $\hat{y^{(i)}}$ from the ideal target $y^{(i)}$ using a function $L$. Some candidate $L$ functions are:

1. $L(y, \hat{y}) = |y - \hat{y}|$

1. $L(y, \hat{y}) = \frac{1}{2}(y - \hat{y})^2$

1. $L(y, \hat{y}) = - \left [ y\log(\hat{y}) + (1-y)\log(1-\hat{y}) \right ]$

Of these, the third loss function has many desirable properties such as: 

* differentiable, smooth and convex with unique optima (can be confirmed by taking second derivative wrt parameters)
* a natural mapping to the negative log-likelihood (discussed later)

The first is the most compelling, later we will explain second part - the negative log-likelihood. So we can change the parameters $w$ and $b$ until we reach a low loss.

Since we want to reduce the loss for all the dataset, then we can use the individual losses to define an averaged loss function over all examples. This is called the cost function J of the dataset. That is:

$$
J(\mathbf{w}, b) = \frac{1}{m}\sum_{i=1}^{m} L(y^{(i)}, \hat{y^{(i)}})
=  \frac{-1}{m}\sum_{i=1}^{m}  \left [ y\log(\hat{y}) + (1-y)\log(1-\hat{y}) \right ]
$$

Because of our choice of loss function (convex and smooth), we can guarantee that J has a single global optima. 

By iterating using a Taylor series approximation (known as gradient descent) we converge to the optimal parameters that minimize the cost function. 

These parameters define a classification rule, or classifier to model the binary data.

### Gradient Descent

#### Deriving The Algorithm

By Taylor's theorem we note that for a well behaved smooth multivariate function F:

$$
F(\mathbf{x} + \mathbf{h}) \approx  F(\mathbf{x}) + \mathbf{h} \cdot \nabla F(\mathbf{x}) + \mathbf{h} \cdot \nabla^2 F(\mathbf{x}) \cdot \mathbf{h}^T
$$

We can rewrite this with the substitution, $\mathbf{x}_t = \mathbf{x} + \mathbf{h}$ and 
$\mathbf{x}_{t-1} = \mathbf{x}$:

$$
F(\mathbf{x}_t) \approx  F(\mathbf{x}_{t-1}) + (\mathbf{x}_{t} -\mathbf{x}_{t-1}) \cdot \nabla F(\mathbf{x}_{t-1}) + (\mathbf{x}_{t} -\mathbf{x}_{t-1}) \cdot \nabla^2 F(\mathbf{x}_{t-1}) \cdot (\mathbf{x}_{t} -\mathbf{x}_{t-1})^T
$$


Near an optima, $\mathbf{x}_{opt}$, $\nabla F(\mathbf{x}_{opt}) = \mathbf{0}$. So if we say that $\mathbf{x}_{t}$ is close to the optima:

$$
F(\mathbf{x}_t) \approx  F(\mathbf{x}_{t-1}) +  (\mathbf{x}_{t} -\mathbf{x}_{t-1}) \cdot \nabla^2 F(\mathbf{x}_{t-1}) \cdot (\mathbf{x}_{t} -\mathbf{x}_{t-1})^T
$$


Rearranging, and noting that $-(\mathbf{x}_{t} -\mathbf{x}_{t-1})^{-1}(F(\mathbf{x}_t) -  F(\mathbf{x}_{t-1})) \approx  - \nabla F(\mathbf{x}_{t-1})$, we have:

$$
 - \nabla F(\mathbf{x}_{t-1}) \approx \nabla^2 F(\mathbf{x}_{t-1}) \cdot (\mathbf{x}_{t} -\mathbf{x}_{t-1})^T
$$

And taking another approximation to the inverse hessian, by a constant (can check dimensions on array multiplications to see that this works):

$$
- \alpha  \nabla F(\mathbf{x}_{t-1}) \approx\ (\mathbf{x}_{t} -\mathbf{x}_{t-1})
$$

Then rearrraning, we have the famous gradient descent algorithm:

From a nearby good starting point $x_{0}$ , and small enough $\alpha$, if we repeat this iteration enough times:

$$
\mathbf{x}_{t}  \approx  \mathbf{x}_{t-1}  - \alpha \nabla F(\mathbf{x}_{t-1})
$$

Then for some R, All $T > R$ will have $x_{T}$ being arbitrarily close to $x_{opt}$.

#### Cost Function

With our logistic function, $J$ is a convex surface over the parameter space. Because of this convexity, we can initialize our parameters anywhere in the space, conventionally the zero vector is chosen for w and 0 for b.

Now we do our iterative process 


### Derivatives

Here Professor Ng provides an understanding of the gradient as a rate of change, using simple analytic examples with numbers. Essentially the point being driven home is:

$$
F(a + h) - F(a) \approx h \frac{\partial F}{\partial x} |_{@x=a}
$$

### More Derivatives Examples
More of the same, some more complex examples with non-constant gradients.


### Computation Graph

Here Professor Ng shows the composite expressions defining an expression can be used to evaluate a function. The major reason this is useful is for caching results,  the forward and the backward propogation of the neural network algorithm. Not that:

- Forward propogation: Substituting intermediate quantities to calculate cost function. 
- Backward propogation: Substituting chain rule (intermediate derivatives) to calculate derivatives of cost function with respect to parameters.

### Derivatives with a Computation Graph

A detailed example of derivatives using the chain rule, numerical examples on a simple compound function. The substitutions can be viewed on the computation graph - a directed acyclic graph.

In code, we note the convention that for a function that is principal and being optimized, say $J$. we denote its derivative wrt some variable $a$ $\frac{\delta J}{da}$ as `da`. This avoids repeatedly referencing the function to be optimized in lengthy token names such as `dJ_over_da`.


### Logistic Regression Gradient Descent

Here we calculate the intermediate derivative quantities needed to complete the chain rule. That is derivatives of the composition functions, to get the chain-ruled derivative of the loss function with respect to the parameters $\mathbf{w}$ and b.

Firstly note that the loss function can be expressed as a composite of:

- $z(w,b;t) = w^Tt +b$

- $\sigma(t) = (1 + e^{-t})^{-1}$

- $\hat{y^{(t)}} = \sigma(t)$

- $g(t,\hat{t}) = t\log(\hat{t})$

- $L(t, \hat{t}) = -g(t, \hat{t}) - g(1-t, 1-\hat{t})$

In fact we can see that $L$ is:

$$L(y,\hat{y}) = - \left[ g(y, \sigma(z(w,b;x)) + g(1-y, 1 - \sigma(z(w,b;x)) \right ]$$

The derivatives of these functions are:

$ $


Now to compute the derivative of the loss function wrt to w and b:

$$
\frac{\partial}{\partial w} L(y,\hat{y}) = 
$$

$$
\frac{\partial}{\partial b} L(y,\hat{y}) = 
$$

### Gradient Descent on `m` examples

## Python and Vectorization

### Vectorization

### More Vectorization Examples

### Vectorizing Logistic Regression

### Vectorizing Logistic Regressions Gradient Output

### Broadcasting in Numpy

### A Note on Numpy Vectors

### Explanation of Logistic Regression Cost Function
