# Logistic Regression
https://www.coursera.org/learn/neural-networks-deep-learning/lecture/yWaRd/logistic-regression-cost-function

# Notation
The notation used by Andrew Ng in Deeplearning.ai Neural Networks course specifies an `m`x`n` input matrix. Other courses describe an `N`x`D` input matrix.

$i = Denotes \ the \ i^{th} training \ example.$<br>
$l = Denotes \ the \ l^{th} layer.$<br>
$m = Number \ of \ samples \ (examples)$ <br>
$n_{x} = Number \ of \ input \ features$<br>
$n_{y} = Number \ of \ output \ classes$<br>
$\hat{y} = \ Estimate \ (prediction)$<br>
$y = Ground \ truth \ value $<br>
$L = Loss \ function$<br>
$J = Cost \ function$<br>
$\alpha = learning \ rate$<br>
$z = w^{T}x^{(i)} + b$

# Estimator

$$\hat{y}^{(i)} = \sigma (w^{T}x^{(i)} + b), \ where \ \sigma (z^{(i)}) = \frac{1}{1 + e^{-z^{(i)}}} $$

$$Given \ \{(x^{(1)}, y^{(1)}),...,(x^{(m)}, y^{(m)})\}, \ want \ \hat{y}^{(i)} \approx  y^{(i)}  $$

# Loss Function, Cost Function
The loss function is a calculation of the discrepency between the predicted value and the actual value. In linear regression, "least squares" is an example of a loss funciton. It does not work well for logistic regression because the optimization problem is non-convex - it has multiple local minima. Gradient descent needs convex data.

## Loss
Loss is computed for a single training example.

$L(\hat{y}^{(i)}, y^{(i)}) = Loss \ function. $

$L(\hat{y}^{(i)}, y^{(i)}) = - (y*log(\hat{y}) + (1 - y)log(1-\hat{y}))$

If $y^{(i)} = 1: L(\hat{y}^{(i)}, y^{(i)}) =  -log(\hat{y}^{(i)})$, where $log(\hat{y}^{(i)})$ and $\hat{y}^{(i)}$ should be close to 1.

If $y^{(i)} = 0: L(\hat{y}^{(i)}, y^{(i)}) =  -log(1 - \hat{y}^{(i)})$, where $log(1 - \hat{y}^{(i)})$ and $\hat{y}^{(i)}$ should be close to 0.

## Cost
The cost function is the average of the loss function of the entire training set. The cost function `J` is convex, meaning that there is a single local and global minimum.

## $J(w, b) = $ Cost function

### $$J(w, b) = \frac{1}{m}\sum_{i=1}^{m}L(\hat{y}^{(i)}, y^{(i)})$$ <br>
### $$ = -\frac{1}{m}\sum_{i=1}^{m}[(y*log(\hat{y}) + (1 - y)log(1-\hat{y}))]$$

## Gradient Descent
In gradient descent, we take partial derivatives many times.

$\alpha = learning \ rate$<br>
$J(w, b) = Cost \ Function$

### $$ w: = w - \alpha * \frac{\partial{J(w, b)}}{\partial{w}} $$

### $$ b: = b - \alpha * \frac{\partial{J(w, b)}}{\partial{b}} $$

# Chain Rule

# Logistic Regression Derivatives

## Notation

$z = w^{T}x + b$<br>
$\hat{y} = a = \sigma(z)$<br>
$L(a, y) = - (y*log(a) + (1 - y)log(1-a))$<br>


## Shorthand

In the context of loss functions, Professor Ng uses a shorthand, which he encourages for use in variable names in code. The partial derivative of the loss with respect to a variable is expressed as the derivative of the wrt only.

$dz = shorthand \ for \ \frac{\partial{L}}{\partial{z}}$<br>
$da = shorthand \ for \ \frac{\partial{L}}{\partial{a}}$<br>
$dw = shorthand \ for \ \frac{\partial{L}}{\partial{w}}$<br>

## Computation Graph
![comp_graph.jpg](attachment:comp_graph.jpg)

## Derivative Formulas

### $$ \frac{\partial{L}}{\partial{z}} = a - y $$

### $$ \frac{\partial{L}}{\partial{a}} = - \frac{y}{a} + \frac{1-y}{1-a} $$

### $$ \frac{\partial{L}}{\partial{w_{1}}} = x_{1} \frac{\partial{L}}{\partial{z}} $$

### $$ \frac{\partial{L}}{\partial{w_{2}}} = x_{2} \frac{\partial{L}}{\partial{z}} $$

### $$ \frac{\partial{L}}{\partial{b}} = \frac{\partial{L}}{\partial{z}} $$

## How to Update `w` and `b`

### $$ w_{1} := w_{1} - \alpha \frac{\partial{L}}{\partial{w_{1}}} $$

### $$ w_{2} := w_{2} - \alpha \frac{\partial{L}}{\partial{w_{2}}} $$

### $$ b := b - \alpha \frac{\partial{L}}{\partial{b}} $$

# Logistic Regression on `m` Examples

## Pseudocode without Vectorization

This is an example of logistic regression on `m` examples without vectorization.

$dz = shorthand \ for \ \frac{\partial{L}}{\partial{z}}$<br>
$da = shorthand \ for \ \frac{\partial{L}}{\partial{a}}$<br>
$dw = shorthand \ for \ \frac{\partial{L}}{\partial{w}}$<br>

![m_examples.png](attachment:m_examples.png)

## Pseudocode with Vectorization

This is an example of logistic regression on `m` examples with vectorization.

![m_examples_vector.png](attachment:m_examples_vector.png)