# Logistic Regression Gradient Descent

## Notation

- $m$ denotes the number of examples in the dataset  
- $n_x$ denotes the input size, or the number of features, per example  
- $n_y$ denotes the output size  

- $X \in \mathbb{R}^{n_x \times m}$ is the input matrix

$$
X =
\begin{bmatrix}
| & | & & | \\
\textbf{x}^{1} & \textbf{x}^{2} & \ldots & \textbf{x}^{m} \\
| & | & & | 
\end{bmatrix}
$$

- $Y \in \mathbb{R}^{n_y \times m}$ is the label matrix

$$
Y = 
\begin{bmatrix}
| & | & & | \\
\textbf{y}^{1} & \textbf{y}^{2} & \ldots & \textbf{y}^{m} \\
| & | & & | \\
\end{bmatrix}
$$

- $\textbf{x}^{(i)} \in \mathbb{R}^{n_x}$ is input feature for the $i^{th}$ example
- $\textbf{y}^{(i)} \in \mathbb{R}^{n_y \times m}$ is the output label for the $i^{th}$ example  
- $\hat{\textbf{y}} \in \mathbb{R}^{n_y}$ is the predicted output vector. Also denoted with $\textbf{a}^{(i)}$  
- $\textbf{w}$ refers to the weight vector, where $w \in \mathbb{R}^{1 \times n_x}$

- $J(\hat{\textbf{y}}, \textbf{y})$ denotes the cost function 

Coding Notation

- $dw_1 := \frac{\partial J(\textbf{w}, b)}{\partial w_1}$  

- $dw_2 := \frac{\partial J(\textbf{w}, b)}{\partial w_2}$  

- $db := \frac{\partial J(\textbf{w}, b)}{\partial b}$  

The following refer to a single input:

$$z = w^T \textbf{x} + b  \tag{1} $$
$$\sigma(z) = \frac{1}{1 + e^{-z}} \tag{2} $$
$$\hat{y} = a = \sigma(z) \tag{3} $$
$$\mathcal{L}(a, y) = -(y \log(a) + (1 - y) \log(1 - a)) \tag{4} $
$$J(\textbf{w}, b) = \frac{1}{m} \sum_{i=1}^{m} \mathcal{L}(\hat{y}^{(i)}, y)) \tag{5} $$


## Pseudocode

### Basic Implementation: Over `m` example training sets

```
J = 0; dw_1 = 0; dw_2 = 0; db = 0

// loop over all training examples
for i = 1 to m
    z_i = w.T * x_i + b
    a_i = sigma(z_i)
    J += -[y_i * log(a_i) + (1 - y_i) * log(1 - a_i)]
    dz_i = a_i - y_i

    // loop over all features
    for each feature
    dw_1 += x_1i * dz_i
    dw_2 += x_2i * dz_i
    db += dz_i
J /= m
dw_1 /= m
dw_2 /= m
db /= m

w_1 = w_1 - alpha * dw_1
w_2 = w_2 - alpha * dw_2
b = b - alpha * db
```

### Vectorized Implementation

*$z$ now becomes $\textbf{z}$, $\hat{y} = a$ becomes $\hat{\textbf{y}} = \textbf{a}$, and $\textbf{x}$ becomes $X$*

Rewrite $z^{(1)}, z^{(2)}, \; \ldots , \; z^{(m)} $ (individual $z$ for multiple examples) as

$$
Z =
\begin{bmatrix}
\textbf{z}^{(1)} & \textbf{z}^{(2)} & \ldots & \textbf{z}^{(m)}
\end{bmatrix}
= w^T X + b, \tag{6}
$$

$a^{(1)}, a^{(2)}, \; \ldots , \; a^{(m)}$ as

$$
A = \sigma(Z) = \sigma(w^T X + b) =
\begin{bmatrix}
\sigma(\textbf{z})^{(1)} & \sigma(\textbf{z})^{(2)} & \ldots & \sigma(\textbf{z})^{(1)}
\end{bmatrix}
=
\begin{bmatrix}
\textbf{a}^{(1)} & \textbf{a}^{(2)} & \ldots & \textbf{a}^{(m)}
\end{bmatrix} \tag{7}
$$

#### Expanding $w^T X$

Recall that $X \in \mathbb{R}^{n_x \times m}$

$$
\begin{bmatrix}
w_1 & w_2 & \ldots & w_m
\end{bmatrix}
\begin{bmatrix}
| & | & & | \\
\textbf{x}^{1} & \textbf{x}^{2} & \ldots & \textbf{x}^{m} \\
| & | & & | 
\end{bmatrix}
=
\begin{bmatrix}
w_1 x^{1}_1 & w_2 x^{2}_1 & \ldots & w_m x^{m}_1 \\
w_2 x^{1}_2 & w_2 x^{2}_2 & \ldots & w_m x^{m}_2 \\
. & . & \ldots & . \\
w_m x^{1}_{n_x} & w_2 x^{2}_{n_x} & \ldots & w_m x^{m}_{n_x}
\end{bmatrix}
$$

#### Derivation of $\frac{d\mathcal{L}}{dz}$

First note that $dz := \frac{d\mathcal{L}}{dz}$  
By the chain rule

$$\frac{d\mathcal{L}}{dz} = \frac{d\mathcal{L}}{da} \times \frac{da}{dz} $$

As  

$$\mathcal{L}(\textbf{a}, \textbf{y}) = -(\textbf{y} \log(\textbf{a}) + (1 - \textbf{y}) \log(1 - \textbf{a})) $$
$$\frac{d\mathcal{L}}{da} \; = \; -y \times \frac{1}{a} \; - \; (1 - y) \times \frac{1}{1 - a} \times -1 $$
$$\frac{d\mathcal{L}}{da} = \frac{-y}{a} + \frac{1 - y}{1 - a} = \frac{a - y}{a(1 - a)} \tag{8} $$
$$\frac{da}{dz} = \frac{d}{dz} \sigma({\textbf(z)}) = \sigma({\textbf(z)}) \times (1 - \sigma({\textbf(z)})) = a(1 - a) \tag{9} $$
$$\frac{d\mathcal{L}}{dz} = \frac{a - y}{a(1 - a)} \times a(1 - a) = a - y $$

Hence 

$$dz = a - y \tag{10} $$


#### TL;DR

$$
dZ =
\begin{bmatrix}
\textbf{dz}^{(1)} & \textbf{dz}^{(2)} & \ldots & \textbf{dz}^{(m)}
\end{bmatrix} = 
\begin{bmatrix}
\textbf{a}^{(1)} - \textbf{y}^{(1)} & \textbf{a}^{(2)} - \textbf{y}^{(2)} & \ldots & \textbf{a}^{(m)} - \textbf{y}^{(m)}
\end{bmatrix} = 
A - Y
$$
$$db = \frac{\partial J}{\partial b} = \frac{1}{m} \sum_{i=1}^m (a^{(i)}-y^{(i)}) = \frac{1}{m} \sum_{i=1}^{m} d\textbf{z}^{(i)} \tag{11} $$
$$dw = \frac{\partial J}{\partial w} = \frac{1}{m}X(A-Y)^T = \frac{1}{m} X dZ^T \tag{12} $$

#### 1. Removing the inner loop

```
J = 0; dw = np.zeros(n_x, 1); db = 0

// loop over all training examples
for i = 1 to m
    z_i = w.T * x_i + b
    a_i = sigma(z_i)
    J += -[y_i * log(a_i) + (1 - y_i) * log(1 - a_i)]
    dz_i = a_i - y_i

    dw += x_i * dz_i
    db += dz_i
J /= m
dw /= m
db /= m

w_1 = w_1 - alpha * dw_1
w_2 = w_2 - alpha * dw_2
b = b - alpha * db
```

#### 2. Removing the outer loop

```
J = 0; dw = np.zeros(n_x, 1); db = 0

Z = np.dot(w.T, X) + b
A = sigma(Z)
J += -[y_i * log(a_i) + (1 - y_i) * log(1 - a_i)]
dZ = A - Y

dw = 1/m * X * dZ.T
db = 1/m np.sum(dZ)

J /= m

w = w - alpha * dw
b = b - alpha * db
```

Bibliography:
- http://www.wildml.com/2015/09/implementing-a-neural-network-from-scratch/
- https://stats.stackexchange.com/questions/211436/why-do-we-normalize-images-by-subtracting-the-datasets-image-mean-and-not-the-c