# Logistic Regression

Given $ x \in \mathcal{R}^{n_x} $,  
want $ \hat{y} = P \big( y=1 \ \Big| \ x \big) , \ 0 \le \hat{y} \le 1 $  
Parameters : $ w \in \mathcal{R}^{n_x},\ b \in \mathcal{R} $

Output: $$ \hat{y} = \sigma \Big( w^T x + b \Big) $$

Sigmoid function: $ \sigma \big( z \big) = \frac{1}{1+e^{-z}} $,
若 z 非常大，sigma 趋近 1; 若 z 非常小, sigma 趋近 0，

使用 Logistic Regression 是要学习出 w, b；使得 $ \hat{y} $ 为 : [ y等于1的几率 ]

### Lost Function : L

为了学习到好的 w, b，需要定义 Lost Function : $ \mathcal{L} $

$$
\mathcal{L} \big( \hat{y}, y \big) =
- \Big( y \log\hat{y} + (1-y) \log\big( 1 - \hat{y} \big) \Big)
$$

不使用平方差，因为平方差在这里无法获得 convex function 凸函数。

### Intuition: 

若 y=1, $ \mathcal{L} = - \Big( \log\hat{y} \Big) $, 要L最小，就是希望 $ \log\hat{y} $ 越大越好，就是希望 $ \hat{y} $ 最大，而在 sigmoid function 限制下 y 最大值不超过1，就是希望 $ \hat{y} $ 尽量接近 1。

若 y=0, $ \mathcal{L} = - \Big( \log \big( 1 - \hat{y} \big) \Big) $，要L最小，就是希望 $ \log(1-\hat{y}) $ 越大越好，就是希望 $ ( 1 - \hat{y} ) $ 最大，而在 sigmoid function 限制下 y 最小值不低于 0，就是希望 $ \hat{y} $ 尽量接近0。

### Explanation:

$$
\begin{align}
\text{IF } y = 1: & \ p\big( y \ \big| \ x \big) = \hat{y} \\
\text{IF } y = 0: & \ p\big( y \ \big| \ x \big) = 1 - \hat{y}
\end{align}
$$

可以将上面 binary 的两种可能汇总成下面的式子，下式带入 y=0, y=1 都可得到上式：

$$
p\big( y \ \big| \ x \big) = \hat{y}^y \big( 1 - \hat{y} \big)^{1-y}
$$

想要的结果是最大化，将上式取 log 后依然是要最大化:

$$
\log p\big( y \ \big| \ x \big) = y \log \hat{y} + (1-y) \log \big( 1 - \hat{y} \big)
$$

### Cost Function : J

Lost function L 是对单一的 training example : $ x^{(i)} $ 做定义。引伸到整个 training dataset, 定义一个 Cost Function : J

$$
\mathcal{J} \big( w, b \big) = 
\frac{1}{m} 
\sum_{i=1}^m \mathcal{L} \big( \hat{y}, y \big) =
- \frac{1}{m} 
\sum_{i=1}^m \Big[ 
y^{(i)} \log\hat{y}^{(i)} + 
\big( 1 - y^{(i)} \big) \log \big( 1 - \hat{y}^{(i)} \big)
\Big]
$$

m: number of training examples.

### Explanation:

$$
\begin{align}
\log p\big( \text{ labels in training set } \big) & = \log \prod_{i=1}^m p \big( y^{(i)} \ \big| \ x^{(i)} \big) \\
& = \sum_{i=1}^m \log p \big( y^{(i)} \ \big| \ x^{(i)} \big) 
\text{ ...using Maximum Likelihood Estimation} \\
& = - \sum_{i=1}^m \mathcal{L} \big( \hat{y}^{(i)}, y^{(i)} \big) \\
\text{Cost } J \big( w, b \big) & = \frac{1}{m} \sum_{i=1}^m \mathcal{L} \big( \hat{y}^{(i)}, y^{(i)} \big)
\end{align} 
$$

### Gradient Descent

简化 $ \mathcal{J}(w,b) $ 为 $ \mathcal{J}(w) $,  
在多维度的空间上，  
y 为 J(w)  
$ x_1, x_2, x_3, \dots $ 为 w

由 multi-variables Calculus 理论可知，将 $ w - \alpha \frac{d\ J(w)}{d\ w} $ 会逐步接近最低点的 J(w)  
此处 $ \alpha $ 为 `learning rate`,  
将 $ \frac{d\ J(w)}{d\ w} $ 在程序中标记为 `dw`, 则下面的 pseudo-code 表示了 gradient descent 获得最佳解的过程:

```code
Repeat {
  w := w - alpha * dw
}
```

以 $ J(w,b) $ 来看，就是 
$$
w := w - \alpha \times \frac{\partial \ J(w, b)}{\partial w} \\
b := b - \alpha \times \frac{\partial \ J(w, b)}{\partial b}
$$

- Computation Graph
- Derivatives
- Chain Rule

$
a, b, c \\
u = b \times c \\
v = a + u \\
J = 3v
$

Chain Rule:

$$
\frac{d\ J}{d a} = \frac{d\ J}{d v} \ \frac{d\ v}{d a}
$$

### Computation Graph

an example with 2 features: $ x_1, x_2 $

$$
\begin{bmatrix}
x_1 \\ w_1 \\ x_2 \\ w_2 \\ b
\end{bmatrix} \Rightarrow 
\begin{bmatrix}
z = w_1 x_1 + w_2 x_2 + b
\end{bmatrix} \Rightarrow 
\begin{bmatrix}
a = \sigma\big( z \big)
\end{bmatrix} \Rightarrow
\begin{bmatrix}
\mathcal{L} \big( a, y \big)
\end{bmatrix}
$$

从最右边的 Lost Function: L 利用导数推回，可得每一步骤变量(z,a) 对最后损失函数的影响。

从 L 推回 a: "da"

$$
\text{"da"} = 
\frac{d}{da} \mathcal{L} \big( a,y \big) = 
- \frac{y}{a} + \frac{1-y}{1-a}
$$

从 L 推回 z: "dz"

$$
\text{"dz"} = 
\frac{d}{dz} \mathcal{L}(a,y) = a - y \\
= \frac{d}{dz} a \times \frac{d}{da} \mathcal{L}(a,y) \\
= \frac{d}{dz} \sigma(z) \times \frac{d}{da} \mathcal{L}(a,y) \\
= \Big( a(1-a) \Big) \times \Big( - \frac{y}{a} + \frac{1-y}{1-a} \Big) = a - y
$$

从 L 推回 w1: "dw1"

$$
\text{"dw1"} = \frac{d}{d w_1} \mathcal{L}(a,y) \\
= \frac{dz}{d w_1} \times \frac{d}{dz} \mathcal{L}(a,y) \\
= \frac{d}{d w_1} \big( w_1 x_1 + w_2 x_2 + b \big) \ \times \ (a-y) \\
= w_1 \times (a-y)
$$

从 L 推回 b: "db"

$$
\text{"db"} = (a-y)
$$

## Forward and Backward propagation

Forward Propagation:
- You get X
- You compute $A = \sigma(w^T X + b) = (a^{(0)}, a^{(1)}, ..., a^{(m-1)}, a^{(m)})$
- You calculate the cost function: $J = -\frac{1}{m}\sum_{i=1}^{m}y^{(i)}\log(a^{(i)})+(1-y^{(i)})\log(1-a^{(i)})$

Here are the two formulas you will be using: 

$$ \frac{\partial J}{\partial w} = \frac{1}{m}X(A-Y)^T$$
$$ \frac{\partial J}{\partial b} = \frac{1}{m} \sum_{i=1}^m (a^{(i)}-y^{(i)})$$

gradient descent: $ \theta = \theta - \alpha \text{ } d\theta$

prediction: $\hat{Y} = A = \sigma(w^T X + b)$