## Logistic Regression

**Soft** Binary Classification:

想知道發生(+1)的機率是多少，介於 0~1 之間。

$ f(x) = P(+1 \ \big| \ x) \in [0,1] $

$$
y = \frac{1}{1+\exp(-x)} = \frac{ e^x }{1 + e^x } = \theta( x ) = 1 - \theta(-x)
$$

![img](imgs/c10-logistic-func-graph.png)

理想狀況是有如下資料 (noiseless data):

$$
\begin{align}
x_1, y'_1 & = 0.9 & = P(+1 \ \big| \ x_1) \\
x_2, y'_2 & = 0.2 & = P(+1 \ \big| \ x_2) \\
\cdots \\
x_N, y'_N & = 0.6 & = P(+1 \ \big| \ x_N)
\end{align}
$$

實際狀況的資料，不知道機率，只知道 0 or 1 (有或無)；  
就像有雜訊的資料 (noisy data):

$$
\begin{align}
x_1, y'_1 & = 1 \\
x_2, y'_2 & = 0 \\
\cdots \\
x_N, y'_N & = 0
\end{align}
$$

Features $ x = (x_0, x_1, x_2, \cdots, x_d) $

獲得 weighted Risk Score: s

$ s = \sum_{i=0}^d w_i x_i $

利用 Logistic Function $ \theta $, 將分數 s 轉化爲 0~1 的機率: 

$$ h(x) = \theta(w^T x) = \frac{1}{1 + \exp(- w^T x)} $$

### Error Measure $ E_{in}(w) $

target function: $ f(x) = P( +1 \ \big| \ x) $  
可以寫成:

$ P(\ y\ \big| \ x\ ) = 
\begin{cases}
f(x) & \text{ for } y = +1 \\
1 - f(x) & \text{ for } y = -1 \\
\end{cases}
$

若有一組資料 $ D = \big\{ (x_1, +1), (x_2, -1), \cdots, (x_N, -1) \big\} $,  
資料 D 發生的機率是:

$
\begin{align}
P(x_1) \ P(+1 \ | \ x_1) & \times P(x_2) \ P(-1 \ | \ x_2) & \cdots & \times P(x_N) \ P(-1 \ | \ x_N) \\
= P(x_1) \ f(x_1) & \times P(x_2) \ (1-f(x_2)) & \cdots & \times P(x_N) \ (1-f(x_N))
\end{align}
$

若 hypothesis h 很接近 target function f, 那麼 g 應該可讓上式發生機率最大。  

$ g = \text{argmax}_h \ \ \text{likelihood}(h) $

logistic function 的對稱性: $ 1 - h(x) = h(-x) $

$ P(x_1) \ h(x_1) \ \times \ P(x_2) \ (1-h(x_2)) \ \cdots \ \times \ P(x_N) \ (1-h(x_N)) $

$ = P(x_1) \ h(+x_1) \ \times \ P(x_2) \ (h(-x_2)) \ \cdots \ \times \ P(x_N) \ h(-x_N) $

$ \varpropto \prod_{n=1}^N h(y_n x_n) $


希望最大化 h 產生資料組 D 的機率:
$$ \max_h \text{likelihood(h)} \varpropto \prod_{n=1}^N h(y_n x_n) = $$

$$ \max_w \text{likelihood(w)} \varpropto \prod_{n=1}^N \theta \big(y_n w^T x_n \big) \to $$

$$ \max_w \ln \prod_{n=1}^N \theta \big(y_n w^T x_n \big) = $$

$$ \max_w \sum_{n=1}^N \ln \theta \big(y_n w^T x_n \big) \to $$

$$ \min_w \sum_{n=1}^N - \ln \theta \big(y_n w^T x_n \big) \to $$

$$ \min_w \frac{1}{N} \sum_{n=1}^N - \ln \theta \big(y_n w^T x_n \big) \to $$

$$ \min_w \frac{1}{N} \sum_{n=1}^N \ln \big(1 + \exp(-y_n w^T x_n) \big) \to $$

$$ \min_w \underbrace{ \frac{1}{N} \sum_{n=1}^N \text{err} \big(w, x_n, y_n \big)}_{E_{in}(w)} $$

### Cross-Entropy Error

$$ \text{err}(w,x,y) = \ln \big( 1 + \exp(-ywx) \big) $$

### Minimizing $ E_{in}(w) $ - Gradient

$ E_{in}(w) $ is Continuous, Differentiable, Twice-Differentiable, Convex.

Find $ \nabla E_{in}(w) = 0 $

to derive $ \nabla E_{in}(w) $, use Chain Rule:

$$ E_{in}(w) = \frac{1}{N} \sum_{n=1}^N \ln \Big( \underbrace{1 + \overbrace{\exp(-y_n w^T x_n)}^{b}}_{a} \Big) $$

$$ 
\frac{\partial E_{in}(w)}{\partial_{w_i}} =
\frac{1}{N} \sum_{n=1}^{N} \Big( \frac{\partial \ln(a)}{\partial_a} \Big) \ 
\Big( \frac{\partial(1+\exp(b))}{\partial_b} \Big) \ 
\Big( \frac{\partial(- y_n w^T x_n)}{\partial_{w_i}} \Big)
$$

$$
= \frac{1}{N} \sum_{n=1}^{N} \Big( \frac{1}{a} \Big) \ 
\Big( \exp(b) \Big) \ 
\Big( -y_n x_{n,i} \Big)
$$

$$
= \frac{1}{N} \sum_{n=1}^{N} \Big( \frac{\exp(b)}{1 + \exp(b)} \Big) \ 
\Big( -y_n x_{n,i} \Big)
$$

$$
= \frac{1}{N} \sum_{n=1}^{N} \theta \Big( b \Big) \ 
\Big( -y_n x_{n,i} \Big)
$$


$$
\nabla E_{in}(w) = \frac{1}{N} \sum_{n=1}^{N} \theta \Big( - y_n w^T x_n \Big) \ 
\Big( -y_n x_n \Big)
$$

### PLA Revisited: Iterative Optimization

pick some n, and update $ w_t $ by

$ w_{t+1} = w_t + \underbrace{1}_{\eta} \times \underbrace{\text{boolean} \Big( \text{sign}( w_t^T x_n ) \ne y_n  \Big) \cdot y_n x_n}_{v} $

When stop, return last w as g.

Choice of $ \big( \eta, v \big) $ and stopping condition defines **Iterative Optimization Approach**

### Linear Approximation

A greedy approach for some given $ \eta \gt 0 $

$$ \min_{\Vert v \Vert = 1} \ \ \ E_{in}(w_t + \eta v) $$

Local approximation by linear formula makes problem easier.  
將整段 $ E_{in} $ 分成小段較好處理，分成小段是拿一點加上一小距離: 斜率 $ \times v $

$$ E_{in}(w_t + \eta v) \approx E_{in}(w_t) + \eta v^T \nabla E_{in} (w_t) $$

if $ \eta $ really small (Taylor Expansion)

$$
\min_{\Vert v \Vert = 1} \Big( \ \underbrace{ E_{in}(w_t) }_\text{known} + \underbrace{ \eta }_{\text{given positive}} v^T \underbrace{ \nabla E_{in} (w_t) }_\text{known} \Big)
$$

$$
\to \min_{\Vert v \Vert = 1} \Big( \ v^T \underbrace{ \nabla E_{in} (w_t) }_\text{known} \Big)
$$


**optimal v**: 與向量 $ v^T $ 完全反方向的 $ \nabla E_{in}(w_t) $ 會獲得最小化，所以最佳化的

$$
v = - \frac{\nabla E_{in}(w_t)}{\Vert \nabla E_{in}(w_t) \Vert}
$$

**Gradient descent**: for small $ \eta $ 

$$
w_{t+1} = w_t - \eta \frac{\nabla E_{in}(w_t)}{\Vert \nabla E_{in}(w_t) \Vert}
$$

**Gradient descent**: a Simple and Popular optimization tool.

### Choice of $ \eta $

$ \eta $ better be monotonic of $ \Vert \nabla E_{in}(w_t) \Vert $,

$ \eta $ 最好是與梯度同比例調整大小，因此令 $ \eta'= \frac{\eta}{\Vert \nabla E_{in}(w_t) \Vert} $

$$
w_{t+1} = w_t - \eta \frac{\nabla E_{in}(w_t)}{\Vert \nabla E_{in}(w_t) \Vert}
$$

$$
w_{t+1} = w_t - \eta' \nabla E_{in}(w_t)
$$

$ \eta' $ 也稱為 Fixed Learning Rate

### Logistic Regression Algorithm

Initialize $ w_0 $,

For t = 0, 1, ...

#### STEP ONE:

Compute

$$
\nabla E_{in}(w_t) = \frac{1}{N} \sum_{n=1}^N \theta \Big( -y_n w_t^T x_n \Big) \big( -y_n x_n \big)
$$

#### STEP TWO:

Update by

$$
w_{t+1} = w_t - \eta \nabla E_{in}(w_t)
$$

Until $ \nabla E_{in}(w_{t+1}) \approx 0 $ or **Enough Iterations**,

return last $ w_{t+1} $ as g.