# Kernel Logistic Regression

on any (b,w), $ \xi_n $ = margin violation = $ \max \big( 1 - y_n (w^T z_n + b), 0 \big) $

- DO Violating Margin: $ \xi_n = 1 - y_n (w^T z_n + b) $
- Not Violating Margin: $ \xi_n = 0 $

原來的 soft margin SVM prime 問題:

$
\min_{b,w,\xi} \frac{1}{2} w^T w + C \cdot \sum_{n=1}^N \xi_n \\
s.t. \ \ y_n \big( w^T z_n + b \big) \ge 1 - \xi_n \text{ and } \xi_n \ge 0 \text{ for all n }
$

經過代入 $ \xi_n $ 變成:

$
\min_{b,w} \ \ \Big( \frac{1}{2} w^T w + C \cdot \sum_{n=1}^N \max \big( 1 - y_n (w^T z_n + b), 0 \big) \Big)
$


### Regularization

最小化錯誤，並加上限制 w 的長度。

$ \min \ \ \frac{1}{2} w^T w + C \sum \widehat{err} $

Just L2 Regularization:

$ \min \ \ \frac{\lambda}{N} w^T w + \frac{1}{N} \sum err $


|                            | minimize          | constraint       |
|-|-|-|
|Regularization by constraint| $ E_{in} $ .......................... | $ w^T w \le C $ |           
|Hard-Margin SVM             | $ w^T w $ ........................... | $ E_{in} = 0 $, and more... |
|L2 Regularization           | $ \frac{\lambda}{N} w^T w + E_{in} $  .................................. | |
|Soft-Margin SVM             | $ \frac{1}{2} w^T w + C_{soft} N \widehat{E_{in}} $ | ... |

Larger C = smaller $ \lambda $ = less regularization

將 SVM 看做一種 regularization model，就可以再延伸這個概念到其他的 learning models.

### Algorithmic Error Measure of SVM

<img src="imgs/c205-errx3.png" style="float:right;" />

令 linear score $ s = w^T z_n + b $

$ err_{0/1} (s,y) = \big[ y s \le 0 \big]_{boolean} $

$ \widehat{err}_{SVM} (s,y) = max\big( 1 - ys, 0 \big) $ : upper bound of $ err_{0/1} $  
also called **hinge error measure**

$ err_{SCE} (s,y) = \log_2 \big( 1 + \exp(-ys) \big) $ : upper bound of $ err_{0/1} $  
used in **logistic regression**


| $ -\infty $ | --------------------- $ ys $ --------------------- | $ +\infty $ |
|-|-|-|
| $ \approx -ys $ | $ \widehat{err_{SVM}}(s,y) $ | = 0 |
| $ \approx -ys $ | $ \log_2 \cdot err_{SCE}(s,y) $ | $ \approx 0 $ |

從這樣角度看， Regularized Logistic Regression 幾乎就等同於 Soft Margin SVM

反過來說，Soft Margin SVM 是否也就等同 Regularized Logistic Regression

### SVM for Soft Binary Classification

#### Naïve Idea 1:

1. run SVM and get $ (b_{SVM}, w_{SVM}) $

2. return $ g(x) = \theta ( w_{SVM}^T x + b_{SVM} ) $

direct use of similarity, works reasonably well.

no LogReg flavor.

#### Naïve Idea 2:

1. run SVM and get $ (b_{SVM}, w_{SVM}) $

2. run LogReg with $ (b_{SVM}, w_{SVM}) $ as $ w_0 $ 起始點

2. return LogReg solution as $ g(x) $

Not really 'easier' than original LogReg.

SVM flavor (kernel) lost?

可否融合以上兩種方式，獲得兩邊的好處 flavors.

### Two-Level Learning

$$
g(x) = \theta \Big( A \cdot \big( w_{SVM}^T \Phi(x) + b_{SVM} \big) + B \Big)
$$

- SVM flavor: fix hyperplane direction by $ w_{SVM} $ - kernel applies.
- LogReg flavor: fine-tune hyperplane to match maximum likelihood by scaling (A) and shifting (B).

- often $ A \gt 0 $ if $ w_{SVM} $ reasonably good.
- often $ B \approx 0 $ if $ b_{SVM} $ reasonably good.

### new LogReg Problem:

$$
\min_{A,B} \frac{1}{N} \sum_{n=1}^{N} \log \Big( \ \ 1 + \exp \Big( -y_n \big( A \cdot \underbrace{( w_{SVM}^T \Phi(x_n) + b_{SVM} )}_{\Phi_{SVM}(x_n)} + B \big) \Big) \ \ \Big)
$$

### Probabilistic SVM - Platt's Model of Probabilistic SVM for Soft Binary Classification

STEP 1: run SVM on D to get $ (b_{SVM}, w_{SVM}) $ or the equivalent $ \alpha $,  
and transform D to $ z_n' = w_{SVM}^T \Phi(x_n) + b_{SVM} $

STEP 2: run LogReg on $ \big\{ \big( z_n', y_n \big) \big\}_{n=1}^N $ to get (A,B)

STEP 3: return $ g(x) = \theta \Big( A \cdot \big( w_{SVM}^T \Phi(x) + b_{SVM} \big) + B \Big) $

NOTE: 實際的 Platt's Model 要更複雜些。

- Soft Binary Classifier not having the same boundary as SVM classifier, because of B.
- how to solve LogReg: GD / SGD / or other (only 2 variables)

kernel SVM = approx. LogReg in Z-space.

Optimal w\* be represented by $ z_n $ ?

## Representer Theorem

claim: for any L2-regularized linear model

$$
\min_w \frac{\lambda}{N} w^T w + \frac{1}{N} \sum_{n=1}^N err \big( y_n, w^T z_n \big)
$$

optimal $ w_{*} = \sum_{n=1}^N \beta_n z_n $

#### Any L2-regularized linear model can be kernelized.

## Kernel Logistic Regression

solving L2-regularized logistic regression

$$
\min_w \ \ \frac{\lambda}{N} w^T w + \frac{1}{N} \sum_{n=1}^N \log \Big( 1 + \exp \big( - y_n w^T z_n \big) \Big)
$$

yields optimal solution 

$$
w_* = \sum_{n=1}^N \beta_n z_n
$$

without loss of generality, can solve for optimal $ \beta $ instead of w。將 w 部份用 $ \beta,z $ 取代:

$$
\min_w \ \ \frac{\lambda}{N} \sum_{n=1}^N \sum_{m=1}^N \beta_n \beta_m K(x_n, x_m) + \frac{1}{N} \sum_{n=1}^N \log \Big( 1 + \exp \big( - y_n \sum_{m=1}^N \beta_m K(x_m, x_n) \big) \Big)
$$

然後用 GD / SGD / ... for unconstrained optimization.

### Kernel Logistic Regression (KLR) : Another View

$ \sum_{m=1}^N \beta_m K(x_m, x_n) $ : inner product between $ \beta $ and transformed data
$ \big( K(x_1, x_n), K(x_2, x_n), \cdots , K(x_N, x_n) \big) $

$ \sum_{n=1}^N \sum_{m=1}^N \beta_n \beta_m K(x_n, x_m) $ : a special regularizer $ \beta^T K \beta $

KLR = linear model of $ \beta $ with kernel as transform and kernel regularizer.

KLR = linear model of w with embedded-in-kernel transform and L2 regularizer

WARNING: coefficients $ \beta_n $ in KLR often non-zero.