# Radial Basis Function Network

Gaussian Kernel: $ K(x, x') = K(x,y) = \exp \big( - \gamma \Vert x - y \Vert^2 \big), \text{ with } \gamma \ge 0 $

$$
\begin{align}
g_{SVM}(x) & = sign \Big( \sum_{SV} \alpha_n y_n K(x_n, x) + b \Big) \\
           & = sign \Big( \sum_{SV} \alpha_n y_n \exp \big( - \gamma \Vert x - x_n \Vert^2 \big) + b \Big)
\end{align}
$$

RBF 可看作是在一堆 SV, $ \alpha $ 上，做線性組合求解

- radial: 距離, SV 的點 x 與 中心:$x_n$ 的距離
- basis function: 拿來做線性組合的係數: $ \alpha_n, y_n $ 

拿到一個點 x, 看這個點與中心的距離，然後進行投票 y

$ g_n(x) = y_n \exp \big( - \gamma \big\Vert x - x_n \big\Vert^2 \big) $

再將所有點的結果合起來

$$
g_{SVM}(x) = sign \Big( \sum_{SV} \alpha_n g_n(x) + b \Big)
$$

結果是 linear aggregation of selected (SV) radial (半徑距離) hypothesis $g_n(x)$

### RBF Network hypothesis

<img src="imgs/c214-rbf-network-hypo.png" style="float:right;width:350px;" >

$
h(x) = Output \Big( \sum_{m=1}^M \beta_m \ RBF(x, \mu_m) + b \Big)
$

centers: $ \mu_m : $ SVM SVs $ x_m $

(signed) votes: $ \beta_m : \alpha_m y_m $ from SVM Dual 

RBF: Gaussian

Output: sign (binary classification)

M = # of SVs

**Learning**: given RBF and Output, decide: $ \mu_m, \beta_m $

### RBF and Similarity

kernel: similarity via Z-space inner product

- governed by Mercer's condition.

$ Poly(x, x') = (1 + x^T x')^2 $

$ Gaussian(x, x') = \exp \big( - \gamma \Vert x - x' \Vert^2 \big) $

RBF: similarity via X-space distance

general similarity function between x and x':

Neuron(x, x') = $ tanh( \gamma x^T x' + 1) $

DNASim(x, x') = $ EditDistance(x, x') $

RBF Network: distance similariry-to-centers as feature transform.

### Full RBF Network

$$
h(x) = Output \Big( \sum_{m=1}^M \beta_m \ RBF(x, \mu_m) + b \Big)
$$

full RBF Network: M=N and each $ \mu_m = x_m $

physical meaning: each $ x_m $ influences similar x by $ \beta_m $

e.g. uniform influence with $ \beta_m = 1 \cdot y_m $, for binary classification.

$$
g_{uniform}(x) = sign \Big( \sum_{m=1}^N y_m \exp \big( - \gamma \Vert x - x_m \Vert^2 \big) \Big)
$$

- aggregate each example's opinion subject to similarity.

full RBF Network: lazy way to decide $ \mu_m $

### Nearest Neighbor

$$
g_{uniform}(x) = sign \Big( \sum_{m=1}^N y_m \exp \big( - \gamma \Vert x - x_m \Vert^2 \big) \Big)
$$

$ \exp \big( - \gamma \Vert x - x_m \Vert^2 \big) $ : 當 x 離 $ x_m $ 最近時候，會最大化影響力  
maximum one often dominates the $ \sum_m^N $ term

使用最近的點 maximum exp(...) 的 $ y_m $, 而不用所有點的 voting,  
selection instead of aggregation

physical meaning:

$ g_{nbor} \big( x \big) = y_m $, such that x closest to $ x_m $

called nearest neighbor model.

can uniformly aggregate k neighbors also: **k nearest neighbor**

k nearest neighbor: also lazy but intuitive.

### Interpolation by Full RBF Network

full RBF Network for squared error regression:

$$
h(x) = Output \Big( \sum_{m=1}^N \beta_m \ \ RBF(x, x_m) \Big)
$$

將 Output 直接輸出，就是 linear regression on RBF-transformed data, 向量 $ z_n \in \mathcal{R}^N $

$$
z_n = \Big[ RBF(x_n,x_1), \ RBF(x_n,x_2), \ \cdots \ , \ RBF(x_n,x_N) \Big] \in \mathcal{R}^N
$$

linear regression 公式解, optimal $ \beta = \big( Z^T Z \big)^{-1} Z^T y \iff Z^T Z $ is invertible}

size of Z? $ Z_{N \times N} $

Z 是 symmetric square matrix, 當套用在 Gaussian RBF 時，有 theoretical fact: 

if $ x_n $ all different, Z with Gaussian RBF is invertible

symmetric: $ Z = Z^T $

$ 
\begin{align}
\beta = & \big( Z^T \ Z \big)^{-1} Z^T y \\
      = & \big( Z \ Z \big)^{-1} Z y \\
      = & Z^{-1} \  Z^{-1} \  Z \ y \\
      = & Z^{-1} \ y
\end{align}
$

### Regularized RBF Network

full Gaussian RBF Network for regression: $ \beta = Z^{-1} y $

$$
g_{RBF} (x_1) = \beta^T z_1 = y^T Z^{-1} \big( \text{ first column of Z } \big) = 
y^T 
\begin{bmatrix}
1 \\ 0 \\ \vdots \\ 0
\end{bmatrix}
= y_1
$$

以此類推知道: $ g_{RBF} (x_n) = y_n $, 所以 $ E_{in} \big( g_{RBF} \big) = 0 $

called exact interpolation for function approximation.

but overfitting for learning.

may use regularization, e.g. ridge regression for $ \beta $ instead

optimal $ \beta = \big( Z^T Z + \lambda I \big)^{-1} Z^T y $

這個公式 $ Z = \big[ Gaussian(x_n, x_m) \big] = $ Gaussian SVM 中的 Kernel matrix K

regularized full RBFNet: $ \beta = \big( Z^T Z + \lambda I \big)^{-1} Z^T y $ ...(對有限維度N, 做 regularization)

kernel ridge regression: $ \beta = \big( K + \lambda I \big)^{-1} y $ ...(對無限多維度, 做 regularization)

### Fewer Centers as Regularization

$$
g_{SVM}(x) = sign \Big( \sum_{SV} \alpha_m y_m \exp \big( - \gamma \Vert x - x_m \Vert^2 \big) + b \Big)
$$

只需要用 SVs

next $ M \ll N $ instead of M = N

effect: regularization by constraining number of centers and voting weights.

physical meaning of centers $ \mu_m $ : prototypes

### Good prototypes: Clustering Problem

if $ x_1 \approx x_2 $

- no need both RBF(x, x1) and RBF(x, x2) in RBFNet
- cluster x1 and x2 by one prototype $ \mu \approx x_1 \approx x_2 $

Clustering with prototype:

- partition $ \big\{ x_n \big\} $ to disjoint sets $ S_1, S_2, \cdots, S_M $
- choose $ \mu_m $ for each $ S_m $
- $ \big\{ x_1, x_2 \big\} \in S_m \iff \mu_m \approx x_1 \approx x_2 $

cluster error with squared error measure:

$$
E_{in} \big( S_1, \cdots, S_M; \ \mu_1, \cdots, \mu_m \big) = \frac{1}{N} \sum_{n=1}^N \sum_{m=1}^M \big[ x_n \in S_m \big]_{boolean}
\Vert x_n - \mu_m \Vert^2
$$

### Partition Optimization

with $ S_1, \cdots, S_M $ being a partition of $ \big\{ x_n \big\} $,

$$
min_{\big\{S_1, \cdots, S_M; \ \mu_1, \cdots, \mu_m \big\}}
\sum_{n=1}^N \sum_{m=1}^M \big[ x_n \in S_m \big]_{boolean}
\Vert x_n - \mu_m \Vert^2
$$

這是一個困難的組合最佳化問題，因為複合了兩個問題:

- 哪些 x 分在一組 S ?
- 每組 S 的中心 $ \mu $ 如何決定?

解決的方法，是將 $ \mu $ 分別 fix 固定來處理:

if $ \mu_1, \cdots, \mu_m $ fixed, for each $ x_n $

- $ big[ x_n \in S_m \big]_{boolean} $ : choose one and only one subset
- $ \Vert x_n - \mu_m \Vert^2 $ : distance to each prototype.

optimal chosen subset $ S_m $ = the one with minimum $ \Vert x_n - \mu_m \Vert^2 $

如上確定分群 S, 再找每個群 S 的中心點 $ \mu_m $

if $ S_1, \cdots, S_M $ fixed, just unconstrained optimization for each $ \mu_m $

$$
\nabla_{\mu_m} E_{in} = -2 \sum_{n=1}^N
\big[ x_n \in S_m \big]_{boolean} \big( x_n - \mu_m \big) =
-2 \Big( \big( \sum_{x_n \in S_m} x_n \big) - \big| S_m \big| \ \mu_m \Big)
$$

optimal prototype $ \mu_m $ = average of $ x_n $ within $ S_m $

## K-Means Algorithm

k prototypes

STEP 0: choose k $ \mu $ 

then repeat STEP 1~2, until converge

STEP 1: optimize $ S_1, S_2, \cdots, S_k $, each $ x_n $ partitioned using closest $ \mu $

STEP 2: optimize $ \mu_1, \mu_2, \cdots, \mu_k $, each $ \mu_n $ computed from $ S_m $

## RBF Network using k-means

STEP 1: run k-Means with k=M to get $ \big\{ \mu_m \big\} $

STEP 2: construct transform $ \Phi(x) $ from RBF (say, Gaussian) at $ \mu_m $

$$
\Phi(x) = \big[ RBF(x, \mu_1), \ RBF(x, \mu_2), \ \cdots, RBF(x, \mu_M), \  \big]
$$

STEP 3: run linear model on $ \Big\{ \big( \Phi(x_n), y_n \big) \Big\} $ to get $ \beta $

STEP 4: return $ g_{RBFNET} (x) = \text{ LinearHypothesis } \big( \beta, \Phi(x) \big) $