In [1]:
%matplotlib inline
import matplotlib.pyplot as plt

# 1.polynomial regression

basic linear regression with one variable:

$$y=\theta_{0} + \theta_{1}x$$

what if linear model could not nicely fit training examples?<br>
we can naturally extend linear model to polynomial model, for example:

$$y=\theta_{0} + \theta_{1}x + \theta_{2}x^{2} + \theta_{3}x^{3}$$

this method can be conclude as: map original attibutes x to some new set of quantities $\phi(x)$ (called features), and use the same set of model.

## 1.1 least mean squares with features
let $\phi \in : \mathbb{R}^{d} \to \mathbb{R}^{p}$ be a feature map, then original batch gradient descent:

$$\theta := \theta + \alpha\sum_{i=1}^{n}(y^{(i)} - \theta^{T}x^{(i)})x^{(i)}$$

using features:

$$\theta := \theta + \alpha\sum_{i=1}^{n}(y^{(i)} - \theta^{T}\phi(x^{(i)}))\phi(x^{(i)})$$

the above becomes computationally expensive when $\phi(x)$ is high dimensional.<br>
but we can observe that, if at some point , $\theta$ can be represented as:

$$\theta = \sum_{i=1}^{n}\beta_{i}\phi(x^{(i)})$$

then in the next round:

$$
\begin{equation}
\begin{split}
\theta &:= \theta + \alpha\sum_{i=1}^{n}(y^{(i)} - \theta^{T}\phi(x^{(i)}))\phi(x^{(i)}) \\
&=\sum_{i=1}^{n}\beta_{i}\phi(x^{(i)}) + \alpha\sum_{i=1}^{n}(y^{(i)} - \theta^{T}\phi(x^{(i)}))\phi(x^{(i)}) \\
&=\sum_{i=1}^{n}(\beta_{i} + \alpha(y^{(i)} - \theta^{T}\phi(x^{(i)})))\phi(x^{(i)})
\end{split}
\end{equation}
$$

$\theta$ can be also represented as a linear representation of $\phi(x^{(i)})$<br>
we can then derive $\beta$'s update rule:

$$\beta_{i} := \beta_{i} + \alpha(y^{(i)} - \sum_{j=1}^{n}\beta_{j}\phi(x^{(j)})^{T}\phi(x^{(i)}))$$

we only need to compute $\left \langle \phi(x^{(j)}), \phi(x^{(i)}) \right \rangle = \phi(x^{(j)})^{T}\phi(x^{(i)}))$ to update parameters no matter how big the feature dimension p is.

# 1.2 kernel

we define the kernel corresponding to the feature map $\phi$ as a function that satisfying:

$$K(x, z) = \left \langle \phi(x), \phi(z) \right \rangle$$

for least mean square problem, we have:

kernel's condition(mercer): let $ K: \mathbb{R}^{d} \times \mathbb{R}^{d} \mapsto \mathbb{R}$. then for K be a valid kernel, it is necessary and sufficient that for any $\left \{ x^{(1)},...,x^{(n)} \right \} $, the corresponding kernel matrix is symmetric positive semi-definite.

# 2. support vector machine

## 2.1 margins

consider logistic regression, where the probability $p(y=1|x;\theta)$ is modeled by $h_{\theta}(x)=\sigma(\theta^{T}x)$. we predict 1 on an input x if and only if $\theta^{T}x >= 0.5$, the larger $\theta^{T}x$ is, the more confidence we are.<br>
the distance from the hyperplane is important.

functional margin:

$$\hat{\gamma}^{(i)}=y^{(i)}(w^{T}x^{(i)} + b)$$

geometric margin:

$$\gamma^{(i)}=\frac{y^{(i)}(w^{T}x^{(i)} + b)}{\left \| w \right \| }$$

geometric margin is the euclid distance.

## 2.2 the optimal margin classifier

we want to maximize geometic margin:

$$
\begin{equation}
\begin{split}
\underset{\gamma, w, b}{max}\ &\gamma \\
s.t\quad &\frac{y^{(i)}(w^{T}x^{(i)} + b)}{\left \| w \right \| } >= \gamma
\end{split}
\end{equation}
$$

without loss of generality, we can set $\gamma\left \| w \right \|=1$, then the above is equivalent to:

$$
\begin{equation}
\begin{split}
\underset{w, b}{min}\ &\frac{1}{2}{\left \| w \right \|}^2 \\
s.t\quad &{y^{(i)}(w^{T}x^{(i)} + b)} >= 1
\end{split}
\end{equation}
$$

the lagrangian of this problem:

$$
\mathcal{L}(w,b,\alpha )=\frac{1}{2}\left \| w \right \|^{2} - \sum_{i=1}^{n}\alpha_{i}\left [ y^{(i)}(w^{T}x^{(i)} + b) - 1 \right ]     
$$