Skip to content

Commit

Permalink
notes lec may 7
Browse files Browse the repository at this point in the history
  • Loading branch information
zhudotexe committed May 7, 2020
1 parent 09f6cd6 commit b7b3df2
Show file tree
Hide file tree
Showing 11 changed files with 188 additions and 1 deletion.
Binary file added _static/.DS_Store
Binary file not shown.
Binary file added _static/kernel/ex1.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _static/svm/ex10.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _static/svm/ex5.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _static/svm/ex6.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _static/svm/ex7.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _static/svm/ex8.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _static/svm/ex9.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ Welcome to cse142-notes's documentation!
linearmodels
tree
svm
kernel



Expand Down
111 changes: 111 additions & 0 deletions kernel.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
Kernel Methods
==============

- step 1: use a special type of mapping functions
- still map feature space to a higher dimensional space
- but computation of :math:`\phi(x) \cdot \phi(z)` is easy (remember the multiplication of features in SVM optimization)
- step 2: rewrite the model s.t.
- the mapping :math:`\phi(x)` never needs to be explicitly computed - i.e. never compute terms like :math:`w \cdot \phi(x) + b`
- only work with :math:`\phi(x) \cdot \phi(z)` - we call this the **kernel** function

**Ex.**

.. math::
\mathbf{x} = <x_1, x_2> \to \phi(x) = <x_1^2, \sqrt{2}x_1 x_2, x_2^2>
We can compute :math:`\phi(x) \cdot \phi(z)` easily:

.. math::
\phi(x) \cdot \phi(z) & = <x_1^2, \sqrt{2}x_1 x_2, x_2^2> \\
& = <x_1^2, \sqrt{2}x_1 x_2, x_2^2> \cdot <z_1^2, \sqrt{2}z_1 z_2, z_2^2> \\
& = ... \\
& = (x \cdot z)^2 \\
& = K(x, z)
We call this a **kernel** function

What about the quadratic feature mapping?

.. math::
... = (1+x \cdot z)^2 = K(x, z)
Kernel in SVM
-------------

Rewriting the primal form of the SVM to use the mapped space doesn't seem easy.

But we can rewrite the dual form easily!

.. math::
\max & \sum_{n=1}^N \alpha_n - \frac{1}{2} \sum_{m,n=1}^N \alpha_m \alpha_n y_m y_n (\phi(x_m)^T \phi(x_n)) \\
\text{subject to } & \sum_{n=1}^N \alpha_n y_n = 0, \alpha_n \geq 0; n = 1..N
**Kernelized SVM**

Now SVM is computing a linear boundary in the higher-dimensional space without actually transforming vectors

.. math::
\max & \sum_{n=1}^N \alpha_n - \frac{1}{2} \sum_{m,n=1}^N \alpha_m \alpha_n y_m y_n (K(x_m, x_n)) \\
\text{subject to } & \sum_{n=1}^N \alpha_n y_n = 0, \alpha_n \geq 0; n = 1..N
Predictions:

.. math::
\hat{y} & = sign(\sum \alpha_n y_n x_n \cdot x' + b) \text{ (in the old space)} \\
& \to sign(\sum \alpha_n y_n \phi(x_n) \cdot \phi(x') + b) \\
& = sign(\sum \alpha_n y_n K(x_n, x') + b)
Formal Definition
-----------------
Let's use the example kernel :math:`K(x, z) = (x \cdot z)^2` for :math:`\phi(x) = <x_1^2, \sqrt{2}x_1x_2, x_2^2>`

- a kernel function is implicitly associated with some :math:`\phi`
- :math:`\phi` maps input :math:`x\in X` to a higher dimensional space :math:`F`:
- :math:`\phi: X \to F`
- kernel takes 2 inputs from :math:`X` and outputs their similarity in F
- :math:`K: X \times X \to R`
- once you have a kernel, you **don't need to know** :math:`\phi`

**Mercer's Condition**

- a function can be a kernel function K if a suitable :math:`\phi` exists
- :math:`\exists \phi` s.t. :math:`K(x, z) = \phi(x) \cdot \phi(z)`
- mathematically: K should be positive semi-definite; i.e. for all square integrable f s.t. :math:`\int f(x)^2 dx < \infty`
- :math:`\int \int f(x)K(x, z)f(z)dxdz > 0`
- sufficient and necessary to identify a kernel function

**Constructing Kernels**

We already know some proven basic kernel functions - given Mercer's condition, we can construct new kernels!

- :math:`k(x, z) = k_1(x, z) + k_2(x, z)` (direct sum)
- :math:`k(x, z) = \alpha k_1(x, z)` (:math:`\forall \alpha > 0` - scalar product)


.. note::
Example: given that k1 and k2 are kernels, prove :math:`k(x, z) = k_1(x, z) + k_2(x, z)` is a kernel

.. math::
\iint f(x)K(x, z)f(z)dxdz & = \iint f(x)[K_1(x, z) + K_2(x, z)]f(z)dxdz \\
& = \iint f(x)K_1(x, z)f(z)dxdz + \iint f(x)K_2(x, z)f(z)dxdz \\
& > 0 + 0
.. image:: _static/svm/ex10.png

(:math:`\phi` for the linear kernel = :math:`\phi(x) = x`)

Perceptron
----------
we can also apply kernels to perceptron!

Naively, we can just replace :math:`\mathbf{x}` with :math:`\phi(x)` in the algorithm - but that requires knowledge
of :math:`\phi`

.. image:: _static/kernel/ex1.png

**Prediction**

since :math:`w = \sum_m \alpha_m \phi(x_m)`, prediction on :math:`x_n` is easy:

.. math::
\hat{y_n} = sign(\sum_m \alpha_m \phi(x_m) \cdot \phi(x_n) + b)
77 changes: 76 additions & 1 deletion svm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -148,8 +148,13 @@ Hard-Margin SVM
- constraints are linear
- this is called the *primal form*, but most people solve the *dual form*

We can encode the primal form algebraically:

.. math::
\min_{w, b} \max_{\alpha \geq 0} \frac{1}{2}(w \cdot w) + \sum_i \alpha_i (1-y_i(w \cdot x_i + b))
Dual Form
"""""""""
^^^^^^^^^

- does not change the solution
- introduces new variables :math:`\alpha_n` for each training instance
Expand Down Expand Up @@ -179,3 +184,73 @@ For soft-margin SVMs, support vectors are:
- points on the wrong side of the hyperplane (:math:`\xi \geq 1`)

**Conclusion**: w and b only depend on the support vectors

Derivation
""""""""""

Given the algebriaecally encoded primal form:

.. math::
\min_{w, b} \max_{\alpha \geq 0} \frac{1}{2}(w \cdot w) + \sum_i \alpha_i (1-y_i(w \cdot x_i + b))
We can switch the order of the min and max:

.. math::
\max_{\alpha \geq 0} \min_{w, b} \frac{1}{2}(w \cdot w) + \sum_i \alpha_i (1-y_i(w \cdot x_i + b)) \\
= \max_{\alpha \geq 0} \min_{w, b} L(w, b, \alpha)
To solve inner min, differentiate L wrt w and b:

.. math::
\frac{\partial L(w, b, \alpha)}{\partial w_k} & = w_k - \sum_i \alpha_i y_i x_{i,k} & \\
\frac{\partial L(w, b, \alpha)}{\partial w} & = w - \sum_i \alpha_i y_i x_i & \to w = \sum_i \alpha_i y_i x_i \\
\frac{\partial L(w, b, \alpha)}{\partial b} & = - \sum_i \alpha_i y_i & \to \sum_i \alpha_i y_i = 0
- :math:`w = \sum_i \alpha_i y_i x_i` means **w** is a weighted sum of examples
- :math:`\sum_i \alpha_i y_i = 0` means positive and negative examples have the same weight
- :math:`\alpha_i > 0` only when :math:`x_i` is a support vector, so **w** is a sum of signed *support vectors*

.. image:: _static/svm/ex5.png

.. image:: _static/svm/ex6.png

**Conclusion**

.. image:: _static/svm/ex7.png

**Soft-Margin**

.. image:: _static/svm/ex8.png

subject to :math:`0 \leq \alpha_i \leq c` :math:`(\forall i)`

Non-Linearly-Seperable
----------------------

What if our data is not linearly seperable?

- use a non-linear classifier
- transform our data so that it is, somehow
- e.g. adding a dummy dimension based on a quadratic formula of the real dimension

.. image:: _static/svm/ex9.png
:width: 500

Feature Mapping
^^^^^^^^^^^^^^^
We can map the original feature vector to a higher dimensional space :math:`\phi(x)`

e.g. quadratic feature mapping:

.. math::
\phi(x) = < & 1, 2x_1, 2x_2, ..., 2x_D, \\
& x_1^2, x_1x_2, ..., x_1x_D, \\
& x_2x_1, x_2^2, ..., x_2x_D, \\
& ... >
Pros: this improves separability, you can apply a linear model more confidently

Cons: There are a lot more features now, and a lot of repeated features - lots of computation and easier to overfit



0 comments on commit b7b3df2

Please sign in to comment.