notes lec may 7

zhudotexe · May 7, 2020 · b7b3df2 · b7b3df2
1 parent 09f6cd6
commit b7b3df2
Show file tree

Hide file tree

Showing 11 changed files with 188 additions and 1 deletion.
diff --git a/_static/.DS_Store b/_static/.DS_Store
diff --git a/_static/kernel/ex1.png b/_static/kernel/ex1.png
diff --git a/_static/svm/ex10.png b/_static/svm/ex10.png
diff --git a/_static/svm/ex5.png b/_static/svm/ex5.png
diff --git a/_static/svm/ex6.png b/_static/svm/ex6.png
diff --git a/_static/svm/ex7.png b/_static/svm/ex7.png
diff --git a/_static/svm/ex8.png b/_static/svm/ex8.png
diff --git a/_static/svm/ex9.png b/_static/svm/ex9.png
diff --git a/index.rst b/index.rst
@@ -19,6 +19,7 @@ Welcome to cse142-notes's documentation!
    linearmodels
    tree
    svm
+   kernel
 
 
 

diff --git a/kernel.rst b/kernel.rst
@@ -0,0 +1,111 @@
+Kernel Methods
+==============
+
+- step 1: use a special type of mapping functions
+    - still map feature space to a higher dimensional space
+    - but computation of :math:`\phi(x) \cdot \phi(z)` is easy (remember the multiplication of features in SVM optimization)
+- step 2: rewrite the model s.t.
+    - the mapping :math:`\phi(x)` never needs to be explicitly computed - i.e. never compute terms like :math:`w \cdot \phi(x) + b`
+    - only work with :math:`\phi(x) \cdot \phi(z)` - we call this the **kernel** function
+
+**Ex.**
+
+.. math::
+    \mathbf{x} = <x_1, x_2> \to \phi(x) = <x_1^2, \sqrt{2}x_1 x_2, x_2^2>
+
+We can compute :math:`\phi(x) \cdot \phi(z)` easily:
+
+.. math::
+    \phi(x) \cdot \phi(z) & = <x_1^2, \sqrt{2}x_1 x_2, x_2^2> \\
+    & = <x_1^2, \sqrt{2}x_1 x_2, x_2^2> \cdot <z_1^2, \sqrt{2}z_1 z_2, z_2^2> \\
+    & = ... \\
+    & = (x \cdot z)^2 \\
+    & = K(x, z)
+
+We call this a **kernel** function
+
+What about the quadratic feature mapping?
+
+.. math::
+    ... = (1+x \cdot z)^2 = K(x, z)
+
+Kernel in SVM
+-------------
+
+Rewriting the primal form of the SVM to use the mapped space doesn't seem easy.
+
+But we can rewrite the dual form easily!
+
+.. math::
+    \max & \sum_{n=1}^N \alpha_n - \frac{1}{2} \sum_{m,n=1}^N \alpha_m \alpha_n y_m y_n (\phi(x_m)^T \phi(x_n)) \\
+    \text{subject to } & \sum_{n=1}^N \alpha_n y_n = 0, \alpha_n \geq 0; n = 1..N
+
+**Kernelized SVM**
+
+Now SVM is computing a linear boundary in the higher-dimensional space without actually transforming vectors
+
+.. math::
+    \max & \sum_{n=1}^N \alpha_n - \frac{1}{2} \sum_{m,n=1}^N \alpha_m \alpha_n y_m y_n (K(x_m, x_n)) \\
+    \text{subject to } & \sum_{n=1}^N \alpha_n y_n = 0, \alpha_n \geq 0; n = 1..N
+
+Predictions:
+
+.. math::
+    \hat{y} & = sign(\sum \alpha_n y_n x_n \cdot x' + b) \text{ (in the old space)} \\
+    & \to sign(\sum \alpha_n y_n \phi(x_n) \cdot \phi(x') + b) \\
+    & = sign(\sum \alpha_n y_n K(x_n, x') + b)
+
+Formal Definition
+-----------------
+Let's use the example kernel :math:`K(x, z) = (x \cdot z)^2` for :math:`\phi(x) = <x_1^2, \sqrt{2}x_1x_2, x_2^2>`
+
+- a kernel function is implicitly associated with some :math:`\phi`
+- :math:`\phi` maps input :math:`x\in X` to a higher dimensional space :math:`F`:
+    - :math:`\phi: X \to F`
+- kernel takes 2 inputs from :math:`X` and outputs their similarity in F
+    - :math:`K: X \times X \to R`
+- once you have a kernel, you **don't need to know** :math:`\phi`
+
+**Mercer's Condition**
+
+- a function can be a kernel function K if a suitable :math:`\phi` exists
+    - :math:`\exists \phi` s.t. :math:`K(x, z) = \phi(x) \cdot \phi(z)`
+- mathematically: K should be positive semi-definite; i.e. for all square integrable f s.t. :math:`\int f(x)^2 dx < \infty`
+    - :math:`\int \int f(x)K(x, z)f(z)dxdz > 0`
+    - sufficient and necessary to identify a kernel function
+
+**Constructing Kernels**
+
+We already know some proven basic kernel functions - given Mercer's condition, we can construct new kernels!
+
+- :math:`k(x, z) = k_1(x, z) + k_2(x, z)` (direct sum)
+- :math:`k(x, z) = \alpha k_1(x, z)` (:math:`\forall \alpha > 0` - scalar product)
+
+
+.. note::
+    Example: given that k1 and k2 are kernels, prove :math:`k(x, z) = k_1(x, z) + k_2(x, z)` is a kernel
+
+    .. math::
+        \iint f(x)K(x, z)f(z)dxdz & = \iint f(x)[K_1(x, z) + K_2(x, z)]f(z)dxdz \\
+        & = \iint f(x)K_1(x, z)f(z)dxdz + \iint f(x)K_2(x, z)f(z)dxdz \\
+        & > 0 + 0
+
+.. image:: _static/svm/ex10.png
+
+(:math:`\phi` for the linear kernel = :math:`\phi(x) = x`)
+
+Perceptron
+----------
+we can also apply kernels to perceptron!
+
+Naively, we can just replace :math:`\mathbf{x}` with :math:`\phi(x)` in the algorithm - but that requires knowledge
+of :math:`\phi`
+
+.. image:: _static/kernel/ex1.png
+
+**Prediction**
+
+since :math:`w = \sum_m \alpha_m \phi(x_m)`, prediction on :math:`x_n` is easy:
+
+.. math::
+    \hat{y_n} = sign(\sum_m \alpha_m \phi(x_m) \cdot \phi(x_n) + b)
diff --git a/svm.rst b/svm.rst
@@ -148,8 +148,13 @@ Hard-Margin SVM
     - constraints are linear
 - this is called the *primal form*, but most people solve the *dual form*
 
+We can encode the primal form algebraically:
+
+.. math::
+    \min_{w, b} \max_{\alpha \geq 0} \frac{1}{2}(w \cdot w) + \sum_i \alpha_i (1-y_i(w \cdot x_i + b))
+
 Dual Form
-"""""""""
+^^^^^^^^^
 
 - does not change the solution
 - introduces new variables :math:`\alpha_n` for each training instance
@@ -179,3 +184,73 @@ For soft-margin SVMs, support vectors are:
 - points on the wrong side of the hyperplane (:math:`\xi \geq 1`)
 
 **Conclusion**: w and b only depend on the support vectors
+
+Derivation
+""""""""""
+
+Given the algebriaecally encoded primal form:
+
+.. math::
+    \min_{w, b} \max_{\alpha \geq 0} \frac{1}{2}(w \cdot w) + \sum_i \alpha_i (1-y_i(w \cdot x_i + b))
+
+We can switch the order of the min and max:
+
+.. math::
+    \max_{\alpha \geq 0} \min_{w, b} \frac{1}{2}(w \cdot w) + \sum_i \alpha_i (1-y_i(w \cdot x_i + b)) \\
+    = \max_{\alpha \geq 0} \min_{w, b} L(w, b, \alpha)
+
+To solve inner min, differentiate L wrt w and b:
+
+.. math::
+    \frac{\partial L(w, b, \alpha)}{\partial w_k} & = w_k - \sum_i \alpha_i y_i x_{i,k} &  \\
+    \frac{\partial L(w, b, \alpha)}{\partial w} & = w - \sum_i \alpha_i y_i x_i & \to w = \sum_i \alpha_i y_i x_i \\
+    \frac{\partial L(w, b, \alpha)}{\partial b} & = - \sum_i \alpha_i y_i & \to \sum_i \alpha_i y_i = 0
+
+- :math:`w = \sum_i \alpha_i y_i x_i` means **w** is a weighted sum of examples
+- :math:`\sum_i \alpha_i y_i = 0` means positive and negative examples have the same weight
+- :math:`\alpha_i > 0` only when :math:`x_i` is a support vector, so **w** is a sum of signed *support vectors*
+
+.. image:: _static/svm/ex5.png
+
+.. image:: _static/svm/ex6.png
+
+**Conclusion**
+
+.. image:: _static/svm/ex7.png
+
+**Soft-Margin**
+
+.. image:: _static/svm/ex8.png
+
+subject to :math:`0 \leq \alpha_i \leq c` :math:`(\forall i)`
+
+Non-Linearly-Seperable
+----------------------
+
+What if our data is not linearly seperable?
+
+- use a non-linear classifier
+- transform our data so that it is, somehow
+    - e.g. adding a dummy dimension based on a quadratic formula of the real dimension
+
+.. image:: _static/svm/ex9.png
+    :width: 500
+
+Feature Mapping
+^^^^^^^^^^^^^^^
+We can map the original feature vector to a higher dimensional space :math:`\phi(x)`
+
+e.g. quadratic feature mapping:
+
+.. math::
+    \phi(x) = < & 1, 2x_1, 2x_2, ..., 2x_D, \\
+    & x_1^2, x_1x_2, ..., x_1x_D, \\
+    & x_2x_1, x_2^2, ..., x_2x_D, \\
+    & ... >
+
+Pros: this improves separability, you can apply a linear model more confidently
+
+Cons: There are a lot more features now, and a lot of repeated features - lots of computation and easier to overfit
+
+
+