-
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
19 changed files
with
428 additions
and
0 deletions.
There are no files selected for viewing
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,201 @@ | ||
Ensemble/Boosting | ||
================= | ||
|
||
Ensemble Methods | ||
---------------- | ||
|
||
- use an ensemble/group of hypotheses | ||
- diversity important; ensemble of "yes-men" is useless | ||
- get diverse hypotheses by using | ||
- different data | ||
- different algorithms | ||
- different hyperparameters | ||
|
||
Why? | ||
|
||
- averaging reduces variance | ||
- ensemble more stables than individual | ||
- consider biased coin p(head) = 1/3 | ||
- variance on one flip = 1/3 - 1/9 = 2/9 | ||
- variance on average of 2 flips = 2/9 - 1/9 = 1/9 | ||
- averaging makes less mistakes | ||
- consider 3 classifiers with accuracy 0.8, 0.7, and 0.7 | ||
- the probability that the majority vote is correct = (f1 correct, f2 correct, f3 wrong) + (f1 correct, f2 wrong, f3 correct) + (f1 wrong, f2 correct, f3 correct) + (f1 correct, f2 correct, f3 correct) | ||
- = :math:`(.8*.7*.3) + (.8*.3*.3) + (.2*.7*.7) + (.8*.8*.7) \approx 0.82` | ||
|
||
Creation | ||
^^^^^^^^ | ||
|
||
- use different training sets | ||
- bootstrap example: pick *m* examples from labeled data with replacement | ||
- cross-validation sampling | ||
- reweight data (boosting, later) | ||
- use different features | ||
- random forests "hide" some features | ||
|
||
Prediction | ||
^^^^^^^^^^ | ||
|
||
- unweighted vote | ||
- weighted vote | ||
- pay more attention to better predictors | ||
- cascade | ||
- filter examples; each level predicts or passes on | ||
|
||
Boosting | ||
-------- | ||
|
||
- "boosts" the preformance of another learning algorithm | ||
- creates a set of classifiers and predicts with weighted vote | ||
- use different distributions on sample to get diversity | ||
- up-weight hard, down-weight easy examples | ||
- note: ensembles used before to reduce variance | ||
|
||
.. image:: _static/boosting/ex1.png | ||
:width: 450 | ||
|
||
If boosting is possible, then: | ||
|
||
- can use fairly wild guesses to produce highly accurate predictions | ||
- if you can learn "part way" you can learn "all the way" | ||
- should be able to improve any learning algorithm | ||
- for any learning problem: | ||
- either can learn always with nearly perfect accuracy | ||
- or there exist cases where cannot learn even slightly better than random guessing | ||
|
||
Ada-Boost | ||
^^^^^^^^^ | ||
|
||
Given a training sample S with labels +/-, and a learning algorithm L: | ||
|
||
1. for t from 1 to T do | ||
1. create distribution :math:`D_t` on :math:`S` | ||
2. call :math:`L` with :math:`D_t` on :math:`S` to get hypothesis :math:`h_t` | ||
1. i.e. :math:`\min \sum_n D_t(n) l(f(x_n), y_n)` where :math:`D_t(n)` is the weight of the sample | ||
3. calculate weight :math:`\alpha_t` for :math:`h_t` | ||
2. final hypothesis is :math:`F(x) = \sum_t \alpha_t h_t(x)`, or :math:`H(x)` = value with most weight | ||
|
||
formally: | ||
|
||
.. image:: _static/boosting/ex2.png | ||
:width: 450 | ||
|
||
So how do we pick :math:`D_t` and :math:`\alpha_t`? | ||
|
||
- :math:`D_1(i) = 1/m` - the weight assigned to :math:`(x_i, y_i)` at :math:`t=1` | ||
- given :math:`D_t` and :math:`h_t`: | ||
- :math:`D_{t+1}(i) = \frac{D_t(i)}{Z_t} \exp(-\alpha_t y_i h_t(x_i))` | ||
- if correct, increase weights by a factor > 1 (positive exponential) | ||
- otherwise decrease by a factor < 1 (negative exponential) | ||
- where :math:`Z_t` is a normalization factor | ||
- where :math:`\alpha_t = \frac{1}{2} \ln (\frac{1-\epsilon_t}{\epsilon_t}) > 0` | ||
- :math:`H_{final}(x) = sign(\sum_t \alpha_t h_t(x))` | ||
|
||
**Example**: | ||
|
||
.. image:: _static/boosting/ex3.png | ||
:width: 300 | ||
|
||
now the weights become :math:`\frac{1}{10} e^{0.42}` for the misclassified and :math:`\frac{1}{10} e^{-0.42}` for the correct | ||
|
||
.. image:: _static/boosting/ex4.png | ||
:width: 400 | ||
|
||
.. image:: _static/boosting/ex5.png | ||
:width: 400 | ||
|
||
.. image:: _static/boosting/ex6.png | ||
:width: 400 | ||
|
||
Analyzing Error | ||
^^^^^^^^^^^^^^^ | ||
|
||
Thm: Write :math:`\epsilon_t` as :math:`1/2 - \gamma_t` - :math:`\gamma_t` = "edge" = how much better than random guessing. | ||
Then: | ||
|
||
.. math:: | ||
\text{training error}(H_{final}) & \leq \prod_t [2 \sqrt{\epsilon_t (1-\epsilon_t)}] \\ | ||
& = \prod_t \sqrt{1-4\gamma_t^2} \\ | ||
& \leq \exp(-2\sum_t \gamma_t^2) | ||
So if :math:`\forall t: \gamma_t \geq \gamma > 0`, then :math:`\text{training error}(H_{final}) \leq e^{-2\gamma^2 T}` | ||
|
||
therefore, as :math:`T \to \infty`, training error :math:`\to 0` | ||
|
||
Proof | ||
^^^^^ | ||
|
||
Let :math:`F(x) = \sum_t \alpha_t h_t(x) \to H_{final}(x) = sign(F(x))` | ||
|
||
Step 1: unwrapping recurrence | ||
|
||
.. math:: | ||
D_{final}(i) & = \frac{1}{m} \frac{\exp(-y_i \sum_t \alpha_t h_t(x_i))}{\prod_t Z_t} \\ | ||
& = \frac{1}{m} \frac{\exp(-y_i F(x_i))}{\prod_t Z_t} | ||
Step 2: training error :math:`(H_{final}) \leq \prod_t Z_t` | ||
|
||
.. math:: | ||
\text{training error}(H_{final}) & = \frac{1}{m} \sum_i & 1 & \text{ if } y_i \neq H_{final}(x_i) \\ | ||
& & 0 & \text{ otherwise} \\ | ||
& = \frac{1}{m} \sum_i & 1 & \text{ if } y_i F(x_i) \leq 0 \\ | ||
& & 0 & \text{ otherwise} \\ | ||
& \leq \frac{1}{m} \sum_t \exp(-y_i F(x_i)) \\ | ||
& = \sum_i D_{final}(i) \prod_t Z_t \\ | ||
& = \prod_t Z_t | ||
Step 3: :math:`Z_t = 2 \sqrt{\epsilon_t (1-\epsilon_t)}` | ||
|
||
.. math:: | ||
Z_t & = \sum_i D_t(i) \exp(-\alpha_t y_i h_t(x_i)) \\ | ||
& = \sum_{i:y_i \neq h_t(x_i)} D_t(i)e^{\alpha_t} + \sum_{i:y_i = h_t(x_i)} D_t(i) e^{-\alpha_t} \\ | ||
& = \epsilon_t e^{\alpha_t} + (1-\epsilon_t) e^{\alpha_t} \\ | ||
& = 2 \sqrt{\epsilon_t (1-\epsilon_t)} | ||
Discussion | ||
^^^^^^^^^^ | ||
|
||
We expect even as training error approaches 0 as T increases, the test error won't - overfitting! | ||
|
||
We can actually predict "generalization error" (basically test error): | ||
|
||
.. math:: | ||
\text{generalization error} \leq \text{training error} + \tilde{O}(\sqrt{\frac{dT}{m}}) | ||
Where :math:`m` = # of training samples, :math:`d` = "complexity" of weak classifiers, :math:`T` = # of rounds | ||
|
||
But in reality, it's not always a tradeoff between training error and test error. | ||
|
||
Margin Approach | ||
""""""""""""""" | ||
|
||
- training error only measures whether classifications are right or wrong | ||
- should also consider confidence of classifications | ||
- :math:`H_{final}` is weighted majority vote of weak classifiers | ||
- measure confidence by *margin* = strength of the vote | ||
- = (weighted fraction voting correctly) - (weighted fraction voting incorrectly) | ||
- so as we train more, we increase the margin, which leads to a decrease in test loss | ||
|
||
- both AdaBoost and SVMs | ||
- work by maximizing margins | ||
- find linear threshold function in high-dimensional space | ||
- but they use different norms | ||
|
||
AdaBoost is: | ||
|
||
- fast | ||
- simple, easy to program | ||
- no hyperparameters (except T) | ||
- flexible, can combine with any learning algorithm | ||
- no prior knowledge needed about weak learner | ||
- provably effective (provided a rough rule of thumb) | ||
- versatile | ||
|
||
But: | ||
|
||
- performance depends on data and weak learner | ||
- consistent with theory, adaboost can fail if: | ||
- weak classifiers too complex (overfitting) | ||
- weak classifiers too weak (basically random guessing) | ||
- underfitting, or low margins -> overfitting | ||
- susceptible to uniform noise |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
EM/GMM | ||
====== | ||
Gaussian Mixture Models: Estimate Mixtures of :math:`K` Gaussians | ||
|
||
- pick sample from gaussian :math:`k` with prob. :math:`\pi_k` | ||
- generative distribution :math:`p(\mathbf{x}) = \sum_{k=1}^K \pi_k N(\mathbf{x} | \mathbf{\mu}_k, \mathbf{\Sigma}_k)` | ||
- where :math:`N` is the gaussian distibution | ||
- a *mixture distribution* with mixture coefficients :math:`\mathbf{\pi}` | ||
- for iid sample :math:`\mathbf{X}` and parameters :math:`\theta = \{\mathbf{\pi}, \mathbf{\mu}, \mathbf{\Sigma}\}`, we have: | ||
|
||
.. math:: | ||
p(\mathbf{X} | \mathbf{\pi}, \mathbf{\mu}, \mathbf{\Sigma}) & = \prod_{n=1}^N p(\mathbf{x_n} | \mathbf{\pi}, \mathbf{\mu}, \mathbf{\Sigma}) \\ | ||
& = \prod_{n=1}^N \sum_{k=1}^K \pi_k N(\mathbf{x} | \mathbf{\mu}_k, \mathbf{\Sigma}_k) | ||
What is :math:`\theta`? | ||
|
||
Log-Likelihood | ||
-------------- | ||
|
||
.. math:: | ||
L(\pi, \mu, \Sigma) & = \ln p(\mathbf{X} | \mathbf{\pi}, \mathbf{\mu}, \mathbf{\Sigma}) \\ | ||
& = \ln (\prod_{n=1}^N \sum_{k=1}^K \pi_k N(\mathbf{x} | \mathbf{\mu}_k, \mathbf{\Sigma}_k)) \\ | ||
& = \sum_{n=1}^N \ln (\sum_{k=1}^K \pi_k N(\mathbf{x} | \mathbf{\mu}_k, \mathbf{\Sigma}_k)) | ||
This is very hard to solve directly! | ||
|
||
Iteratively | ||
----------- | ||
|
||
- which gaussian picked for :math:`x_i \in X` is a latent variable :math:`z_i` in :math:`\{0, 1\}^K` (one of K encoding) | ||
- :math:`Z` is vector of :math:`z_i`'s | ||
- note that :math:`p(\mathbf{X} | \mathbf{\pi}, \mathbf{\mu}, \mathbf{\Sigma}) = \sum_Z p(\mathbf{X}, Z | \mathbf{\pi}, \mathbf{\mu}, \mathbf{\Sigma})` | ||
- complete data is :math:`\{X, Z\}` | ||
- incomplete data is just :math:`X` | ||
- don't know :math:`Z`, but from :math:`\theta^{old}` can infer | ||
|
||
.. image:: _static/emgmm/ex1.png | ||
|
||
.. image:: _static/emgmm/ex2.png | ||
|
||
.. image:: _static/emgmm/ex3.png | ||
|
||
Maximize this to get the new parameters. | ||
|
||
.. image:: _static/emgmm/ex4.png |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -21,6 +21,10 @@ Welcome to cse142-notes's documentation! | |
svm | ||
kernel | ||
unsupervised | ||
pca | ||
emgmm | ||
boosting | ||
nn | ||
|
||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
Neural Nets | ||
=========== | ||
|
||
Neural nets can be used to approximate nonlinear functions with a hypothesis! | ||
|
||
Neural nets are an intuitive extension of perceptron - think of perceptron as a 1-layer neural net. | ||
|
||
But in perceptron, the output is just the sign of the linear combination of the inputs - in NN, we use an | ||
*activation function* | ||
|
||
.. math:: | ||
sign(W_i \cdot x) \to f(W_i \cdot x) | ||
Activations should be nonlinear - if they're linear, it's redundant | ||
|
||
Activations | ||
----------- | ||
|
||
- sign: -1, 0, or 1 | ||
- not differentiable | ||
- often used for output layer | ||
- tanh: :math:`\frac{e^{2x}-1}{e^{2x}+1}` | ||
- sigmoid: :math:`\frac{e^x}{1+e^x}` | ||
- ReLU: :math:`\max(0, x)` | ||
|
||
Training | ||
-------- | ||
Let's consider the following loss objective on a 2-layer neural net with weights W and v: | ||
|
||
.. math:: | ||
L(w, v) = \min_{W,v} \sum_n \frac{1}{2} (y^n - score)^2 | ||
we just need to find | ||
|
||
.. math:: | ||
\frac{\partial L}{\partial W}, \frac{\partial L}{\partial v} | ||
We do this using *backpropogation*. | ||
|
||
.. image:: _static/nn/ex1.png | ||
:width: 450 | ||
|
||
.. image:: _static/nn/ex2.png | ||
:width: 450 | ||
|
||
.. image:: _static/nn/ex3.png | ||
:width: 450 | ||
|
||
VAE | ||
--- | ||
Whereas normal AE turns images into a latent vector, VAE tries to learn the parameters of a gaussian distribution | ||
that the image is a mixture of | ||
|
||
.. math:: | ||
c_i = \exp(\sigma_i)e_i + m_i | ||
where :math:`c_i` is a component of the latent vector, :math:`e_i` is a random exponential term, and | ||
:math:`m_i` and :math:`\sigma_i` are the gaussian variables. | ||
|
||
KL Divergence | ||
------------- | ||
|
||
Roughly, a measure of how close two distributions are to each other (>= 0) | ||
|
||
.. image:: _static/nn/ex4.png | ||
|
||
Random | ||
------ | ||
|
||
Random note: GAN objective can also be written :math:`\max_D V(G,D) = -2 \log 2 + 2 JSD(P_{data}(x) || P_G(x))` | ||
|
||
f-GAN | ||
^^^^^ | ||
|
||
Uses a generalized divergence function: | ||
|
||
.. math:: | ||
D_f(q||p) = \int p(x) f[\frac{q(x)}{p(x)}]dx | ||
by making :math:`f = \log`, this is KL divergence |
Oops, something went wrong.