Skip to content

Commit

Permalink
final commit
Browse files Browse the repository at this point in the history
  • Loading branch information
zhudotexe committed Jun 22, 2020
1 parent e002248 commit 808f3f3
Show file tree
Hide file tree
Showing 19 changed files with 428 additions and 0 deletions.
Binary file added _static/boosting/ex1.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _static/boosting/ex2.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _static/boosting/ex3.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _static/boosting/ex4.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _static/boosting/ex5.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _static/boosting/ex6.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _static/emgmm/ex1.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _static/emgmm/ex2.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _static/emgmm/ex3.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _static/emgmm/ex4.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _static/nn/ex1.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _static/nn/ex2.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _static/nn/ex3.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _static/nn/ex4.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
201 changes: 201 additions & 0 deletions boosting.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
Ensemble/Boosting
=================

Ensemble Methods
----------------

- use an ensemble/group of hypotheses
- diversity important; ensemble of "yes-men" is useless
- get diverse hypotheses by using
- different data
- different algorithms
- different hyperparameters

Why?

- averaging reduces variance
- ensemble more stables than individual
- consider biased coin p(head) = 1/3
- variance on one flip = 1/3 - 1/9 = 2/9
- variance on average of 2 flips = 2/9 - 1/9 = 1/9
- averaging makes less mistakes
- consider 3 classifiers with accuracy 0.8, 0.7, and 0.7
- the probability that the majority vote is correct = (f1 correct, f2 correct, f3 wrong) + (f1 correct, f2 wrong, f3 correct) + (f1 wrong, f2 correct, f3 correct) + (f1 correct, f2 correct, f3 correct)
- = :math:`(.8*.7*.3) + (.8*.3*.3) + (.2*.7*.7) + (.8*.8*.7) \approx 0.82`

Creation
^^^^^^^^

- use different training sets
- bootstrap example: pick *m* examples from labeled data with replacement
- cross-validation sampling
- reweight data (boosting, later)
- use different features
- random forests "hide" some features

Prediction
^^^^^^^^^^

- unweighted vote
- weighted vote
- pay more attention to better predictors
- cascade
- filter examples; each level predicts or passes on

Boosting
--------

- "boosts" the preformance of another learning algorithm
- creates a set of classifiers and predicts with weighted vote
- use different distributions on sample to get diversity
- up-weight hard, down-weight easy examples
- note: ensembles used before to reduce variance

.. image:: _static/boosting/ex1.png
:width: 450

If boosting is possible, then:

- can use fairly wild guesses to produce highly accurate predictions
- if you can learn "part way" you can learn "all the way"
- should be able to improve any learning algorithm
- for any learning problem:
- either can learn always with nearly perfect accuracy
- or there exist cases where cannot learn even slightly better than random guessing

Ada-Boost
^^^^^^^^^

Given a training sample S with labels +/-, and a learning algorithm L:

1. for t from 1 to T do
1. create distribution :math:`D_t` on :math:`S`
2. call :math:`L` with :math:`D_t` on :math:`S` to get hypothesis :math:`h_t`
1. i.e. :math:`\min \sum_n D_t(n) l(f(x_n), y_n)` where :math:`D_t(n)` is the weight of the sample
3. calculate weight :math:`\alpha_t` for :math:`h_t`
2. final hypothesis is :math:`F(x) = \sum_t \alpha_t h_t(x)`, or :math:`H(x)` = value with most weight

formally:

.. image:: _static/boosting/ex2.png
:width: 450

So how do we pick :math:`D_t` and :math:`\alpha_t`?

- :math:`D_1(i) = 1/m` - the weight assigned to :math:`(x_i, y_i)` at :math:`t=1`
- given :math:`D_t` and :math:`h_t`:
- :math:`D_{t+1}(i) = \frac{D_t(i)}{Z_t} \exp(-\alpha_t y_i h_t(x_i))`
- if correct, increase weights by a factor > 1 (positive exponential)
- otherwise decrease by a factor < 1 (negative exponential)
- where :math:`Z_t` is a normalization factor
- where :math:`\alpha_t = \frac{1}{2} \ln (\frac{1-\epsilon_t}{\epsilon_t}) > 0`
- :math:`H_{final}(x) = sign(\sum_t \alpha_t h_t(x))`

**Example**:

.. image:: _static/boosting/ex3.png
:width: 300

now the weights become :math:`\frac{1}{10} e^{0.42}` for the misclassified and :math:`\frac{1}{10} e^{-0.42}` for the correct

.. image:: _static/boosting/ex4.png
:width: 400

.. image:: _static/boosting/ex5.png
:width: 400

.. image:: _static/boosting/ex6.png
:width: 400

Analyzing Error
^^^^^^^^^^^^^^^

Thm: Write :math:`\epsilon_t` as :math:`1/2 - \gamma_t` - :math:`\gamma_t` = "edge" = how much better than random guessing.
Then:

.. math::
\text{training error}(H_{final}) & \leq \prod_t [2 \sqrt{\epsilon_t (1-\epsilon_t)}] \\
& = \prod_t \sqrt{1-4\gamma_t^2} \\
& \leq \exp(-2\sum_t \gamma_t^2)
So if :math:`\forall t: \gamma_t \geq \gamma > 0`, then :math:`\text{training error}(H_{final}) \leq e^{-2\gamma^2 T}`

therefore, as :math:`T \to \infty`, training error :math:`\to 0`

Proof
^^^^^

Let :math:`F(x) = \sum_t \alpha_t h_t(x) \to H_{final}(x) = sign(F(x))`

Step 1: unwrapping recurrence

.. math::
D_{final}(i) & = \frac{1}{m} \frac{\exp(-y_i \sum_t \alpha_t h_t(x_i))}{\prod_t Z_t} \\
& = \frac{1}{m} \frac{\exp(-y_i F(x_i))}{\prod_t Z_t}
Step 2: training error :math:`(H_{final}) \leq \prod_t Z_t`

.. math::
\text{training error}(H_{final}) & = \frac{1}{m} \sum_i & 1 & \text{ if } y_i \neq H_{final}(x_i) \\
& & 0 & \text{ otherwise} \\
& = \frac{1}{m} \sum_i & 1 & \text{ if } y_i F(x_i) \leq 0 \\
& & 0 & \text{ otherwise} \\
& \leq \frac{1}{m} \sum_t \exp(-y_i F(x_i)) \\
& = \sum_i D_{final}(i) \prod_t Z_t \\
& = \prod_t Z_t
Step 3: :math:`Z_t = 2 \sqrt{\epsilon_t (1-\epsilon_t)}`

.. math::
Z_t & = \sum_i D_t(i) \exp(-\alpha_t y_i h_t(x_i)) \\
& = \sum_{i:y_i \neq h_t(x_i)} D_t(i)e^{\alpha_t} + \sum_{i:y_i = h_t(x_i)} D_t(i) e^{-\alpha_t} \\
& = \epsilon_t e^{\alpha_t} + (1-\epsilon_t) e^{\alpha_t} \\
& = 2 \sqrt{\epsilon_t (1-\epsilon_t)}
Discussion
^^^^^^^^^^

We expect even as training error approaches 0 as T increases, the test error won't - overfitting!

We can actually predict "generalization error" (basically test error):

.. math::
\text{generalization error} \leq \text{training error} + \tilde{O}(\sqrt{\frac{dT}{m}})
Where :math:`m` = # of training samples, :math:`d` = "complexity" of weak classifiers, :math:`T` = # of rounds

But in reality, it's not always a tradeoff between training error and test error.

Margin Approach
"""""""""""""""

- training error only measures whether classifications are right or wrong
- should also consider confidence of classifications
- :math:`H_{final}` is weighted majority vote of weak classifiers
- measure confidence by *margin* = strength of the vote
- = (weighted fraction voting correctly) - (weighted fraction voting incorrectly)
- so as we train more, we increase the margin, which leads to a decrease in test loss

- both AdaBoost and SVMs
- work by maximizing margins
- find linear threshold function in high-dimensional space
- but they use different norms

AdaBoost is:

- fast
- simple, easy to program
- no hyperparameters (except T)
- flexible, can combine with any learning algorithm
- no prior knowledge needed about weak learner
- provably effective (provided a rough rule of thumb)
- versatile

But:

- performance depends on data and weak learner
- consistent with theory, adaboost can fail if:
- weak classifiers too complex (overfitting)
- weak classifiers too weak (basically random guessing)
- underfitting, or low margins -> overfitting
- susceptible to uniform noise
45 changes: 45 additions & 0 deletions emgmm.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
EM/GMM
======
Gaussian Mixture Models: Estimate Mixtures of :math:`K` Gaussians

- pick sample from gaussian :math:`k` with prob. :math:`\pi_k`
- generative distribution :math:`p(\mathbf{x}) = \sum_{k=1}^K \pi_k N(\mathbf{x} | \mathbf{\mu}_k, \mathbf{\Sigma}_k)`
- where :math:`N` is the gaussian distibution
- a *mixture distribution* with mixture coefficients :math:`\mathbf{\pi}`
- for iid sample :math:`\mathbf{X}` and parameters :math:`\theta = \{\mathbf{\pi}, \mathbf{\mu}, \mathbf{\Sigma}\}`, we have:

.. math::
p(\mathbf{X} | \mathbf{\pi}, \mathbf{\mu}, \mathbf{\Sigma}) & = \prod_{n=1}^N p(\mathbf{x_n} | \mathbf{\pi}, \mathbf{\mu}, \mathbf{\Sigma}) \\
& = \prod_{n=1}^N \sum_{k=1}^K \pi_k N(\mathbf{x} | \mathbf{\mu}_k, \mathbf{\Sigma}_k)
What is :math:`\theta`?

Log-Likelihood
--------------

.. math::
L(\pi, \mu, \Sigma) & = \ln p(\mathbf{X} | \mathbf{\pi}, \mathbf{\mu}, \mathbf{\Sigma}) \\
& = \ln (\prod_{n=1}^N \sum_{k=1}^K \pi_k N(\mathbf{x} | \mathbf{\mu}_k, \mathbf{\Sigma}_k)) \\
& = \sum_{n=1}^N \ln (\sum_{k=1}^K \pi_k N(\mathbf{x} | \mathbf{\mu}_k, \mathbf{\Sigma}_k))
This is very hard to solve directly!

Iteratively
-----------

- which gaussian picked for :math:`x_i \in X` is a latent variable :math:`z_i` in :math:`\{0, 1\}^K` (one of K encoding)
- :math:`Z` is vector of :math:`z_i`'s
- note that :math:`p(\mathbf{X} | \mathbf{\pi}, \mathbf{\mu}, \mathbf{\Sigma}) = \sum_Z p(\mathbf{X}, Z | \mathbf{\pi}, \mathbf{\mu}, \mathbf{\Sigma})`
- complete data is :math:`\{X, Z\}`
- incomplete data is just :math:`X`
- don't know :math:`Z`, but from :math:`\theta^{old}` can infer

.. image:: _static/emgmm/ex1.png

.. image:: _static/emgmm/ex2.png

.. image:: _static/emgmm/ex3.png

Maximize this to get the new parameters.

.. image:: _static/emgmm/ex4.png
4 changes: 4 additions & 0 deletions index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,10 @@ Welcome to cse142-notes's documentation!
svm
kernel
unsupervised
pca
emgmm
boosting
nn



Expand Down
80 changes: 80 additions & 0 deletions nn.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
Neural Nets
===========

Neural nets can be used to approximate nonlinear functions with a hypothesis!

Neural nets are an intuitive extension of perceptron - think of perceptron as a 1-layer neural net.

But in perceptron, the output is just the sign of the linear combination of the inputs - in NN, we use an
*activation function*

.. math::
sign(W_i \cdot x) \to f(W_i \cdot x)
Activations should be nonlinear - if they're linear, it's redundant

Activations
-----------

- sign: -1, 0, or 1
- not differentiable
- often used for output layer
- tanh: :math:`\frac{e^{2x}-1}{e^{2x}+1}`
- sigmoid: :math:`\frac{e^x}{1+e^x}`
- ReLU: :math:`\max(0, x)`

Training
--------
Let's consider the following loss objective on a 2-layer neural net with weights W and v:

.. math::
L(w, v) = \min_{W,v} \sum_n \frac{1}{2} (y^n - score)^2
we just need to find

.. math::
\frac{\partial L}{\partial W}, \frac{\partial L}{\partial v}
We do this using *backpropogation*.

.. image:: _static/nn/ex1.png
:width: 450

.. image:: _static/nn/ex2.png
:width: 450

.. image:: _static/nn/ex3.png
:width: 450

VAE
---
Whereas normal AE turns images into a latent vector, VAE tries to learn the parameters of a gaussian distribution
that the image is a mixture of

.. math::
c_i = \exp(\sigma_i)e_i + m_i
where :math:`c_i` is a component of the latent vector, :math:`e_i` is a random exponential term, and
:math:`m_i` and :math:`\sigma_i` are the gaussian variables.

KL Divergence
-------------

Roughly, a measure of how close two distributions are to each other (>= 0)

.. image:: _static/nn/ex4.png

Random
------

Random note: GAN objective can also be written :math:`\max_D V(G,D) = -2 \log 2 + 2 JSD(P_{data}(x) || P_G(x))`

f-GAN
^^^^^

Uses a generalized divergence function:

.. math::
D_f(q||p) = \int p(x) f[\frac{q(x)}{p(x)}]dx
by making :math:`f = \log`, this is KL divergence

0 comments on commit 808f3f3

Please sign in to comment.