final commit

zhudotexe · Jun 22, 2020 · 808f3f3 · 808f3f3
1 parent e002248
commit 808f3f3
Show file tree

Hide file tree

Showing 19 changed files with 428 additions and 0 deletions.
diff --git a/_static/boosting/ex1.png b/_static/boosting/ex1.png
diff --git a/_static/boosting/ex2.png b/_static/boosting/ex2.png
diff --git a/_static/boosting/ex3.png b/_static/boosting/ex3.png
diff --git a/_static/boosting/ex4.png b/_static/boosting/ex4.png
diff --git a/_static/boosting/ex5.png b/_static/boosting/ex5.png
diff --git a/_static/boosting/ex6.png b/_static/boosting/ex6.png
diff --git a/_static/emgmm/ex1.png b/_static/emgmm/ex1.png
diff --git a/_static/emgmm/ex2.png b/_static/emgmm/ex2.png
diff --git a/_static/emgmm/ex3.png b/_static/emgmm/ex3.png
diff --git a/_static/emgmm/ex4.png b/_static/emgmm/ex4.png
diff --git a/_static/nn/ex1.png b/_static/nn/ex1.png
diff --git a/_static/nn/ex2.png b/_static/nn/ex2.png
diff --git a/_static/nn/ex3.png b/_static/nn/ex3.png
diff --git a/_static/nn/ex4.png b/_static/nn/ex4.png
diff --git a/boosting.rst b/boosting.rst
@@ -0,0 +1,201 @@
+Ensemble/Boosting
+=================
+
+Ensemble Methods
+----------------
+
+- use an ensemble/group of hypotheses
+- diversity important; ensemble of "yes-men" is useless
+- get diverse hypotheses by using
+    - different data
+    - different algorithms
+    - different hyperparameters
+
+Why?
+
+- averaging reduces variance
+    - ensemble more stables than individual
+    - consider biased coin p(head) = 1/3
+        - variance on one flip = 1/3 - 1/9 = 2/9
+        - variance on average of 2 flips = 2/9 - 1/9 = 1/9
+- averaging makes less mistakes
+    - consider 3 classifiers with accuracy 0.8, 0.7, and 0.7
+    - the probability that the majority vote is correct = (f1 correct, f2 correct, f3 wrong) + (f1 correct, f2 wrong, f3 correct) + (f1 wrong, f2 correct, f3 correct) + (f1 correct, f2 correct, f3 correct)
+    - = :math:`(.8*.7*.3) + (.8*.3*.3) + (.2*.7*.7) + (.8*.8*.7) \approx 0.82`
+
+Creation
+^^^^^^^^
+
+- use different training sets
+    - bootstrap example: pick *m* examples from labeled data with replacement
+    - cross-validation sampling
+    - reweight data (boosting, later)
+- use different features
+    - random forests "hide" some features
+
+Prediction
+^^^^^^^^^^
+
+- unweighted vote
+- weighted vote
+    - pay more attention to better predictors
+- cascade
+    - filter examples; each level predicts or passes on
+
+Boosting
+--------
+
+- "boosts" the preformance of another learning algorithm
+- creates a set of classifiers and predicts with weighted vote
+- use different distributions on sample to get diversity
+    - up-weight hard, down-weight easy examples
+- note: ensembles used before to reduce variance
+
+.. image:: _static/boosting/ex1.png
+    :width: 450
+
+If boosting is possible, then:
+
+- can use fairly wild guesses to produce highly accurate predictions
+- if you can learn "part way" you can learn "all the way"
+- should be able to improve any learning algorithm
+- for any learning problem:
+    - either can learn always with nearly perfect accuracy
+    - or there exist cases where cannot learn even slightly better than random guessing
+
+Ada-Boost
+^^^^^^^^^
+
+Given a training sample S with labels +/-, and a learning algorithm L:
+
+1. for t from 1 to T do
+    1. create distribution :math:`D_t` on :math:`S`
+    2. call :math:`L` with :math:`D_t` on :math:`S` to get hypothesis :math:`h_t`
+        1. i.e. :math:`\min \sum_n D_t(n) l(f(x_n), y_n)` where :math:`D_t(n)` is the weight of the sample
+    3. calculate weight :math:`\alpha_t` for :math:`h_t`
+2. final hypothesis is :math:`F(x) = \sum_t \alpha_t h_t(x)`, or :math:`H(x)` = value with most weight
+
+formally:
+
+.. image:: _static/boosting/ex2.png
+    :width: 450
+
+So how do we pick :math:`D_t` and :math:`\alpha_t`?
+
+- :math:`D_1(i) = 1/m` - the weight assigned to :math:`(x_i, y_i)` at :math:`t=1`
+- given :math:`D_t` and :math:`h_t`:
+    - :math:`D_{t+1}(i) = \frac{D_t(i)}{Z_t} \exp(-\alpha_t y_i h_t(x_i))`
+        - if correct, increase weights by a factor > 1 (positive exponential)
+        - otherwise decrease by a factor < 1 (negative exponential)
+    - where :math:`Z_t` is a normalization factor
+    - where :math:`\alpha_t = \frac{1}{2} \ln (\frac{1-\epsilon_t}{\epsilon_t}) > 0`
+- :math:`H_{final}(x) = sign(\sum_t \alpha_t h_t(x))`
+
+**Example**:
+
+.. image:: _static/boosting/ex3.png
+    :width: 300
+
+now the weights become :math:`\frac{1}{10} e^{0.42}` for the misclassified and :math:`\frac{1}{10} e^{-0.42}` for the correct
+
+.. image:: _static/boosting/ex4.png
+    :width: 400
+
+.. image:: _static/boosting/ex5.png
+    :width: 400
+
+.. image:: _static/boosting/ex6.png
+    :width: 400
+
+Analyzing Error
+^^^^^^^^^^^^^^^
+
+Thm: Write :math:`\epsilon_t` as :math:`1/2 - \gamma_t` - :math:`\gamma_t` = "edge" = how much better than random guessing.
+Then:
+
+.. math::
+    \text{training error}(H_{final}) & \leq \prod_t [2 \sqrt{\epsilon_t (1-\epsilon_t)}] \\
+    & = \prod_t \sqrt{1-4\gamma_t^2} \\
+    & \leq \exp(-2\sum_t \gamma_t^2)
+
+So if :math:`\forall t: \gamma_t \geq \gamma > 0`, then :math:`\text{training error}(H_{final}) \leq e^{-2\gamma^2 T}`
+
+therefore, as :math:`T \to \infty`, training error :math:`\to 0`
+
+Proof
+^^^^^
+
+Let :math:`F(x) = \sum_t \alpha_t h_t(x) \to H_{final}(x) = sign(F(x))`
+
+Step 1: unwrapping recurrence
+
+.. math::
+    D_{final}(i) & = \frac{1}{m} \frac{\exp(-y_i \sum_t \alpha_t h_t(x_i))}{\prod_t Z_t} \\
+    & = \frac{1}{m} \frac{\exp(-y_i F(x_i))}{\prod_t Z_t}
+
+Step 2: training error :math:`(H_{final}) \leq \prod_t Z_t`
+
+.. math::
+    \text{training error}(H_{final}) & = \frac{1}{m} \sum_i & 1 & \text{ if } y_i \neq H_{final}(x_i) \\
+    & & 0 & \text{ otherwise} \\
+    & = \frac{1}{m} \sum_i & 1 & \text{ if } y_i F(x_i) \leq 0 \\
+    & & 0 & \text{ otherwise} \\
+    & \leq \frac{1}{m} \sum_t \exp(-y_i F(x_i)) \\
+    & = \sum_i D_{final}(i) \prod_t Z_t \\
+    & = \prod_t Z_t
+
+Step 3: :math:`Z_t = 2 \sqrt{\epsilon_t (1-\epsilon_t)}`
+
+.. math::
+    Z_t & = \sum_i D_t(i) \exp(-\alpha_t y_i h_t(x_i)) \\
+    & = \sum_{i:y_i \neq h_t(x_i)} D_t(i)e^{\alpha_t} + \sum_{i:y_i = h_t(x_i)} D_t(i) e^{-\alpha_t} \\
+    & = \epsilon_t e^{\alpha_t} + (1-\epsilon_t) e^{\alpha_t} \\
+    & = 2 \sqrt{\epsilon_t (1-\epsilon_t)}
+
+Discussion
+^^^^^^^^^^
+
+We expect even as training error approaches 0 as T increases, the test error won't - overfitting!
+
+We can actually predict "generalization error" (basically test error):
+
+.. math::
+    \text{generalization error} \leq \text{training error} + \tilde{O}(\sqrt{\frac{dT}{m}})
+
+Where :math:`m` = # of training samples, :math:`d` = "complexity" of weak classifiers, :math:`T` = # of rounds
+
+But in reality, it's not always a tradeoff between training error and test error.
+
+Margin Approach
+"""""""""""""""
+
+- training error only measures whether classifications are right or wrong
+- should also consider confidence of classifications
+- :math:`H_{final}` is weighted majority vote of weak classifiers
+- measure confidence by *margin* = strength of the vote
+    - = (weighted fraction voting correctly) - (weighted fraction voting incorrectly)
+- so as we train more, we increase the margin, which leads to a decrease in test loss
+
+- both AdaBoost and SVMs
+    - work by maximizing margins
+    - find linear threshold function in high-dimensional space
+- but they use different norms
+
+AdaBoost is:
+
+- fast
+- simple, easy to program
+- no hyperparameters (except T)
+- flexible, can combine with any learning algorithm
+- no prior knowledge needed about weak learner
+- provably effective (provided a rough rule of thumb)
+- versatile
+
+But:
+
+- performance depends on data and weak learner
+- consistent with theory, adaboost can fail if:
+    - weak classifiers too complex (overfitting)
+    - weak classifiers too weak (basically random guessing)
+        - underfitting, or low margins -> overfitting
+- susceptible to uniform noise
diff --git a/emgmm.rst b/emgmm.rst
@@ -0,0 +1,45 @@
+EM/GMM
+======
+Gaussian Mixture Models: Estimate Mixtures of :math:`K` Gaussians
+
+- pick sample from gaussian :math:`k` with prob. :math:`\pi_k`
+- generative distribution :math:`p(\mathbf{x}) = \sum_{k=1}^K \pi_k N(\mathbf{x} | \mathbf{\mu}_k, \mathbf{\Sigma}_k)`
+    - where :math:`N` is the gaussian distibution
+- a *mixture distribution* with mixture coefficients :math:`\mathbf{\pi}`
+- for iid sample :math:`\mathbf{X}` and parameters :math:`\theta = \{\mathbf{\pi}, \mathbf{\mu}, \mathbf{\Sigma}\}`, we have:
+
+.. math::
+    p(\mathbf{X} | \mathbf{\pi}, \mathbf{\mu}, \mathbf{\Sigma}) & = \prod_{n=1}^N p(\mathbf{x_n} | \mathbf{\pi}, \mathbf{\mu}, \mathbf{\Sigma}) \\
+    & = \prod_{n=1}^N \sum_{k=1}^K \pi_k N(\mathbf{x} | \mathbf{\mu}_k, \mathbf{\Sigma}_k)
+
+What is :math:`\theta`?
+
+Log-Likelihood
+--------------
+
+.. math::
+    L(\pi, \mu, \Sigma) & = \ln p(\mathbf{X} | \mathbf{\pi}, \mathbf{\mu}, \mathbf{\Sigma}) \\
+    & = \ln (\prod_{n=1}^N \sum_{k=1}^K \pi_k N(\mathbf{x} | \mathbf{\mu}_k, \mathbf{\Sigma}_k)) \\
+    & = \sum_{n=1}^N \ln (\sum_{k=1}^K \pi_k N(\mathbf{x} | \mathbf{\mu}_k, \mathbf{\Sigma}_k))
+
+This is very hard to solve directly!
+
+Iteratively
+-----------
+
+- which gaussian picked for :math:`x_i \in X` is a latent variable :math:`z_i` in :math:`\{0, 1\}^K` (one of K encoding)
+- :math:`Z` is vector of :math:`z_i`'s
+- note that :math:`p(\mathbf{X} | \mathbf{\pi}, \mathbf{\mu}, \mathbf{\Sigma}) = \sum_Z p(\mathbf{X}, Z | \mathbf{\pi}, \mathbf{\mu}, \mathbf{\Sigma})`
+- complete data is :math:`\{X, Z\}`
+- incomplete data is just :math:`X`
+- don't know :math:`Z`, but from :math:`\theta^{old}` can infer
+
+.. image:: _static/emgmm/ex1.png
+
+.. image:: _static/emgmm/ex2.png
+
+.. image:: _static/emgmm/ex3.png
+
+Maximize this to get the new parameters.
+
+.. image:: _static/emgmm/ex4.png
diff --git a/index.rst b/index.rst
@@ -21,6 +21,10 @@ Welcome to cse142-notes's documentation!
    svm
    kernel
    unsupervised
+   pca
+   emgmm
+   boosting
+   nn
 
 
 

diff --git a/nn.rst b/nn.rst
@@ -0,0 +1,80 @@
+Neural Nets
+===========
+
+Neural nets can be used to approximate nonlinear functions with a hypothesis!
+
+Neural nets are an intuitive extension of perceptron - think of perceptron as a 1-layer neural net.
+
+But in perceptron, the output is just the sign of the linear combination of the inputs - in NN, we use an
+*activation function*
+
+.. math::
+    sign(W_i \cdot x) \to f(W_i \cdot x)
+
+Activations should be nonlinear - if they're linear, it's redundant
+
+Activations
+-----------
+
+- sign: -1, 0, or 1
+    - not differentiable
+    - often used for output layer
+- tanh: :math:`\frac{e^{2x}-1}{e^{2x}+1}`
+- sigmoid: :math:`\frac{e^x}{1+e^x}`
+- ReLU: :math:`\max(0, x)`
+
+Training
+--------
+Let's consider the following loss objective on a 2-layer neural net with weights W and v:
+
+.. math::
+    L(w, v) = \min_{W,v} \sum_n \frac{1}{2} (y^n - score)^2
+
+we just need to find
+
+.. math::
+    \frac{\partial L}{\partial W}, \frac{\partial L}{\partial v}
+
+We do this using *backpropogation*.
+
+.. image:: _static/nn/ex1.png
+    :width: 450
+
+.. image:: _static/nn/ex2.png
+    :width: 450
+
+.. image:: _static/nn/ex3.png
+    :width: 450
+
+VAE
+---
+Whereas normal AE turns images into a latent vector, VAE tries to learn the parameters of a gaussian distribution
+that the image is a mixture of
+
+.. math::
+    c_i = \exp(\sigma_i)e_i + m_i
+
+where :math:`c_i` is a component of the latent vector, :math:`e_i` is a random exponential term, and
+:math:`m_i` and :math:`\sigma_i` are the gaussian variables.
+
+KL Divergence
+-------------
+
+Roughly, a measure of how close two distributions are to each other (>= 0)
+
+.. image:: _static/nn/ex4.png
+
+Random
+------
+
+Random note: GAN objective can also be written :math:`\max_D V(G,D) = -2 \log 2 + 2 JSD(P_{data}(x) || P_G(x))`
+
+f-GAN
+^^^^^
+
+Uses a generalized divergence function:
+
+.. math::
+    D_f(q||p) = \int p(x) f[\frac{q(x)}{p(x)}]dx
+
+by making :math:`f = \log`, this is KL divergence