notes lec apr 16

zhudotexe · Apr 16, 2020 · d7548f0 · d7548f0
1 parent 617ab67
commit d7548f0
Showing 1 changed file with 186 additions and 0 deletions.
diff --git a/regression.rst b/regression.rst
@@ -193,4 +193,190 @@ using some log properties, we can...
     \ln L(f) & = \sum_{i=1}^m \ln p(y^{(i)} | \mathbf{x}^{(i)}, f) \\
     \ln L(f) & = m \ln (\frac{1}{\sqrt{2 \pi} \sigma}) - \frac{1}{2 \sigma^2} \sum_{i=1}^m (y^{(i)} - f(\mathbf{x}^{(i)})^2
 
+Assuming a Gaussian distribution on :math:`p`.
+
 So to maximize the likelihood of *f*, we're just minimizing the squared error!
+
+Convexity and Gradients
+-----------------------
+We know that the squared error is convex - we want to find the bottom of the bowl, where the gradient is 0.
+
+.. note::
+    For review: minimizing :math:`f(x) = (x - 5)^2`
+
+    - :math:`f'(x) = 2(x - 5)`
+    - :math:`f'(x) = 0`
+    - :math:`x = 5` is the minimizer
+
+So, we can derive the closed form from the ML formula:
+
+.. math::
+    \nabla_w \ln L(f) & = \nabla_w (m \ln (\frac{1}{\sqrt{2 \pi} \sigma}) - \frac{1}{2 \sigma^2} \sum_{i=1}^m (y^{(i)} - \mathbf{w} \cdot \mathbf{x^{(i)}})^2 \\
+    & = (\frac{1}{\sigma^2} \sum_{i=1}^m (y^{(i)} - \mathbf{w}^T \cdot \mathbf{x^{(i)}})\mathbf{x_i^T}) \\
+    \mathbf{0}^T & = \sum_{i=1}^m y^{(i)} \mathbf{x_i^T} - \mathbf{w}^T \sum_{i=1}^m \mathbf{x}^{(i)} \mathbf{x_i^T} \\
+    ... & \text{some matrix magic...} \\
+    \mathbf{\omega}_{ML} & = (X^T X)^{-1}X^T \vec{y}
+
+.. note::
+    Some of the matrix magic:
+
+    .. math::
+        \sum_{i=1}^m y^{(i)} \mathbf{x}^{(i)} = X^T \vec{y} \\
+
+We can also use **SGD** to learn instead of finding minimum directly:
+
+Cycle through samples, taking a step in the negative gradient direction for each example.
+
+.. math::
+    \omega_{new} & = \omega_{old} - \eta \nabla Error(\mathbf{x}^{(i)}, y^{(i)}) \\
+    & = \omega_{old} - \eta \nabla \frac{1}{2} (y^{(i)} - \omega_{old} \cdot \mathbf{x}^{(i)})^2 \\
+    & = \omega_{old} + \eta (y^{(i)} - \omega_{old} \cdot \mathbf{x}^{(i)}) \mathbf{x}^{(i)}
+
+In this formula, :math:`\eta` is the learning rate. This is the LMS algorithm.
+
+Bias-Variance Decomposition
+---------------------------
+Suppose we have training instances :math:`\mathbf{x}^{(1)}..\mathbf{x}^{(m)}` and true target function :math:`f`
+
+Let labels :math:`y^{(i)}` in sample be :math:`f(\mathbf{x}^{(i)}) + \epsilon ^{(i)}` where :math:`\epsilon ^{(i)}` are
+any iid noise.
+
+Assume :math:`E[\epsilon ^{(i)}] = 0`, so :math:`E[y ^{(i)}] = f(\mathbf{x} ^{(i)})`
+
+Let's examine the expected squared error between regression function :math:`g` learned from sample and "true"
+function :math:`f` at a particular test point :math:`\mathbf{x}`
+
+.. math::
+    E_{noise} [(g(\mathbf{x}) - f(\mathbf{x}))^2]
+
+where :math:`E_{noise}` is the expectation over training label noise.
+
+Let :math:`\bar{g}(\mathbf{x})` be :math:`E_{noise} [(g(\mathbf{x})]`. We can rewrite
+:math:`E_{noise} [(g(\mathbf{x}) - f(\mathbf{x}))^2]` as
+
+.. math::
+    & = E_{noise} [(g(\mathbf{x}) - \bar{g} + \bar{g} - f(\mathbf{x}))^2] \\
+    & = E_{noise} [(g(\mathbf{x}) - \bar{g})^2 + (\bar{g} - f(\mathbf{x}))^2 + 2(g(\mathbf{x}) - \bar{g})(\bar{g} - f(\mathbf{x}))] \\
+    & = E_{noise} [(g(\mathbf{x}) - \bar{g})^2] + E_{noise} [(\bar{g} - f(\mathbf{x}))^2] + 2E_{noise} [(g(\mathbf{x}) - \bar{g})(\bar{g} - f(\mathbf{x}))] \\
+    & = E_{noise} [(g(\mathbf{x}) - \bar{g})^2] + (\bar{g} - f(\mathbf{x}))^2 + 2E_{noise} [(g(\mathbf{x}) - \bar{g})(\bar{g} - f(\mathbf{x}))] \\
+    & = \text{variance of } g(\mathbf{x}) + (\text{bias of } g(\mathbf{x}))^2 + 2E_{noise} [(g(\mathbf{x}) - \bar{g})(\bar{g} - f(\mathbf{x}))]
+
+Since :math:`f(\mathbf{x})` and :math:`\bar{g}` are constant with respect to noise,
+
+.. math::
+    E_{noise} [(g(\mathbf{x}) - \bar{g})(\bar{g} - f(\mathbf{x}))] & = (\bar{g} - f(\mathbf{x})) E_{noise}[(g(\mathbf{x}) - \bar{g})] \\
+    & = (\bar{g} - f(\mathbf{x})) (E_{noise}[g(\mathbf{x})] - E_{noise}[\bar{g}]) \\
+    & = (\bar{g} - f(\mathbf{x})) (\bar{g} - \bar{g}) \\
+    & = 0.
+
+ergo,
+
+.. math::
+    E_{noise} [(g(\mathbf{x}) - f(\mathbf{x}))^2] = \text{variance of } g(\mathbf{x}) + (\text{bias of } g(\mathbf{x}))^2
+
+Regularized Least Squares
+-------------------------
+
+- Regularization penalizes complexity and reduces variance (but increases bias)
+- adds a term :math:`\lambda` to the squared error (the regularization coefficient)
+
+.. math::
+    \sum_{i=1}^m (y^{(i)} - \mathbf{w}^T \mathbf{x}^{(i)})^2 + \frac{\lambda}{2} \mathbf{w}^T \mathbf{w}
+
+which is minimized by
+
+.. math::
+    \mathbf{w} = (\lambda \mathbf{I} + X^T X)^{-1} X^T \vec{y}
+
+More generally, we can use a different regularizer:
+
+.. math::
+    \sum_{i=1}^n (y^{(i)} - \mathbf{w}^T \mathbf{x}^{(i)})^2 + \frac{\lambda}{2} \sum_{j=1}^m |w_j|^q
+
+Logistic Regression
+-------------------
+
+Logistic Regression learns a parameter vector :math:`\mathbf{\theta}` (or weights :math:`\mathbf{w}`)
+
+Idea behind 2-class logistic regression:
+
+- two labels, so :math:`y \in \{0, 1\}` (binary classification)
+- discriminative model gives :math:`p(y = 1 | \mathbf{x}; \mathbf{\theta})`
+- labels softly separated by hyperplane
+- maximum confusion at hyperplane: when :math:`\theta^T \mathbf{x} = 0` then :math:`p(y=1| \mathbf{x}; \theta) = 1/2`
+- use add a dimension trick to shift hyperplane off :math:`\mathbf{0}`
+- assume :math:`p(y = 1| \mathbf{x}; \theta)` is some :math:`g(\theta \cdot \mathbf{x})`
+
+Properties of :math:`g(\theta \cdot \mathbf{x}) = g(\theta^T \mathbf{x}) = p(y = 1| \mathbf{x}; \theta)`:
+
+- :math:`g(- \inf) = 0`
+- :math:`g(\inf) = 1`
+- :math:`g(0) = 1/2`
+- confidence of label increases as you move away from the boundary - :math:`g(\theta^T \mathbf{x})` is monotonically increasing
+- :math:`g(-a) = 1 - g(a)` (symmetry)
+
+For example, one function that satisfies these properties is the logistic sigmoid.
+
+.. math::
+    h_\theta(\mathbf{x}) = g(\theta \cdot \mathbf{x}) = \frac{1}{1 + \exp(-\theta \cdot \mathbf{x})} = \frac{\exp(\theta \cdot \mathbf{x})}{1 + \exp(\theta \cdot \mathbf{x})}
+
+And :math:`g()` has a simple derivative: :math:`g'(a) = g(a)(1-g(a))`.
+
+Likelihood
+^^^^^^^^^^
+
+.. math::
+    L(\theta) & = p(\vec{y} | X; \theta) \\
+    & = \prod_{i = 1}^m p(y ^{(i)} | \mathbf{x} ^{(i)}; \theta) \\
+    & = \prod_{i = 1}^m p(y = 1 | \mathbf{x} ^{(i)}; \theta) ^{y ^{(i)}} p(y = 0 | \mathbf{x} ^{(i)}; \theta) ^{1 - y ^{(i)}} \\
+    & \text{^ this encodes the if-test on y} \\
+    & = \prod_{i=1}^m h_\theta (\mathbf{x} ^{(i)}) ^{y ^{(i)}} (1 - h_\theta (\mathbf{x} ^{(i)})) ^{1-y ^{(i)}}
+
+this is convex!
+
+As before, log-likelihood is easier:
+
+.. math::
+    l(\theta) & = \log(L(\theta)) \\
+    & = \sum_{i=1}^m y ^{(i)} \log (h_\theta (\mathbf{x} ^{(i)})) + (1-y ^{(i)}) \log (1 - h_\theta (\mathbf{x} ^{(i)}))
+
+Taking the derivatives for one sample on one dimension of :math:`\theta`,
+
+.. math::
+    \frac{\partial}{\partial \theta_j} l(\theta) = (y ^{(i)} - h_\theta (\mathbf{x} ^{(i)})) x_j ^{(i)}
+
+so we can then plug that, generalized into all dimensions, into SGD
+
+.. math::
+    \theta := \theta + \alpha (y ^{(i)} - h_\theta (\mathbf{x} ^{(i)})) \mathbf{x} ^{(i)}
+
+Looks pretty similar to LMS, but :math:`h_\theta()` replaces :math:`\theta^T \mathbf{x}`.
+
+Multiclass
+^^^^^^^^^^
+We can extend logistic regression to multiple classes:
+
+- learn weights :math:`\theta_k` for each class :math:`k \in \{1..K\}`
+- class-k-ness of instance :math:`\mathbf{x}` is estimated by :math:`\theta_k \cdot \mathbf{x}`
+- estimate :math:`p(Class = k | \mathbf{x}; \theta_1 .. \theta_K)` for instance :math:`\mathbf{x}` using softmax fcn:
+
+.. math::
+    h_k(\mathbf{x}; \theta_1 .. \theta_K) = \frac{\exp(\theta_k \cdot \mathbf{x})}{\sum_{r=1}^K \exp(\theta_r \cdot \mathbf{x})}
+
+- want weights that maximize likelihood of the sample
+- use one-of-K encoding for labels
+    - make each label :math:`\mathbf{y} ^{(i)}` a K-vector with :math:`y ^{(i)}_k = 1` iff class = k
+- likelihood of :math:`m` labels in sample is:
+
+.. math::
+    L(\theta_1 .. \theta_K) & = p(\mathbf{y} ^{(1)} .. \mathbf{y} ^{(m)} | X; \theta_1 .. \theta_K) \\
+    & = \prod_{i=1}^m p(\mathbf{y} ^{(i)} | \mathbf{x} ^{(i)}; \theta_1 .. \theta_K) \\
+    & = \prod_{i=1}^m \prod_{k=1}^K h_k(\mathbf{x} ^{(i)}; \theta_1 .. \theta_K) ^{y_k ^{(i)}}
+
+- iterative methods maximize log likelihood
+
+.. note::
+    Class :math:`\theta_K` is actually redundant, since :math:`p(class = K | \mathbf{x}) = 1 - \sum_{k=1}^{K-1} p(class = k | \mathbf{x})`.
+
+
+