upload notes

zhudotexe · May 17, 2020 · e002248 · e002248
1 parent b7b3df2
commit e002248
Show file tree

Hide file tree

Showing 5 changed files with 228 additions and 0 deletions.
diff --git a/_static/unsupervised/ex1.png b/_static/unsupervised/ex1.png
diff --git a/_static/unsupervised/ex2.png b/_static/unsupervised/ex2.png
diff --git a/_static/unsupervised/ex3.png b/_static/unsupervised/ex3.png
diff --git a/index.rst b/index.rst
@@ -20,6 +20,7 @@ Welcome to cse142-notes's documentation!
    tree
    svm
    kernel
+   unsupervised
 
 
 

diff --git a/unsupervised.rst b/unsupervised.rst
@@ -0,0 +1,227 @@
+Unsupervised Learning
+=====================
+
+Getting supervised data is expensive and time-consuming; let's make sense of the data w/o labels
+
+Today: K-means clustering
+
+- Algorithm
+- Convergence
+- Initialization
+- Hyperparameter selection
+    - how many classes?
+
+In some cases, different clustering algorithms can output different clusters (e.g. multicolored squares/circles -
+split on shape or color?)
+
+Algorithm
+---------
+In a clustering problem, each instance is just a data vector :math:`\mathbf{x}` without a label.
+
+Let's use k-means clustering.
+
+Pseudocode:
+
+.. code-block:: text
+
+    assign cluster centers randomly
+    while centers are changing:
+        assign each example to the closest center (least square distance)
+        compute the new centers according to the assignments
+
+But note that there can be many stop conditions - a certain number of iterations, when the centers stop changing,
+or some other rule
+
+.. code-block:: python
+    :linenos:
+
+    def k_means(D, k):
+        for k = 1 to K:  # random init
+            mu_k = some random location
+
+        while mus changing:
+            for n = 1 to N:  # assign to closest center
+                z_n = argmin_k ||u_k - x_n||
+
+            for k = 1 to K:  # compute new centers
+                X_k = {x_n | z_n = k}
+                mu_k = Mean(X_k)
+
+        return z
+
+Convergence
+-----------
+K-means is guaranteed to converge - i.e. there is some loss function, and it stops changing after some time.
+
+So what does it optimize? The sum of the squared distance of points from their cluster means!
+
+.. math::
+    L(z,\mu; D) = \sum_N ||x_n - \mu_{z_n}||^2 = \sum_k \sum_{n:z_n=k} ||x_n - \mu_k||^2
+
+where :math:`z` is the assignments of each point and :math:`\mu` is the location of each cluster center.
+
+**Lemma 1**: Given a set of K points, the point m that minimizes the squared distances of all points to m is the mean
+of the K points.
+
+**Theorem**: For any dataset and any number of clusters, the K-means algorithm converges in a finite number of
+iterations.
+
+How to prove:
+
+- focus on where in the algorithm the loss is decreasing
+    - lines 7, 11
+- prove that loss can only decrease at those points
+    - at line 7, :math:`||x_n - \mu_b|| \leq ||x_n - \mu_a||` where a is old and b is new
+    - at line 11, by lemma 1, the total sum of the squared distances decreases as the mean is the point that minimizes the squared distances of all points in each cluster
+- prove that loss cannot decrease forever - there exists a lower bound reachable in a finite number of steps
+    - the lower bound is 0, when each training point is the center of a cluster
+    - :math:`\mu` is always the means of some subset of data, and there are a finite number of values for :math:`\mu`
+
+This proves convergence.
+
+But... what does it converge to?
+
+- it's not guaranteed to converge to the "right answer"
+- not even guaranteed to converge to the same answer each time (depends on initialization)
+
+Initialization
+--------------
+How do we find the best answer given that it depends on initialization?
+
+1. run a lot of times and pick the solution with mimimum loss
+2. pick certain initialization points
+
+Furthest-First
+^^^^^^^^^^^^^^
+
+- pick initial means as far from each other as possible
+- pick a random example and set it as the mean of the first cluster
+- for ``2..K``, pick the example furthest away from each picked one
+
+Note: this initialization is sensitive to outliers, but this effect is usually reduced after the first few iterations
+
+**Probabilistic Tweak**
+
+- pick a random example and set it as the mean of the first cluster
+- for ``2..K``, :math:`\mu_k` is sampled from points in the dataset s.t. the probability of choosing a point is proportional to its distance from the closest other mean
+- has better theoretical guarantees
+
+How to choose K
+---------------
+How do we choose K when there are an unknown number of clusters?
+
+- if you pick K with the smallest loss, K=n has loss = 0
+- "regularize" K via the BIC/AIC (Bayes Info Criteria/Akaike IC)
+    - increasing K decreases the first term, but increases the second term
+
+BIC: (where D is the # of training instances)
+
+.. math::
+    \arg \min_K \hat{L}_K + K \log D
+
+AIC:
+
+.. math::
+    \arg \min_K \hat{L}_K + 2KD
+
+Probabilistic Clustering
+-------------------------
+
+**Unobserved Variables**
+
+- also called hidden/latent variables
+- not missing values; just not observed in the data
+- e.g.:
+    - imaginary quantity meant to provide simplification
+    - a real world object/phenomena that is difficult or impossible to measure
+    - a object/phenomena that was not measured (due to faulty sensors)
+- can be discrete or continuous
+
+**On to the clustering**
+
+Let's assume that each cluster is generated by a Gaussian distribution:
+
+.. image:: _static/unsupervised/ex1.png
+    :width: 300
+
+The three clusters are defined by :math:`<\mu_x, \sigma^2_x>` for each cluster x.
+
+Additionally, we add a parameter :math:`\theta = <\theta_R, \theta_G, \theta_B>` which is the probability of
+being assigned to a given cluster.
+
+So now, our objective is to find :math:`(\theta, <\mu_R, \sigma^2_R>, <\mu_G, \sigma^2_G>, <\mu_B, \sigma^2_B>)` -
+but to find these parameters we need to know the assignments, but the assignments define the parameters!
+
+Setup
+^^^^^
+Assume that samples are drawn from K Gaussian distributions.
+
+1. If we knew cluster assignments, could we find the parameters?
+2. If we knew the parameters, could we find the cluster assignments?
+
+Just as in non-probabilistic k-means, we initialize parameters randomly, then solve 1 and 2 iteratively.
+
+Params from Assignments
+"""""""""""""""""""""""
+Let the cluster assignment for a point :math:`x_n` be :math:`y_n`, and N be the normal distribution.
+
+.. math::
+    P(D|\theta, \mu s, \sigma s) & = P(X, Y | \theta, \mu s, \sigma s) \\
+    & = \prod_n P(x_n, y_n | \theta, \mu s, \sigma s) \\
+    & = \prod_n P(y_n | \theta, \mu s, \sigma s) P(x_n | y_n, \theta, \mu s, \sigma s) \\
+    & = \prod_n P(y_n | \theta) P(x_n| \mu_{y_n}, \sigma_{y_n}) \\
+    & = \prod_N \theta_{y_n} N(x_n | \mu_{y_n}, \sigma_{y_n})
+
+To find the distributions, we just argmax this equation. But since it's all products, we can take the log and set
+the derivatives to 0!
+
+.. math::
+    \log P(D|\theta, \mu s, \sigma s) = \sum_n \log \theta_{y_n} + \sum_n \log N(x_n | \mu_{y_n}, \sigma_{y_n})
+
+.. image:: _static/unsupervised/ex2.png
+    :width: 300
+
+Assignments from Params
+"""""""""""""""""""""""
+Assign a point to the most likely cluster:
+
+.. math::
+    y_n & = \arg \max_k P(y_n = k | x_n) \\
+    & = \arg \max_k \frac{P(y_n = k, x_n)}{P(x_n)} \\
+    & = \arg \max_k P(y_n = k, x_n) \\
+    & = \arg \max_k P(y_n = k) P(x_n | y_n = k) \\
+    & = \arg \max_k \theta_k N(x_n|\mu_k, \sigma^2_k)
+
+Note though - each point has exactly one cluster assignment. But we live in a probabilistic world - assign
+the point to *every* cluster with a probability!
+
+**Soft Assignment**
+
+:math:`z_{n,k}` is a real number in the range [0..1], :math:`Z_n` is a normalization constant.
+
+.. math::
+    z_{n,k} & = P(y_n = k | x_n) \\
+    & = \frac{P(y_n = k, x_n)}{P(x_n)} \\
+    & = \frac{1}{Z_n} P(y_n = k, x_n) \\
+    & = \frac{1}{Z_n} P(y_n = k) P(x_n | y_n = k) \\
+    & = \frac{1}{Z_n} \theta_k N(x_n|\mu_k, \sigma^2_k)
+
+and :math:`\sum_k z_{n,k} = 1`.
+
+**What about assignments -> params?**
+
+- find the parameters that maximize the likelihood of the data
+
+.. image:: _static/unsupervised/ex3.png
+    :width: 300
+
+This procedure is called **Expectation Maximization**, where step 1 is the Maximization step and step 2 is the
+Expectation step. This is the Gaussian Mixture Model, specifically
+
+In conclusion, EM:
+
+1. computes the *expected* values for the latent values/params (in gaussian, :math:`\mu, \sigma^2`)
+2. *maximizes* the expected complete log likelihood of the data to estimate the parameters/latent vars
+
+.. note::
+    K-means is a special form of GMM, and GMM is a special form of EM-clustering.