Skip to content

Commit

Permalink
upload notes
Browse files Browse the repository at this point in the history
  • Loading branch information
zhudotexe committed May 17, 2020
1 parent b7b3df2 commit e002248
Show file tree
Hide file tree
Showing 5 changed files with 228 additions and 0 deletions.
Binary file added _static/unsupervised/ex1.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _static/unsupervised/ex2.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _static/unsupervised/ex3.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ Welcome to cse142-notes's documentation!
tree
svm
kernel
unsupervised



Expand Down
227 changes: 227 additions & 0 deletions unsupervised.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,227 @@
Unsupervised Learning
=====================

Getting supervised data is expensive and time-consuming; let's make sense of the data w/o labels

Today: K-means clustering

- Algorithm
- Convergence
- Initialization
- Hyperparameter selection
- how many classes?

In some cases, different clustering algorithms can output different clusters (e.g. multicolored squares/circles -
split on shape or color?)

Algorithm
---------
In a clustering problem, each instance is just a data vector :math:`\mathbf{x}` without a label.

Let's use k-means clustering.

Pseudocode:

.. code-block:: text
assign cluster centers randomly
while centers are changing:
assign each example to the closest center (least square distance)
compute the new centers according to the assignments
But note that there can be many stop conditions - a certain number of iterations, when the centers stop changing,
or some other rule

.. code-block:: python
:linenos:
def k_means(D, k):
for k = 1 to K: # random init
mu_k = some random location
while mus changing:
for n = 1 to N: # assign to closest center
z_n = argmin_k ||u_k - x_n||
for k = 1 to K: # compute new centers
X_k = {x_n | z_n = k}
mu_k = Mean(X_k)
return z
Convergence
-----------
K-means is guaranteed to converge - i.e. there is some loss function, and it stops changing after some time.
So what does it optimize? The sum of the squared distance of points from their cluster means!
.. math::
L(z,\mu; D) = \sum_N ||x_n - \mu_{z_n}||^2 = \sum_k \sum_{n:z_n=k} ||x_n - \mu_k||^2
where :math:`z` is the assignments of each point and :math:`\mu` is the location of each cluster center.
**Lemma 1**: Given a set of K points, the point m that minimizes the squared distances of all points to m is the mean
of the K points.
**Theorem**: For any dataset and any number of clusters, the K-means algorithm converges in a finite number of
iterations.
How to prove:
- focus on where in the algorithm the loss is decreasing
- lines 7, 11
- prove that loss can only decrease at those points
- at line 7, :math:`||x_n - \mu_b|| \leq ||x_n - \mu_a||` where a is old and b is new
- at line 11, by lemma 1, the total sum of the squared distances decreases as the mean is the point that minimizes the squared distances of all points in each cluster
- prove that loss cannot decrease forever - there exists a lower bound reachable in a finite number of steps
- the lower bound is 0, when each training point is the center of a cluster
- :math:`\mu` is always the means of some subset of data, and there are a finite number of values for :math:`\mu`
This proves convergence.
But... what does it converge to?
- it's not guaranteed to converge to the "right answer"
- not even guaranteed to converge to the same answer each time (depends on initialization)
Initialization
--------------
How do we find the best answer given that it depends on initialization?
1. run a lot of times and pick the solution with mimimum loss
2. pick certain initialization points
Furthest-First
^^^^^^^^^^^^^^
- pick initial means as far from each other as possible
- pick a random example and set it as the mean of the first cluster
- for ``2..K``, pick the example furthest away from each picked one
Note: this initialization is sensitive to outliers, but this effect is usually reduced after the first few iterations
**Probabilistic Tweak**
- pick a random example and set it as the mean of the first cluster
- for ``2..K``, :math:`\mu_k` is sampled from points in the dataset s.t. the probability of choosing a point is proportional to its distance from the closest other mean
- has better theoretical guarantees
How to choose K
---------------
How do we choose K when there are an unknown number of clusters?
- if you pick K with the smallest loss, K=n has loss = 0
- "regularize" K via the BIC/AIC (Bayes Info Criteria/Akaike IC)
- increasing K decreases the first term, but increases the second term
BIC: (where D is the # of training instances)
.. math::
\arg \min_K \hat{L}_K + K \log D
AIC:
.. math::
\arg \min_K \hat{L}_K + 2KD
Probabilistic Clustering
-------------------------
**Unobserved Variables**
- also called hidden/latent variables
- not missing values; just not observed in the data
- e.g.:
- imaginary quantity meant to provide simplification
- a real world object/phenomena that is difficult or impossible to measure
- a object/phenomena that was not measured (due to faulty sensors)
- can be discrete or continuous
**On to the clustering**
Let's assume that each cluster is generated by a Gaussian distribution:
.. image:: _static/unsupervised/ex1.png
:width: 300
The three clusters are defined by :math:`<\mu_x, \sigma^2_x>` for each cluster x.
Additionally, we add a parameter :math:`\theta = <\theta_R, \theta_G, \theta_B>` which is the probability of
being assigned to a given cluster.
So now, our objective is to find :math:`(\theta, <\mu_R, \sigma^2_R>, <\mu_G, \sigma^2_G>, <\mu_B, \sigma^2_B>)` -
but to find these parameters we need to know the assignments, but the assignments define the parameters!
Setup
^^^^^
Assume that samples are drawn from K Gaussian distributions.
1. If we knew cluster assignments, could we find the parameters?
2. If we knew the parameters, could we find the cluster assignments?
Just as in non-probabilistic k-means, we initialize parameters randomly, then solve 1 and 2 iteratively.
Params from Assignments
"""""""""""""""""""""""
Let the cluster assignment for a point :math:`x_n` be :math:`y_n`, and N be the normal distribution.
.. math::
P(D|\theta, \mu s, \sigma s) & = P(X, Y | \theta, \mu s, \sigma s) \\
& = \prod_n P(x_n, y_n | \theta, \mu s, \sigma s) \\
& = \prod_n P(y_n | \theta, \mu s, \sigma s) P(x_n | y_n, \theta, \mu s, \sigma s) \\
& = \prod_n P(y_n | \theta) P(x_n| \mu_{y_n}, \sigma_{y_n}) \\
& = \prod_N \theta_{y_n} N(x_n | \mu_{y_n}, \sigma_{y_n})
To find the distributions, we just argmax this equation. But since it's all products, we can take the log and set
the derivatives to 0!
.. math::
\log P(D|\theta, \mu s, \sigma s) = \sum_n \log \theta_{y_n} + \sum_n \log N(x_n | \mu_{y_n}, \sigma_{y_n})
.. image:: _static/unsupervised/ex2.png
:width: 300
Assignments from Params
"""""""""""""""""""""""
Assign a point to the most likely cluster:
.. math::
y_n & = \arg \max_k P(y_n = k | x_n) \\
& = \arg \max_k \frac{P(y_n = k, x_n)}{P(x_n)} \\
& = \arg \max_k P(y_n = k, x_n) \\
& = \arg \max_k P(y_n = k) P(x_n | y_n = k) \\
& = \arg \max_k \theta_k N(x_n|\mu_k, \sigma^2_k)
Note though - each point has exactly one cluster assignment. But we live in a probabilistic world - assign
the point to *every* cluster with a probability!
**Soft Assignment**
:math:`z_{n,k}` is a real number in the range [0..1], :math:`Z_n` is a normalization constant.
.. math::
z_{n,k} & = P(y_n = k | x_n) \\
& = \frac{P(y_n = k, x_n)}{P(x_n)} \\
& = \frac{1}{Z_n} P(y_n = k, x_n) \\
& = \frac{1}{Z_n} P(y_n = k) P(x_n | y_n = k) \\
& = \frac{1}{Z_n} \theta_k N(x_n|\mu_k, \sigma^2_k)
and :math:`\sum_k z_{n,k} = 1`.
**What about assignments -> params?**
- find the parameters that maximize the likelihood of the data
.. image:: _static/unsupervised/ex3.png
:width: 300
This procedure is called **Expectation Maximization**, where step 1 is the Maximization step and step 2 is the
Expectation step. This is the Gaussian Mixture Model, specifically
In conclusion, EM:
1. computes the *expected* values for the latent values/params (in gaussian, :math:`\mu, \sigma^2`)
2. *maximizes* the expected complete log likelihood of the data to estimate the parameters/latent vars
.. note::
K-means is a special form of GMM, and GMM is a special form of EM-clustering.

0 comments on commit e002248

Please sign in to comment.