Unsupervised learning involves working with an unlabeled training set, where the algorithm is tasked with making sense of the data and creating clusters.

Clustering is good for:
- Market Segmentation
- Social network analysis
- Astronomical data analysis

##### K-Means
It's the most popular algorithm for automatically grouping data into coherent subsets (clusters). It works by:

1. Initiliasing random data points known as cluster centroids.
2. Assigning all examples into one of the groups, based on which cluster the example is closest to. This is known as clustering assignment.
3. Move Centroid: computes the average for all the points inside each of the cluster centroid groups, then move the cluster centroid points to those averages.
4. Re-run 2 and 3 until clusters have been found.

Main variables are:

- K (Number of clusters)
- Training Set $X^{(1)}, X^{(2)}$ Where  $x^{(i)} \in \mathbb{R}^n$

Repeat:

```for i = 1 to m:
      $c>{(i)}$:= index (from 1 to K) of cluster centroid closest to x^{(i)}
    for k = 1 to K:
       $\mu_k$:= average (mean) of points assigned to cluster k```



#### K-Means Optimization Objective

Notation:

- $C^{(i)}$ = index of cluster (1,2,...*_K_*) to which example $x^{(i)}$ is currently assigned
- $\mu_k$ = cluster centroid _k_ $(\mu_k \in \mathbb{R}^n$)
- $\mu_{c^i}$ = cluster centroid of cluster to which example $x^{(i)}$ has been assigned

Using the above variables we can define our *cost function*:

$$ J(c^{(i)},\dots,c^{(m)},\mu_1,\dots,\mu_K) = \dfrac{1}{m}\sum_{i=1}^m ||x^{(i)} - \mu_{c^{(i)}}||^2$$


To randomly initialise K cluster centroids:

- Should have K < m. e.g. k=2.
- Randomly pick _K_ training examples.
- Set $\mu_1,...,\mu_k$ equal to these _K_ examples.

To avoid K-means getting stuck in local minima and not doing a good job at minimizing, we should try multiple random initialisations:

`for 1 to 100 {
    Randomly initialise k-means.
    Run K-means. Get $c^{(1)},...,c^{(m)}, \mu_1,...,\mu_k.$
    Compute cost function (distortion)
    Pick clustering that gave lowest cost (distortion)
}`

This normally makes sense in clusterings between 2-10. For larger clusters running many random initialisations is normally not necessary.

#### Choosing the number of clusters

What is right value of K? most likely ambigitious. 

Elbow method is one possible way of choosing the value K.

##### Elbow Method:

Start by initialising with a single cluster, then run cost function _J_. Then proceed to run K-means with 2 number of clusters, then 3 clusters and keep going until there's a low decline in the distortion.

Choosing the value where there's a large deep in the distortion but from which distortion starts descending in a slower manner, is normally called the 'elbow'. As plotting this would resemble a human arm. Where the 'elbow' is then chosen as the possible ideal value of _K_

More often than not, the results won't look like an elbow and ideal value won't be as obvious.

Another more useful way of choosing the ideal value of _K_ depends on the actual purpose of the K-Means. Evaluating K-means based on a metric for how well it performs for that later purpose. In another words, choose by hand and then evaluate its performance based on the actual purpose.

#### Dimensionality Reduction

This relates to Data compression, reducing the number of dimensions used to simplify the process of calculating results and at the same time improving performance.

Make assumptions based on domain knowledge. If you know some dimensions are more important than others, you can just work with those.

##### Principal Component Analysis



PCA is normally used to:
    - To find latent (not directly observable) features driving patterns in data
    - Reduce dimensionality
        - Visualise high-dimensional data (reducing its dimensions. e.g. Reduce from 4D to 2D to scatter plot)
        - Reduce noise in the data
        - In preparation to run other algorithms that work better with fewer inputs
 