# 9. Unsupervised Learning

Although most applications today are in supervised learning, most of the data available is actually unlabeled. 

Here is where unsupervised learning shines. In this chapter, we will look at three unsupervised learning tasks:

1. **Clustering**: group similar instances in classes
2. **Anomaly detection**: learn what is normal data to detect abnormal instances
3. **Density estimation**: estimating the probability density function (PDF) of the random process that generated the dataset

### 1. Clustering

Examples of clustering algorithms include:* 

* Segmentation
* Data analysis
* Dimensionality reduction
* Anomaly detection
* Semi-supervised learning
* Search engines
* Image compression

Let's now look at two particular algorithms.

#### K-Means

K-means is a relatively simple yet powerful algorithm that will try to find each cluster’s center and assign each instance to the closest cluster.

Let's try it out on built-in blobs:

In [4]:
from sklearn.datasets import make_blobs
import numpy as np

blob_centers = np.array(
    [[ 0.2,  2.3],
     [-1.5 ,  2.3],
     [-2.8,  1.8],
     [-2.8,  2.8],
     [-2.8,  1.3]])
blob_std = np.array([0.4, 0.3, 0.1, 0.1, 0.1])

In [5]:
X, y = make_blobs(n_samples=2000, centers=blob_centers,
                  cluster_std=blob_std, random_state=7)

In [6]:
from sklearn.cluster import KMeans

k = 5
kmeans = KMeans(n_clusters=k)
y_pred = kmeans.fit_predict(X)

In the context of clustering, an instance’s label is the index of the cluster that this instance gets assigned to by the algorithm. 

In [7]:
y_pred

array([0, 4, 2, ..., 3, 2, 4])

In [8]:
y_pred is kmeans.labels_

True

We can also have a look at the centroids:

##### kmeans.cluster_centers_

And we can use them to quickly assign new instances:

In [10]:
X_new = np.array([[0, 2], [3, 2], [-3, 3], [-3, 2.5]])

In [11]:
kmeans.predict(X_new)

array([2, 2, 3, 3])

Instead of assigning each instance to a single cluster (**hard clustering**), it can be useful to just give each instance a score per cluster (**soft clustering**).

#### K-means algorithm

The algorithm works by initially placing clusters in a random position and iterating until convergence, which usually happens in few steps and linear computational complexity with regards to the number of instances _m_, the number of clusters _k_ and the number of dimensions _n_.

However, guarantee of convergence is not guarantee of global optimum. Improving centroid initialization can therefore lead to better results.

We could do this by:

1. Setting the centroids ourselves, usually after running a first random init iteration
2. Run the algorithm multiple times with different random initializations and keep the best solution (the one with minimal _inertia_ - mean squared distance between each instance and its closest centroid)
3. Use the K-means +/+ implementation which works by selecting centroids that are distance from one another

The last option, developed by David Arthur and Sergei Vassilvitskii in 2006, is the default Scikit-learn implementation.

#### Accelerated K-means and Mini-batch K-means 

Another improvement to the algorithm was proposed by Charles Elkan in 2003 and take advantage of the triangle inequality ($AC ≤ AB + BC$) to reduce computation of the distances. 

