# 9. Unsupervised Learning

Although most applications today are in supervised learning, most of the data available is actually unlabeled. 

Here is where unsupervised learning shines. In this chapter, we will look at three unsupervised learning tasks:

1. **Clustering**: group similar instances in classes
2. **Anomaly detection**: learn what is normal data to detect abnormal instances
3. **Density estimation**: estimating the probability density function (PDF) of the random process that generated the dataset

### 1. Clustering

Examples of clustering algorithms include:* 

* Segmentation
* Data analysis
* Dimensionality reduction
* Anomaly detection
* Semi-supervised learning
* Search engines
* Image compression

Let's now look at two particular algorithms.

#### K-Means

K-means is a relatively simple yet powerful algorithm that will try to find each cluster’s center and assign each instance to the closest cluster.

Let's try it out on built-in blobs:

In [4]:
from sklearn.datasets import make_blobs
import numpy as np

blob_centers = np.array(
    [[ 0.2,  2.3],
     [-1.5 ,  2.3],
     [-2.8,  1.8],
     [-2.8,  2.8],
     [-2.8,  1.3]])
blob_std = np.array([0.4, 0.3, 0.1, 0.1, 0.1])

In [5]:
X, y = make_blobs(n_samples=2000, centers=blob_centers,
                  cluster_std=blob_std, random_state=7)

In [6]:
from sklearn.cluster import KMeans

k = 5
kmeans = KMeans(n_clusters=k)
y_pred = kmeans.fit_predict(X)

In the context of clustering, an instance’s label is the index of the cluster that this instance gets assigned to by the algorithm. 

In [7]:
y_pred

array([0, 4, 2, ..., 3, 2, 4])

In [8]:
y_pred is kmeans.labels_

True

We can also have a look at the centroids:

##### kmeans.cluster_centers_

And we can use them to quickly assign new instances:

In [10]:
X_new = np.array([[0, 2], [3, 2], [-3, 3], [-3, 2.5]])

In [11]:
kmeans.predict(X_new)

array([2, 2, 3, 3])

Instead of assigning each instance to a single cluster (**hard clustering**), it can be useful to just give each instance a score per cluster (**soft clustering**).

#### K-means algorithm

The algorithm works by initially placing clusters in a random position and iterating until convergence, which usually happens in few steps and linear computational complexity with regards to the number of instances _m_, the number of clusters _k_ and the number of dimensions _n_.

However, guarantee of convergence is not guarantee of global optimum. Improving centroid initialization can therefore lead to better results.

We could do this by:

1. Setting the centroids ourselves, usually after running a first random init iteration
2. Run the algorithm multiple times with different random initializations and keep the best solution (the one with minimal _inertia_ - mean squared distance between each instance and its closest centroid)
3. Use the K-means +/+ implementation which works by selecting centroids that are distance from one another

The last option, developed by David Arthur and Sergei Vassilvitskii in 2006, is the default Scikit-learn implementation.

#### Accelerated K-means and Mini-batch K-means 

Another improvement to the algorithm was proposed by Charles Elkan in 2003 and take advantage of the triangle inequality ($AC ≤ AB + BC$) to reduce computation of the distances. This approach is used in the default implementation of Scikit-learn. 

To reduce storage requirements, in 2010 David Sculley proposed using mini-batches, speeding things up by 3-4 times.   
Although the Mini-batch K-Means algorithm is much faster than the regular K-Means algorithm, its inertia is generally slightly worse, especially as $k$ increases. 

#### Finding the optimal numbers of clusters

Unfortunately, inertia is not a good criteria for choosing $k$, so let's see a few others:

1. A common approach is the **elbow rule**, which stems from the similarity of the inertia-k graph to a human arm. In this case, we would choose the $k$ which lies on the steepest change of slope (the _elbow_ of our _arm_)

2. A more precise and computationally expensive method is to use the **silhouette score** (mean of the silhouette coefficient over all the instances). The coefficient varies from -1 to +1, where:
* -1 = instance may have been assigned to the wrong cluster
* 0 = instance close to cluster boundary
* +1 = instance well inside cluster boundaries and far from other clusters

**Note:** The silhouette coefficient is equal to: $\frac{(b-a)}{max(a,b)}$ where 
$a$ = mean distanceto the other instances in the same cluster
$b$ = mean distance to the instances of the next closest cluster

#### Limits of K-means

Despite its advantages (fast - scalable) K-means encounters strong limitations when it comes to clusters of varying sizes, densities, or non-spherical shapes. In addition, it may lead to suboptimal solutions due to initialization. 

**Note**: Scaling the inputs may help, but does not guarantee optimal results. 

#### Application: Preprocessing

Clustering can be an efficient approach to dimensionality reduction, in particular as a preprocessing step before a supervised learning algorithm.

For example, using MNIST:

In [12]:
from sklearn.datasets import load_digits

X_digits, y_digits = load_digits(return_X_y=True)

In [13]:
# splitting in training and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_digits, y_digits)

In [14]:
# fitting logistic regression
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=42, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [15]:
# evaluate accuracy
log_reg.score(X_test, y_test)

0.9555555555555556

95.5% baseline, not bad. Let’s see if we can do better by using K-Means as a preprocessing step. We will create a pipeline to cluster the training set in 50 clusters > replace the images with their distances to these 50 clusters > train our logistic regression.

In [16]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
("kmeans", KMeans(n_clusters=50)),
("log_reg", LogisticRegression()),
])

pipeline.fit(X_train, y_train)



Pipeline(memory=None,
         steps=[('kmeans',
                 KMeans(algorithm='auto', copy_x=True, init='k-means++',
                        max_iter=300, n_clusters=50, n_init=10, n_jobs=None,
                        precompute_distances='auto', random_state=None,
                        tol=0.0001, verbose=0)),
                ('log_reg',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='warn', n_jobs=None,
                                    penalty='l2', random_state=None,
                                    solver='warn', tol=0.0001, verbose=0,
                                    warm_start=False))],
         verbose=False)

In [17]:
pipeline.score(X_test, y_test)

0.9777777777777777

We already basically halved our error, and we did so by choosing $k$ arbitrarily. Let’s use `GridSearchCV` to find the optimal number of clusters:

In [22]:
from sklearn.model_selection import GridSearchCV

# suppressing output 
%%capture
param_grid = dict(kmeans__n_clusters=range(2, 100))
grid_clf = GridSearchCV(pipeline, param_grid, cv=3, verbose=2)
grid_clf.fit(X_train, y_train)

UsageError: Line magic function `%%capture` not found.


In [23]:
# best value of k
grid_clf.best_params_

{'kmeans__n_clusters': 57}

In [24]:
grid_clf.score(X_test, y_test)

0.9777777777777777

Final accuracy: 97.7% 

#### Application: Semi-Supervised learning

