# MACHINE LEARNING FOR RESEARCHERS

# Notebook 3. Unsupervised learning methods



This notebook describes **clustering** and **outlier detection** methods in the unsupervised domain learning.

The following contents are covered: 

- $K$-means 
- GMM
- Model-based outlier detection (based on Mahalanobis distance)
- Local Outlier Factor

It is highly recommended that this notebook is read and run after a first reading of the theory and in parallel with the slides available in AV. 
Note also that it is not required to develop any code. All examples are totally implemented, and therefore these notebooks have to be regarded as demonstrative material. The goal is understanding the operation of the algorithms. The notebook contains several questions that have be to submitted through AV. 

As with Notebooks 1 and 2, the codes used for generating and plotting some of the data sets have been adapted from <a href=https://github.com/ageron/handson-ml2>Geron (Github site)</a>. Please, consult the textbook for reference. 

## K-means

**Clustering** algorithms aim at identifying groups (clusters) of *similar* data points.  $K$-means is a clustering algorithm whose intuitive idea is grouping together *close* data points while separating *far* ones.

Let $\pmb{\mu}_k$ be a $D$-dimensional point representing the $k^\text{th}$ cluster center (**centroid**), for $k$=$1,2,\ldots,K$, and let $\pmb{r}_{nk}$ be binary variables with value 1 if data point $\pmb{x}_n$ is assigned to cluster $k$, or 0 otherwise.

To characterize how good a cluster assignment is, the **cost** $J$ is defined:

\begin{equation}
J = \sum_{n=1}^N \sum_{k=1}^K \pmb{r}_{nk} \|\pmb{x}_n - \pmb{\mu}_k\|^2
\end{equation}

The goal of $K$-means is to find the values $\{\pmb{r}_{nk}\}$ and $\{\pmb{\mu}_k\}$ minimizing cost $J$.

For that, the next iterative algorithm is used:

1. Choose random centroids $\{\pmb{\mu}_k\}$

2. For each point, select the closer cluster assuming fixed centroids $\{\pmb{\mu}_k\}$, updating $\{\pmb{r}_{nk}\}$: 

\begin{equation}
\pmb{r}_{nk} = \begin{cases}
1&\text{if $k$=$\underset{j}{\operatorname{argmax}} \|\pmb{x}_n - \pmb{\mu}_j\|^2$}\\
0&\text{otherwise}.
\end{cases}
\end{equation}

3. Update the centroids $\{\pmb{\mu}_k\}$ assuming the $\{\pmb{r}_{nk}\}$ are fixed. That is,

\begin{equation}
\pmb{\mu}_k = \frac{\sum_{n=1}^N \pmb{r}_{nk} \pmb{x}_n}{\sum_{n=1}^N \pmb{r}_{nk}}
\end{equation}

4. Go to 2) if solution didn't converge

Let's implement this procedure and see its operation with a toy data set. First, some common functions for data visualization: 

In [None]:
import numpy as np

from IPython.display import clear_output
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

def plot_data(X, R, colors=['r.', 'g.', 'b.', 'k.', 'y.','c.','m.'], markersize=2):
    Raux = R.copy()
    if R.ndim>1:
        K = R.shape[1]
        for n in range(R.shape[0]):
            i = np.argmax(R[n])
            Raux[n] = np.zeros([1,K])
            Raux[n,i] = 1
    else:
        minR = np.amin(R)
        K = np.amax(R)-minR+1
        Raux = np.zeros([X.shape[0],K])
        for n in range(R.shape[0]):
            i = R[n] - minR 
            Raux[n] = np.zeros([1,K])
            Raux[n,i] = 1 
            
    for k in range(Raux.shape[1]):
        pattern = np.zeros([K,])
        pattern[k] = 1
        matches = (Raux==pattern).all(axis=1).nonzero()
        plt.plot(X[matches, 0], X[matches, 1], colors[k], markersize=markersize)
        
def plot_centroids(centroids, weights=None, circle_color='w', cross_color='k'):
    if weights is not None:
        centroids = centroids[weights > weights.max() / 10]
    plt.scatter(centroids[:, 0], centroids[:, 1],
                marker='o', s=30, linewidths=8,
                color=circle_color, zorder=10, alpha=0.9)
    plt.scatter(centroids[:, 0], centroids[:, 1],
                marker='x', s=50, linewidths=50,
                color=cross_color, zorder=11, alpha=1)

def plot_decision_boundaries(X, resolution=1000, show_centroids=True,
                             show_xlabels=True, show_ylabels=True, 
                             method='sklearn', clusterer=None, 
                             centroids=None, Sigma=None, pi=None, methodpredict=None):
    mins = X.min(axis=0) - 0.1
    maxs = X.max(axis=0) + 0.1
    xx, yy = np.meshgrid(np.linspace(mins[0], maxs[0], resolution),
                         np.linspace(mins[1], maxs[1], resolution))
    if method=='sklearn':
        Z = clusterer.predict(np.c_[xx.ravel(), yy.ravel()])
    else:
        if pi is None:
            Z = methodpredict(np.c_[xx.ravel(), yy.ravel()],centroids)
        else:
            Z = methodpredict(np.c_[xx.ravel(), yy.ravel()],centroids,Sigma,pi)
        
    Z = Z.reshape(xx.shape)

    plt.contourf(Z, extent=(mins[0], maxs[0], mins[1], maxs[1]),
                cmap="Pastel2")
    plt.contour(Z, extent=(mins[0], maxs[0], mins[1], maxs[1]),
                linewidths=1, colors='k')
    
    if method=='sklearn':
        plot_data(X, clusterer.predict(X))
        if show_centroids:
            try: 
                plot_centroids(clusterer.cluster_centers_)
            except:
                plot_centroids(clusterer.means_)

        if show_xlabels:
            plt.xlabel("$x_1$", fontsize=14)
        else:
            plt.tick_params(labelbottom=False)
        if show_ylabels:
            plt.ylabel("$x_2$", fontsize=14, rotation=0)
        else:
            plt.tick_params(labelleft=False)
    else:
        plot_data(X,selectcluster(X,centroids))


def plot_clusters(X, y=None):
    plt.scatter(X[:, 0], X[:, 1], c=y, s=1)
    plt.xlabel("$x_1$", fontsize=14)
    plt.ylabel("$x_2$", fontsize=14, rotation=0)

### K-means implementation

In [None]:
def selectcluster(X,mu):
    R = np.zeros([X.shape[0],mu.shape[0]])
    for i in range(X.shape[0]):
        V = X[i,:]-mu
        dist = np.linalg.norm(V, axis=1)
        R[i,np.argmin(dist)] = 1
    return R

def updatecentroids(X,R):
    K = R.shape[1]
    mu = np.zeros([K, X.shape[1]])
    for k in range(K):
        pattern = np.zeros([K,])
        pattern[k] = 1
        matches = (R==pattern).all(axis=1).nonzero()
        mu[k,:] = np.mean(X[matches,:], axis=1)
    return mu
 
def kmeans(X,K,plot=False):
    mu = np.random.permutation(X)[:K]
    R = np.zeros([X.shape[0],mu.shape[0]])
    
    while 1:
        R_old = R.copy()
        R = selectcluster(X,mu)
        mu_old = mu.copy()
        mu = updatecentroids(X,R)
        if plot:
            clear_output(wait=True)
            plot_data(X,R)
            plot_centroids(mu, circle_color='r', cross_color='y')
            plt.show()
        if (R_old==R).all() and (mu_old==mu).all(): # Result has converged
            break
    
    return mu,R

def predict(X,mu): 
    R = selectcluster(X,mu)
    label = np.argmax(R,axis=1)
    return label

Now, let's run this code with an example data set. Note that the figure shows the clustering process evolution. **Since the centroids are chosen at random at the beginning the results changes from execution to execution**. After the iterative the boundary surfaces separating each cluster (by distance to the corresponding centroid) are show for 2 runs of the algorithm.

In [None]:
# Lets create some clustered data 
from sklearn.datasets import make_blobs

blob_centers = np.array(
    [[ 0.2,  2.3],
     [-1.5 ,  2.3],
     [-2.8,  1.8],
     [-2.8,  2.8],
     [-2.8,  1.3]])
blob_std = np.array([0.4, 0.3, 0.1, 0.1, 0.1])

X, t = make_blobs(n_samples=2000, centers=blob_centers,
                  cluster_std=blob_std, random_state=7)

# First we can see an interactive K-means running
kmeans(X,5,plot=True)

# Next, show solutions to M runs 
M = 2 # Set the number of runs
plt.figure(figsize=(8,8))

for m in range(M):
    mu, R = kmeans(X,5,plot=False)
    plt.subplot(211+m) # Set the "2" to the M used to address the right plot
    plot_decision_boundaries(X, method='mymethod', resolution=500, centroids=mu, methodpredict=predict)
    plot_data(X,R)
    plot_centroids(mu, circle_color='r', cross_color='y')

<b style="color:red"> Previous cell may take a long time (~2 min) to execute and represent the boundaries. Please, allow enough time for running. </b>

***
### Question 1
> **As you seen in the figures above, the result of a single K-means running is changing from execution to execution. Propose some way to select among all the runs the best solution.**
***

## Gaussian mixture model

A GMM is a superposition of $K$-dimensional Gaussian variables, and has distribution:
\begin{equation}
p(\pmb{x}) = \sum_{k=1}^K \pi_k \mathcal{N}(\pmb{x}|\pmb{\mu}_k, \pmb{\Sigma}_k)
\end{equation}

where the $\{\pi_k\}$ are non-negative and satisfy $\sum_{k=1}^K \pi_k$ = 1, and $\mathcal{N}(\pmb{x}|\pmb{\mu},\pmb{\Sigma})$ denotes a multi-variate Gaussian random variable.

The GMM model can also be understood by assuming a latent (hidden) variable
$z$, whose value $k \in \{1,\ldots,K\}$ selects the $k^\text{th}$ Gaussian
variable $\mathcal{N}(\pmb{\mu}_k, \pmb{\Sigma}_k)$ from which the variable $\pmb{x}$
is subsequently drawn.


1. Set initial  parameters $\{\pi_k, \pmb{\mu}_k,\pmb{\Sigma}_k\}$.
2. **Expectation**. With the current parameters, evaluate the *responsibilities* $\{\gamma_{nk}\}$: the
  probability that the $n^\text{th}$ data point has been generated by cluster $k$:
  \begin{equation}
  \gamma_{nk} = \frac{\pi_k\mathcal{N}(\pmb{x}_n|\pmb{\mu}_k,\Sigma_k)}{\sum_{j=1}^K   \pi_j\mathcal{N}(\pmb{x}_n|\pmb{\mu}_j,\Sigma_j)}
  \end{equation}

3. **Maximization**. Set $N_k$ = $\sum_{n=1}^N \gamma_{nk}$ and re-estimate the parameters:
  \begin{equation}
  \begin{split}
  \pmb{\mu}_k^\text{new} & =  \frac{1}{N_k} \sum_{n=1}^N \gamma_{nk}\pmb{x}_n \\
  \pmb{\Sigma}_k^\text{new} & =  \frac{1}{N_k} \sum_{n=1}^N \gamma_{nk} (\pmb{x}_n -   \pmb{\mu}_k^\text{new}) (\pmb{x}_n - \pmb{\mu}_k^\text{new})^T \\
  \pi_k^\text{new} & =  \frac{N_k}{N}
  \end{split}
  \end{equation}

4. Compute the log-likelihood of the data set:
\begin{equation}
\sum_{n=1}^N ln ( \sum_{k=1}^K \pi_k \mathcal{N}(\pmb{x}_n|\pmb{\mu}_k, \pmb{\Sigma}_k))
\end{equation}

   Go to 2) if solution didn't converge






In [None]:
from scipy.stats import multivariate_normal

def expectation(X,mu,Sigma,pi):
    pi = pi.ravel()
    gamma = np.zeros([X.shape[0], mu.shape[0]])
    for i in range(X.shape[0]):
        aux = np.zeros([mu.shape[0],])
        for j in range(mu.shape[0]):
            try: 
                aux[j] = pi[j]*multivariate_normal.pdf(X[i], mu[j], Sigma[j])
            except np.linalg.LinAlgError as err:
                aux[j] = pi[j]/K
                
        gamma[i] = aux/np.sum(aux)
    return gamma 

def maximization(X,gamma):
    K = gamma.shape[1]
    N = np.sum(gamma,axis=0)
    mu = np.zeros([gamma.shape[1], X.shape[1]])
    Sigma = np.zeros([gamma.shape[1], X.shape[1], X.shape[1]])
    pi = np.zeros([X.shape[1],])
    for k in range(K):
        for j in range(X.shape[0]):
            mu[k] += gamma[j,k]*X[j]
        mu[k] = mu[k]/N[k]
        
        for j in range(X.shape[0]):
            aux = (X[j]-mu[k]).reshape([2,1])
            Sigma[k] += gamma[j,k]*aux.dot(aux.T)
        Sigma[k] = Sigma[k]/N[k]
        if np.sum(Sigma[k])>((np.max(X)-np.min(X))/3)**2: # Reset Sigma
            Sigma[k] = Sigma[k]/np.max(Sigma[k]) #np.eye(X.shape[1])
        
    pi = N/sum(N)
    return mu,Sigma,pi
 
def gmm(X,K,plot=False):
    # First K-means is run to select an initial solution
    mu, R = kmeans(X,K,plot=False)
    Sigma = np.zeros([K, X.shape[1], X.shape[1]])
    for k in range(K):
        Sigma[k] = np.eye(X.shape[1])
    pi = (np.ones(K)/K).reshape(-1,1)
    gamma = R
    J = float('Inf')
    
    while 1:
        gamma_old = gamma.copy()
        gamma = expectation(X,mu,Sigma,pi)
        mu_old = mu.copy()
        Sigma_old = Sigma.copy()
        pi_old = pi.copy()
        mu, Sigma, pi = maximization(X,gamma)
        if plot:
            clear_output(wait=True)
            plot_data(X,gamma)
            plot_centroids(mu, circle_color='r', cross_color='y')
            plt.show()
            
        J_old = J
        aux = np.zeros([X.shape[0],])
        for i in range(X.shape[0]):
            aux[i] = 0
            for j in range(mu.shape[0]):
                try: 
                    aux[i] += pi[j]*multivariate_normal.pdf(X[i], mu[j], Sigma[j])
                except np.linalg.LinAlgError as err:
                    aux[i] += pi[j]/K
            aux[i] = np.log(aux[i])
        J = np.sum(aux)
            
        if (J-J_old)**2 < 0.01: # Convergence
            break
    
    return mu,Sigma,pi,gamma

def predictgmm(X,mu,Sigma,pi): 
    gamma = expectation(X,mu,Sigma,pi)
    label = np.argmax(gamma,axis=1)
    return label

plt.figure(figsize=(8,4))
mu, Sigma, pi, gamma = gmm(X,5,plot=True)
plot_decision_boundaries(X, method='mymethod', resolution=500, centroids=mu, Sigma=Sigma, pi=pi, methodpredict=predictgmm)
plot_data(X,gamma)
plot_centroids(mu, circle_color='r', cross_color='y')

<b style="color:red"> Previous cell may take a long time (~5 min) to execute and represent the boundaries. Please, allow enough time for running. </b>

## Clustering with sklearn

Next, we will examine how to create clusters directly with the sklearn library. For that, the classes **KMeans** and **GaussianMixture** have to be used. Their syntax is straightforward and very similar to that of the supervised methods. The running times are also **much faster** than the python-versions shown above, since sklearn is internally using optimized libraries written in C and C++.

In [None]:
plt.figure(figsize=(8,8))

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=5, random_state=0).fit(X)
plt.subplot(211)
plot_decision_boundaries(X, method='sklearn', clusterer=kmeans)

from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=5, random_state=0).fit(X)
plt.subplot(212)
plot_decision_boundaries(X, method='sklearn', clusterer=gmm)

## Model (distance-based) outlier detection


The core idea of these methods is determining how much unexpected an observation $\pmb{x}$ is, given a previously observed data set $\{\pmb{x}_n\}$ by computing the [Mahalanobis distance](https://en.wikipedia.org/wiki/Mahalanobis_distance): 

\begin{equation}
d_\mathcal{M} =  [(\pmb{x}-\widehat{\pmb{\mu}})^T \widehat{\pmb{\Sigma}}^{-1} (\pmb{x}-\widehat{\pmb{\mu}})]^{1/2}
\end{equation}

where:

\begin{equation}
\begin{split}
\widehat{\pmb{\mu}} & = \frac{1}{N}\sum_{n=1}^N\pmb{x}_n \\
\widehat{\pmb{\Sigma}} & = \frac{1}{N-1} \sum_{n=1}^N (\pmb{x}_n -\widehat{\pmb{\mu}})(\pmb{x}_n -\widehat{\pmb{\mu}})^T
\end{split}
\end{equation}

This distance is commonly used as a measure of the novelty of a point compared with a clustered data set or distribution. Next examples show how this algorithm detects outliers in different setups:

In [None]:
from sklearn.datasets import make_blobs

blob_centers = np.array([[0.0,  0.0]])
blob_std = np.array([[0.2]])

X, t = make_blobs(n_samples=200, centers=blob_centers,
                  cluster_std=blob_std, random_state=7)

X_outliers = np.random.uniform(low=-4, high=4, size=(20, 2))
n_outliers = X_outliers.shape[0]
X = np.r_[X, X_outliers]

plt.figure(figsize=(10,8))
plot_clusters(X,'b')

from scipy.stats import multivariate_normal

gmm = GaussianMixture(n_components=1, random_state=0).fit(X) # 1 Gaussiana = Mahalanobis
#y = multivariate_normal.pdf(X, gmm.means_[0], gmm.covars_[0])
y = gmm.score_samples(X)
condition = y<np.percentile(y,10)
outliers = X[condition] # 10% of the most anomalous points
plt.scatter(outliers[:, 0], outliers[:, 1], s=10*y[condition]/np.percentile(y,10), edgecolors='y', facecolors='y', label='Outlier scores')
plot_clusters(X_outliers,'k')
plt.show()

This plot marks as outliers the 10% of the most unlikely points. The **bigger the circled area the higher the chances that the point is a real outlier**. Points in the circled areas with the inner point in black are those not generated from the Gaussian distribution (real outliers). The case for a multi-cluster data is shown next:

In [None]:
from sklearn.datasets import make_blobs

blob_centers = np.array([[0.0,  0.0], [1.0, 1.0], [2.0, 0.0]])
blob_std = np.array([[0.2], [0.3], [0.2]])

X, t = make_blobs(n_samples=200, centers=blob_centers,
                  cluster_std=blob_std, random_state=7)

X_outliers = np.random.uniform(low=-1, high=3, size=(20, 2))
n_outliers = X_outliers.shape[0]
X = np.r_[X, X_outliers]

plt.figure(figsize=(10,8))
plot_clusters(X,'b')

from scipy.stats import multivariate_normal

gmm = GaussianMixture(n_components=1, random_state=0).fit(X) # 1 Gaussiana = Mahalanobis
#y = multivariate_normal.pdf(X, gmm.means_[0], gmm.covars_[0])
y = gmm.score_samples(X)
condition = y<np.percentile(y,10)
outliers = X[condition] # 10% of the most anomalous points
plt.scatter(outliers[:, 0], outliers[:, 1], s=10*y[condition]/np.percentile(y,10), edgecolors='y', facecolors='y', label='Outlier scores')
plot_clusters(X_outliers,'k')
plt.show()

## Local Outlier Factor

Local Outlier Factor (LOF) is a **local density** method, performed over the $K$ nearest neighbors, whose distance is used to estimate the point density. When that density is much lower than that of the neighbors, a point is considered an outlier.

Let $k$-distance$(\pmb{x})$ be the distance of a point $\pmb{x}$ to its $k^\text{th}$ nearest neighbor, and let $N_K(\pmb{x})$ be the set of neighbors at distance less or equal than $K$-distance$(\pmb{x})$. Besides, let the **reachability** between two points $\pmb{x}, \pmb{x}'$ be $r_K$=$\max\{K\text{-distance}(\pmb{x}), d(\pmb{x},\pmb{x}')\}$.

Then, the **local reachability density** of $\pmb{x}$ is defined as:
\begin{equation}
\rho_K(\pmb{x}) = \left(\frac{\sum_{\pmb{x}'\in N_K(\pmb{x})} r_K(\pmb{x},\pmb{x}'
)}{|N_K(\pmb{x})|}\right)^{-1}
\end{equation}

and represents the inverse of the distance at which $\pmb{x}$ can be reached **from** its neighbors.

Finally, it is possible to compare the local density with that of the neighbors by computing:

\begin{equation}
\text{LOF}_K(\pmb{x}) = \frac{\sum\limits_{\pmb{x}' \in N_K(\pmb{x})} \rho_K(\pmb{
x}')}{|N_K(\pmb{x})| \rho_K(\pmb{x})}
\end{equation}

**A ratio greater than a selected LOF$_{\text{critical}}$ identifies an outlier**. 


In [None]:
from sklearn.neighbors import LocalOutlierFactor
from sklearn.datasets import make_blobs

blob_centers = np.array([[0.0,  0.0]])
blob_std = np.array([[0.2]])

X, t = make_blobs(n_samples=200, centers=blob_centers,
                  cluster_std=blob_std, random_state=7)

X_outliers = np.random.uniform(low=-4, high=4, size=(20, 2))
X = np.r_[X, X_outliers]

# fit the model for outlier detection (default)
clf = LocalOutlierFactor(n_neighbors=20, contamination=0.1)
y_pred = clf.fit_predict(X)
X_scores = clf.negative_outlier_factor_

plt.figure(figsize=(10,8))
plt.title("Local Outlier Factor (LOF)")
plt.scatter(X[:, 0], X[:, 1], color='b', s=3., label='Data points')
radius = (X_scores.max() - X_scores) / (X_scores.max() - X_scores.min())
cond = radius>np.percentile(radius,90) # 10% of the highest radius
plt.scatter(X[cond, 0], X[cond, 1], s=radius[cond]*1000, edgecolors='y', facecolors='y', label='Outlier scores')
plot_clusters(X_outliers,'k')
plt.show()

As shown, **in LOF we can use the local density to provide a score with the likelihood that a points is a real outlier**. The sa 

In [None]:
from sklearn.datasets import make_blobs

blob_centers = np.array([[0.0,  0.0], [1.0, 1.0], [2.0, 0.0]])
blob_std = np.array([[0.2], [0.3], [0.2]])

X, t = make_blobs(n_samples=200, centers=blob_centers,
                  cluster_std=blob_std, random_state=7)

X_outliers = np.random.uniform(low=-1, high=3, size=(20, 2))
X = np.r_[X, X_outliers]

# fit the model for outlier detection (default)
clf = LocalOutlierFactor(n_neighbors=20, contamination=0.1)
y_pred = clf.fit_predict(X)
X_scores = clf.negative_outlier_factor_

plt.figure(figsize=(10,8))
plt.title("Local Outlier Factor (LOF)")
plt.scatter(X[:, 0], X[:, 1], color='b', s=3., label='Data points')
radius = (X_scores.max() - X_scores) / (X_scores.max() - X_scores.min())
cond = radius>np.percentile(radius,90) # 10% of the highest radius
plt.scatter(X[cond, 0], X[cond, 1], s=radius[cond]*1000, edgecolors='y', facecolors='y', label='Outlier scores')
plot_clusters(X_outliers,'k')
plt.show()

***
### Question 2
> **In a real problem, it is often useful to combine clustering and outlier filtering in unlabeled data. In which order you have to perform that operations and why?**
***

## Full example on the MNIST data set

In this last section we show how to use unsupervised learning on a more complex data set: the MNIST. For that, the following procedure is used: 

1. To reduce run time a random subset of the instances is selected (20000 instances).
2. A GMM clustering takes place on this data and the images clustered together are shown.
3. An outlier detection of the 0.1% more likely outliers is performed using LOF.

In [None]:
from IPython.display import clear_output
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1)
# print(mnist.DESCR) # Uncomment to show description
X, t = mnist["data"], mnist["target"] # N = 70000, D = 784 

from sklearn.utils import shuffle
X, t = shuffle(X, t, random_state=0)
X, t = X[:10000], t[:10000]

def plot_digit(data):
    image = data.reshape(28, 28)
    plt.imshow(image, cmap = mpl.cm.binary,
               interpolation="nearest")
    plt.axis("off")
    
def plot_digits(instances, images_per_row=10, **options):
    size = 28
    images_per_row = min(len(instances), images_per_row)
    images = [instance.reshape(size,size) for instance in instances]
    n_rows = (len(instances) - 1) // images_per_row + 1
    row_images = []
    n_empty = n_rows * images_per_row - len(instances)
    images.append(np.zeros((size, size * n_empty)))
    for row in range(n_rows):
        rimages = images[row * images_per_row : (row + 1) * images_per_row]
        row_images.append(np.concatenate(rimages, axis=1))
    image = np.concatenate(row_images, axis=0)
    plt.imshow(image, cmap = mpl.cm.plasma, **options)
    plt.axis("off")

plt.figure(figsize=(9,9))
example_images = X[:100]
plot_digits(example_images, images_per_row=10)
plt.show()

### Clustering

In [None]:
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=10, random_state=0).fit(X)
y = gmm.predict(X) # Assigns to each image a cluster

<b style="color:red"> Previous cell may take a long time (~2 min) to execute and represent the boundaries. Please, allow enough time for running. </b>

In [None]:
# Let's show the first 100 images clustered together for one class 
# to see how good has been the result

plt.figure(figsize=(20,5))
plot_digits(X[y==9][:100], images_per_row=10)
plt.show()

In [None]:
# Let's show the first 100 images clustered together for another class 
# to see how good has been the result

plt.figure(figsize=(20,5))
plot_digits(X[y==1][:100], images_per_row=10)
plt.show()

**Not bad!! This process can be used for data pre-labeling**

### Outliers 

In [None]:
clf = LocalOutlierFactor(n_neighbors=20, contamination=0.1)
y_pred = clf.fit_predict(X)
X_scores = clf.negative_outlier_factor_
radius = (X_scores.max() - X_scores) / (X_scores.max() - X_scores.min())
cond = radius>np.percentile(radius,99.9) # 0.1% of the highest radius -> 8 images

outliers = X[cond]
plot_digits(outliers, images_per_row=10)
plt.show()

<b style="color:red"> Previous cell may take a long time (~2 min) to execute and represent the boundaries. Please, allow enough time for running. </b>

**See how the result reveals some oddly drawn characters!!**

***
### Question 3
> **Provide some example of unsupervised learning usage in your research topic, and explain which of the previous algorithms you would use.**
***