In [None]:
# Copyright 2021 Google LLC
# Use of this source code is governed by an MIT-style
# license that can be found in the LICENSE file or at
# https://opensource.org/licenses/MIT.

# Author(s): Kevin P. Murphy (murphyk@gmail.com) and Mahmoud Soliman (mjs@aucegypt.edu)

<a href="https://opensource.org/licenses/MIT" target="_parent"><img src="https://img.shields.io/github/license/probml/pyprobml"/></a>

<a href="https://colab.research.google.com/github/probml/pyprobml/blob/master/notebooks/figures//chapter21_figures.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cloning the pyprobml repo

In [None]:
!git clone https://github.com/probml/pyprobml 
%cd pyprobml/scripts

# Installing required software (This may take few minutes)

In [None]:
!apt install octave  -qq > /dev/null
!apt-get install liboctave-dev -qq > /dev/null

## Figure 21.2:

  (a) An example of single link clustering using city block distance. Pairs (1,3) and (4,5) are both distance 1 apart, so get merged first. (b) The resulting dendrogram. Adapted from Figure 7.5 of \citep  Alpaydin04 .  
Figure(s) generated by [agglomDemo.m](https://github.com/probml/pmtk3/blob/master/demos/agglomDemo.m) 

In [None]:
!octave -W agglomDemo.m >> _

## Figure 21.4:

  Hierarchical clustering of yeast gene expression data. (a) Single linkage. (b) Complete linkage. (c) Average linkage.  
Figure(s) generated by [hclustYeastDemo.m](https://github.com/probml/pmtk3/blob/master/demos/hclustYeastDemo.m) 

In [None]:
!octave -W hclustYeastDemo.m >> _

## Figure 21.5:

  (a) Some yeast gene expression data plotted as a heat map. (b) Same data plotted as a time series.  
Figure(s) generated by [kmeansYeastDemo.m](https://github.com/probml/pmtk3/blob/master/demos/kmeansYeastDemo.m) 

In [None]:
!octave -W kmeansYeastDemo.m >> _

## Figure 21.6:

  Hierarchical clustering applied to the yeast gene expression data. (a) The rows are permuted according to a hierarchical clustering scheme (average link agglomerative clustering), in order to bring similar rows close together. (b) 16 clusters induced by cutting the average linkage tree at a certain height.  
Figure(s) generated by [hclustYeastDemo.m](https://github.com/probml/pmtk3/blob/master/demos/hclustYeastDemo.m) 

In [None]:
!octave -W hclustYeastDemo.m >> _

## Figure 21.7:

  Illustration of K-means clustering in 2d. We show the result of using two different random seeds. Adapted from Figure 9.5 of \citep  Geron2019 .  
Figure(s) generated by [kmeans_voronoi.py](https://github.com/probml/pyprobml/blob/master/scripts/kmeans_voronoi.py) 

In [None]:
%run ./kmeans_voronoi.py

## Figure 21.8:

  Clustering the yeast data from \cref  fig:yeast  using K-means clustering with $K=16$. (a) Visualizing all the time series assigned to each cluster. (d) Visualizing the 16 cluster centers as prototypical time series.  
Figure(s) generated by [kmeansYeastDemo.m](https://github.com/probml/pmtk3/blob/master/demos/kmeansYeastDemo.m) 

In [None]:
!octave -W kmeansYeastDemo.m >> _

## Figure 21.9:

  An image compressed using vector quantization with a codebook of size $K$. (a) $K=2$. (b) $K=4$.  
Figure(s) generated by [vqDemo.m](https://github.com/probml/pmtk3/blob/master/demos/vqDemo.m) 

In [None]:
!octave -W vqDemo.m >> _

## Figure 21.10:

  Illustration of batch vs mini-batch K-means clustering on the 2d data from \cref  fig:kmeansVoronoi . Left: distortion vs $K$. Right: Training time vs $K$. Adapted from Figure 9.6 of \citep  Geron2019 .  
Figure(s) generated by [kmeans_minibatch.py](https://github.com/probml/pyprobml/blob/master/scripts/kmeans_minibatch.py) 

In [None]:
%run ./kmeans_minibatch.py

## Figure 21.11:

  Performance of K-means and GMM vs $K$ on the 2d dataset from \cref  fig:kmeansVoronoi . (a) Distortion on validation set vs $K$.  
Figure(s) generated by [kmeans_silhouette.py](https://github.com/probml/pyprobml/blob/master/scripts/kmeans_silhouette.py) [gmm_2d.py](https://github.com/probml/pyprobml/blob/master/scripts/gmm_2d.py) [kmeans_silhouette.py](https://github.com/probml/pyprobml/blob/master/scripts/kmeans_silhouette.py) 

In [None]:
%run ./kmeans_silhouette.py

In [None]:
%run ./gmm_2d.py

In [None]:
%run ./kmeans_silhouette.py

## Figure 21.12:

  Voronoi diagrams for K-means for different $K$ on the 2d dataset from \cref  fig:kmeansVoronoi .  
Figure(s) generated by [kmeans_silhouette.py](https://github.com/probml/pyprobml/blob/master/scripts/kmeans_silhouette.py) 

In [None]:
%run ./kmeans_silhouette.py

## Figure 21.13:

  Silhouette diagrams for K-means for different $K$ on the 2d dataset from \cref  fig:kmeansVoronoi .  
Figure(s) generated by [kmeans_silhouette.py](https://github.com/probml/pyprobml/blob/master/scripts/kmeans_silhouette.py) 

In [None]:
%run ./kmeans_silhouette.py

## Figure 21.14:

  Some data in 2d fit using a GMM with $K=5$ components. Left column: marginal distribution $p(\mathbf  x )$. Right column: visualization of each mixture distribution, and the hard assignment of points to their most likely cluster. (a-b) Full covariance. (c-d) Tied full covariance. (e-f) Diagonal covairance, (g-h) Spherical covariance. Color coding is arbitrary.  
Figure(s) generated by [gmm_2d.py](https://github.com/probml/pyprobml/blob/master/scripts/gmm_2d.py) 

In [None]:
%run ./gmm_2d.py

## Figure 21.15:

  Some 1d data, with a kernel density estimate superimposed. Adapted from Figure 6.2 of \citep  Martin2018 .  
Figure(s) generated by [gmm_identifiability_pymc3.py](https://github.com/probml/pyprobml/blob/master/scripts/gmm_identifiability_pymc3.py) 

In [None]:
%run ./gmm_identifiability_pymc3.py

## Figure 21.16:

  Illustration of the label switching problem when performing posterior inference for the parameters of a GMM. We show a KDE estimate of the posterior marginals derived from 1000 samples from 4 HMC chains. (a) Unconstrained model. Posterior is symmetric. (b) Constrained model, where we add a penalty to ensure $\mu _0 < \mu _1$. Adapted from Figure 6.6-6.7 of \citep  Martin2018 .  
Figure(s) generated by [gmm_identifiability_pymc3.py](https://github.com/probml/pyprobml/blob/master/scripts/gmm_identifiability_pymc3.py) 

In [None]:
%run ./gmm_identifiability_pymc3.py

## Figure 21.17:

  Fitting GMMs with different numbers of clusters $K$ to the data in \cref  fig:gmmIdentifiabilityData . Black solid line is KDE fit. Solid blue line is posterior mean; feint blue lines are posterior samples. Dotted lines show the individual Gaussian mixture components, evaluated by plugging in their posterior mean parameters. Adapted from Figure 6.8 of \citep  Martin2018 .  
Figure(s) generated by [gmm_chooseK_pymc3.py](https://github.com/probml/pyprobml/blob/master/scripts/gmm_chooseK_pymc3.py) 

In [None]:
%run ./gmm_chooseK_pymc3.py

## Figure 21.18:

  WAIC scores for the different GMMs. The empty circle is the posterior mean WAIC score for each model, and the black lines represent the standard error of the mean. The solid circle is the in-sample deviance of each model, i.e., the unpenalized log-likelihood. The dashed vertical line corresponds to the maximum WAIC value. The gray triangle is the difference in WAIC score for that model compared to the best model. Adapted from Figure 6.10 of \citep  Martin2018 .  
Figure(s) generated by [gmm_chooseK_pymc3.py](https://github.com/probml/pyprobml/blob/master/scripts/gmm_chooseK_pymc3.py) 

In [None]:
%run ./gmm_chooseK_pymc3.py

## Figure 21.19:

  We fit a mixture of 20 Bernoullis to the binarized MNIST digit data. We visualize the estimated cluster means $ \boldsymbol  \mu   _k$. The numbers on top of each image represent the estimated mixing weights $ \pi  _k$. No labels were used when training the model.  
Figure(s) generated by [mixBerMnistEM.m](https://github.com/probml/pmtk3/blob/master/demos/mixBerMnistEM.m) 

In [None]:
!octave -W mixBerMnistEM.m >> _

## Figure 21.20:

  Clustering data consisting of 2 spirals. (a) K-means. (b) Spectral clustering.  
Figure(s) generated by [spectral_clustering_demo.py](https://github.com/probml/pyprobml/blob/master/scripts/spectral_clustering_demo.py) 

In [None]:
%run ./spectral_clustering_demo.py