***

# Practical activity: Gaussian mixture models

***

- In this activity you will explore and analyze a dataset of flowers from the Iris genus
- We will use data from the [UCI](https://archive.ics.uci.edu/ml/) repository 
- We will use the following libraries: 
    - Python [pandas](https://pandas.pydata.org) to handle data tables
    - matplotlib for visualization
    - Scikit-learn for the modeling

### Instructions
1. Follow the steps in this notebook, complete the activities and answer the questions marked with a **Q**
1. Pre-process the data and inspect it
1. Train a K-means and GMM model to separate the three classes of flowers in the IRIS dataset

In [None]:
%matplotlib notebook
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture, BayesianGaussianMixture

***
### Getting the data
- Use the following block to obtain the data
- If you don't have `wget` follow the link, download the data manually and put it on the data folder

In [None]:
!wget -nc http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data 
df = pd.read_table("iris.data", delimiter=",",  na_values="?",
                   names= ["sepal length", "sepal width", "petal length", "petal width", "class"])
!rm iris.data
df.head()

*** 
### Data exploration

- **Q:** How many features and samples are in the table?
- **Q:** How many classes?
- **Q:** Are the classes balanced?

Inspect the features
- **Q:** Do they distribute like a known pdf? 
- **Q:** How many nodes do they have?

In [None]:
fig, ax = plt.subplots(figsize=(5, 3))
ax.hist(data["sepal length"].values, bins=10);

## Cluster analysis

Cluster the data using GMM with spherical covariance for $K=1, 2, 3, 4$ and visualize the clusters in the reduced data space (PCA). Plot the Bayesian Information Criterion (BIC) as a function of $K$. 

- **Q:** What would be an appropriate value of $K$ in this case?
- **Q:** Do the cluster coincide with the actual labels (purity of clusters)?
- **Q:** Repeat for  GMM with (a) diagonal covariance and (b) full covariance and compare the results
- **Q:** Repeat for the Variational GMM with full covariance. Use an equal concentration prior. How many clusters are selected by the algorithm?

In [None]:
# Reduced dimensionality space, use this to visualize your results
data_reduced = PCA(n_components=2).fit_transform(data)
fig, ax = plt.subplots(figsize=(5, 3))
ax.scatter(data_reduced[:, 0], data_reduced[:, 1], c=label, cmap=plt.cm.Accent);

In [None]:
# GMM spherical covariance
GaussianMixture?

In [None]:
# GMM diagonal covariance

In [None]:
# GMM full covariance

In [None]:
# Variational GMM
BayesianGaussianMixture?