***

# Practical activity: Gaussian mixture models

***

- In this activity you will explore and analyze a dataset of flowers from the Iris genus
- We will use data from the [UCI](https://archive.ics.uci.edu/ml/) repository 
- We will use the following libraries: 
    - Python [pandas](https://pandas.pydata.org) to handle data tables
    - matplotlib for visualization
    - Scikit-learn for the modeling

### Instructions
1. Follow the steps in this notebook, complete the activities and answer the questions marked with a **Q**
1. Pre-process the data and inspect it
1. Train a K-means and GMM model to separate the three classes of flowers in the IRIS dataset

In [None]:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.mixture import GaussianMixture, BayesianGaussianMixture
from sklearn.cluster import KMeans
%matplotlib notebook
import matplotlib.pylab as plt

***
### Getting the data
- Use the following block to obtain the data
- If you don't have `wget` follow the link, download the data manually and put it on the data folder

In [None]:
!wget -nc http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data -P data

*** 
### Explore the data

- Import the table as a pandas dataframe and explore it
- **Q:** How many features and samples are in the table?
- **Q:** How many classes?

In [None]:
# Use help(pd.read_table) to understand de parameters of read_table
df = pd.read_table("data/iris.data", delimiter=",",  na_values="?",
                   names= ["sepal length", "sepal width", "petal length", "petal width", "class"])

df

**Q:** Are the classes balanced?

In [None]:
# Make class numeric
labels = ["Iris-setosa", "Iris-versicolor", "Iris-virginica"]
def compare(x):
    for k in range(3):
        if x == labels[k]:
            return k

df["class"] = df["class"].apply(compare)
label = df.iloc[:, -1]
# get data
data =df.iloc[:, 0:4]
# plot histogram
fig, ax = plt.subplots(figsize=(5, 3))
ax.hist(label);

 - Inspect the features, Do they distribute like a known pdf? Do the have more than one mode?

In [None]:
fig, ax = plt.subplots(figsize=(5, 3))
ax.hist(data["sepal length"].values);

- Use Principal Component Analysis (PCA) to reduce the dimensionality to 2
- Plot the data using color for the class
- **Q:** Is the data separable in two dimensions?
- **Q:** What type of covariance would you use to model the clusters? Why?

In [None]:
data_reduced = PCA(n_components=2).fit_transform(data)
fig, ax = plt.subplots(figsize=(5, 3))
ax.scatter(data_reduced[:, 0], data_reduced[:, 1], c=label, cmap=plt.cm.Accent);

### Cluster analysis

- **Q:** Cluster the data using k-means using $K=1, 2, 3, 4$ and visualize the clusters in the reduced data space (PCA). Plot the sum of square errors as a function of $K$. What would be an appropriate value of $K$ in this case?
- **Q:** Do the cluster coincide with the actual labels?
- **Q:** Repeat with the GMM with (a) diagonal covariance and (b) full covariance. Compare with the k-means results
- **Q:** (BONUS) Repeat with the Variational GMM with (a) diagonal covariance and (b) full covariance. Try an equal concentration prior. How many clusters are selected by the algorithm?

In [None]:
# Kmeans
help(KMeans)

In [None]:
# GMM
help(GaussianMixture)

In [None]:
# Variational GMM
help(BayesianGaussianMixture)