# Exercise: Clustering methods with real data

In this exercise, the aim is to apply k-means, hierarchical clustering and GMM clustering to different multivariate datasets. 
## A. Auto MPG dataset:

A dataset of cars in which the features are the following: 

1. mpg: continuous
2. cylinders: multi-valued discrete
3. displacement: continuous
4. horsepower: continuous
5. weight: continuous
6. acceleration: continuous
7. model year: multi-valued discrete
8. origin: multi-valued discrete (Japan, USA, Europe)
9. car name: string (unique for each instance)

You can load the data from different sources:

- UCI repository: https://archive.ics.uci.edu/ml/datasets/auto+mpg
- Kaggle datasets: https://www.kaggle.com/uciml/autompg-dataset
- Built-in data from seaborn libraries: https://seaborn.pydata.org/generated/seaborn.load_dataset.html

**NOTE**: It is important to remove the non-numerical variables before applying the different clustering methods. It is also important to remove the observations that contain a NaN. Here is an example of how to proceed:


In [1]:
import pandas as pd 
import seaborn as sns 
mpg = sns.load_dataset("mpg")
mpg_num = mpg.select_dtypes(include='number') # keep only numeric variables 
mpg_num_nonans = mpg_num.dropna() # remove observations with NaNs

print('Original data = {}'.format(mpg.shape))
print('Numerical variables = {}'.format(mpg_num.shape))
print('dataset without NaNs = {}'.format(mpg_num_nonans.shape))

Original data = (398, 9)
Numerical variables = (398, 7)
dataset without NaNs = (392, 7)


## B. The Iris dataset: 

No doubt, the most famous dataset in machine learning. It contains observations of flowers with the following attributes

1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm

In addition, the database includes a class label indicating the species among three different classes:

5. class:
-- Iris Setosa
-- Iris Versicolour
-- Iris Virginica

**NOTE**: Since clustering methods are unsupervised, you should discard the class label when applying clustering algorithms. 

You can get the data from multiple sources:

- UCI dataset: https://archive.ics.uci.edu/ml/datasets/iris
- Kaggle: https://www.kaggle.com/arshid/iris-flower-dataset
- Built-in datasets from seaborn: https://seaborn.pydata.org/generated/seaborn.load_dataset.html
- Built-in data from sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html#sklearn.datasets.load_iris


## C. Choose it yourself:

You can choose any dataset of your interest from Kaggle or UCI dataset repository. Make sure to fulfill the following requirements:

- Keep only numerical features and avoid categorical numerical with few discrete values
- Remove observations with missing values or NaNs
- Make sure you take out the class label before you cluster data 
