# Clustering Exercise  
  
In the `clustering.ipynb` notebook, you were given a demonstration of how dimensionality reduction and clustering can be used to do exploratory analysis of data. In this exercise, you will be carrying out this analysis yourself on a dataset that we have synthetically generated.  
  
You are going to be given measurements for 300 patients. There are a number of patient subtypes in the data. Your task is to identify these subtypes and clinically interpret what they may correspond to.  
  
You can use the skills and code from the `clustering.ipynb` notebook to help you with this exercise. If you have any questions, put your hand up and a course instructor will come over to help you.

## Method  
  
Follow these steps to identify the clusters 
- Use `pandas` to read in the `synthetic_clusters.csv` dataset. This has a number of observations for 300 patients. 
- Use UMAP to project this high dimensional dataset to 2 dimensions instead for visualisation. Make sure to record the `n_neighbors` and `min_dist` that you use to do this. 
- On the original data, perform K-Means or hierarchical clustering (your choice). Vary the number of clusters to identify a suitable number of clusters
- Use the UMAP projection to visualise how these different numbers of clusters appear in the projected space. What seems to be the correct number of clusters?
- Interpret the different clusters: for example, compare the average glucose level or BMI. What might they correspond to clinically?

We have got you started below

In [None]:
# Installing the UMAP library to make sure we can use it

src_dir = "../../../src"
import sys
sys.path.append(src_dir)

from install_if_missing import install_if_missing

install_if_missing("umap-learn==0.5.1", verbose=True)
install_if_missing("seaborn", verbose=True)

In [None]:
# Importing any libraries that we need to solve the problem

import pandas as pd 
import numpy as np 

import umap.umap_ as umap
import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

from scipy.cluster.hierarchy import linkage, dendrogram, fcluster

from sklearn.preprocessing import StandardScaler

In order for clustering and UMAP to make sense, all of the data has to appear on the same scale. Otherwise, naturally large values (like heart rate) will skew the result more than naturally low values (like BMI). We do this using `StandardScaler` from the `sklearn` library. This performs a z-score normalisation, which normalises the values so that they have a mean of 0 and a standard deviation of 1. This keeps all of the values on the same scale, but does not change their distribution, which is important for machine learning methods to work.

In [None]:
df_raw = pd.read_csv("dataset/synthetic_clusters.csv")

# Standardize the data
scaler = StandardScaler()
df = scaler.fit_transform(df_raw)

In [None]:
# The first step is to use UMAP to project the data into 2D

