# Assignment
To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks below, and plan on discussing with your mentor. You can also take a look at these example solutions.

1. Get the silhouette coefficient of the two cluster k-means solution. You'll notice that the silhouette coefficient will turn out to be greater than the one above where cluster number is three. We know that the Iris dataset consists of three different clusters. How do you explain that the silhouette score of the solution with the number of clusters being equal to the correct number of classes is lower than the one where the number of clusters is different from the correct number of classes?

2. In this assignment, you'll be working with the heart disease dataset from the UC Irvine Machine Learning Repository.

Load the dataset from the Thinkful's database. Here's the credentials you can use to connect to the database:

        postgres_user = 'dsbc_student'
        postgres_pw = '7*.8G9QH21'
        postgres_host = '142.93.121.174'
        postgres_port = '5432'
        postgres_db = 'heartdisease'
        
The dataset needs some preprocessing. So, apply the following code before working with the dataset:

Define the features and the outcome

        X = heartdisease_df.iloc[:, :13]
        y = heartdisease_df.iloc[:, 13]

Replace missing values (marked by ?) with a 0

        X = X.replace(to_replace='?', value=0)

Binarize y so that 1 means heart disease diagnosis and 0 means no diagnosis

        y = np.where(y > 0, 0, 1)
        
Here, X will be your features and in y we hold the labels. If y is equal to 1, then it indicates that the corresponding patient has heart disease and if y is equal to 0, then the patient doesn't have heart disease.

1. Apply GMM to the heart disease data by setting n_components=2. Get ARI and silhoutte scores for your solution and compare it with those of the k-means and hierarchical clustering solutions that you implemented in the assignments of the previous checkpoints. Which algorithm does perform better?

2. GMM implementation of scikit-learn has a parameter called covariance_type. This parameter determines the type of covariance parameters to use. Specifically, there are four types you can specify:

    1. full: This is the default. Each component has its own general covariance matrix.
    2. tied: All components share the same general covariance matrix.
    3. diag: Each component has its own diagonal covariance matrix.
    4. spherical: Each component has its own single variance.
    
Try all of these. Which one does perform better in terms of ARI and silhouette scores?

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.patches import Ellipse
from sklearn.preprocessing import StandardScaler
from sklearn.mixture import GaussianMixture
from sklearn.decomposition import PCA
from sklearn import datasets, metrics
from sqlalchemy import create_engine

In [2]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'heartdisease'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))
heart_disease = engine.execute('SELECT * FROM heartdisease').fetchall()
engine.dispose()

In [3]:
heartdisease_df = pd.DataFrame(heart_disease)
X = heartdisease_df.iloc[:, :13]
y = heartdisease_df.iloc[:, 13]
X = X.replace(to_replace='?', value=0)
y = np.where(y > 0, 0, 1)

In [6]:
X_std = StandardScaler().fit_transform(X)
X_pca = PCA(n_components=2).fit_transform(X_std)

# #1
Apply GMM to the heart disease data by setting n_components=2. Get ARI and silhoutte scores for your solution and compare it with those of the k-means and hierarchical clustering solutions that you implemented in the assignments of the previous checkpoints. Which algorithm does perform better?

In [7]:
gmm_cluster = GaussianMixture(n_components=2)

In [8]:
clusters = gmm_cluster.fit_predict(X_std)

In [9]:
metrics.adjusted_rand_score(y, clusters)

0.4207322145049338

In [10]:
metrics.silhouette_score(X_std, clusters)

0.16118591340148433

# #2
GMM implementation of scikit-learn has a parameter called covariance_type. This parameter determines the type of covariance parameters to use. Specifically, there are four types you can specify:

1. full: This is the default. Each component has its own general covariance matrix.
2. tied: All components share the same general covariance matrix.
3. diag: Each component has its own diagonal covariance matrix.
4. spherical: Each component has its own single variance.

Try all of these. Which one does perform better in terms of ARI and silhouette scores?

In [11]:
cov_types = ['full', 'tied', 'diag', 'spherical']

In [13]:
for cov_type in cov_types:
    clusters = GaussianMixture(n_components=2, covariance_type=cov_type).fit_predict(X_std)
    print('Covariance Type: {}'.format(cov_type))
    print('ARI: {}'.format(metrics.adjusted_rand_score(y, clusters)))
    print('Silhouette: {}'.format(metrics.silhouette_score(X_std, clusters)))
    print()

Covariance Type: full
ARI: 0.4207322145049338
Silhouette: 0.16118591340148433

Covariance Type: tied
ARI: 0.4558104186161976
Silhouette: 0.1671559472293439

Covariance Type: diag
ARI: 0.3870024156200561
Silhouette: 0.1604139815113049

Covariance Type: spherical
ARI: 0.20765243525722465
Silhouette: 0.12468753110276876



In both cases the 'tied' covariance type holds the highest score.