# 04: Cluster evaluation and interpretation

**Author:** Grace Akatsu

**Class:** CPBS 7602, Fall 2025

---
## Overview
This notebook evaluates and interprets the k-means and DBSCAN clustering performed in notebooks 02 and 03.

## Table of Contents
*   [Import libraries](#import_libraries)
*   [Set paths and seed](#set_paths)
*   [Read in data](#read_data)
*   [Cluster evaluation: internal metrics](#internal)
    *   [Adjusted Rand Index (ARI)](#ari)
    *   [Normalized Mutual Information (NMI)](#nmi)
*   [Cluster evaluation: external metrics](#external)
    *   [Silhouette coefficient](#silhouette)
    *   [Calinski–Harabasz Index (Variance Ratio Criterion)](#calinski)
    *   [Davies–Bouldin Index](#davies)


---

## Import libraries <a class="anchor" id="import_libraries"></a>

In [1]:
import numpy as np
import pandas as pd
import os
from sklearn.metrics import adjusted_rand_score as ari
from sklearn.metrics import normalized_mutual_info_score as nmi 
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score
import matplotlib.pyplot as plt

## Set paths and seed <a class="anchor" id="set_paths"></a>

In [2]:
DATA_FILE = "/Users/akatsug/OneDrive - The University of Colorado Denver/CPBS_7602_big_data_in_biomedical_informatics/assignment01/clean_data/gtex_top10_tissues_top5000_variable_genes_standardized.csv"
K_MEANS_OUTPUTS = "/Users/akatsug/OneDrive - The University of Colorado Denver/CPBS_7602_big_data_in_biomedical_informatics/assignment01/k_means_outputs"
DBSCAN_OUTPUTS = "/Users/akatsug/OneDrive - The University of Colorado Denver/CPBS_7602_big_data_in_biomedical_informatics/assignment01/DBSCAN_outputs"

In [3]:
np.random.seed(0)

## Read in data <a class="anchor" id="read_data"></a>

In [4]:
data_raw = pd.read_csv(
    DATA_FILE,
    index_col="SAMPID"
)

data_raw.head()

Unnamed: 0_level_0,Tissue,ENSG00000244734.3,ENSG00000188536.12,ENSG00000198804.2,ENSG00000198938.2,ENSG00000163220.10,ENSG00000198899.2,ENSG00000198886.2,ENSG00000198712.1,ENSG00000143632.14,...,ENSG00000261236.7,ENSG00000188112.8,ENSG00000170035.15,ENSG00000024862.17,ENSG00000213619.9,ENSG00000176087.14,ENSG00000115596.3,ENSG00000138386.16,ENSG00000182872.15,ENSG00000070669.16
SAMPID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
GTEX-1117F-0226-SM-5GZZ7,Adipose - Subcutaneous,-0.317682,-0.320624,-0.822484,-0.623593,-0.278153,-0.845429,-0.920044,-0.94437,-0.332153,...,-0.259958,0.087855,0.714803,0.76994,0.143898,-0.47663,-0.245165,-0.547999,-0.027921,-0.037272
GTEX-1117F-0426-SM-5EGHI,Muscle - Skeletal,-0.319849,-0.32218,0.219063,1.660759,-0.278587,1.28664,0.307966,0.663645,1.117576,...,1.470513,-0.589025,-0.033767,-0.57209,0.467826,-0.715005,-0.381079,-1.292842,-0.339644,-1.050707
GTEX-1117F-0526-SM-5EGHJ,Artery - Tibial,-0.31943,-0.321842,-0.872736,-0.647149,-0.278998,-0.710659,-0.839425,-0.911312,-0.332987,...,-0.509545,-0.527738,-0.000939,-0.229877,-0.396138,-1.214898,-0.315997,-0.447965,1.078165,0.048887
GTEX-1117F-2926-SM-5GZYI,Skin - Not Sun Exposed (Suprapubic),-0.320043,-0.321507,-0.525812,-0.609139,-0.276681,-0.63397,-0.646396,-0.870145,-0.332046,...,-0.344541,0.708257,-0.008799,0.094763,-0.049997,-0.492368,-0.070677,1.017824,0.276392,0.139678
GTEX-111CU-0226-SM-5GZXC,Thyroid,-0.321842,-0.323617,-0.064829,0.022578,-0.267405,0.605461,0.046808,0.438473,-0.332147,...,0.941758,0.128075,0.421663,0.986367,0.335941,1.52109,-0.36826,-0.127483,0.448233,5.554269


In [5]:
k_means_outputs = pd.read_csv(
    os.path.join(K_MEANS_OUTPUTS, "kmeans_k10_cluster_assignments.csv"),
    index_col="SAMPID"
)

dbscan_outputs = pd.read_csv(
    os.path.join(DBSCAN_OUTPUTS, "dbscan_cluster_assignments.csv"),
    index_col="SAMPID"
)

# Replace the dbscan cluster assignments that are "-1" (noise) with NA
dbscan_outputs['dbscan_cluster_assignments'] = dbscan_outputs['dbscan_cluster_assignments'].replace(-1, np.nan)

# Merge the two clustering outputs on SAMPID
clustering_outputs = pd.merge(k_means_outputs, dbscan_outputs, on='SAMPID', how='inner')

clustering_outputs.head()

Unnamed: 0_level_0,k10_cluster_assignemnts,dbscan_cluster_assignments
SAMPID,Unnamed: 1_level_1,Unnamed: 2_level_1
GTEX-1117F-0226-SM-5GZZ7,1,
GTEX-1117F-0426-SM-5EGHI,2,
GTEX-1117F-0526-SM-5EGHJ,7,
GTEX-1117F-2926-SM-5GZYI,3,
GTEX-111CU-0226-SM-5GZXC,8,0.0


In [6]:
# Now combine to make a final dataframe with ground truth and clustering outputs
data = pd.merge(
    data_raw['Tissue'],
    clustering_outputs,
    on='SAMPID',
    how='inner'
)

data.head

<bound method NDFrame.head of                                                        Tissue  \
SAMPID                                                          
GTEX-1117F-0226-SM-5GZZ7               Adipose - Subcutaneous   
GTEX-1117F-0426-SM-5EGHI                    Muscle - Skeletal   
GTEX-1117F-0526-SM-5EGHJ                      Artery - Tibial   
GTEX-1117F-2926-SM-5GZYI  Skin - Not Sun Exposed (Suprapubic)   
GTEX-111CU-0226-SM-5GZXC                              Thyroid   
...                                                       ...   
GTEX-ZZPU-0926-SM-5GZYT                Heart - Left Ventricle   
GTEX-ZZPU-1326-SM-5GZWS                               Thyroid   
GTEX-ZZPU-2426-SM-5E44I                       Artery - Tibial   
GTEX-ZZPU-2626-SM-5E45Y                     Muscle - Skeletal   
GTEX-ZZPU-2726-SM-5NQ8O                Adipose - Subcutaneous   

                          k10_cluster_assignemnts  dbscan_cluster_assignments  
SAMPID                                      

## Cluster evaluation: internal metrics <a class="anchor" id="internal"></a>

We will first compare the clusters assessed by k-means and DBSCAN clustering using internal metrics. We are able to use these metrics because we have the ground truth (tissue of origin). This is not always the case.

### Adjusted Rand Index (ARI) <a class="anchor" id="ari"></a>

In [None]:
# Calculate Adjusted Rand Index (ARI) between k-means clusters and the ground truth
kmeans_ari = ari(data['Tissue'], data['k10_cluster_assignments'])
print(f"Adjusted Rand Index (K-Means vs Ground Truth): {kmeans_ari}")   

# Calculate Adjusted Rand Index (ARI) between DBSCAN clusters and the ground truth
# Only use the samples where DBSCAN assigned a cluster (i.e., not noise)
data_no_dbscan_noise = data.dropna(subset=['dbscan_cluster_assignments'])

dbscan_ari = ari(data_no_dbscan_noise['Tissue'], data_no_dbscan_noise['dbscan_cluster_assignments'])
print(f"Adjusted Rand Index (DBSCAN vs Ground Truth): {dbscan_ari}")

print("\nAn ARI score of 0 indicates random labeling, while a score of 1 indicates perfect agreement between the clustering and the ground truth.")

Adjusted Rand Index (K-Means vs Ground Truth): 0.8440047343695338
Adjusted Rand Index (DBSCAN vs Ground Truth): 0.32326880409646136

An ARI score of 0 indicates random labeling, while a score of 1 indicates perfect agreement between the clustering and the ground truth.


### Normalized Mutual Information (NMI) <a class="anchor" id="nmi"></a>

In [None]:
# Calculated Normalized Mutual Information (NMI) between k-means clusters and the ground truth
kmeans_nmi = nmi(data['Tissue'], data['k10_cluster_assignments'])
print(f"Normalized Mutual Information (K-Means vs Ground Truth): {kmeans_nmi}") 

# Calculated Normalized Mutual Information (NMI) between DBSCAN clusters and the ground truth
# Again only use samples where DBSCAN assigned a cluster
data_no_dbscan_noise = data.dropna(subset=['dbscan_cluster_assignments'])

dbscan_nmi = nmi(data_no_dbscan_noise['Tissue'], data_no_dbscan_noise['dbscan_cluster_assignments'])
print(f"Normalized Mutual Information (DBSCAN vs Ground Truth): {dbscan_nmi}")

print("\nAn NMI score of 0 indicates no mutual information, while a score of 1 indicates perfect correlation between the clustering and the ground truth.")

Normalized Mutual Information (K-Means vs Ground Truth): 0.9270642984573485
Normalized Mutual Information (DBSCAN vs Ground Truth): 0.6914949490964558

An NMI score of 0 indicates no mutual information, while a score of 1 indicates perfect correlation between the clustering and the ground truth.


## Cluster evaluation: external metrics <a class="anchor" id="external"></a>

### Silhouette coefficient <a class="anchor" id="silhouette"></a>

In [15]:
kmeans_silhouette = silhouette_score(data_raw.drop('Tissue', axis=1), data['k10_cluster_assignments'])
print(f"K-Means Silhouette Score: {kmeans_silhouette}")

print("\nA silhouette score close to 1 indicates that samples are well clustered, while a score close to -1 indicates that samples may have been assigned to the wrong cluster.")

KeyError: 'k10_cluster_assignments'

### Calinski–Harabasz Index (Variance Ratio Criterion) <a class="anchor" id="calinski"></a>

In [None]:


print("\nA higher Calinski-Harabasz score indicates better-defined clusters with higher values.")

### Davies–Bouldin Index <a class="anchor" id="davies"></a>

In [None]:


print("\nA Davies-Bouldin score close to 0 indicates better clustering, with lower values being better.")