## Clustering
### Author: Thi Quy T. Tran
### UH ID: 2021505

In [63]:
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score
from collections import Counter


### Function for MinMax feature normalization
The input `x` is the raw data in a 2-D array of the shape `(number of data points, number of features`.

The output `x_norm` is the normalized data of the input `x` with the same shape as the input.

This function will be used for normalizing data before using DBSCAN for clustering.


In [64]:
df = pd.read_csv("clinical_records_dataset.csv") 

features = df.drop(columns=["time", "DEATH_EVENT"])

y_true = df["DEATH_EVENT"].to_numpy()

In [65]:
def feature_norm(x):
    # x is a 2-D array of the shape (number of data points, number of features
    eps = np.finfo(float).eps
    x_norm = x - np.expand_dims(x.min(0), axis=0)
    x_norm = x_norm / (np.expand_dims((x.max(0) - x.min(0)), axis=0) + eps)
    
    return x_norm

normalized_data = feature_norm(features.to_numpy())


### Task 1: Function for computing purity

In [66]:
def compute_purity(y_true, y_pred):
    total_correct = 0
    unique_clusters = np.unique(y_pred)
    
    y_true = np.array(y_true)

    for cluster in unique_clusters:
        cluster_indices = np.where(y_pred == cluster)[0]
        true_labels_in_cluster = y_true[cluster_indices]
        class_counts = Counter(true_labels_in_cluster)
        majority_class_count = max(class_counts.values())
        total_correct += majority_class_count
    
    purity = total_correct / len(y_true)
    return purity 


### Task 2: K-means Clustering with k = 2

Apply K-means clustering with k=2

In [67]:
kmeans = KMeans(n_clusters=2, random_state=0)
y_pred = kmeans.fit_predict(features)

# Add cluster labels to the dataframe for analysis (optional)
df["Cluster"] = y_pred



Calculate percentage of data points in each cluster

In [68]:
cluster_counts = Counter(y_pred)
total_points = len(y_pred)

for cluster, count in cluster_counts.items():
    percentage = (count / total_points) * 100
    print(f"Cluster {cluster}: {percentage:.2f}% of data points")

Cluster 0: 78.26% of data points
Cluster 1: 21.74% of data points


Compute overall purity

In [69]:
overall_purity = compute_purity(y_true, y_pred)
print(f"Overall Purity: {overall_purity:.4f}")

Overall Purity: 0.6789


Calculate purity for each cluster

In [70]:
for cluster in cluster_counts.keys():
    # Get indices of points in the current cluster
    cluster_indices = np.where(y_pred == cluster)[0]
    # True labels for points in this cluster
    true_labels_in_cluster = y_true[cluster_indices]
    # Majority class count
    class_counts = Counter(true_labels_in_cluster)
    majority_class_count = max(class_counts.values())
    # Purity for this cluster
    cluster_purity = majority_class_count / len(true_labels_in_cluster)
    print(f"Cluster {cluster} Purity: {cluster_purity:.4f}")


Cluster 0 Purity: 0.6923
Cluster 1 Purity: 0.6308


The cluster has the highest purity
- Cluster 0 has the highest purity (0.6923), meaning that Cluster 0 is the most homogeneous in terms of true class labels among the two clusters.

Analysis:

- Cluster 0 is more pure, indicating that the majority of the data points in this cluster belong to a single class, likely making this cluster more well-defined.
- Cluster 1, with a lower purity of 0.6308, is less homogeneous. This suggests that it might contain more mixed-class data points, with the majority class being less dominant.

The higher purity in Cluster 0 could imply that the clustering algorithm (K-means) has better separated one of the groups (possibly the group with more distinct characteristics or a stronger pattern), while Cluster 1 might represent a more ambiguous group that is harder to clearly separate based on the features used for clustering.

### Task 3: K-Means Clustering with Varying k and Evaluation

In [71]:
k_values = [2, 10, 30, 50, 100]
num_runs = 10
results = []

for k in k_values:
    purities = []
    silhouette_scores = []
    
    for run in range(num_runs):
        # Run K-means
        kmeans = KMeans(n_clusters=k, random_state=run)
        kmeans.fit(features)
        
        # Get predicted clusters and compute purity
        y_pred = kmeans.labels_
        purity = compute_purity(y_true, y_pred)
        purities.append(purity)
        
        # Compute Silhouette coefficient
        sil_score = silhouette_score(features, y_pred, metric='euclidean')
        silhouette_scores.append(sil_score)
    
    # Calculate average purity and average silhouette coefficient
    avg_purity = np.mean(purities)
    avg_silhouette = np.mean(silhouette_scores)
    
    # Store the results
    results.append([k, avg_purity, avg_silhouette])

# Create a dataframe to display the results
results_df = pd.DataFrame(results, columns=["k", "Purity", "Silhouette Coefficient"])

print(results_df)



     k    Purity  Silhouette Coefficient
0    2  0.678930                0.582893
1   10  0.685284                0.593803
2   30  0.703679                0.560786
3   50  0.719732                0.572983
4  100  0.765552                0.520991


Best for Purity: 
- k = 100 (Purity ~= 0.766), as more clusters lead to better assignment of data points to the correct class.

Best for Silhouette Coefficient: 
- k = 10 (Silhouette ~= 0.594), indicating better-defined clusters with good separation and cohesion.

Purity vs. k:
- Purity increases with k because more clusters allow for finer grouping, improving class homogeneity within clusters. However, very high k values might lead to overfitting and overly specific clusters.

### Task 4: DBSCAN Clustering Experiments with Varying eps

In [72]:
def run_dbscan(eps_value):
    db = DBSCAN(eps=eps_value, min_samples=5, metric='euclidean')
    y_pred = db.fit_predict(normalized_data)
    
    # Count number of clusters and anomalies
    n_clusters = len(set(y_pred)) - (1 if -1 in y_pred else 0)
    n_anomalies = np.sum(y_pred == -1)
    
    # Compute purity
    purity = compute_purity(df["DEATH_EVENT"].to_numpy(), y_pred)
    
    return n_clusters, n_anomalies, purity

# Run DBSCAN for different eps values
eps_values = [0.3, 0.5, 0.7]
results = []

for eps_value in eps_values:
    n_clusters, n_anomalies, purity = run_dbscan(eps_value)
    results.append([eps_value, n_clusters, n_anomalies, purity])

# Create a DataFrame to display the results
results_df = pd.DataFrame(results, columns=["eps", "Number of Clusters", "Number of Anomalies", "Purity"])
print(results_df)

   eps  Number of Clusters  Number of Anomalies    Purity
0  0.3                  18                  146  0.688963
1  0.5                  22                   21  0.688963
2  0.7                  22                   13  0.695652


The best clustering result in terms of purity is obtained with eps = 0.7, which gives a purity of approximately 0.696:
- Purity increases as the value of eps increases from 0.3 (~0.689) to 0.7 (~0.696).
- A higher eps value leads to more data points being included in the clusters, which results in a higher purity. However, too high a value may also cause more data points to be grouped together, potentially lowering the purity if clusters become less distinct.