# Clustering - Ensemble Methods

Using the new_df_without_outliers_copy_smote_resampled.xlsx
- Train an Ensemble Model: Train an ensemble model (e.g., Random Forest or Gradient Boosting) to predict whether a person has diabetes or not based on the features.
- Get Predicted Labels: Use the trained ensemble model to predict the labels (diabetes or non-diabetes) for each data point in your dataset.
- Perform Clustering: Apply clustering algorithms (e.g., K-means, hierarchical clustering) on the features along with the predicted labels.
- Evaluate Clustering: Evaluate the clustering results to see if there are meaningful clusters corresponding to the presence or absence of diabetes.

In [7]:
# Import packages
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.cluster import KMeans
import numpy as np
import pandas as pd

from sklearn.metrics import silhouette_score

In [5]:
# We will be using the new_df_without_outliers_copy_smote_resampled.xlsx
df = pd.read_excel('new_df_without_outliers_copy_smote_resampled.xlsx')
df.sample(5)

Unnamed: 0,age,hypertension,heart_disease,bmi,HbA1c_level,blood_glucose_level,smoking_history_encoded,gender_encoded,diabetes
128745,1.22984,0,0,-0.137284,0.414153,1.014928,0.26373,-0.974068,1
11667,-2.065704,0,0,-1.945771,0.52909,-0.628168,-1.579747,1.211318,0
11098,0.057253,0,0,0.01697,0.128449,-1.427512,0.26373,1.211318,0
7227,-0.404259,0,0,-0.241118,-1.173634,-1.427512,0.331039,-0.974068,0
18113,1.257185,1,0,-0.241118,0.22861,-1.427512,1.636068,1.211318,0


In [6]:
# Assuming 'X' contains your features and 'y' contains the target variable (diabetes or not)
# Assuming you have loaded your dataset into X and y
X = df.drop('diabetes', axis=1)
y = df['diabetes']

# Train a Gradient Boosting Classifier
clf = GradientBoostingClassifier(n_estimators=100, random_state=42)
clf.fit(X, y)

# Predict probabilities of each data point belonging to each class
probabilities = clf.predict_proba(X)

# Perform clustering on the predicted probabilities
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(probabilities)

# Get cluster labels
cluster_labels = kmeans.labels_

# Analyze clustering results
cluster_analysis = pd.DataFrame({'Cluster': cluster_labels, 'Actual Diabetes': y})
print(cluster_analysis)

        Cluster  Actual Diabetes
0             1                0
1             1                0
2             1                0
3             1                0
4             1                0
...         ...              ...
181139        0                1
181140        0                1
181141        0                1
181142        1                1
181143        0                1

[181144 rows x 2 columns]


In [8]:
# Compute the silhouette score
silhouette_avg = silhouette_score(X, cluster_labels)

print(f"Silhouette Score: {silhouette_avg}")

Silhouette Score: 0.14371949001054107
