# Diabetes Risk: Predicting Diabetes Prevalence Through Unsupervised Learning
---

### 1 Introduction

---
#### 1.1 Overview

Diabetes is a metabolical disorder that impacts individual's blood sugar levels. This is often due to the inability to produce or utilize insulin. Dibetes has become quite prevalent in the modern world, given the changes in diets, the prevelance of other diseases, and and many other possible external factors. However, based on known and presumed risks for diabetes, it would be useful to know if these types of risks can be detected to predict whether one is proned to having diabetes or not. To understand this topic further, it is essential to apply unsupervised learning, via clustering models, such as K-Means Clustering and Hierarchical Clustering.


#### 1.2 Goal

As an individual who has predisposition to diabetes via genetics, it is essential to understand the potential risk that may predict diabetes, as well as understand the risks that come from having diabetes. By being able to group and identify those proned to diabetes, based on potential diabetes risks, it will enable for better awareness and long term preventative care. To approach this problem, it is essential to evaluate different clustering approaches, to determine what model would best predict diabetes, given the presence of certain factors. The main clustering approaches that will be investigated in this will be Hierarchical Clustering via Agglomerative Clustering and K-Means Clustering.

#### 1.3 About the Data

**1.3.1 Load The Data**

In [None]:
import numpy as np # for statistical analysis
import pandas as pd # to data process and load csv.
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, learning_curve
import copy # to copy data as needed
import matplotlib.pyplot as plt # to create plots
import seaborn as sns # to visualize pairplots and correalation matrices
from sklearn.decomposition import PCA, NMF # for non-negative matrix
from sklearn.cluster import KMeans, AgglomerativeClustering # for both clustering methods
from sklearn.linear_model import LogisticRegression # to create logistic regressions
from sklearn.tree import DecisionTreeClassifier # to create a decision tree
from sklearn.metrics import accuracy_score, precision_score, confusion_matrix # to calculate for a confusion matrix
from sklearn.tree import plot_tree # to plot decision tree
from sklearn.metrics import silhouette_score # eo evaluate simlarity between clusters

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# acquire the diabetes data
diabetes = pd.read_csv("/kaggle/input/diabetes-risk-prediction/diabetes_risk_prediction_dataset.csv")

# print out top 5 data sets from diabetes
diabetes.head()

In [None]:
# print number of columns
print("Number of Columns:", 
      len(diabetes.columns))

#print number of rows
print("Number of Rows:", 
      len(diabetes))

**Observations:** Just from a glance, this data evidently showcases that every variable is a binary classification, except for `Age`. Given that the main focus is determining how risk factors can predict diabetes, it is essential to allocate `class` as the response variable. This is because `class` is classifies of whether an individual has diabetes or not, will be the response factor for the multi-linear regression. Another thing to take note of is that every variable is non-numeric, meaning all the binary classification columns will need to be converted to binary values. There is also evidently 17 features and 520 observations.

**1.3.2 Data Source and Citation**

**_Citation:_**
Himanshu (rcratos).(n.d.). Diabetes Risk Prediction [Data set]. Kaggle. Retrieved from https://www.kaggle.com/datasets/rcratos/diabetes-risk-prediction

The dataset that was used is called the Diabetes Risk Prediction from Kaggle. It can be found from the [Data Risk Prediction](https://www.kaggle.com/datasets/rcratos/diabetes-risk-prediction/data) page, which is cited in the link above. In this data, it provides information on individuals with different potential risk factors for diabetes, which could be used to decide risk predictions. 

**1.3.3 Data Description**

This data overall provides information on individuals and whether they have acquired the potential risk factors of diabetes, as well as whether they have diabetes themselves.

This data consists of is 520 rows and 7 columns. Here are all the features of this data and what they are:

- **Age**: The age of the individual. (Numeric)
- **Gender**: The gender of the individual. (Categorical: Male or Female)
- **Polyuria**: A common diabetes symptom in which an individual urinates excessively. (Binary Classification: Yes or No)
- **Polydipsia**: A common diabetes symptom in which an individual is excessively thirsty. (Binary Classification: Yes or No)
- **Sudden weight loss**: A diabetes symptom when an individuals experiences severe unexplained weight loss, which can be a sign of diabetes. (Binary Classification: Yes or No)
- **Weakness**: A sign of the individual experiences physical weakness 
- **Polyphagia**: A common diabetes symptom in which an individual is excessively hungry. (Binary Classification: Yes or No)
- **Genital thrush**: A yeast infection that causes irritation in the genital area. (Binary Classification: Yes or No)
- **Visual blurring**: A loss or blur in vision (Binary Classification: Yes or No)
- **Itching**: Whether an individual experiences itching or irritation. (Binary Classification: Yes or No)
- **Irritability**: Whether the individual experiences irritability or not. (Binary Classification: Yes or No)
- **Delayed healing**: Whether the individual experiences slow healing of wounds.(Binary Classification: Yes or No)
- **Partial paresis**: A symptom of diabetes in which there is partial loss of voluntary movement. (Binary Classification: Yes or No)
- **Muscle stiffness**: Whether the individual experiences muscle stiffness. (Binary Classification: Yes or No)
- **Alopecia**: Whether the individual suffers from hair loss. (Binary Classification: Yes or No)
- **Obesity**: Whether the individual is obese or not (Binary Classification: Yes or No)
- **Class**: Whether the individual has diabetes or not (Binary Classification: Yes or No)

This data will be used to evaluate vital symptoms/risks that contribute to diabetes, as well as predict causes of diabetes risk/risk factors that come from diabetes.


### 2 Data Cleaning

--- 
#### 2.1 Data Exploration

The first part of the exploratory data analysis, is to determine what kind of data is being used in this investigation. It is also important to evaluate any important parts of the data that may need to be processed. For this first part, it would be good to make sure the data reflects the data type they are described to be. Then it will be useful to know whether there are null values that may need to be preprocessed before analysis. Given that clustering and dimenstionality reduction will be modeled, it is essential to evaluate normalizing the data. The reason is it will accomodate for equal contribution from all the features, especially for evaluating distance metrics in clustering and reducing dimensions with Non-Negative Matrix.

**2.1.1: Types of Usable Data**

In [None]:
# print out top 5 data sets from diabetes
diabetes.head()

**Observations:** Based on this table, there is evidently categorical data that need to be converted to binary values, especially when it comes to any feature aside from "Age".

In [None]:
# print information on the datasets
# check for null values
print("Diabetes Information")
print(diabetes.info(), "\n")

**Observations:** Based on the `Diabetes Information` above, it is evident that age is the only numeric variable present in the data, which means every other column needs to be converted into binary values.

**2.1.2: Evaluate Potential Need To Remove Null Values**

In [None]:
# print number sum of null values present in each column
print("Diabetes Information: Null Values")
print(diabetes.isnull().sum(), "\n")

**Observations:** For the `Diabetes Information: Null Values` table, there are no null variables to extract, which means this data has been processed to not include null values.

**2.1.3: Evaluate Potential Need For Normalization**

In [None]:
# visualize box plots for "age" relative to class before normalization
sns.boxplot(x = "class", y = "Age", 
            data = diabetes, palette = ["dodgerblue", "r"])
plt.xlabel("Class")
plt.ylabel("Age")
plt.title("Boxplot of Age by Class Before Normalization")
plt.show()

**Observations:** According to the boxplot, despite having different medians between classes, the boxplot size seems to look the same size, suggesting it might already be normalized.

**Suggestions for Pre-Processing:** The data evalution suggests there are no null values to remove from the data for any features. However, the data itself suggests that data need to be converted to binary values before conducting exploratory data analysis. The boxplot for `Age` suggest there may not be a need to standardize the `Age` feature, but  for safe measures, it will still be best to normalize the data.

#### 2.2 Pre-Process Data for Exploratory Analysis

**2.2.1 Approach** 

To get started in further understanding that data at hand, it is vital to ensure the data is useable for the process. The first step is to convert all binary categorical labels into values. The following will be done to the labels: {`Yes = 1`, `No = 0`}, {`Male = 1; Female = 0`}, and {`Positive = 1; Negative = 0`}. Furthermore, it is essential to normalize the data, in case it has not been. From the looks of the Boxplot before, it is possible this data is already standardized for analysis. The last thing is to ensure the data is reduced enough to where only important features are referenced for non-negative factorization to avoid the curse of dimensionality.

**2.2.2 Reassigning Categorical Data to Binary Values** 

In [None]:
# acquire the binary categorical variables and convert to numeric
# select all yes or no binary classifications; not age, gender, and class
bi_col = diabetes.columns.drop(["Age", "Gender", "class"])
# create new diabetes dataframe with new values
db = diabetes.copy()
# ensure yes = 1; no = 0
db[bi_col] = db[bi_col].apply(lambda x: x.map({"Yes": 1,
                                               "No": 0}))
# ensure male = 1; female = 0
db["Gender"] = db["Gender"].map({"Male": 1,
                                 "Female": 0})
# ensure positive = 1; negative = 0
db["class"] = db["class"].map({"Positive": 1,
                               "Negative": 0})
# make sure the new data no longer has the labels
db

**2.2.3 Normalizing the Data** 

To ensure that the data will be usable for clustering and dimension reduction, normalizing the dataset will still be conducted.

In [None]:
# isolate age
age = ["Age"]

# conduct Standardization using StandardScalar()
scaler = StandardScaler()
# reassign the normalized data into the age column
db[age] = scaler.fit_transform(db[age])
# display db
db

In [None]:
# visualize box plots for "age" relative to class before normalization
sns.boxplot(x = "class", y = "Age", 
            data = db, palette = ["dodgerblue", "r"])
plt.xlabel("Class")
plt.ylabel("Age")
plt.title("Boxplot of Age by Class After Normalization")
plt.show()

**Observations:** From comparing the looks of both the before and normalized data, it appears that the data was shifted due to the normalization of the age column. From looking at the boxplot, the median of the non-diabetic boxplot shifted down, while also reducing the error.

**2.2.3 Data Reduction for Higher Dimensionality** 

For the clustering approach, the approach of using the full standardized data alone is useful because it can model clusters based on all the varibales; however it might perform more efficiently with less data. Therefore, both clustering will undergo model analysis for both data sets, in order to find which features model the clustering better. As for dimensionality, reducing the data size can avoid the curse of dimensionality. Given NMF will be used to reduce the dimensionality beforehand, it is overall a better idea to ustilize PCA and reduce it.

In [None]:
# apply PCA where the number of dimensions exceed .95 of the data
pca = PCA(n_components = 0.95)  
pca.fit(db)

# transform db into principal components
pca_db = pca.transform(db)

# acquire components of each principal component
pca_comp = pca.components_

# create DataFrame for pca
feat_imp = pd.DataFrame(pca_comp.T,
                        columns=[f'PC{i+1}' for i in range(pca.n_components_)])

# select features to remove based on their importance in principal components
feat_remove = feat_imp[feat_imp.abs().max(axis = 1) < 0.1].index

# reduce db dataset
db_reduced = db.drop(columns = feat_remove)

# display reduced data
db_reduced

**Observations:** Despite conducting PCA to reduce the dataset, it seems that the data has already undergone data reduction. The dimensions are also the same as it intially was in the scaled `db`. Ths means all the features are useful for both the clustering and dimenstionality reduction models. However, given the negative state of age, it is possible that it will mess up the non-negative matrix. The best thing to do from here is remove the age column.

In [None]:
db1 = db_reduced.drop(columns = ["Age"])
db1

#### 2.3 Data Cleaning Overview

**Summary:** Overall, there was no null data points and the data was cleaned beforehand in that aspect. However, the data values were in strings for classification labels, so before proceeding to Exploratory Data Analysis, it was necessary to convert all the Yes or No; Male or Female, and Postive or Negative binary classifications into binary values. This will help allow for correlation matrices and any other exploratory analysis to occur. Data reduction was also conducted; however it seems that the data ideally has all the important features to keep. 

### 3 Exploratory Data Analysis (EDA)
In this portion, visualizations and analysis will be conducted in order to understand the data more and simplify the data further before modeling. The first part is to find correlations between the features, as well as have and idea of what kind of relationship each chosen feature has in relation to the diabetes classification.

#### 3.1 Data Statistics
This will allows for the understanding the demographics of who was being observed in this data set. This should not influence the prediction model but rather evaluate if the models later one can predict, even if the given demographics are different.

**3.1.1 Summary Statistics**

In [None]:
summary_stats = db1.describe()
print(summary_stats)

**3.1.1 Pie Chart: Viualization of Summary Statistics**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Find every column feature, except "Age"
bi_col = [col for col in db1.columns]

# Create subplots with a layout of 4 rows and 4 columns
fig, axes = plt.subplots(4, 4, figsize=(15, 10))

# iterate through each column feature to create a pie chart
for i, col in enumerate(bi_col):
    # within the figure dimension
    dim = axes[i//4, i%4]
    # get count of each plot by class
    class_counts = db[col].value_counts()
    
    # Plot the pie chart
    dim.pie(class_counts, labels=class_counts.index, autopct='%1.1f%%', colors=["dodgerblue", "r"])
    dim.set_title(f"Precentage of {col.capitalize()} by Class")
    dim.legend(title = "Class", 
               loc = "upper right", 
               labels=["Negative", "Positive"], 
               prop={"size": 8})
        
plt.tight_layout()
plt.show()

**Observations:** Based on the summary and the pie chart visualization, this demonstrates the overall make-up of the observations in the data set. The composition of each treat overall that the majority of the individuals in the dataset have traits of : 
- **Gender**: 63.1% Male; 36.9% Female = Male
- **Polyuria**: 50.4% Yes; 49.6% No = No Polyuria
- **Polydipsia**: 55.2% Yes; 44.8% No = No Polydipsia
- **Sudden weight loss**: 58.3% Yes; 41.7% No = No sudden weight loss
- **Weakness**: 58.7% Yes; 41.3% No = No sudden weight loss
- **Polyphagia**: 54.4% Yes; 45.6% No = No Polyphagia
- **Genital thrush**: 77.7% Yes; 22.3% No = No genital irritation from yeast infection
- **Visual blurring**: 55.2% Yes; 44.8% No = No Polyphagia
- **Itching**: 51.3% Yes; 48.7% No = No itching
- **Irritability**: 75.8% Yes; 24.2% No = No irritability
- **Delayed healing**: 54.0% Yes; 46.0% No = No slow healing
- **Partial paresis**: 56.9% Yes; 43.1% No = No loss of movement
- **Muscle stiffness**: 62.5% Yes; 37.5% No = No muscle stiffness
- **Alopecia**: 65.6% Yes; 34.4% No = No hair loss
- **Obesity**: 83.1% Yes; 16.9% No = No obesity
- **Class**: 61.5% Yes; 38.5% No = No diabetes.

**Observations:** Since the dataset in general does not demonstrate the individuals selected to be mainly diabetic or those with risk of diabetes, this suggests the data is not biased towards selecting diabetic individual and the diabete risks. Although this does not appear to favor diabetes, it shows that the data potentially can help categorize the data in modeling more accurately without bias.

To further evaluate this, it is best to see the breakdown of each feature in respect to `class`.

#### 3.2 Evaluate Diabetic Risks by Class 

**3.2.1 Bar Pot Comparisons**

It is uncertain whether the composition of the data will effect the model, so it is best to identify which feature trait of risk is found in those with diabetes. Use barplot to distinguish each binary category to each class.

In [None]:
# conduct count plot for each binary category; exclude age
# find every column feature, except age
bi_col = [col for col in db1.columns]
bi_col

# visualize bar plots for binary variables, relative to class
fig, axes = plt.subplots(4, 4, 
                         figsize=(15, 10))

# iterate through each column feature to creat bar plot
for i, col in enumerate(bi_col): # exclude age
    # within the figure dimension
    dim = axes[i//4, i%4]
    # creat count plot by class
    sns.countplot(x = col, hue = "class", 
                  data = db, ax = axes[i//4, i%4], 
                  palette = ["dodgerblue", "r"])
    dim.set_title(f"Distribution of {col.capitalize()} by Class'")
    
plt.tight_layout()
plt.show()

**Observations:** Based bar plot visualization, this demonstrates the overall make-up of the observations in the data set. Diabetics were more likely to: 
- **Gender (Key: Male = 1, Female = 0)**: Be Female
- **Polyuria (Key: Yes = 1, No = 0)**: Have Polyuria
- **Polydipsia (Key: Yes = 1, No = 0)**: Have Polydipsia
- **Sudden weight loss (Key: Yes = 1, No = 0)**: Have sudden weight loss
- **Weakness**: 58.7% Yes; 41.3% No = No sudden weight loss
- **Polyphagia (Key: Yes = 1, No = 0)**: Have Polyphagia
- **Genital thrush (Key: Yes = 1, No = 0)**: Have  genital irritation from yeast infection
- **Visual blurring (Key: Yes = 1, No = 0)**: Have  Polyphagia
- **Itching (Key: Yes = 1, No = 0)**: Have itching
- **Irritability (Key: Yes = 1, No = 0)**: Have irritability
- **Delayed healing (Key: Yes = 1, No = 0)**: Have slow healing
- **Partial paresis (Key: Yes = 1, No = 0)**: Have loss of movement
- **Muscle stiffness (Key: Yes = 1, No = 0)**: Have muscle stiffness
- **Alopecia (Key: Yes = 1, No = 0)**: Have hair loss
- **Obesity (Key: Yes = 1, No = 0)**: Have obesity

**Class** could not be evaluated due to it being compared to itself.

**Observations:** The data suggest that class can evaluate by different binary categories, in order to identify who will have diabetes and who will not. In result, modeling this via dimensional and clustering can help create a prediction utilizing this structure of categorization.

#### 3.3 Visualization Relationship Between Variables

Although from class alone, it is able to provide a statistic of what diabete risks will more liekly have diabetes, it is interesting to evalute whether a relationship exists between all the features, especially considering that PCA found all the features important.

**3.3.1 Correlation Matrix**

In [None]:
# compute
corr = db1.corr()

# create a heatmap on what variables are correlated to diabetes based on gender/age
plt.figure(figsize=(10, 8))
# create heatmap from sns library
sns.heatmap(corr, annot = True, # include correlation values
            cmap = "coolwarm", 
            square = True)
plt.title('Diabetes Feature Correlation Matrix')
plt.xticks(rotation = 45, 
           ha='right')
plt.tight_layout()  # ensure the data is not too big
plt.show()

**Observations:** Based on the evaluation of this data, it seems the variables that have the highest correlation, relative to `class`, would be primarily `Polyuria` and `Polydipisia`. However, the lowest risks variables for having diabetes would be, `Genital thrush`, `visual blurring`, `itching`, `weakness`, `delayed healing`, `muscle stiffness`, `Alopecia`, and `Obesity`. Looking at the values present, it seems that majority of the variables, except for `Polyyuria` and `Polydipsia`, were lower than a correlation of 0.5. This indicates that class does not have a strong correlation with majority of the features, but it is worth taking notice of these two features for analysis. This does not provide any means for further processing, given PCA already ensured the data was ready for usage. To address this, it is best to utilize NMF to reduce the dimensionality instead.

#### 3.3 Pre-Processing Test and Train Data for Models
Now that the target and predictor variables are understood, it is time to create prediction models based on them. However, it must first be pre-processed

In [None]:
# assign class as response variable, and the rest into X
X = db1.drop(columns = ["class"])  # Features
y = db1["class"]  # Target variable

# display X variable
X

In [None]:
# display the y variable
y 

In [None]:
# now split data using train_test_split; make sure 20% goes to test data
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.2, 
                                                    shuffle = True,
                                                    random_state = 42) # for reproducibility

#### 3.3 EDA Overview

**Summary:** Overall, the corelation matrices determined that `Age`, `class`, `gender`, `Polyuria`, `Polydipsia`, `sudden weight loss`, and `partial paresis` would be the ideal data features to analyze for modeling, and the predicted classifications of individuals to have diabetes are more likely those who are: older than 35, are female, have Polyuria, have Pollydipsia, have sudden weight loss, and have partial paresis. After simplifying the data set to a useable dataframe, the data was split into train and test data set for X and y; respectively.

### 4 Modeling Analysis

---

#### 4.1 Reducing Dimensionality
Given that there are so many dimensions from the number of featurs in this data, it is possibly too high of a dimension to proceed with clustering. So it will be conducted first.

In [None]:
# reduce dimensionalirty with negative matrix factorization
nmf = NMF(n_components=2, random_state=42)
X_train_nmf = nmf.fit_transform(X_train)
X_test_nmf = nmf.transform(X_test)

# create a visualiztion of how the matrix adjusted the dimensionality
sns.scatterplot(x = X_train_nmf[:, 0], y = X_train_nmf[:, 1], 
                hue = y_train, palette = "Set1", 
                legend="full")
plt.title("Non-Negative Matrix Factorization")
plt.xlabel("Component 1")
plt.ylabel("Component 2")
plt.legend(title="Class")
plt.show()

**Observations** After reducing the dimensionality, it seems that the data was able to plot in a somewhat accurate model. From the plot, it demonstrates how NMF was able to trasnform the data that is more usable for plotting K-Means and Hiearchical Clustering.

#### 4.2 Plot K-Means and Hierarchical Clustering
In this process, the K-means plot and Hierarchical Clustering Dendrogram to exhibit the underlying model for their respecitve clustering. To do the clustering, the X_train values created from NMF will be used, to ensure the transformed data helps the clustering predict better. 

**4.2.1 K-Means Plot**

In [None]:
# cluster into two groups using k_means
kmeans = KMeans(n_clusters = 2, 
                n_init=10, 
                random_state=42)
# use the reduces dimsionality data to predict fit
kmeans_labels = kmeans.fit_predict(X_train_nmf)

# plot for kmeans
# first get unique labels
uniq_labels = np.unique(kmeans_labels)

# iterate through each unique label, plotting where 0 and 1's were identified
for i in uniq_labels:
    plt.scatter(X_train_nmf[kmeans_labels == i, 0], X_train_nmf[kmeans_labels == i, 1], label=i)
plt.title("KMeans Clustering")
plt.legend()
plt.show()

**Observations:** From the plot, it seems that the K-Means clustering was able to distinguish the data decently into two groups ; however, it can be seen that the model doesn't have enough of a definite clustering to say it is the best model for this data set. 

**4.2.2 Hierarchical Clustering**

In [None]:
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram
import numpy as np
import matplotlib.pyplot as plt

def plot_dendrogram(model, **kwargs):
    # acquire the children nodes ofthe model
    children = model.children_
    # assume distance is the number of children
    distance = np.arange(children.shape[0])
    # acquire the number of observations in each cluster level
    observations = np.arange(2, children.shape[0] + 2)

    # use define linkage matrix with the values above
    linkage_matrix = np.column_stack([children, 
                                      distance, 
                                      observations]).astype(float)

    # plot the dendrogram
    dendrogram(linkage_matrix, **kwargs)

# conduct agglomerative clustering for modeling; 2 clusters, adjust linkage to highest accuracy
agg_clustering = AgglomerativeClustering(n_clusters = 2,
                                         distance_threshold = None,
                                         linkage = "complete")
agg_labels = agg_clustering.fit_predict(X_train_nmf)

# plot the dendrogram
plot_dendrogram(agg_clustering, truncate_mode='level', p=6)
plt.xlabel("Data Points")
plt.ylabel("Distance")
plt.title('Agglomerative Clustering Dendrogram')
plt.show()

**Observations:** In this Dendrogram, it is evident that there are somewhat similar groupings, but there is visually differences between data points, indicating limited similarity.

#### 4.3 Evaluate Efficiency of K-Means and Hierarchical Clustering
From the models before, it is evident there is some reluctance as to how well both models are able to predict the clustering; however to further ensure whether this is the case, precision, accuracy, and confusion matrices will be evaluated for both, based on the predicted permutations/label orders. Functions will be needed to evaluate true prediction values from label orders before evaluating these matrics. 

Given the clustering in the K-means and Agglomerative Clustering, it is evident that the similarity between clusters is slacking. To ensure this obdervation is the case, silhouette score will be taken for both as well.

**4.3.1 Create Functions to Evaluate Accuaracy, Precision, and Confusion Matrix**

In [None]:
import itertools
# check permutation of label to predict whether a model predicts the clustering well
def permute_labels(ytrain, model_labels, n = 2):
    # create some permutation for labels
    permutations = itertools.permutations(range(n))

    # create empty place to story permuations and accuracy score.
    best_perm = None
    best_acc = 0.0
    best_precision = 0.0
    
    # iterate over the permutations; where p is label ordering
    for p in permutations:
        # permute the predicted labels
        permuted_labels = [p[label] for label in model_labels]
        # find accuracy between the permuted values and the expected y value
        acc = accuracy_score(ytrain, permuted_labels)
        prec = precision_score(ytrain, permuted_labels)

        
        # find best accuracy and permuation
        if acc > best_acc:
            best_acc = acc
            best_prec = prec
            best_perm = p
            

    return best_perm, best_acc, best_prec

In [None]:
# creat confusion matrix for the true labels and predicted labels
def create_conf_mx(y_train, model_labels, permutation = None):
    # use when there is permuation
    if permutation:
        yt_labels = [permutation[i] for i in y_train]

    # Create and return the confusion matrix
    return confusion_matrix(yt_labels, model_labels)

**4.3.2 Evaluate Metrics for K-Means** 

In [None]:
# find accuracy and label ordering/permutation
kmeans_perm, kmeans_acc, kmeans_prec = permute_labels(y_train, kmeans_labels)
print("K-Means Clustering Label Ordering:", kmeans_perm) 
print("K-Means Clustering Accuracy:", kmeans_acc)
print("K-Means Clustering Precision:", kmeans_prec)

In [None]:
# create confusion matrix for k-means
km_mx_best = create_conf_mx(y_train, kmeans_labels, kmeans_perm)
print("Best True Values Confusion Matrix:")

# plot confusion matrix using sns
plt.figure(figsize=(8, 6))
sns.heatmap(km_mx_best, annot=True, cmap = "Purples", 
            xticklabels=["Non-Diabetic", "Diabetic"], 
            yticklabels=["Non-Diabetic", "Diabetic"])
plt.xlabel("Predicted Values")
plt.ylabel("True Values")
plt.title("Confusion Matrix for K-Means Clustering")
plt.show()

In [None]:
km_score = silhouette_score(X_train_nmf, kmeans_labels)
print("Silhouette Score: ", km_score)

**Observations:** The following K-Means Clustering Metrics were acquired: 
- Accuracy: 0.8052884615384616
- Precision: 0.9666666666666667
- Silhouette Score:  0.44072574501920675

Based on these values, it is evident that the K-Means Clustering can some what accurately and precisely predict the groupings of who would be at risk for diabetes; however, the clusting method itself does not have a clear similarity between the points. This is demonstrated from how lowe the silhouette score is. 

**4.3.2 Evaluate Metrics for Hierarchical Clustering** 

In [None]:
# find accuracy and label ordering/permutation
agg_perm, agg_acc, agg_prec = permute_labels(y_train, agg_labels)

# print output
print("Hierarchical Clustering Label Ordering:", agg_perm) 
print("Hierarchical Clustering Accuracy:", agg_acc)
print("Hierarchical Clustering Precision:", agg_prec)

In [None]:
# create confusion matrix for hierarchical
agg_mx_best = create_conf_mx(y_train, agg_labels, agg_perm)
print("Best True Values Confusion Matrix for Hierarchical Clustering:")
print(agg_mx_best)

# plot confusion matrix using sns
plt.figure(figsize=(8, 6))
sns.heatmap(agg_mx_best, annot=True, cmap = "Purples", 
            xticklabels=["Non-Diabetic", "Diabetic"], 
            yticklabels=["Non-Diabetic", "Diabetic"])
plt.xlabel("Predicted Values")
plt.ylabel("True Values")
plt.title("Confusion Matrix for Hierarchical Clustering")
plt.show()

In [None]:
agg_score = silhouette_score(X_train_nmf, agg_labels)
print("Silhouette Score: ", agg_score)

**Observations:** The following Hierarchical Clustering Metrics were acquired: 
- Accuracy: 0.8173076923076923
- Precision: 0.9625668449197861
- Silhouette Score: 0.37396613190254996

Based on these values, similar to k-means clustering, Hierarchical Clustering can some what accurately and precisely predict the groupings of who would be at risk for diabetes; however, the clustering method ha no similarities found between the data points in the clusters. This confirms the observations frorm the Dendrogram. 

### 5 Results and Analysis

---

#### 5.1 Results
The accuracy scores and precision scores for K-Means Clustering was found to be 0.805 and 0.967; respectively, while Hierarchical Clustric was found to be 0.817 and 0.963; respectively.
Furthermore the Confusion Matrices demonstrated that both models were somewhat able to predict True Positives and True Negatives more than the False Positives and False Negatives.

Despite this potential numerically, the models for each showed their respective clusterings had many assumptions, especially for hierarchical clustering. When looking at the dendrogram, there was minimal simlarity between the clusterings, which is what cause the dendrogram to look really distorted and tall. As for K-means, in comparison to the Non-Negative Matrix plot, there were many data points that were misclassified. 

Lastly, for the silhouette score, which evaluated the similarity score between clusters, K-means had a value of 0.441, while Hierarchical had a score of 0.374



#### 5.2 Analysis
Given the accuracy scores, precision scores, and confusion matrices for both the K-Means and Hierarchical Clustering have the potential to predict those at risk for diabetes, especially with accuracies greater than 80% and precisions greater than 90%. However, based on the models these cluering approaches display, there is a huge lack of smilarity between clusters that it does not actually cluster with knowledge. Furthermore, the silhouette scores for both models wer both below 0.5, indicating the similarities between clusters were minimal. Given this, although the models demonstrate potential to accurately and precisely predict, it does not predict in a clear way. This indicate that more work needs to be done in order to potentially predict accurately in the future.

### 6 Closing 

---

#### 6.1 Things to Improve
Although there is potential to continue utilizing either clustering method for clustering diabetes proned individuals by diabetes risk, there is definitely a need to reevaluate the approach for this type of unsupervised learning. One possible change that can be made is to simplify the data further via evalutaion of the confusion matrix. Furthermore, adjusting the models' hyperparameters could possible contribute to an improved model. Lastly, another suggestion would be to utilize other unsupervised learning or dimension reduction appraoches, in order to cater to the data set better.

#### 6.2 Conclusion
In conclusion, it is evident that through the usage of K-Means adn Hierarchical Clustering that predicting who is diabetic or not is possible; however the model selection needs further work and investigation. Nonetheless, based on the evaluation of both models' metrics, all the features have the potential contribute towards predicting diabetes. Overall this unsupervised learning approach needs more work and adjustments and has the potential to predict, but the confidence to do so is not there quite yet. In conclusion, "Gender", "Polyuria", "Polydipsia", "Sudden weight loss", "Weakness", "Polyphagia", "Genital thrush", "Visual blurring", "Itching", "Irritability", "Delayed healing","Partial paresis", "Muscle stiffness", "Alopecia", and "Obesity" are all potential risk predictors for diabetes via clustering, but the model needs work to allow for more confidence in predicting.

### 7 Source

---

**7.1 Citation**

Himanshu (rcratos).(n.d.). Diabetes Risk Prediction [Data set]. Kaggle. Retrieved from https://www.kaggle.com/datasets/rcratos/diabetes-risk-prediction

### 8 Github

---

**GitHub Link:** https://github.com/kpnguyenco/Unsupervised-Data-Analysis-Diabetes-Risk.git