<a href="https://colab.research.google.com/github/klmartinez/DSF/blob/main/activities/diabetes_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Unsupervised Learing Algorithm for Predicting Diabetes

Kiana Lee Martinez\
kianalee@arizona.edu\
last updated: 2022-11-23

In this Notebook I will be taking code from Hossein Faridnasr's [Diabetes Prediction & Model Selection Accuracy>83%](https://www.kaggle.com/code/hosseinfaridnasr/diabetes-prediction-model-selection-accuracy-83) and Carlos Lizarraga's [Introduction to Unsupervised Learning Algorithms](https://github.com/clizarraga-UAD7/Notebooks/blob/main/Intro_UnsupervisedLearning.ipynb).

# Loading necessary packages and our data

In [1]:
# import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, RidgeClassifier, RidgeClassifierCV
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn import tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
sns.set(rc={'figure.figsize':(11.7,8.27)})

In [2]:
# read in diabetes dataset
df = pd.read_csv('../input/predict-diabities/diabetes.csv')

FileNotFoundError: ignored

# Getting to know our dataset

In [None]:
df.shape

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
df.info()

All of our features are numeric so there is no need for dealing with categorical features.

In [None]:
print(f'number of duplicate rows: {df.duplicated().sum()}\nnumber of null values:\n{df.isna().sum()}')

Fortunately, there are no null or duplicate values in our dataset so we can continue.

## Exploratotry Data Analysis(EDA)

In [None]:
sns.pairplot(df,palette = ["#8000ff","#da8829"])

In [None]:
sns.heatmap(df.corr(),annot=True)

# Modeling  
First of all, we split our dataset into a training and a test dataset.

In [None]:
data = df.values
X, y = data[:,:-1], data[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 0)

Here we are going to define all of our models:

In [None]:
lr_model = LogisticRegression(max_iter = 10000)
ridge_model = RidgeClassifier()
ridgecv_model = RidgeClassifierCV()
gpc = GaussianProcessClassifier()
tr = tree.DecisionTreeClassifier()
knn = KNeighborsClassifier(n_neighbors=3)
svc = make_pipeline(StandardScaler(), SVC(gamma='auto'))
rf = RandomForestClassifier(max_depth=2)

The model_metrics function below returns useful metrics such as accuracy and f1_score in a dictionary format for each model.

In [None]:
def model_metrics(model, X_test, y_test, decimals = 5):
    import numpy as np
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = np.round(accuracy_score(y_test, y_pred),decimals)
    pre = np.round(precision_score(y_test, y_pred),decimals)
    rec = np.round(recall_score(y_test, y_pred),decimals)
    f1 = np.round(f1_score(y_test, y_pred),decimals)
    auc = np.round(roc_auc_score(y_test, y_pred),decimals)
    return {'accuracy': acc, 'precision': pre, 'recall': rec, 'f1_score': f1, 'auc': auc}

We combine the metrics for our models into a dataframe:

In [None]:
results = pd.DataFrame(
    [
        model_metrics(lr_model, X_test, y_test),
        model_metrics(ridge_model, X_test, y_test),
        model_metrics(ridgecv_model, X_test, y_test),
        model_metrics(gpc, X_test, y_test),
        model_metrics(tr, X_test, y_test),
        model_metrics(knn, X_test, y_test),
        model_metrics(svc, X_test, y_test),
        model_metrics(rf, X_test, y_test)
    ], 
    index = ['LogisticRegression', 'RidgeClassifier', 'RidgeClassifierCV', 'GaussianProcessClassifier', 'DecisionTreeClassifier', 'KNeighborsClassifier', 'SupportVectorClassification', 'RandomForestClassifier']) \
.reset_index() \
.rename(columns={'index':'model'})

## Model Selection  
Now, Let's see which of our models has performed better!

In [None]:
results.sort_values(['accuracy', 'f1_score', 'auc'],
              ascending = [False, False, False])

Let's visualize the performance of our models:

In [None]:
results = results.sort_values('accuracy', ascending = False)
plt.xticks(rotation=45)
sns.barplot(x = results['model'], y=results['accuracy']).set_title('Model Performance based on Accuracy')

In [None]:
results = results.sort_values('f1_score', ascending = False)
plt.xticks(rotation=45)
sns.barplot(x = results['model'], y=results['f1_score']).set_title('Model Performance based on the f1_score')

In [None]:
results = results.sort_values('auc', ascending = False)
plt.xticks(rotation=45)
sns.barplot(x = results['model'], y=results['auc']).set_title('Model Performance based on AUC(Area Under Curve)')

# Conclusion and future projects  
It appears that with the conditions we had for our modeling process, the RidgeClassifier has performed better than the others in terms of accuracy, f1_score and also AUC, so we can say that in this situation the RidgeClassifier is the best model. However, the performance of our mdoels does depend on the random_state argument in our data splitting section and the results could change if we had a different value for it.  

In future projects we could do a little more EDA(Exploratory Data Analysis) on our dataset to get to know the relationships between the variables better. We could also try using a Neural Network for classifying the patients and see how our performance holds up. Also trying to optimize the performance of our current models by adjusting their arguments could be valuable as well!

I would greatly appreciate it if you could upvote this notebook and also take a look at [my other notebooks](https://www.kaggle.com/hosseinfaridnasr/code).