### **Problem :**

Prediction of type 2 diabetes mellitus (T2DM) at an early stage can lead to improved treatment and increased quality of life. Diagnosis of diabetes is considered a challenging problem at an early stage. A single parameter is not very effective to accurately diagnose diabetes and may be misleading in the decision making process. There is a need to combine different parameters to effectively predict diabetes at an early stage. You need to develop a machine learning model that can predict whether people have diabetes based on a subset of their clinical data.


Read more:

https://www.nature.com/articles/s41598-020-61123-x

https://www.sciencedirect.com/science/article/pii/S2352914819300176 (dataset description)



### **Dataset :**

The dataset is part of the large dataset held at the National Institutes of Diabetes-Digestive-Kidney Diseases (NIH). The target variable is specified as "Outcome"; 1 indicates positive diabetes test result, 0 indicates negative.

### **Variables :**
* Pregnancies    : Number of pregnancies
* Glucose        : 2-hour plasma glucose concentration in the oral glucose tolerance test
* Blood Pressure : Blood Pressure (Smallness) (mm Hg)
* SkinThickness  : Skin Thickness
* Insulin        : 2-hour serum insulin (mu U/ml)
* Diabetes Pedigree Function : Function (2 hour plasma glucose concentration in oral glucose tolerance test)
* BMI            : Body mass index
* Age            : Age (years)
* Outcome        : Have the disease(1) or not (0)

### 1. Import Libraries (Nothing to do)

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
import os
# !pip install missingno
import missingno as msno
from datetime import date
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import LocalOutlierFactor
from sklearn.preprocessing import MinMaxScaler, LabelEncoder, StandardScaler, RobustScaler

In [None]:
# some adjustments
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
pd.set_option('display.width', 500)

### 🔎 1. Reading the Dataset (Nothing to do)

You can alternatively use colab's [import file feature](https://colab.research.google.com/notebooks/io.ipynb)

In [None]:


from google.colab import files
uploaded = files.upload()

df_ = pd.read_csv('diabetes.csv')
df = df_.copy()

In [None]:
# auxiliary functions
def check_df(dataframe):
    print("##################### Shape #####################")
    print(dataframe.shape)
    print("##################### Types #####################")
    print(dataframe.dtypes)
    print("##################### Head #####################")
    print(dataframe.head(3))
    print("##################### Tail #####################")
    print(dataframe.tail(3))
    print("##################### NA #####################")
    print(dataframe.isnull().sum())
    print("##################### Quantiles #####################")
    print(dataframe.quantile([0, 0.05, 0.50, 0.95, 0.99, 1]).T)


def grab_col_names(dataframe, cat_th=10, car_th=20):
    """


    ------
        dataframe: dataframe

        cat_th: int, optional

        car_th: int, optinal


    Returns
    ------
        cat_cols: list
                Categorical features
        num_cols: list
                Numerical features
        cat_but_car: list
               Categorical view cardinal variable list

    Examples
    ------
        import seaborn as sns
        df = sns.load_dataset("iris")
        print(grab_col_names(df))

    """

    # cat_cols, cat_but_car
    cat_cols = [col for col in dataframe.columns if dataframe[col].dtypes == "O"]
    num_but_cat = [col for col in dataframe.columns if dataframe[col].nunique() < cat_th and
                   dataframe[col].dtypes != "O"]
    cat_but_car = [col for col in dataframe.columns if dataframe[col].nunique() > car_th and
                   dataframe[col].dtypes == "O"]
    cat_cols = cat_cols + num_but_cat
    cat_cols = [col for col in cat_cols if col not in cat_but_car]

    # num_cols
    num_cols = [col for col in dataframe.columns if dataframe[col].dtypes != "O"]
    num_cols = [col for col in num_cols if col not in num_but_cat]

    print(f"Observations: {dataframe.shape[0]}")
    print(f"Variables: {dataframe.shape[1]}")
    print(f'cat_cols: {len(cat_cols)}')
    print(f'num_cols: {len(num_cols)}')
    print(f'cat_but_car: {len(cat_but_car)}')
    print(f'num_but_cat: {len(num_but_cat)}')
    return cat_cols, num_cols, cat_but_car




def missing_values_table(dataframe, na_name=False):
    na_columns = [col for col in dataframe.columns if dataframe[col].isnull().sum() > 0]
    n_miss = dataframe[na_columns].isnull().sum().sort_values(ascending=False)
    ratio = (dataframe[na_columns].isnull().sum() / dataframe.shape[0] * 100).sort_values(ascending=False)
    missing_df = pd.concat([n_miss, np.round(ratio, 2)], axis=1, keys=['n_miss', 'ratio'])
    print(missing_df, end="\n")
    if na_name:
        return na_columns

def missing_vs_target(dataframe, target, na_columns):
    temp_df = dataframe.copy()

    for col in na_columns:
        temp_df[col + '_NA_FLAG'] = np.where(temp_df[col].isnull(), 1, 0)

    na_flags = temp_df.loc[:, temp_df.columns.str.contains("_NA_")].columns

    for col in na_flags:
        print(pd.DataFrame({"TARGET_MEAN": temp_df.groupby(col)[target].mean(),
                            "Count": temp_df.groupby(col)[target].count()}), end="\n\n\n")

df.columns = [col.upper() for col in df.columns]

### 🔎 2. Data preparation

In [None]:
check_df(df)

In [None]:
cat_cols, num_cols, cat_but_car = grab_col_names(df)

###2.1 Data exploration

❗**Plot the label distribution. What is the ratio of positive cases?**
(Hint: use the function cat_summary whihc uses seaborns  countplot: https://seaborn.pydata.org/generated/seaborn.countplot.html

Alternatively you can use the histplot method directly: https://seaborn.pydata.org/generated/seaborn.histplot.html.

In [None]:
def cat_summary(dataframe, col_name, plot=False):
    print(pd.DataFrame({col_name: dataframe[col_name].value_counts(),
                        "Ratio": 100 * dataframe[col_name].value_counts() / len(dataframe)}))
    print("##########################################")
    if plot:
        sns.countplot(x=dataframe[col_name], data=dataframe)
        plt.show()


cat_summary(df, "OUTCOME",plot=True)






❗**Summarize the features. Plot the distriobution of each feature. Do you notice anything weird?**

(Hint: use the histplot seaborn method: https://seaborn.pydata.org/generated/seaborn.histplot.html)

Tryout the pairplot method from seaborn also!

In [None]:

def num_summary(dataframe, numerical_col, plot= False):
    quantiles = [0.05, 0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90, 0.95, 0.99]
    print(dataframe[numerical_col].describe(quantiles).T)

    if plot:
        dataframe[numerical_col].hist(bins=20)
        plt.xlabel(numerical_col)
        plt.title(numerical_col)
        plt.show()

#for col in num_cols:
#    num_summary(df, col,plot=True)

#sns.pairplot(df, hue='OUTCOME')

for col in df.columns:

  sns.histplot(data=df, x=col)
  plt.show()



❗**Are patients diagnosed with diabetes older or younger than the rest?** (You can use the function below. Or you  write your own script directly)

In [None]:
def target_summary_with_num(dataframe, target, numerical_col):
    print(dataframe.groupby(target).agg({numerical_col: "mean"}), end="\n\n\n")

target_summary_with_num(df,"OUTCOME","AGE")


❗**Plot the correlation matrix between different features (outcome included). Which feature is highest correlated with the target?** (Hint: Use the correlation function in the dataframe type. You can plot it using seaborn heatmap)

In [None]:
f, ax = plt.subplots(figsize=[7, 5])
sns.heatmap(df.corr(), annot=True, fmt=".2f", ax=ax, cmap="YlGnBu")
ax.set_title("Correlation Matrix", fontsize=20)
plt.show()

### 🔎 2.2 Missing Value Analysis

At a first glance there are no missing values (NA values).



In [None]:
df.isna().sum()


However, if you look at BLOODPRESSURE (or BMI or INSULIN for example) you can notice some values which do not make any sense.  

In [None]:
(df==0).sum()

Let's replace those values with NaN

In [None]:
df[["GLUCOSE","BLOODPRESSURE","SKINTHICKNESS","INSULIN","BMI"]]= df[["GLUCOSE","BLOODPRESSURE","SKINTHICKNESS","INSULIN","BMI"]].replace(0,np.NaN)

In [None]:
na_cols = missing_values_table(df, True)

❗**How would you handle missing data?** Hint: you can use the function bellow.
Alternatively, use the fillna method: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html.

Some examples:
https://towardsdatascience.com/8-methods-for-handling-missing-values-with-python-pandas-842544cdf891




In [None]:
def median_target(variable):
    temp = df[df[variable].notna()]
    temp = temp[[variable, 'OUTCOME']].groupby(['OUTCOME'])[[variable]].median().reset_index()
    return temp

In [None]:



columns = df.columns
columns = columns.drop("OUTCOME")

for col in columns:
    df.loc[(df['OUTCOME'] == 0) & (df[col].isna()), col] = median_target(col)[col][0]
    df.loc[(df['OUTCOME'] == 1) & (df[col].isna()), col] = median_target(col)[col][1]




In [None]:
cat_cols, num_cols, cat_but_car = grab_col_names(df)

### 2.3 One-Hot Encoding

In [None]:
df = pd.get_dummies(df[cat_cols + num_cols], drop_first=True)

### 2.4 Feature Standarzation

❗**Why do we need feature standardization?** (Hint: to implement use a Scaler object from python)

In [None]:
# Standardization for numerical cols
rs = StandardScaler()
df[num_cols] = rs.fit_transform(df[num_cols])
df.head()


### 🔎 3. Model training and evalation


###3.1 Simple model

❗**Train a ML model!** Data is already split into train and test. First, try out a [logistic regression](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression). What accuracy do you get?      

In [None]:
y = df["OUTCOME"]
X = df.drop(["OUTCOME"], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=17)


from sklearn.linear_model import LogisticRegression
rf_model = LogisticRegression(random_state=46).fit(X_train, y_train)

y_pred = rf_model.predict(X_test)
accuracy_score(y_pred, y_test)






###3.2 Evaluation metrics

❗**Plot the results. What other metrics than accuracy can you think of?** Hint: [ROC](https://stackoverflow.com/questions/25009284/how-to-plot-roc-curve-in-python). What about precision-recall? Confusion m...?

In [None]:
import sklearn.metrics as metrics
# calculate the fpr and tpr for all thresholds of the classification
probs = rf_model.predict_proba(X_test)
y_pred_proba = probs[:,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="Logistic Regression, auc="+str(auc))
plt.legend(loc=4)
plt.show()


from sklearn.metrics import precision_recall_curve
precision, recall, thresholds = precision_recall_curve(y_test,  y_pred_proba)
ap=metrics.average_precision_score(y_test, y_pred_proba)
plt.plot(recall,precision,label="Logistic Regression, ap="+str(ap))
plt.legend(loc=4)
plt.show()

from sklearn.metrics import confusion_matrix
#Get the confusion matrix
cf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(cf_matrix, annot=True)



**Robust evaluation**

❗Evaluate your model's performance on multiple train-test splits. What is the average performance of your model? Try out different metrics.  

Hint: [Cross Validation](https://scikit-learn.org/stable/modules/cross_validation.html). You can either implement it on your own (for loop) or use the functions available in sklearn.

In [None]:

#cross validation
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import ShuffleSplit
alt_cv=ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)
scores = cross_val_score(LogisticRegression(), X, y, cv=alt_cv, scoring='roc_auc')
print(scores.mean(), scores.std())

###3.3 Compare different types of classifiers

❗Try an SVM or a Random Forest. Hint: Check out the [sklearn](https://scikit-learn.org/stable/supervised_learning.html) classifiers.

In [None]:
y = df["OUTCOME"]
X = df.drop(["OUTCOME"], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=17)
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(random_state=46).fit(X_train, y_train)


y_pred = rf_model.predict(X_test)
accuracy_score(y_pred, y_test)


import sklearn.metrics as metrics
# calculate the fpr and tpr for all thresholds of the classification
probs = rf_model.predict_proba(X_test)
y_pred_proba = probs[:,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="Random Forrest, auc="+str(auc))
plt.legend(loc=4)
plt.show()


from sklearn.metrics import precision_recall_curve
precision, recall, thresholds = precision_recall_curve(y_test,  y_pred_proba)
ap=metrics.average_precision_score(y_test, y_pred_proba)
plt.plot(recall,precision,label="Random Forrest, ap="+str(ap))
plt.legend(loc=4)
plt.show()

from sklearn.metrics import confusion_matrix
#Get the confusion matrix
cf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(cf_matrix, annot=True)



## 4.Discussion :

❗ **What is the problem you want to solve? Why is it important?**

❗ **Are there any challenges with the data?**

❗**What are the challenges of deploying this a model in a clinical setting? Would it work?**

-----