# Pima Diabetes Database

The dataset for this notebook is available at: https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database
Based on the anlysis by https://www.kaggle.com/code/lipinor/eda-feature-engineering-0-83-auc

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import sys

sys.path.append("../")

We load the data and check out what the variables look like: 

In [8]:
df = pd.read_csv("../input/pima-indians-diabetes-database/diabetes.csv")
print(df.size)
df.head()

### Exploratory Data Analysis

We'll check for NULL values and missing data that we might need to impute:

In [4]:
df.isnull().sum()
# no null values

In [5]:
df.drop(["Outcome"], axis=1).hist(bins=50, figsize=(20, 15))
plt.show()

We can see that there are 0s for [Glucose], [BloodPressure], [SkinThickness], [Insulin], and [BMI]. We don't want to drop these rows, since it's a considerable number:

In [9]:
miss_columns = ["Glucose", "BloodPressure", "SkinThickness", "BMI", "Insulin"]
df[miss_columns].isin([0]).sum()

We'll replace these values with NaN in order to use the SimpleImputer afterwards.

In [10]:
df[miss_columns] = df[miss_columns].replace(0, np.NaN)
df.isnull().sum()

We should look at the distribution of diabetics and non-diabetics:

In [15]:
print("Percentage of 1 (diabetic): {:.0%}".format(np.mean(df["Outcome"])))
print("Percentage of 0 (non-diabetic): {:.0%}".format(1 - np.mean(df["Outcome"])))

We are dealing with an imbalanced set, therefore we need to take that into account when preparing our training set.

## EDA
We'll quickly explore our data visualy to have a grasp of what features could possibly be more important to separate our target variable etc.

In [17]:
p = sns.pairplot(df, hue = 'Outcome')
plt.show()

It seems like [Glucose] will separate the target variable nicely, as we'll see later. 

### Baseline model

We split the data into train and validation sets as usual:

In [18]:
from sklearn.model_selection import train_test_split

target_df = df["Outcome"]
features_df = df.drop(["Outcome"], axis=1)

X_train, X_test, y_train, y_test = train_test_split(
    features_df, target_df, test_size=0.3, random_state=42, stratify=target_df
)

In [23]:
print("% of 0 (non-diabetic) on train set: {:.2%}".format(1-np.mean(y_train)))
print("% of 1 (diabetic) on train set: {:.2%}".format(np.mean(y_train)))
print("% of 0 (non-diabetic) on test set: {:.2%}".format(1-np.mean(y_test)))
print("% of 1 (diabetic) on test set: {:.2%}".format(np.mean(y_test)))

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import roc_auc_score, confusion_matrix, precision_score, recall_score

def make_pipeline(classifier):
    """Create a pipeline for a classifier."""
    steps = [
        ("imputer", SimpleImputer(strategy="mean")),
        ("scaler", StandardScaler()),
        ("classifier", classifier),
    ]

    return Pipeline(steps)


def score_model(pipeline, test_set):
    """Score a given pipeline using Accuracy, ROC AUC, Precision and Recall"""
    predict = pipeline.predict(test_set)
    predict_proba = pipeline.predict_proba(test_set)[:, 1]

    metrics = {"accuracy": pipeline.score(test_set, y_test),
               "roc_auc": roc_auc_score(y_test, predict_proba),
               "precision": precision_score(y_test, predict),
               "recall": recall_score(y_test, predict)}
    
    return metrics

def plot_cm(labels, predictions, p=0.5):
    cm = confusion_matrix(labels, predictions)
    plt.figure(figsize=(5,5))
    sns.heatmap(cm, annot=True, fmt="d")
    plt.title('Confusion matrix')
    plt.ylabel('Actual label')
    plt.xlabel('Predicted label')

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

rfc = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)

pipe_rfc = make_pipeline(rfc)

pipe_rfc.fit(X_train, y_train)

score_rfc = score_model(pipe_rfc, X_test)

print("Accuracy: {:.4f}".format(score_rfc['accuracy']))
print("ROC AUC: {}".format(score_rfc['roc_auc']))
print("Precision: {}".format(score_rfc['precision']))
print("Recall: {}".format(score_rfc['recall']))
plot_cm(y_test, pipe_rfc.predict(X_test))

Let's check the importance of each feature:

In [None]:
for i, item in enumerate(rfc.feature_importances_):
    # Use i and item to print out the feature importance of each column
    print("{0:s}: {1:.2f}".format(X_train.columns[i], item))

### Feature Engineering

Here we try to build some new features in order to enhance the results. As Glucose, BMI and Age are the three most important features, we are going to focus on these.

In [None]:
X_train_new = X_train.copy()
X_test_new = X_test.copy()

In [None]:
def plot_scatter(feature1, feature2):
    sns.scatterplot(x=X_train[feature1], y=X_train[feature2], hue=y_train, data=X_train)
    plt.show()

In [None]:
plot_scatter("Glucose", "Age")

It seems most of the healthies have Age <= 30 and Glucose <= 150.

In [None]:
X_train_new.loc[:, "N1"] = 0
X_train_new.loc[(X_train["Age"] <= 30) & (X_train["Glucose"] <= 125), "N1"] = 1

X_test_new.loc[:, "N1"] = 0
X_test_new.loc[(X_test["Age"] <= 30) & (X_test["Glucose"] <= 125), "N1"] = 1

X_train_new["N1"].value_counts()

In [None]:
plot_scatter("BMI", "Age")

In [None]:
X_train_new.loc[:, "N2"] = 0
X_train_new.loc[(X_train["Age"] <= 30) & (X_train["BMI"] <= 30), "N2"] = 1

X_test_new.loc[:, "N2"] = 0
X_test_new.loc[(X_test["Age"] <= 30) & (X_test["BMI"] <= 30), "N2"] = 1

X_train_new["N2"].value_counts()

In [None]:
plot_scatter("Glucose", "BMI")

In [None]:
X_train_new.loc[:, "N3"] = 0
X_train_new.loc[(X_train["Glucose"] <= 100) & (X_train["BMI"] <= 40), "N3"] = 1

X_test_new.loc[:, "N3"] = 0
X_test_new.loc[(X_test["Glucose"] <= 100) & (X_test["BMI"] <= 40), "N3"] = 1

X_train_new["N3"].value_counts()

In [None]:
pipe_rfc.fit(X_train_new, y_train)

score_rfc = score_model(pipe_rfc, X_test_new)

print("Accuracy: {:.4f}".format(score_rfc['accuracy']))
print("ROC AUC: {}".format(score_rfc['roc_auc']))
print("Precision: {}".format(score_rfc['precision']))
print("Recall: {}".format(score_rfc['recall']))
plot_cm(y_test, pipe_rfc.predict(X_test_new))

In [None]:
for i, item in enumerate(rfc.feature_importances_):
    # Use i and item to print out the feature importance of each column
    print("{0:s}: {1:.2f}".format(X_train_new.columns[i], item))

It seems that the feature engineering improved our results in comparison to the baseline model.