# Getting Started

This is what I'm doing in this notebook:

* Doing some exploratory data analysis.
* Making predictions on the outcome of diabetes diagnosis based on the symptoms shown on patients using 7 different models + 1 voting
* Comparing each model's prediction performance
* Experimenting using different feature selection methods

With this notebook, I hope to learn more about data visualization and predicting outcomes using Python. Any upvote, comment, and suggestion will be very appreciated!

# Importing Library and Dataset

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use("ggplot")
import seaborn as sns
sns.set_palette("bwr")
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import cross_val_score
%matplotlib inline

### Taking a look on our dataset

In [None]:
df = pd.read_csv("/kaggle/input/early-stage-diabetes-risk-prediction-dataset/diabetes_data_upload.csv")
df.head()

In [None]:
df.describe(include="all")

# Cleaning Data

### Checking null values

In [None]:
sns.heatmap(df.isnull(), cbar=False)

### Standardize column names and map boolean values

In [None]:
# Make column name lowercase and convert space to underscore
df.columns = map(str.lower, df.columns)
df.columns = df.columns.str.strip()
df.columns = df.columns.str.replace(" ", "_")

In [None]:
# Map yes/no values
one_values = ["Male", "Positive", "Yes"]
zero_values = ["Female", "Negative", "No"]

for column in df.columns:
    df[column] = df[column].replace(to_replace=[one_values], value=1)
    df[column] = df[column].replace(to_replace=[zero_values], value=0)

We will also rename "class" column into "status"

In [None]:
df = df.rename({"class": "status"}, axis = "columns")
df.head()

# Exploring Our Data

### Status and gender distribution

In [None]:
# Defining a function to plot a simple pie chart
def plotPie(value, title, label):
    plt.figure(figsize=(4,4))
    plt.pie(
        value.value_counts(),
        startangle=90,
        labels = label,
        autopct=(lambda p:f'{p:.2f}%\n{p*sum(value.value_counts())/100 :.0f} items')
    )
    plt.title(title)
    plt.show()

plotPie(df["status"], "Status distribution", ["Positive", "Negative"])
plotPie(df["gender"], "Gender distribution", ["Male", "Female"])

### Age distribution

In [None]:
plt.figure(figsize=(5,5))

ax = sns.distplot(df["age"], color="r")

### Status in relation with gender

In [None]:
ax = sns.countplot(x="status", data=df, hue="gender")

### Status in relation with age

In [None]:
ax = sns.violinplot(x="status", y="age", data=df)

In [None]:
# Divide data into positive and negative class data
df_pos = df[df["status"] == 1]
df_neg = df[df["status"] == 0]

In [None]:
print("Average positive age:", df_pos["age"].mean())
print("Average negative age:", df_neg["age"].mean())

### Correlation visualization between symptoms and diabetes status

In [None]:
df_symptoms = df[df.columns.difference(["age", "status", "gender"])]

for column in df_symptoms.columns:
    plt.figure(figsize=(4,4))
    ax = sns.barplot(x=column, y="status", data=df)
    ax.set_xticklabels(["No", "Yes"])
    ax.set_ylabel("Diabetes risk")
    ax.set_xlabel(None)
    title = column.capitalize()
    plt.title(title)
    plt.show()

### Occurences of symptoms in all patients

In [None]:
# Select only the symptom columns
df_symptoms = df[df.columns.difference(["age", "status", "gender"])]
plt.figure(figsize=(5,5))
for column in df_symptoms.columns:
    plotPie(df_symptoms[column], column.capitalize(), ["Yes", "No"])

### Correlation heatmap

In [None]:
plt.figure(figsize=(8,8))

corr = df.corr()
ax = sns.heatmap(
    corr, 
    vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(20, 160, n=256),
    square=True,
)
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=50,
    horizontalalignment="right"
);

We can briefly see that polyuria and polydipsia have the strongest positive correlations with status, and gender has the strongest negative correlation.

# Building Models

## (1) Features selection: with Pearson's

We will select top 10 features with the highest absolute value of Pearson's correlation coefficient.

In [None]:
feat_corr = df.corr()["status"].to_frame()
feat_corr

In [None]:
# Sort values with highest correlation
feat_corr["status"] = abs(feat_corr["status"])
feat_corr = feat_corr.sort_values(by="status", ascending=False).reset_index(drop=False)
feat_corr = feat_corr[1:11]["index"].to_numpy()
feat_corr

### (1.a) Dividing dataset into training and test set

In [None]:
from sklearn.model_selection import train_test_split

x = df[feat_corr]
y = df["status"]

(x_train, x_test, y_train, y_test) = train_test_split(x, y, test_size = 0.2, random_state=1)

### (1.b) Scaling and standardizing the data

In [None]:
from sklearn.preprocessing import StandardScaler

scl = StandardScaler()
x_train = scl.fit_transform(x_train)
x_test = scl.transform(x_test)

### (1.c) Baseline validation

We'd like to see how different models perform with default parameters. We'll be using ten fold cross validation to get a baseline.

In [None]:
# Defining objects for the models and creating a list to iterate the process

from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import VotingClassifier

nb = GaussianNB()
lr = LogisticRegression(max_iter = 2000)
dt = tree.DecisionTreeClassifier(random_state = 1)
rf = RandomForestClassifier(random_state = 1)
svc = SVC(probability = True)
knn = KNeighborsClassifier()
xgb = XGBClassifier(random_state =1)
vot = VotingClassifier(
    estimators = [('nb',nb), ('lr',lr), ('dt',dt), ('rf',rf), ('svc',svc), ('knn',knn), ('xgb',xgb)],
    voting = 'soft'
)

models = [nb, lr, dt, rf, svc, knn, xgb, vot]
models_name = [
    "Naive Bayes",
    "Logistic Regression",
    "Decision Tree",
    "Random Forest",
    "SVM",
    "K-Nearest Neighbor",
    "XGBoost",
    "Voting"
]

In [None]:
results_base = {}

for index, model in enumerate(models):
    cv = cross_val_score(model, x_train, y_train, cv=10)
    results_base[models_name[index]] = cv.mean() * 100.0
    print("Baseline using", models_name[index], "=", cv.mean() * 100.0, "%", "with std:", cv.std())

### (1.d) Predicting data

We'll also take a look at the confusion matrix of each model.

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix

results = {}

for index, model in enumerate(models):
    model.fit(x_train, y_train)
    predict = model.predict(x_test)
    confuse = confusion_matrix(y_test, predict)
    accur = accuracy_score(y_test, predict)
    results[models_name[index]] = accur * 100.0
    
    title = models_name[index] + ": " + "{:.3f}%".format(accur*100) + " accurate\n"
    ax = sns.heatmap(confuse/np.sum(confuse), annot=True, fmt='.1%', cmap="Greens")
    ax.set_title(title)
    plt.show()

### (1.e) Comparing performance with baseline

In [None]:
x = np.arange(len(results))

plt.figure(figsize=(9,5))
ax = plt.subplot(111)
ax.bar(x, results_base.values(), width=0.4, color="c", align="center")
ax.bar(x+0.4, results.values(), width=0.4, color="r", align="center")
ax.legend(("Base", "Real"))
plt.ylim((85, 100))
plt.xticks(x+0.4, results_base.keys())
plt.title("Performance comparison")
plt.xticks(rotation=40, horizontalalignment="right")
plt.show()

## (2) Features selection: with Chi-Squared test

We will select top 10 features with the highest Chi-squared value.

In [None]:
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest

x = df[df.columns.difference(["status"])]
y = df["status"]

feat_chi = SelectKBest(score_func=chi2, k=10)
fit = feat_chi.fit(x, y)
feat_chi = pd.concat([pd.DataFrame(x.columns), pd.DataFrame(fit.scores_)], axis=1)
feat_chi.columns = ["column", "score"]
feat_chi = feat_chi.sort_values(by="score", ascending=False).reset_index(drop=False)
feat_chi = feat_chi[0:10]["column"].to_numpy()
print(feat_chi)

Compare it to the list of feature selection we've created earlier using Pearson's correlation coefficient. There is some difference between both of them. Let's check.

In [None]:
print(feat_corr)
print()
print("Difference:", list(set(feat_corr).symmetric_difference(set(feat_chi))))

### (2.a) Dividing dataset into training and test set

In [None]:
x = df[feat_chi]
y = df["status"]

(x_train, x_test, y_train, y_test) = train_test_split(x, y, test_size = 0.2, random_state=1)

### (2.b) Standardizing and scaling the data

In [None]:
scl = StandardScaler()
x_train = scl.fit_transform(x_train)
x_test = scl.transform(x_test)

### (2.c) Baseline validation

In [None]:
results_base = {}

for index, model in enumerate(models):
    cv = cross_val_score(model, x_train, y_train, cv=10)
    results_base[models_name[index]] = cv.mean() * 100.0
    print("Baseline using", models_name[index], "=", cv.mean() * 100.0, "%", "with std:", cv.std())

### (2.d) Predicting data

In [None]:
results = {}

for index, model in enumerate(models):
    model.fit(x_train, y_train)
    predict = model.predict(x_test)
    confuse = confusion_matrix(y_test, predict)
    accur = accuracy_score(y_test, predict)
    results[models_name[index]] = accur * 100.0
    
    title = models_name[index] + ": " + "{:.3f}%".format(accur*100) + " accurate\n"
    ax = sns.heatmap(confuse/np.sum(confuse), annot=True, fmt='.1%', cmap="Greens")
    ax.set_title(title)
    plt.show()

### (2.e) Comparing performance with baseline

In [None]:
x = np.arange(len(results))

plt.figure(figsize=(9,5))
ax = plt.subplot(111)
ax.bar(x, results_base.values(), width=0.4, color="c", align="center")
ax.bar(x+0.4, results.values(), width=0.4, color="r", align="center")
ax.legend(("Base", "Real"))
plt.ylim((85, 100))
plt.xticks(x+0.4, results_base.keys())
plt.title("Performance comparison")
plt.xticks(rotation=40, horizontalalignment="right")
plt.show()

# Conclusion

So, in this notebook, I experimented with different features selection methods and models. It turns out that using chi-squared method is best for categorical input and categorical output. After comparing the models above, the best models for predicting diabetes in this dataset are XGBoost, Random Forest, and Decision Tree using both features selection methods.

The best accuracy I can get is with XGBoost and Random Forest, both with 97.1% accuracy with feature selection done using chi-squared.

Any upvote, comment, and suggestion will be very appreciated! Thank you.