Context
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

Content
The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

Our goal is to build a classifier using training data, such that given a test sample, we can classify 
(or essentially predict) whether its outcome is 0 (no diabetes) or 1 (diabetes).

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:
df = pd.read_csv('diabetes.csv')
df

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


In [3]:
df.shape

(768, 9)

In [4]:
df.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

In [5]:
df.isna().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

split data to 60-20-20 train, val, test

In [6]:
from sklearn.model_selection import train_test_split
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)#they use random state 42#does this matter
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)
#print(df_train)

df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

y_train = df_train['Outcome'].values
y_val = df_val['Outcome'].values
y_test = df_test['Outcome'].values

del df_train['Outcome']
del df_val['Outcome']
del df_test['Outcome']

DictVectorizer -> apply one-hot encoding to categorical features and get the feature matrix
(turn train and validation into matrices)

In [7]:
from sklearn.feature_extraction import DictVectorizer
dv = DictVectorizer(sparse=False)

train_dict = df_train.to_dict(orient='records')
X_train = dv.fit_transform(train_dict)

val_dict = df_val.to_dict(orient='records')
X_val = dv.transform(val_dict)

train a decision tree classifier to predict the Class variable

In [8]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score

fit the tree with default parameters

In [9]:
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

DecisionTreeClassifier()

evaluating the quality of the model with AUC

In [24]:
from sklearn.metrics import roc_auc_score
y_pred = dt.predict_proba(X_train)[:, 1]
roc_auc_score(y_train, y_pred)
print('train auc: %.3f' % auc)

y_pred = dt.predict_proba(X_val)[:, 1]
roc_auc_score(y_val, y_pred)
print('val auc: %.3f' % auc)

train auc: 0.717
val auc: 0.717


fit tree with max_depth=2

In [25]:
dt2 = DecisionTreeClassifier(max_depth=2)
dt2.fit(X_train, y_train)

y_pred = dt2.predict_proba(X_train)[:, 1]
auc = roc_auc_score(y_train, y_pred)
print('train auc: %.3f' % auc)

y_pred = dt2.predict_proba(X_val)[:, 1]
auc = roc_auc_score(y_val, y_pred)
print('val auc: %.3f' % auc)

train auc: 0.782
val auc: 0.717


auc is the same, so will use the model with default parameters

In [34]:
#logistic regression

In [26]:
#from sklearn.linear_model import LogisticRegression

In [27]:
#model = LogisticRegression(solver='lbfgs')
# solver='lbfgs' is the default solver in newer version of sklearn
# for older versions, you need to specify it explicitly
#model.fit(X_train, y_train)

In [28]:
#y_pred = model.predict_proba(X_val)[:, 1]

In [29]:
#divorce_decision = (y_pred >= 0.9)

In [30]:
#(y_val == divorce_decision).mean()

In [31]:
#df_pred = pd.DataFrame()
#df_pred['probability'] = y_pred
#df_pred['prediction'] = divorce_decision.astype(int)
#df_pred['actual'] = y_val

In [32]:
#df_pred['correct'] = df_pred.prediction == df_pred.actual

In [33]:
#df_pred.correct.mean()