This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset.
The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

In [18]:
import pandas as pd
df = pd.read_csv("C:\\Users\\Administrator\\Desktop\\diabetes.csv")
print(df.shape)
# print head of data set
df.head()

(768, 9)


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [19]:
X = df.drop('Outcome', axis=1)
y = df['Outcome']

In [20]:
from sklearn.model_selection import train_test_split
# implementing train-test-split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=66)

In [21]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc.fit(X_train,y_train)
# predictions
rfc_predict = rfc.predict(X_test)
X_test



Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
149,2,90,70,17,0,27.3,0.085,22
416,1,97,68,21,0,27.2,1.095,22
275,2,100,70,52,57,40.5,0.677,25
742,1,109,58,18,116,28.5,0.219,22
145,0,102,75,23,0,0.0,0.572,21
...,...,...,...,...,...,...,...,...
572,3,111,58,31,44,29.5,0.430,22
498,7,195,70,33,145,25.1,0.163,55
562,1,87,68,34,77,37.6,0.401,24
196,1,105,58,0,0,24.3,0.187,21


In [22]:
#Evaluating Performance
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report,confusion_matrix

In [23]:
rfc_cv_score = cross_val_score(rfc, X, y, cv=10, scoring='roc_auc')

In [24]:
print("=== Confusion Matrix ===")
print(confusion_matrix(y_test, rfc_predict))
print('\n')
print("=== Classification Report ===")
print(classification_report(y_test, rfc_predict))
print('\n')
print("=== All AUC Scores ===")
print(rfc_cv_score)
print('\n')
print("=== Mean AUC Score ===")
print("Mean AUC Score - Random Forest: ", rfc_cv_score.mean())

=== Confusion Matrix ===
[[148  28]
 [ 40  38]]


=== Classification Report ===
              precision    recall  f1-score   support

           0       0.79      0.84      0.81       176
           1       0.58      0.49      0.53        78

    accuracy                           0.73       254
   macro avg       0.68      0.66      0.67       254
weighted avg       0.72      0.73      0.73       254



=== All AUC Scores ===
[0.76962963 0.79259259 0.84111111 0.74148148 0.76407407 0.84148148
 0.79925926 0.85592593 0.73538462 0.80384615]


=== Mean AUC Score ===
Mean AUC Score - Random Forest:  0.7944786324786325


In [9]:
#The next thing is we will tune our hyperparameters(parameter whose value is set before the learning process begins) so that we can
#improve the performance of the model
import numpy as np
from sklearn.model_selection import RandomizedSearchCV
# number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# number of features at every split
max_features = ['auto', 'sqrt']

# max depth
max_depth = [int(x) for x in np.linspace(100, 500, num = 11)]
max_depth.append(None)
# create random grid
random_grid = {'n_estimators': n_estimators,'max_features': max_features,'max_depth': max_depth}
# Random search of parameters
rfc_random = RandomizedSearchCV(estimator = rfc, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the model
rfc_random.fit(X_train, y_train)
# print results
print(rfc_random.best_params_)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   35.8s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:  2.4min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:  4.4min finished


{'n_estimators': 200, 'max_features': 'sqrt', 'max_depth': 140}


In [10]:
#My results were: ‘n_estimators’ = 200; ‘max_features’ = ‘auto’; ‘max_depth’: 340.
#Now we can plug these back into the model to see if it improved our performance
rfc = RandomForestClassifier(n_estimators=200, max_depth=140, max_features='sqrt')
rfc.fit(X_train,y_train)
rfc_predict = rfc.predict(X_test)
rfc_cv_score = cross_val_score(rfc, X, y, cv=10, scoring='roc_auc')
print("=== Confusion Matrix ===")
print(confusion_matrix(y_test, rfc_predict))
print('\n')
print("=== Classification Report ===")
print(classification_report(y_test, rfc_predict))
print('\n')
print("=== All AUC Scores ===")
print(rfc_cv_score)
print('\n')
print("=== Mean AUC Score ===")
print("Mean AUC Score - Random Forest: ", rfc_cv_score.mean())

=== Confusion Matrix ===
[[150  26]
 [ 34  44]]


=== Classification Report ===
             precision    recall  f1-score   support

          0       0.82      0.85      0.83       176
          1       0.63      0.56      0.59        78

avg / total       0.76      0.76      0.76       254



=== All AUC Scores ===
[0.78259259 0.83592593 0.82259259 0.74962963 0.80518519 0.86592593
 0.8562963  0.9062963  0.81038462 0.85192308]


=== Mean AUC Score ===
Mean AUC Score - Random Forest:  0.8286752136752137


Our roc_auc score improved from 0.77 to 0.82. 
The downside is that our number of false positives increased slightly (but false negatives declined).

In [17]:
A = rfc.predict([[1,97,68,21,0,27.2,1.95,22]])
A

array([0], dtype=int64)