# Diabetes Prediction Using Random Forests

This notebook builds a model to predict diabetes using random forests

### First Import relevant files and load data

In [10]:
from sklearn.linear_model import LogisticRegressionCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score as precision
from sklearn.metrics import recall_score as recall
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score
from sklearn.preprocessing import label_binarize
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import GridSearchCV
import numpy as np
import pandas as pd
from sklearn.svm import LinearSVC
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('diabetes_coursera.csv')

### Separate X and Y Variables

Note, unlike logistic or SVM models, we do not have to standardize for Random Forests since scaling is not an issue with tree based models

In [12]:
X_rf = df.drop(columns='Diabetes_binary')
y_rf = df['Diabetes_binary']



### Create a test and train set, setting the test set to 20% of the data. 

Set a constant random state so that the data is reproducible

In [18]:
X_train_rf, X_test_rf, y_train_rf, y_test_rf = train_test_split( X_rf, y_rf, test_size=0.2,random_state=4)


### Create the classifier object and initialize the grid search hyperparameters

Instruct the grid search module to use Recall as its criteria

In [None]:
rf = RandomForestClassifier()
rf.get_params().keys()
param_grid = {'n_estimators': [2*n+1 for n in range(5)],
'max_depth' : [2*n+1 for n in range(7) ],
}
search = GridSearchCV(estimator=rf,
param_grid=param_grid,scoring='recall')

### Fit the model and print the best parameters

In [24]:
search.fit(X_train_rf, y_train_rf)
print("Best Parameters:", search.best_params_)
print("Best recall Score:", search.best_score_)

Best Parameters: {'max_depth': 5, 'n_estimators': 9}
Best recall Score: 0.7917465293187597


### See the [PDF Report](classification_report.pdf) for detailed conclusions of all three models (Logistic, SVM, Random Forests)