#1. Specify the question and project goal


### a) Specifying the Data Analysis Question

What is your research question? What problem is it that you are trying to solve?

The task is to predict whether a patient will be diagnosed with diabetes based on the provided variables



### b) Defining the Metric for Success

What will convince you that your project has succeeded?

a model with accuracy greater than 0.85

### c) Understanding the context 

The background information surrounding the problem or research question.

The data is collected and shared by a pharmaceutical company whos aim is to predict with high accuracy a patients diabetes diagnosis 


### d) Recording the Experimental Design

The steps you will take from the beginning to the end of this project.

Data collection, cleaning and understanding, followed by creating a model, training and testing it, reviewing the outcome and challenging the results.

### e) Data Relevance

Is your data relevant to the problem or research question?

Yes, the data provided is relevant, as these are the key factors in determining a diabetes diagnosis

## <font color='#2F4F4F'>2. Data Cleaning & Preparation</font>

Step 1: Load libraries and import data

In [1]:
#Loading libraries
import pandas as pd

Step 2: Import dataset and explore

In [2]:
#Import data and explore

df=pd.read_csv('https://bit.ly/DiabetesDS')

df.sample(10)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
654,1,106,70,28,135,34.2,0.142,22,0
292,2,128,78,37,182,43.3,1.224,31,1
18,1,103,30,38,83,43.3,0.183,33,0
624,2,108,64,0,0,30.8,0.158,21,0
69,4,146,85,27,100,28.9,0.189,27,0
745,12,100,84,33,105,30.0,0.488,46,0
93,4,134,72,0,0,23.8,0.277,60,1
394,4,158,78,0,0,32.9,0.803,31,1
461,1,71,62,0,0,21.8,0.416,26,0
89,1,107,68,19,0,26.5,0.165,24,0


In [3]:
#preview the dataset shape
df.shape

(768, 9)

In [4]:
#look for duplicates
sum(df.duplicated())
# ther are no duplicated records in the dataset

0

In [5]:
#look for missing values
df.isnull().sum()

#there are no missing values in the dataset

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [6]:
#preview datatypes
df.dtypes

#no encoding is needed in order to allow the dataset to be used in modelling

Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object

## <font color='#2F4F4F'>3. Modelling</font>

We will model the Diabetes data set using 3 models and evaluate 

In [7]:
#Step 1 is to split the data into training and testing data sets using a split of 75/25

from sklearn.model_selection import train_test_split

df_train, df_valid = train_test_split(df, test_size=0.25, random_state=12345)

features_train =df_train.drop(['Outcome'], axis=1)
target_train = df_train['Outcome']
features_valid = df_valid.drop(['Outcome'], axis=1)
target_valid = df_valid['Outcome']

print(features_train.shape)
print(target_train.shape)
print(features_valid.shape)
print(target_valid.shape)

(576, 8)
(576,)
(192, 8)
(192,)


# Decision Tree Modelling


Step 1& 2: Data Modelling & Evaluation

In [11]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics i 
model = DecisionTreeClassifier(random_state=12345, max_depth = 20)
model.fit(features_train, target_train)

#convert features_valid to series to a 1d array for the prediction to work
predicted_valid = pd.Series(model.predict(features_valid))

# compares predicted values vs actual values to determine accuracy of the model
accuracy_valid = accuracy_score(predicted_valid,target_valid)

print(accuracy_valid)

0.78125


Step 3: Hyperparameter Tuning

In [19]:
for depth in range(1,30):
        model = DecisionTreeClassifier(random_state=12345,max_depth=depth)
        # < recreate the model, specifying max_depth=depth >

        model.fit(features_train,target_train)# < train the model >

        predictions_valid = model.predict(features_valid)# < find the predictions using validation set >

        print("max_depth =", depth, ": ", end='')
        print(accuracy_score(target_valid, predictions_valid))

max_depth = 1 : 0.7708333333333334
max_depth = 2 : 0.7708333333333334
max_depth = 3 : 0.7604166666666666
max_depth = 4 : 0.75
max_depth = 5 : 0.8177083333333334
max_depth = 6 : 0.8229166666666666
max_depth = 7 : 0.765625
max_depth = 8 : 0.7604166666666666
max_depth = 9 : 0.7552083333333334
max_depth = 10 : 0.734375
max_depth = 11 : 0.765625
max_depth = 12 : 0.7604166666666666
max_depth = 13 : 0.7447916666666666
max_depth = 14 : 0.7447916666666666
max_depth = 15 : 0.7395833333333334
max_depth = 16 : 0.7447916666666666
max_depth = 17 : 0.7604166666666666
max_depth = 18 : 0.7760416666666666
max_depth = 19 : 0.78125
max_depth = 20 : 0.78125
max_depth = 21 : 0.78125
max_depth = 22 : 0.78125
max_depth = 23 : 0.78125
max_depth = 24 : 0.78125
max_depth = 25 : 0.78125
max_depth = 26 : 0.78125
max_depth = 27 : 0.78125
max_depth = 28 : 0.78125
max_depth = 29 : 0.78125


Observations:#
From the above hyperparameter tuning, the maximum accuracy possible is 0.82 achieved at a max depth of 6. At a max_depth of 19 and above, the accuracy remains at 0.78125. 

Decision Tree model is not the optimal model for the required task as it does not yet hit our goal of 0.85 

# Random Forest Modelling


Step 1& 2: Data Modelling & Evaluation

In [42]:
from sklearn.ensemble import RandomForestRegressor

#create and train the model
regressor = RandomForestRegressor(n_estimators=1000, random_state=0)

#fit the regressor with X and Y data
regressor.fit(features_train,target_train)

#we can test the model using test data sets
regressor.predict(features_valid)

#let's also evaluate the accuracy of the model using the test data set
regressor.score(features_valid,target_valid)

#the model is 0.36 accurate with n_estimator parameter at 1000

0.35958591515151506

Step 3: Hyperparameter Tuning

In [43]:
from sklearn.model_selection import GridSearchCV

regressor = RandomForestRegressor(n_estimators=100, random_state=0)
#< recreate the model

regressor.fit(features_train,target_train)

param_grid = { 
    'n_estimators': [200,400,600,1000]
}

CV_regressor = GridSearchCV(estimator=regressor, param_grid=param_grid, cv= 5)
CV_regressor.fit(features_train,target_train)
print(CV_regressor.best_params_)

#the best parameter(n_estimator) is determined to be 1000 of the options

{'n_estimators': 1000}


Observations: at n_estimator 10000 the model can only provide an accuracy of 0.36, optimal parameters to obtain an accuracy of 0.85 or higher is not known. to conclude, the random forest method is not able to provide accuracy higer than our goal. Higher estimators 

Random Forest is not the optimal model to use for prediciton diabetes diagnosis

# Logistic Regression Modelling


Step 1& 2: Data Modelling & Evaluation

In [45]:
from sklearn.linear_model import LogisticRegression

lr_model = LogisticRegression(random_state=12345,solver='liblinear')
#train the model
lr_model.fit(features_train,target_train)

#See the model's accuracy, call the score() method:
test_predictions= lr_model.score(features_valid,target_valid)

print(test_predictions)

0.7916666666666666


Step 3: Hyperparameter Tuning

In [48]:
from sklearn.model_selection import RepeatedStratifiedKFold
#create a model and define a dictionary with solver parameters 
model1 = LogisticRegression(random_state=12345)
solvers = ['newton-cg', 'lbfgs', 'liblinear']

# define grid search
grid = dict(solver=solvers)

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=model1, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)
grid_result = grid_search.fit(features_valid,target_valid)

# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

Best: 0.774123 using {'solver': 'newton-cg'}


Observation: using logistic regression, does not help us achieve our accuracy goral of 0.85 and above, the highest possible accuracy is 77% using the newton-cg solver.

Logistic regression is not the optimal model to achieve our accuracy goal in determining diabetes diagnosis

# Findings and Recommendations


Of all three models tested, the seemingly best model to use would be deicison tree model that would give us the highest accuracy of 82% with a max_depth paramtere of 6. 

Random forest modelling could also return a value close to the 0.85 accuracy with the correct n_estimator hyperparameter. in the above analysis we found estimators higher than 1000 caused the cell to run for long periods, not returning a best parameter above 1000. 

**Recommendation**: Use the decision tree classifier model to predict diabetes diagnosis.