## 1. Defining the Question

### a)  Problem statement:
As a data professional working for a pharmaceutical company, you need to develop a
model that predicts whether a patient will be diagnosed with diabetes. The model needs
to have an accuracy score greater than 0.85.


### b) Data Analysis Question

To predict diabetes in patients

### c) Metric of success

The model should have an accuracy of > 0.85

 ### Steps to follow:
 Data Importation
● Data Exploration
● Data Cleaning
● Data Preparation
● Data Modeling (Using Decision Trees, Random Forest and Logistic Regression)
● Model Evaluation
● Hyparameter Tuning
● Findings and Recommendations


### Data Importation

In [1]:
# Importing our libraries
# ---
# import Pandas for data manipulation
import pandas as pd
#improt numpy for arithimetic oprations
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

In [2]:
# Load the data below
# --- 
# Dataset url =  https://bit.ly/DiabetesDS

# reading the data and storing it in a dataframe named diabetes_df
diabetes_df = pd.read_csv("https://bit.ly/DiabetesDS")

# 

### Data Exploration

In [3]:
# Checking the first 5 rows of data
# ---
diabetes_df.head()
#

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
# Checking the last 5 rows of data
# ---
diabetes_df.tail()
#

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.34,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1
767,1,93,70,31,0,30.4,0.315,23,0


In [None]:
# Sample 10 rows of data
# ---
diabetes_df.sample(10)
#

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
631,0,102,78,40,90,34.5,0.238,24,0
230,4,142,86,0,0,44.0,0.645,22,1
415,3,173,84,33,474,35.7,0.258,22,1
384,1,125,70,24,110,24.3,0.221,25,0
103,1,81,72,18,40,26.6,0.283,24,0
456,1,135,54,0,0,26.7,0.687,62,0
149,2,90,70,17,0,27.3,0.085,22,0
609,1,111,62,13,182,24.0,0.138,23,0
28,13,145,82,19,110,22.2,0.245,57,0
565,2,95,54,14,88,26.1,0.748,22,0


In [5]:
# Checking number of rows and columns
# ---

diabetes_df.shape
#  

(768, 9)

In [6]:
# Checking datatypes
# ---
diabetes_df.dtypes
# 

Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object

### Data Cleaning

In [7]:
# Checking datatypes and missing entries of all the variables
# ---
#

diabetes_df.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

No missing data

In [8]:
# Standardizing your dataset i.e. variable renaming
# 
# convert all column headers to lower case

diabetes_df.columns =  diabetes_df.columns.str.lower()
diabetes_df.columns

Index(['pregnancies', 'glucose', 'bloodpressure', 'skinthickness', 'insulin',
       'bmi', 'diabetespedigreefunction', 'age', 'outcome'],
      dtype='object')

In [9]:
# Checking how many duplicate rows are there in the data
# ---
# 
diabetes_duplicated_df =diabetes_df[diabetes_df.duplicated()]
diabetes_duplicated_df

Unnamed: 0,pregnancies,glucose,bloodpressure,skinthickness,insulin,bmi,diabetespedigreefunction,age,outcome


### Observation:

No duplicates

In [10]:
#checking the shape of the clean data
diabetes_df.shape

(768, 9)

### Solution Implementation

In [36]:
# Spliting the data set to train and valid data sets
diabetes_train, diabetes_valid = train_test_split(diabetes_df, test_size=0.40, random_state=12345)

# create features and target for valid and train data
features_train = diabetes_train.drop(['outcome'], axis=1)
target_train = diabetes_train['outcome']
features_valid = diabetes_valid.drop(['outcome'], axis=1)
target_valid = diabetes_valid['outcome']


#decision tree modeling
for depth in range(1, 7):
    model = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    model.fit(features_train, target_train)

    predictions_valid = model.predict(features_valid)
    print('max_depth =', depth, ': ', end='')
    print(accuracy_score(target_valid, predictions_valid))


# Accurcy test
train_decision_predictions = model.predict(features_train)
test_decision_predictions = model.predict(features_valid)

print('Accuracy')
print('Training set:', accuracy_score(target_train, train_decision_predictions))
print('Test set:', accuracy_score(target_valid, test_decision_predictions))  
decision_tree = accuracy_score(target_valid, test_decision_predictions)

max_depth = 1 : 0.7727272727272727
max_depth = 2 : 0.788961038961039
max_depth = 3 : 0.7857142857142857
max_depth = 4 : 0.7824675324675324
max_depth = 5 : 0.762987012987013
max_depth = 6 : 0.7402597402597403
Accuracy
Training set: 0.8826086956521739
Test set: 0.7402597402597403


In [42]:
# using Random Forest
model = RandomForestClassifier(random_state=12345, n_estimators=40)
model.fit(features_train, target_train)
model.score(features_valid, target_valid)
train_predictions = model.predict(features_train)
test_predictions = model.predict(features_valid)

print('Accuracy')
print('Training set:', accuracy_score(target_train, train_predictions))
print('Test set:', accuracy_score(target_valid, test_predictions))
 
random_forest = accuracy_score(target_valid, test_predictions)

Accuracy
Training set: 1.0
Test set: 0.7987012987012987


In [46]:
#Logistic Regression
model = LogisticRegression(random_state=12345, solver='liblinear')
model.fit(features_train, target_train)
model.score(features_valid, target_valid)

train_predictions = model.predict(features_train)
test_predictions = model.predict(features_valid)

print('Accuracy')
print('Training set:', accuracy_score(target_train, train_predictions))
print('Test set:', accuracy_score(target_valid, test_predictions))

logistic_regression= accuracy_score(target_valid, test_predictions)

Accuracy
Training set: 0.758695652173913
Test set: 0.8051948051948052


In [47]:
print('Test accuracy decision tree', decision_tree)
print('Test accuracy Random forest', random_forest)
print('Test accuracy logistic regression',logistic_regression)

Test accuracy decision tree 0.7402597402597403
Test accuracy Random forest 0.7987012987012987
Test accuracy logistic regression 0.8051948051948052


##  Findings and Recommendations:

from the above model analysis, the best model to use for prediction is Logistic regression with an accuracy of 0.80. However, it does not meet our target accuracy. Further improvement of the model is required.