<h1>In this notebook I will use regression and random forest to analyse a dataset provided by the National Institute of Diabetes and Kidney Diseases. I will use this data to predict whether a patient should be classified as Diabetic or Non diabetic. Furthermore I will determine the accuracy of this model in predicting whether a patient is diabetic or not and if necesarry I will make adjustments to the model to make it more accurate.</h1>

In [133]:
import pandas as pd

diabetes = pd.read_csv('G:\Data Science & PowerBI\Diabetes.csv')
diabetes.head(6)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0


In [141]:
# I will create some boxplots to get a visual insights in distribution of the data. This should give an immediate idea
#of which features are usefull in predicting the outcome.
from matplotlib import pyplot as plt
%matplotlib inline

Labels = diabetes['Outcome']

boxplot = diabetes.boxplot(by='Labels', layout=(10,1), figsize=(10,70), fontsize=15)
plt.ylim([0,150])
plt.show()

#The boxplot for Pregnancies and Age show differences in the distribution. This may help with the prediction.

KeyError: 'Labels'

In [142]:
#Next step is to apply a model to predict. 
#Let's start by dividing the dataset into X and y and start training and testing

x = diabetes[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']].values
y = diabetes['Outcome'].values

from sklearn.model_selection import train_test_split

#split the dataset 70/30

X_train, X_test, y_train, y_test = train_test_split(x,y, test_size=0.3, random_state=0)



In [143]:
#train the model
from sklearn.linear_model import LogisticRegression

reg = 0.01

model = LogisticRegression(C=1/reg, solver="liblinear").fit(X_train, y_train)
print (model)

LogisticRegression(C=100.0, solver='liblinear')


In [144]:
#make predictions based on test data
Predictions = model.predict(X_test)

#compare actual data with predicted data
print('Actual values:', y_test)
print('Predicted values:', Predictions)

Actual values: [1 0 0 1 0 0 1 1 0 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
 0 0 0 0 0 0 1 1 0 0 1 1 1 0 0 1 0 0 0 0 1 1 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 1 0 0 0 1 0
 1 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 1 1 0 0 0 0 0 1 0 0 0
 0 1 0 1 0 0 1 0 0 0 1 1 1 1 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 1 1
 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 1 1 0 0 1 0
 1 1 0 0 1 1 0 0 0]
Predicted values: [1 0 0 1 0 0 1 1 0 0 1 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0
 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 1 0 0 1 1 1 1 0 0 0 0 0 0 1
 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 1 0 0 0 0 1 0
 0 1 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0
 0 0 0 1 0 0 1 0 1 0 0 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0
 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
 0 1 1 1 0 0 0 0 0]


In [145]:
#Use the accuracy score to determine how accurate the predictions are

from sklearn.metrics import accuracy_score

score = accuracy_score(y_test, Predictions) * 100
print('the accuracy of the model is:', score,"%")

the accuracy of the model is: 77.92207792207793 %


<h1>Let's apply some more metrics to check how accurate this model is. THe confusion matrix should provide insight in the amount of true positives and true negatives.</h1>

In [127]:
from sklearn.metrics import confusion_matrix

cm=confusion_matrix(y_test, Predictions)
cm

array([[141,  16],
       [ 35,  39]], dtype=int64)

<h1>The relatively low accuracy of this model might be caused by the fact that i did not perform any pre processing of the data. Let's perform some pre processing to check if this increases the accuracy</h1>

In [129]:
# use pipeline for numeric and categorical features

# Train the model
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
import numpy as np

# Define preprocessing for numeric columns 
numeric_features = [0,1,2,3,4,5,6]
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

# Define preprocessing for categorical features (encode the Age column)
categorical_features = [7]
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Create preprocessing and training pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('logregressor', LogisticRegression(C=1/reg, solver="liblinear"))])


# fit the pipeline to train a logistic regression model on the training set
model = pipeline.fit(X_train, (y_train))
print (model)


Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  [0, 1, 2, 3, 4, 5, 6]),
                                                 ('cat',
                                                  Pipeline(steps=[('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  [7])])),
                ('logregressor',
                 LogisticRegression(C=100.0, solver='liblinear'))])


In [130]:
#using this pipeline, let's make a prediction to see if the model is more accurate

predictions = model.predict(X_test)
score2 = accuracy_score(y_test, predictions) * 100
print('the accuracy of the model is:', score2,"%")

the accuracy of the model is: 75.32467532467533 %


<h1>To see if a different model is more accurate than logistic regression is at this moment, I will apply Random Forest to make predictions.</h1>

In [147]:
from sklearn.ensemble import RandomForestClassifier

# Create preprocessing and training pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('logregressor', RandomForestClassifier(n_estimators=100))])

# fit the pipeline to train a random forest model on the training set
model = pipeline.fit(X_train, (y_train))
print (model)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  [0, 1, 2, 3, 4, 5, 6]),
                                                 ('cat',
                                                  Pipeline(steps=[('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  [7])])),
                ('logregressor', RandomForestClassifier())])


In [148]:
predictions_rf = model.predict(X_test)
score3 = accuracy_score(y_test, predictions_rf) * 100
print('the accuracy of the model is:', score3,"%")

the accuracy of the model is: 75.75757575757575 %


In [149]:
#Save the model as pickle file for re use

import joblib

filename = 'G:\Data Science & PowerBI\Diabetes_model.pkl'
joblib.dump(model, filename)

['G:\\Data Science & PowerBI\\Diabetes_model.pkl']