Project Scenario:
You are a data scientist in an epidemology department.

The governtment is waging a war on diabetes, and you are at the frontline. Your weapon is your Python skills, and you bullets are data.

In this project, you will train a machine learning model to predict whether an individual is ate risk of getting diabetes.

# Part 3: Machine Learning Model Training

Aim: Predict whether an individual is ate risk of getting diabetes

Thus, it is a classification problem.

In [2]:
# import libraries
import pandas as pd

# We use sklearn to get all the machine learning tools we need

# Used to split our training set into train and test:
from sklearn.model_selection import train_test_split

# Create a dummy classifier, which will help by acting as a baseline:
from sklearn.dummy import DummyClassifier

# Using Logistic Regression will be the method:
from sklearn.linear_model import LogisticRegression

# Classification techniques:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Confusion Matrix & Classification Report:
from sklearn.metrics import confusion_matrix, classification_report

In [4]:
df = pd.read_csv("diabetes_data_clean.csv")
df

Unnamed: 0,age,ismale,polyuria,polydipsia,sudden weight loss,weakness,polyphagia,genital thrush,visual blurring,itching,irritability,delayed healing,partial paresis,muscle stiffness,alopecia,obesity,class
0,40,1,0,1,0,1,0,0,0,1,0,1,0,1,1,1,1
1,58,1,0,0,0,1,0,0,1,0,0,0,1,0,1,0,1
2,41,1,1,0,0,1,1,0,0,1,0,1,0,1,1,0,1
3,45,1,0,0,1,1,1,1,0,1,0,1,0,0,0,0,1
4,60,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1
5,55,1,1,1,0,1,1,0,1,1,0,1,0,1,1,1,1
6,57,1,1,1,0,1,1,1,0,0,0,1,1,0,0,0,1
7,66,1,1,1,1,1,0,0,1,1,1,0,1,1,0,0,1
8,67,1,1,1,0,1,1,1,0,1,1,0,1,1,0,1,1
9,70,1,0,1,1,1,1,0,1,1,1,0,0,0,1,0,1


In [44]:
# So we need to first split our data into the independent and dependent variables first!
# Let X be all the independent Variables
# and y be the dependent variable (class - which indicates if someone has diabetes or not)

X = df.drop("class", axis = 1)
y = df["class"]

In [45]:
X

Unnamed: 0,age,ismale,polyuria,polydipsia,sudden weight loss,weakness,polyphagia,genital thrush,visual blurring,itching,irritability,delayed healing,partial paresis,muscle stiffness,alopecia,obesity
0,40,1,0,1,0,1,0,0,0,1,0,1,0,1,1,1
1,58,1,0,0,0,1,0,0,1,0,0,0,1,0,1,0
2,41,1,1,0,0,1,1,0,0,1,0,1,0,1,1,0
3,45,1,0,0,1,1,1,1,0,1,0,1,0,0,0,0
4,60,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1
5,55,1,1,1,0,1,1,0,1,1,0,1,0,1,1,1
6,57,1,1,1,0,1,1,1,0,0,0,1,1,0,0,0
7,66,1,1,1,1,1,0,0,1,1,1,0,1,1,0,0
8,67,1,1,1,0,1,1,1,0,1,1,0,1,1,0,1
9,70,1,0,1,1,1,1,0,1,1,1,0,0,0,1,0


In [46]:
y

0      1
1      1
2      1
3      1
4      1
5      1
6      1
7      1
8      1
9      1
10     1
11     1
12     1
13     1
14     1
15     1
16     1
17     1
18     1
19     1
20     1
21     1
22     1
23     1
24     1
25     1
26     1
27     1
28     1
29     1
      ..
490    0
491    0
492    0
493    0
494    0
495    0
496    0
497    0
498    1
499    0
500    1
501    0
502    0
503    0
504    0
505    0
506    0
507    0
508    0
509    0
510    0
511    0
512    0
513    1
514    1
515    1
516    1
517    1
518    0
519    0
Name: class, Length: 520, dtype: int64

In [47]:
# Now we need to split into training data and test data
# We use our train_test_split function here
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, stratify = y)

In [48]:
# Now we have X_train and y_train who are our training data
# and X_test and y_test, which act as our testing data

## Alright let us begin our model training
We shall first start with DummyClassifier to establish a baseline
- A dummy classifier basically randomly guesses the classifiers and gives a minimum base line that our machine learning model should be better than


In [49]:
dummy = DummyClassifier()
dummy.fit(X_train, y_train)
dummy_pred = dummy.predict(X_test)

In [50]:
# Let us assess our DummyClassifier model
confusion_matrix(y_test, dummy_pred) 

array([[14, 26],
       [30, 34]], dtype=int64)

This is how to read the above array:
![image.png](attachment:image.png)

## Logistic Regression

In [51]:
# If you don't wish to use a confusion matrix, can use a classification report instead:
print(classification_report(y_test, dummy_pred))

              precision    recall  f1-score   support

           0       0.32      0.35      0.33        40
           1       0.57      0.53      0.55        64

   micro avg       0.46      0.46      0.46       104
   macro avg       0.44      0.44      0.44       104
weighted avg       0.47      0.46      0.47       104



In [52]:
# Now let us start with Logistic Regression:
logr = LogisticRegression(max_iter = 10000)
logr.fit(X_train, y_train)
logr_pred = logr.predict(X_test)



In [53]:
confusion_matrix(y_test,logr_pred)

array([[39,  1],
       [ 9, 55]], dtype=int64)

As you can see here- there is a noticeable improvement from the baseline. We can also use the classification report to gauge the improvement:

In [54]:
print(classification_report(y_test,logr_pred))

              precision    recall  f1-score   support

           0       0.81      0.97      0.89        40
           1       0.98      0.86      0.92        64

   micro avg       0.90      0.90      0.90       104
   macro avg       0.90      0.92      0.90       104
weighted avg       0.92      0.90      0.91       104



## Decision Tree

In [83]:
tree = DecisionTreeClassifier()
tree.fit(X_train,y_train)
tree_pred = tree.predict(X_test)

In [84]:
confusion_matrix(y_test, tree_pred)

array([[39,  1],
       [ 4, 60]], dtype=int64)

In [85]:
print(classification_report(y_test, tree_pred))

              precision    recall  f1-score   support

           0       0.91      0.97      0.94        40
           1       0.98      0.94      0.96        64

   micro avg       0.95      0.95      0.95       104
   macro avg       0.95      0.96      0.95       104
weighted avg       0.95      0.95      0.95       104



## Random Forest

In [64]:
forest = RandomForestClassifier()
forest.fit(X_train,y_train)
forest_pred = forest.predict(X_test)



In [86]:
confusion_matrix(y_test, forest_pred)

array([[39,  1],
       [ 2, 62]], dtype=int64)

In [87]:
print(classification_report(y_test, forest_pred))

              precision    recall  f1-score   support

           0       0.95      0.97      0.96        40
           1       0.98      0.97      0.98        64

   micro avg       0.97      0.97      0.97       104
   macro avg       0.97      0.97      0.97       104
weighted avg       0.97      0.97      0.97       104



### Conclusions from the three models:
1. All three models outperformed the baseline set by the dummy classifier
2. Random Forest was the best performing model out of the three

### Indentifying the Most Important Features
- Since Random Forest was the be performing model out of the three
- We shall us it to judge the importance of each Feature

In [88]:
pd.DataFrame({"feature": X.columns,
            "importance": forest.feature_importances_}).sort_values("importance",ascending = False)

Unnamed: 0,feature,importance
2,polyuria,0.231687
3,polydipsia,0.190876
0,age,0.118526
4,sudden weight loss,0.080394
1,ismale,0.078598
12,partial paresis,0.039714
10,irritability,0.03896
8,visual blurring,0.038676
9,itching,0.030022
11,delayed healing,0.025318


Thus, Polyuria is the most important feature when it comes to predicting diabetes
____________________________________________________________________________

Summary:
1. Trained a basline model
2. Trained three different models- logistic regression, decision tree, random forest
3. Indentified the important features using the best performing model