In [None]:
You are a data scientist working for a healthcare company, and you have been tasked with creating a
decision tree to help identify patients with diabetes based on a set of clinical variables. You have been
given a dataset (diabetes.csv) with the following variables:
1. Pregnancies: Number of times pregnant (integer)
2. Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test (integer)
3. BloodPressure: Diastolic blood pressure (mm Hg) (integer)
4. SkinThickness: Triceps skin fold thickness (mm) (integer)
5. Insulin: 2-Hour serum insulin (mu U/ml) (integer)
6. BMI: Body mass index (weight in kg/(height in m)^2) (float)
7. DiabetesPedigreeFunction: Diabetes pedigree function (a function which scores likelihood of diabetes
based on family history) (float)
8. Age: Age in years (integer)
9. Outcome: Class variable (0 if non-diabetic, 1 if diabetic) (integer)

Q1. Import the dataset and examine the variables. Use descriptive statistics and visualizations to
    understand the distribution and relationships between the variables.
Q2. Preprocess the data by cleaning missing values, removing outliers, and transforming categorical
    variables into dummy variables if necessary.
Q3. Split the dataset into a training set and a test set. Use a random seed to ensure reproducibility.
Q4. Use a decision tree algorithm, such as ID3 or C4.5, to train a decision tree model on the training set. Use
    cross-validation to optimize the hyperparameters and avoid overfitting.
Q5. Evaluate the performance of the decision tree model on the test set using metrics such as accuracy,
    precision, recall, and F1 score. Use confusion matrices and ROC curves to visualize the results.
Q6. Interpret the decision tree by examining the splits, branches, and leaves. Identify the most important
    variables and their thresholds. Use domain knowledge and common sense to explain the patterns and
   trends.
Q7. Validate the decision tree model by applying it to new data or testing its robustness to changes in the
   dataset or the environment. Use sensitivity analysis and scenario testing to explore the uncertainty and
    risks.
Here’s the dataset link:

Your goal is to create a decision tree to predict whether a patient has diabetes based on the other
variables. Here are the steps you can follow:

https://drive.google.com/file/d/1Q4J8KS1wm4-_YTuc389enPh6O-eTNcx2/view?

usp=sharing

By following these steps, you can develop a comprehensive understanding of decision tree modeling and
its applications to real-world healthcare problems. Good luck!

In [27]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [28]:
from sklearn.datasets import load_diabetes

In [29]:
dataset = load_diabetes()

In [30]:
print(dataset.DESCR)

.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - age     age in years
      - sex
      - bmi     body mass index
      - bp      average blood pressure
      - s1      tc, total serum cholesterol
      - s2      ldl, low-density lipoproteins
      - s3      hdl, high-density lipoproteins
      - s4      tch, total cholesterol / HDL
      - s5      ltg, possibly log of serum triglycerides level
      - s6      glu, blood sugar level

Note: Each of these 1

In [31]:
df = pd.read_csv('diabetes.csv')
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [32]:
df.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [34]:
dataset.target

array([151.,  75., 141., 206., 135.,  97., 138.,  63., 110., 310., 101.,
        69., 179., 185., 118., 171., 166., 144.,  97., 168.,  68.,  49.,
        68., 245., 184., 202., 137.,  85., 131., 283., 129.,  59., 341.,
        87.,  65., 102., 265., 276., 252.,  90., 100.,  55.,  61.,  92.,
       259.,  53., 190., 142.,  75., 142., 155., 225.,  59., 104., 182.,
       128.,  52.,  37., 170., 170.,  61., 144.,  52., 128.,  71., 163.,
       150.,  97., 160., 178.,  48., 270., 202., 111.,  85.,  42., 170.,
       200., 252., 113., 143.,  51.,  52., 210.,  65., 141.,  55., 134.,
        42., 111.,  98., 164.,  48.,  96.,  90., 162., 150., 279.,  92.,
        83., 128., 102., 302., 198.,  95.,  53., 134., 144., 232.,  81.,
       104.,  59., 246., 297., 258., 229., 275., 281., 179., 200., 200.,
       173., 180.,  84., 121., 161.,  99., 109., 115., 268., 274., 158.,
       107.,  83., 103., 272.,  85., 280., 336., 281., 118., 317., 235.,
        60., 174., 259., 178., 128.,  96., 126., 28

In [35]:
## Independent and dependent
X = df.iloc[:,:]
y = dataset.target

In [36]:
X

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


In [37]:
y.shape

(442,)

In [38]:
print("Shape of X:", X.shape)
print("Shape of y:", y.shape)

Shape of X: (768, 9)
Shape of y: (442,)


In [40]:
import random

# Determine the number of samples to match
min_samples = min(X.shape[0], y.shape[0])

# Randomly sample the larger variable
random.seed(123)  # Set the random seed for reproducibility
if X.shape[0] > y.shape[0]:
    X = X.sample(n=min_samples, random_state=123)
else:
    y = y.sample(n=min_samples, random_state=123)
    

In [41]:
print("Shape of X:", X.shape)
print("Shape of y:", y.shape)

Shape of X: (442, 9)
Shape of y: (442,)


In [42]:
from sklearn.model_selection import train_test_split
import random

# Set the random seed for reproducibility
random.seed(123)
# Split the dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

In [44]:
X_train.shape

(353, 9)

In [45]:
y_train.shape

(353,)

In [46]:
from sklearn.tree import DecisionTreeClassifier

In [47]:
## Post Pruning
treeclassifier = DecisionTreeClassifier()

In [48]:
treeclassifier.fit(X_train,y_train)

In [49]:
X_train.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
345,8,126,88,36,108,38.5,0.349,49,0
703,2,129,0,0,0,38.5,0.304,41,0
566,1,99,72,30,18,38.6,0.412,21,0
63,2,141,58,34,128,25.4,0.699,24,0
716,3,173,78,39,185,33.8,0.97,31,1


In [None]:
from sklearn import tree
plt.figsize = (15,10)
tree.plot_tree(treeclassifier,filled = True)

[Text(0.84334716796875, 0.9833333333333333, 'x[1] <= 194.5\ngini = 0.993\nsamples = 353\nvalue = [1, 1, 1, 1, 2, 1, 1, 2, 3, 3, 1, 2, 4, 2\n3, 1, 1, 2, 2, 1, 3, 3, 4, 1, 2, 2, 1, 2\n5, 4, 1, 2, 1, 4, 2, 1, 1, 3, 3, 4, 1, 1\n4, 1, 5, 3, 2, 1, 3, 2, 1, 3, 1, 2, 1, 3\n2, 2, 4, 2, 1, 2, 1, 3, 3, 1, 1, 2, 3, 1\n2, 1, 2, 1, 1, 4, 3, 4, 2, 2, 2, 1, 1, 1\n2, 1, 3, 4, 2, 4, 1, 1, 1, 1, 3, 2, 2, 1\n1, 2, 1, 2, 1, 1, 1, 1, 1, 3, 3, 1, 2, 1\n1, 5, 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2\n2, 2, 1, 5, 1, 3, 2, 1, 1, 1, 2, 1, 2, 1\n4, 1, 1, 2, 1, 1, 2, 2, 1, 1, 1, 1, 3, 2\n1, 2, 1, 1, 3, 1, 2, 2, 2, 1, 1, 2, 1, 2\n2, 2, 1, 1, 3, 1, 2, 1, 1, 4, 2, 1, 1, 1\n1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1]'),
 Text(0.7019324311755952, 0.95, 'x[0] <= 10.5\ngini = 0.993\nsamples = 348\nvalue = [1, 1, 1, 1, 2, 1, 1, 2, 3, 3, 1, 2, 4, 2\n3, 1, 1, 2, 2, 1, 3, 3, 4, 1, 2, 2, 1, 2\n5, 3, 1, 2, 1, 4, 2, 1, 1, 3, 3, 4, 1, 1\n4, 1, 5, 3, 2, 1, 3, 2, 1, 3, 1, 2, 1, 3\n2, 2, 4, 2, 1, 2, 1, 3, 3, 1, 1, 2, 1, 1\n2, 1, 2, 1, 1, 4, 3, 4

In [53]:
##  Prediction
y_pred = treeclassifier.predict(X_test)

In [54]:
y_pred

array([185.,  71., 281.,  90., 177., 233., 200., 196., 202., 102.,  71.,
       220.,  83., 139., 341.,  91.,  71.,  71., 144., 128., 281., 270.,
       131., 163.,  60., 144., 272.,  75., 200., 201., 118.,  77.,  91.,
        60., 270., 220., 179., 229., 245.,  95., 179., 189.,  71.,  70.,
        72., 160.,  66., 122.,  97.,  73., 276., 233.,  71., 103.,  72.,
       107., 216.,  87., 185.,  47.,  60., 252., 258., 141., 217.,  49.,
        63., 190., 137.,  73.,  42.,  90.,  84., 151.,  94., 168.,  65.,
        59., 222., 136., 152., 132., 139., 272., 124., 160.,  71.,  43.,
       295.])

In [55]:
from sklearn.metrics import classification_report,accuracy_score,confusion_matrix

In [56]:
print(classification_report(y_test,y_pred))
print(accuracy_score(y_pred,y_test))
print(confusion_matrix(y_pred,y_test))

              precision    recall  f1-score   support

        31.0       0.00      0.00      0.00         1
        39.0       0.00      0.00      0.00         1
        42.0       0.00      0.00      0.00         1
        43.0       0.00      0.00      0.00         0
        45.0       0.00      0.00      0.00         1
        47.0       0.00      0.00      0.00         0
        49.0       0.00      0.00      0.00         0
        51.0       0.00      0.00      0.00         1
        53.0       0.00      0.00      0.00         2
        54.0       0.00      0.00      0.00         1
        55.0       0.00      0.00      0.00         1
        59.0       0.00      0.00      0.00         2
        60.0       0.00      0.00      0.00         1
        61.0       0.00      0.00      0.00         1
        63.0       0.00      0.00      0.00         1
        65.0       0.00      0.00      0.00         0
        66.0       0.00      0.00      0.00         1
        68.0       0.00    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
