### Diabetes Prediction Using Decision Tree, Random Forest, and K-Fold Cross-Validation

In this project, we will build two models — a **Decision Tree** and a **Random Forest** — to predict diabetes for subjects in the **Pima Indians dataset**. The prediction will be based on predictor variables such as **age**, **blood pressure**, and **BMI (Body Mass Index)**.

### Dataset Overview

The **Pima Indians dataset** is a subset of the data from the **UCI Machine Learning Repository** and is available as a built-in dataset in the **MASS library**. The dataset contains **768 complete records**.

The data has been divided into two subsets:
- **Training dataset (80%)**
- **Test dataset (20%)**

Additionally, any records with zeros in the features that don't make sense (e.g., a zero for blood pressure or BMI) have been cleaned from the dataset.

### Decision Tree Model

A **Decision Tree** is a flowchart-like structure used for decision-making. Here's how it works:
- **Internal Nodes**: Each node represents a **feature** (or attribute) from the dataset.
- **Branches**: Each branch represents a **decision rule** based on the feature values.
- **Leaf Nodes**: The leaf nodes represent the **outcome** or prediction.

<center><img src='https://i.imgur.com/6Fam41M.png'></center>

The topmost node in a decision tree is known as the **root node**. The decision tree model learns to partition the data based on attribute values through a process called **recursive partitioning**. This means that at each step, the dataset is split into subsets based on a chosen feature that best divides the data.

The tree-like structure mimics human-level thinking by breaking down complex decisions into simple "yes" or "no" questions. This makes **Decision Trees** easy to understand, interpret, and visualize, which is why they are widely used in machine learning.


### Load Libraries

In this section, we are importing the necessary libraries that will help us load, manipulate, and model the data.

- **`import pandas as pd`**:
  - We import the **Pandas** library, which is essential for data manipulation and analysis. It provides data structures like `DataFrame` that allow us to work efficiently with structured data (e.g., tables).
  
- **`import numpy as np`**:
  - **NumPy** is a library for numerical computing. It provides support for large, multi-dimensional arrays and matrices, and includes mathematical functions to operate on them. It is especially useful when working with numerical data.

- **`from sklearn.model_selection import train_test_split`**:
  - This imports the `train_test_split` function from **scikit-learn**. It is used to split the dataset into two parts: the training data (used to train the model) and the test data (used to evaluate the model). This helps in assessing the model's performance on unseen data.

- **`from sklearn import metrics`**:
  - We import the **metrics** module from **scikit-learn**. This module provides several functions to evaluate the performance of machine learning models, such as calculating accuracy, precision, recall, F1 score, and other evaluation metrics.

- **`from sklearn import tree`**:
  - This imports the **tree** module from **scikit-learn**, which includes the tools needed to build and work with **Decision Trees**. A Decision Tree is a popular machine learning algorithm used for classification tasks, where the goal is to predict a category (such as whether a person has diabetes).


In [1]:
# Load libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn import tree

### Load the Dataset

In [2]:
data = pd.read_csv(r"diabetes.csv")

### View the First Few Rows of the Dataset

In [3]:
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


### Check the Shape of the Dataset

In [4]:
data.shape

(768, 9)

### Define Column Names

In [5]:
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']

### Assign Column Names to the Dataset

In [6]:
data.columns = col_names

### View the Updated Dataset

In [7]:
data.head()

Unnamed: 0,pregnant,glucose,bp,skin,insulin,bmi,pedigree,age,label
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


### Split the Dataset into Features and Target Variable

In this step, we are dividing the dataset into two parts:

1. **Features (`X`)**:
   - The **features** are the input variables that the machine learning model uses to make predictions.
   - We create a new variable `X` by excluding the **`label`** column from the dataset. This leaves all the other columns (such as `pregnant`, `glucose`, `bmi`, etc.) which will be used as features for the model.

2. **Target Variable (`y`)**:
   - The **target variable** is the outcome we want to predict. In this case, it is the **`label`** column, which indicates whether a person has diabetes (1) or not (0).
   - The **`label`** column is assigned to the variable `y`, representing the target variable.

By splitting the dataset this way, we prepare it for model training. The model will use `X` (the features) to learn patterns and predict the values in `y` (the target).


In [8]:
#split dataset in features and target variable

X = data.drop(columns = 'label') # Features
y = data['label'] # Target variable

### Split the Dataset into Training Set and Test Set

In this step, we are splitting the dataset into two parts:
- **Training Set**: This part of the data is used to train the machine learning model.
- **Test Set**: This part of the data is kept aside and used to evaluate the performance of the trained model.

The `train_test_split` function is used to split the data, and here's what each argument means:

1. **`X` and `y`**:
   - `X` contains the feature variables (input data), and `y` contains the target variable (the outcome we want to predict).

2. **`test_size=0.2`**:
   - This argument specifies the proportion of the dataset to include in the test set. In this case, `0.2` means 20% of the data will be used for testing, while 80% will be used for training the model.

3. **`random_state=1`**:
   - This sets the random seed to ensure that the split is reproducible. By using the same `random_state`, you ensure that the data is split in the same way every time the code is run. This helps in obtaining consistent results.

After this step:
- `X_train` and `y_train` will be used to train the model.
- `X_test` and `y_test` will be used to evaluate the model’s performance.


In [9]:
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

### Check the Shape of the Training Set

In [10]:
X_train.shape

(614, 8)

### Check for Missing Values in the Training Set

In [11]:
X_train.isna().sum()

Unnamed: 0,0
pregnant,0
glucose,0
bp,0
skin,0
insulin,0
bmi,0
pedigree,0
age,0


In [12]:
X_test.shape

(154, 8)

### Cross Validation

**Cross-validation** is a technique used to evaluate the performance of machine learning models. It involves training multiple models on different subsets of the available data and then testing them on the remaining complementary subsets. This helps in assessing how well a model generalizes to unseen data.

The key points about **cross-validation** are:

1. **Train and Test on Different Subsets**:
   - The data is split into several subsets (or folds). In each iteration, the model is trained on some folds and evaluated on the remaining fold(s). This process repeats multiple times, with each fold serving as the test set exactly once.

2. **Detect Overfitting**:
   - Cross-validation is commonly used to detect **overfitting**, which happens when a model learns to perform well on the training data but fails to generalize to new, unseen data. By testing the model on different subsets of the data, we get a better estimate of its true performance.
   
3. **Improved Model Evaluation**:
   - Instead of relying on a single train-test split, cross-validation provides a more reliable estimate of how the model will perform on new, unseen data by using multiple splits and evaluations.

4. **Cross-validation and Model Tuning**:
   - Cross-validation can be useful in **hyperparameter tuning**, where different sets of hyperparameters are tested, and the model's performance is evaluated on different data splits to choose the best configuration.

Common types of cross-validation include:
- **K-Fold Cross Validation**: The dataset is divided into `k` subsets, and the model is trained and evaluated `k` times.
- **Stratified K-Fold Cross Validation**: Similar to K-Fold, but it ensures that each fold has the same distribution of target classes, which is useful for imbalanced datasets.

Overall, cross-validation helps in selecting models that generalize well to new data, ensuring that they don't just memorize the training data (which leads to overfitting).


## K-fold Cross Validation

Let’s say that you have trained a machine learning model. Now, you need to find out how well this model performs. Is it accurate enough to be used? How does it compare to another model? There are several evaluation methods to determine this. One such method is called K-fold cross validation.
![](https://i.imgur.com/NJkv5Tk.jpg)

Cross validation is an evaluation method used in machine learning to find out how well your machine learning model can predict the outcome of unseen data. It is a method that is easy to comprehend, works well for a limited data sample and also offers an evaluation that is less biased, making it a popular choice.

The data sample is split into **‘k’ number of smaller samples**, hence the name: K-fold Cross Validation. You may also hear terms like four fold cross validation, or ten fold cross validation, which essentially means that the sample data is being split into four or ten smaller samples respectively.


## How is k-fold cross validation performed?

The general stratergy is quite straight forward and the following steps can be used:

- First, shuffle the dataset and split into k number of subsamples. (It is important to try to make the subsamples equal in size and ensure k is less than or equal to the number of elements in the dataset).
- In the first iteration, the first subset is used as the test data while all the other subsets are considered as the training data.
 -Train the model with the training data and evaluate it using the test subset. Keep the evaluation score or error rate, and get rid of the model.
- Now, in the next iteration, select a different subset as the test data set, and make everything else (including the test set we used in the previous iteration) part of the training data.
- Re-train the model with the training data and test it using the new test data set, keep the evaluation score and discard the model.
- Continue iterating the above k times. Each data subsamples will be used in each iteration until all data is considered. You will end up with a k number of evaluation scores.
- The total error rate is the average of all these individual evaluation scores.

![](https://i.imgur.com/6aZw9So.png)


## How to determine the best value for ‘k’ in K-Fold Cross Validation?

Chosing a good value for k is important. A poor value for k can result in a poor evaluation of the model’s abilities. In other words, it can cause the measured ability of the model to be overestimated (high bias) or change widely depending on the training data used (high variance).

Generally, there are three ways to select k:

1. Let k = 5, or k =10. Through experimentation, it has been found that selecting k to be 5 or 10 results in sufficiently good results.
2. Let k = n, where n is the size of the dataset. This ensures each sample is used in the test data set.
3. Another way is to chose k so that every split data sample is sufficiently large, ensuring they are statistically represented in the larger dataset.

## Types of cross validation

Cross validation can be divided into two major categories:

1. **Exhaustive**, where the method learn and test on every single possibility of dividing the dataset into training and testing subsets.
2. **Non-exhaustive cross validation** methods where all ways of splitting the sample are not computed.

**Exhaustive cross-validation**

**Leave-p-out cross validation** is a method of exhaustive cross validation. Here, p number of observations (or elements in the sample dataset) are left out as the training dataset, everything else is considered as part of the training data. For more clarity, if you look at the above image, p is equal to 5, as shown by the 5 circles in the ‘test data’.

**Leave-one-out cross validation** a special form of leave-p-out exhuastive cross validation method, where p = 1. This is also a specific case for k-fold cross validation, where k = N(number of elements in the sample dataset).

***Mathematical Expression***

LOOCV involves one fold per observation i.e each observation by itself plays the role of the validation set. The (N-1) observations play the role of the training set. With least-squares linear, a single model performance cost is the same as a single model. In LOOCV, refitting of the model can be avoided while implementing the LOOCV method. MSE(Mean squared error) is calculated by fitting on the complete dataset.

![](https://i.imgur.com/b6zqrU4.png)

In the above formula, $h_i$ represents how much influence an observation has on its own fit i.e between 0 and 1 that punishes the residual, as it divides by a small number. It inflates the residual.


**Non-exhaustive cross-validation**

K-fold cross validation where k is not equal to N, Stratified cross validation and repeated random sub-sampling validation are non-exhaustive cross validation methods.

**Stratified cross validation:** There can be some tricky situation with K Fold validation.

Since we are randomly shuffling the data and dividing it into folds. Chances are we may get highly imbalanced folds which can cause our training to be biased.

![](https://i.imgur.com/96J2TrZ.jpg)

Let say, somehow we get a fold that has majority belongs to one class (say dog class) and only a few as cat class. This will certainly create problem in our training and to avoid this we make stratified folds using stratification.

**Stratification** is a process of rearranging the data to ensure that each fold or group is a good representative of the complete dataset.

For example, in binary classification, every split has elements of which roughly 50% belongs to class 0 and 50% that belongs to class 1.


Partitions are selected such that each partition contains roughly the same amount of elements for each class label.

**Repeated random sub-sampling validation (Monte Carlo cross validation):** Data is split into multiple random subsets and the model is trained and evaluated for each split. The results are averaged over the splits. Unlike the k-fold cross validation, proportions of the training and test set size are not dependent on the size of the data set, which is an advantage. However, a disadvantage is that some data elements will never be selected as a part of the test set, while some may be selected multiple times. When the amount of random splits are increased and approach infinity, the results tend to be similar to that of leave-p-out cross validation.

**Steps:**

- Split training data randomly (maybe 70–30% split or 62.5–37.5% split or 86.3–13.7%split). For each iteration, the train-test split percentage is different.
- Fit the model on train data set for that iteration and calculate test error using the fitted model on test data
- Repeat many iterations (say 100 or 500 or even 1000 iterations) and take the average of the test errors.

![](https://i.imgur.com/Raz2ovj.png)

**Note - the same data can be selected more than once in the test set or even never at all.**

### Import Cross-Validation and Stratified K-Fold

In [13]:
from sklearn.model_selection import StratifiedKFold, cross_val_score

### Explanation for Feature Scaling

In this step, we are performing **feature scaling** to standardize the feature data. Feature scaling is important because many machine learning algorithms perform better when the features are on a similar scale. For example, **KNN** or **SVM** algorithms can be sensitive to the magnitude of the features.

1. **Standardization**:
   - We are using **StandardScaler**, which standardizes the data by transforming each feature to have a **mean of 0** and a **standard deviation of 1**. This is done to avoid any one feature from dominating the model due to its larger range of values.
   
2. **Fitting and Transforming the Data**:
   - The **`fit_transform`** function calculates the mean and standard deviation of each feature and then uses these values to standardize the dataset. The result is a scaled version of the dataset where each feature is centered around 0 with a spread of 1.
   
3. **Why is this important?**:
   - Scaling the data ensures that all features contribute equally to the model's performance. For example, without scaling, features like **age** (ranging from 0 to 100) may dominate over features like **BMI** (ranging from 10 to 40) in distance-based algorithms.
   
After this step, the scaled feature data will be used for training the machine learning models.


In [14]:
# Feature scaling
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

### Random Forest with Stratified K-Fold Cross-Validation

In this step, we use **Stratified K-Fold Cross-Validation** with a **Random Forest Classifier** to evaluate the model's performance.

1. **Stratified K-Fold Cross-Validation**:
   - We use **StratifiedKFold** to split the dataset into 5 subsets (folds), ensuring that the distribution of target classes (diabetic vs. non-diabetic) is maintained in each fold. This is important in imbalanced datasets to ensure each fold has the same proportion of classes.
   - **`n_splits=5`**: This divides the dataset into 5 parts.
   - **`shuffle=True`**: The data is shuffled before splitting to avoid any order bias in the dataset.
   - **`random_state=42`**: This ensures that the data splits are reproducible. Using the same value for `random_state` will produce the same train-test splits every time the code is run.

2. **Random Forest Classifier**:
   - A **Random Forest Classifier** is used as the machine learning model. It is an ensemble learning technique that builds multiple decision trees and combines their results to improve accuracy and reduce overfitting.
   - **`random_state=42`** ensures that the random processes (like random feature selection) are reproducible.

3. **Cross-Validation with `cross_val_score`**:
   - **`cross_val_score`** is used to evaluate the model using cross-validation. It trains the **RandomForestClassifier** on different training sets (using the Stratified K-Fold split) and evaluates it on the complementary test sets.
   - **`cv=kfold`** specifies that **Stratified K-Fold Cross-Validation** should be used.
   - **`scoring='accuracy'`** calculates the **accuracy** of the model for each fold, i.e., the proportion of correctly predicted instances.

4. **Output**:
   - The accuracy scores for each fold are stored in **`rf_scores`**, and the mean accuracy across all folds is calculated and printed.
   - The print statements show the accuracy scores for each fold and the overall mean accuracy of the model.

This process helps in understanding the model’s performance on different subsets of data, making it less prone to overfitting and providing a more robust evaluation.


In [15]:
# Use StratifiedKFold to maintain class balance
from sklearn.ensemble import RandomForestClassifier
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

rf_model = RandomForestClassifier(random_state=42)
rf_scores = cross_val_score(rf_model, X_scaled, y, cv=kfold, scoring='accuracy')

print("Random Forest Cross-Validation Accuracy Scores:", rf_scores)
print("Mean Accuracy:", np.mean(rf_scores))


Random Forest Cross-Validation Accuracy Scores: [0.77922078 0.79220779 0.76623377 0.75163399 0.77124183]
Mean Accuracy: 0.7721076309311604
