<br>

<br>

<br>

# ðŸš€ **PREDICTING DIABETES** ðŸš€

**BOOSTING ALGORITHM (XGBOOST)**

<br>

## **INDEX**

- **STEP 1: PROBLEM DEFINITION AND DATA COLLECTION**
- **STEP 2: DATA EXPLORATION AND CLEANING**
- **STEP 3: UNIVARIATE VARIABLE ANALYSIS**
- **STEP 4: MULTIVARIATE VARIABLE ANALYSIS**
- **STEP 5: FEATURE ENGINEERING**
- **STEP 6: FEATURE SELECTION**
- **STEP 7: MACHINE LEARNING**
- **STEP 8: CONCLUSIONS**

<br>

### **STEP 1: PROBLEM DEFINITION AND DATA COLLECTION**

- 1.1. Problem Definition
- 1.2. Library Importing
- 1.3. Data Collection

**1.1. PROBLEM DEFINITION**

Diabetes is a chronic health condition that affects millions of people worldwide. Early detection and diagnosis of diabetes are crucial for effective management and prevention of complications. In this study, we aim to develop a predictive model that can accurately identify individuals at risk of developing diabetes based on a set of diagnostic measures. By leveraging a dataset from the National Institute of Diabetes and Digestive and Kidney Diseases.

**RESEARCH QUESTIONS**

**Feature Importance**
- Which diagnostic measures (e.g., glucose levels, BMI) are the strongest predictors of diabetes?
- How do the relative importance of these features compare?

**Feature Interactions**
- Are there significant interactions between diagnostic measures that influence diabetes risk?
- How do these interactions affect the predictive model?

**Clinical Implications**
- Can the model identify subgroups of patients with distinct risk profiles?
- How can the model be used to improve clinical decision-making and early intervention?

**Model Performance**
- How well does the **`BOOSTING ALGORITHM (XGBoost)`** generalize to new, unseen data?
- What is the impact of different hyperparameter settings on model performance?


**Methodology**
- **`Extreme Gradient Boosting`**
- XGBoost, or Extreme Gradient Boosting, is a powerful machine learning algorithm that is widely used for both classification and regression tasks. It's part of a family of algorithms known as gradient boosting machines.

**How does XGBoost work?**

- **Sequential Model Building**:  XGBoost constructs a model sequentially. It starts by building a simple model (like a decision tree) and then adds new models one by one.
- **Minimizing Loss**: Each new model is trained to correct the errors made by the previous models. It does this by minimizing a loss function, which measures how well the model fits the training data.
- **Regularization**: XGBoost incorporates regularization techniques to prevent overfitting. This helps the model generalize better to unseen data.
- **Parallel Processing**: XGBoost is designed to be highly efficient and can leverage multiple cores of a CPU or GPUs for parallel processing.


**`XGBoost` vs. `Random Forest` vs. `Decision Tree`**
- **Decision Tree**: A decision tree is a basic machine learning model that makes decisions by splitting the data based on certain conditions. It's a single tree-like model.
- **Random Forest**: A random forest is an ensemble method that combines multiple decision trees. Each tree in the forest is trained on a different subset of the data and features. The final prediction is made by averaging the predictions of all the trees.
- **XGBoost**: XGBoost is also an ensemble method, but it differs from random forest in several ways:
    - **Sequential vs. Parallel**: Random forest builds trees independently, while XGBoost builds trees sequentially.
    - **Optimization**: XGBoost optimizes a loss function directly, making it more efficient.
    - **Regularization**: XGBoost incorporates regularization techniques to prevent overfitting.
    - **Handling Missing Values**: XGBoost has built-in mechanisms for handling missing values.

**To summarize:**
- Decision trees are the building blocks of more complex models like random forests and XGBoost.
- Random forests combine multiple decision trees to improve accuracy and reduce overfitting.
- XGBoost is a highly optimized gradient boosting algorithm that builds models sequentially and incorporates regularization to prevent overfitting.

<br>

**1.2. LIBRARY IMPORTING**

In [29]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pickle  # For saving the model
from pickle import dump
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.feature_selection import f_classif, SelectKBest
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from xgboost import XGBClassifier
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

**1.3. DATA COLLECTION**

In [30]:
# URLs for the processed datasets (adjust these URLs with the correct RAW paths from GitHub)
X_train_url = "https://raw.githubusercontent.com/jenuzho/PREDICTING-DIABETES-decision-tree/main/data/processed/X_train_without_outliers_minmax_sel.csv"
X_test_url = "https://raw.githubusercontent.com/jenuzho/PREDICTING-DIABETES-decision-tree/main/data/processed/X_test_without_outliers_minmax_sel.csv"
y_train_url = "https://raw.githubusercontent.com/jenuzho/PREDICTING-DIABETES-decision-tree/main/data/processed/y_train.csv"
y_test_url = "https://raw.githubusercontent.com/jenuzho/PREDICTING-DIABETES-decision-tree/main/data/processed/y_test.csv"

# Load the datasets
X_train = pd.read_csv(X_train_url)
X_test = pd.read_csv(X_test_url)
y_train = pd.read_csv(y_train_url)
y_test = pd.read_csv(y_test_url)

# Check the first few rows of the training data
print(X_train.head())
print(y_train.head())


   Pregnancies   Glucose   Insulin       BMI  DiabetesPedigreeFunction  \
0     0.176471  0.577889  0.188172  0.641414                  0.064171   
1     0.176471  0.567839  0.114247  0.496633                  0.488414   
2     0.294118  0.793970  0.282258  0.663300                  0.282531   
3     0.176471  0.391960  0.000000  0.547138                  0.171123   
4     0.000000  0.507538  0.000000  0.601010                  0.106952   

        Age  
0  0.116667  
1  0.066667  
2  0.133333  
3  0.300000  
4  0.083333  
   Outcome
0        0
1        0
2        1
3        0
4        0


In [31]:
# Create the base XGBoost model
xgb_model = XGBClassifier(random_state=42, eval_metric="logloss")

# Train the model on the training data
xgb_model.fit(X_train, y_train.values.ravel())

# Make predictions on the test data
y_pred = xgb_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Base Model Accuracy: {accuracy}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Base Model Accuracy: 0.7727272727272727

Classification Report:
              precision    recall  f1-score   support

           0       0.84      0.79      0.81        96
           1       0.68      0.74      0.71        58

    accuracy                           0.77       154
   macro avg       0.76      0.77      0.76       154
weighted avg       0.78      0.77      0.77       154



In [32]:
import xgboost as xgb
from sklearn.metrics import accuracy_score

# Convert data to DMatrix (XGBoost's optimized data structure)
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# Define parameters
param = {
    'max_depth': 3,
    'eta': 0.1,
    'objective': 'binary:logistic',
    'eval_metric': 'logloss',
}

# Perform cross-validation
cv_results = xgb.cv(param, dtrain, num_boost_round=100, nfold=3, metrics='logloss', as_pandas=True)

# Train the final model with the best number of boosting rounds
final_model = xgb.train(param, dtrain, num_boost_round=len(cv_results))

# Make predictions
y_pred = final_model.predict(dtest)
y_pred_binary = [1 if i > 0.5 else 0 for i in y_pred]

# Evaluate the predictions
accuracy = accuracy_score(y_test, y_pred_binary)
print(f"Final Model Accuracy: {accuracy}")


Final Model Accuracy: 0.7987012987012987


In [33]:
from xgboost import XGBClassifier

# Get the best parameters from GridSearchCV
best_params = grid_search.best_params_

# Train a new XGBClassifier with the best parameters
best_xgb_model = XGBClassifier(**best_params, eval_metric="logloss", random_state=42)
best_xgb_model.fit(X_train, y_train.values.ravel())

# Make predictions on the test set
y_pred_optimized = best_xgb_model.predict(X_test)


Parameters: { "classifier__learning_rate", "classifier__max_depth", "classifier__n_estimators" } are not used.



In [34]:
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report

# Clean the best parameters by removing the prefix if needed
best_params = {key.split('__')[-1]: value for key, value in grid_search.best_params_.items()}

# Train a new XGBClassifier with the cleaned parameters
best_xgb_model = XGBClassifier(**best_params, eval_metric="logloss", random_state=42)
best_xgb_model.fit(X_train, y_train.values.ravel())

# Make predictions on the test set
y_pred_optimized = best_xgb_model.predict(X_test)

# Evaluate the optimized model
accuracy_optimized = accuracy_score(y_test, y_pred_optimized)
print(f"Optimized Model Accuracy: {accuracy_optimized}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_optimized))


Optimized Model Accuracy: 0.7077922077922078

Classification Report:
              precision    recall  f1-score   support

           0       0.68      1.00      0.81        96
           1       1.00      0.22      0.37        58

    accuracy                           0.71       154
   macro avg       0.84      0.61      0.59       154
weighted avg       0.80      0.71      0.64       154



In [35]:
from sklearn.metrics import confusion_matrix, classification_report

# Evaluate the optimized model
accuracy_optimized = accuracy_score(y_test, y_pred_optimized)
print(f"Optimized Model Accuracy: {accuracy_optimized}")

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred_optimized)
print("\nConfusion Matrix:")
print(conf_matrix)

# Classification Report
print("\nClassification Report:")
print(classification_report(y_test, y_pred_optimized))


Optimized Model Accuracy: 0.7077922077922078

Confusion Matrix:
[[96  0]
 [45 13]]

Classification Report:
              precision    recall  f1-score   support

           0       0.68      1.00      0.81        96
           1       1.00      0.22      0.37        58

    accuracy                           0.71       154
   macro avg       0.84      0.61      0.59       154
weighted avg       0.80      0.71      0.64       154

