<br>

<br>

<br>

# ðŸš€ **PREDICTING DIABETES** ðŸš€

**BOOSTING ALGORITHM (XGBOOST)**

<br>

## **INDEX**

- **STEP 1: PROBLEM DEFINITION AND DATA COLLECTION**
- **STEP 2: DATA EXPLORATION AND CLEANING**
- **STEP 3: UNIVARIATE VARIABLE ANALYSIS**
- **STEP 4: MULTIVARIATE VARIABLE ANALYSIS**
- **STEP 5: FEATURE ENGINEERING**
- **STEP 6: FEATURE SELECTION**
- **STEP 7: MACHINE LEARNING**
- **STEP 8: CONCLUSIONS**

<br>

### **STEP 1: PROBLEM DEFINITION AND DATA COLLECTION**

- 1.1. Problem Definition
- 1.2. Library Importing
- 1.3. Data Collection

**1.1. PROBLEM DEFINITION**

Diabetes is a chronic health condition that affects millions of people worldwide. Early detection and diagnosis of diabetes are crucial for effective management and prevention of complications. In this study, we aim to develop a predictive model that can accurately identify individuals at risk of developing diabetes based on a set of diagnostic measures. By leveraging a dataset from the National Institute of Diabetes and Digestive and Kidney Diseases.

**RESEARCH QUESTIONS**

**Feature Importance**
- Which diagnostic measures (e.g., glucose levels, BMI) are the strongest predictors of diabetes?
- How do the relative importance of these features compare?

**Feature Interactions**
- Are there significant interactions between diagnostic measures that influence diabetes risk?
- How do these interactions affect the predictive model?

**Clinical Implications**
- Can the model identify subgroups of patients with distinct risk profiles?
- How can the model be used to improve clinical decision-making and early intervention?

**Model Performance**
- How well does the **`BOOSTING ALGORITHM (XGBoost)`** generalize to new, unseen data?
- What is the impact of different hyperparameter settings on model performance?


**Methodology**
- **`Extreme Gradient Boosting`**
- XGBoost, or Extreme Gradient Boosting, is a powerful machine learning algorithm that is widely used for both classification and regression tasks. It's part of a family of algorithms known as gradient boosting machines.

**How does XGBoost work?**

- **Sequential Model Building**:  XGBoost constructs a model sequentially. It starts by building a simple model (like a decision tree) and then adds new models one by one.
- **Minimizing Loss**: Each new model is trained to correct the errors made by the previous models. It does this by minimizing a loss function, which measures how well the model fits the training data.
- **Regularization**: XGBoost incorporates regularization techniques to prevent overfitting. This helps the model generalize better to unseen data.
- **Parallel Processing**: XGBoost is designed to be highly efficient and can leverage multiple cores of a CPU or GPUs for parallel processing.


**`XGBoost` vs. `Random Forest` vs. `Decision Tree`**
- **Decision Tree**: A decision tree is a basic machine learning model that makes decisions by splitting the data based on certain conditions. It's a single tree-like model.
- **Random Forest**: A random forest is an ensemble method that combines multiple decision trees. Each tree in the forest is trained on a different subset of the data and features. The final prediction is made by averaging the predictions of all the trees.
- **XGBoost**: XGBoost is also an ensemble method, but it differs from random forest in several ways:
    - **Sequential vs. Parallel**: Random forest builds trees independently, while XGBoost builds trees sequentially.
    - **Optimization**: XGBoost optimizes a loss function directly, making it more efficient.
    - **Regularization**: XGBoost incorporates regularization techniques to prevent overfitting.
    - **Handling Missing Values**: XGBoost has built-in mechanisms for handling missing values.

**To summarize:**
- Decision trees are the building blocks of more complex models like random forests and XGBoost.
- Random forests combine multiple decision trees to improve accuracy and reduce overfitting.
- XGBoost is a highly optimized gradient boosting algorithm that builds models sequentially and incorporates regularization to prevent overfitting.

<br>

**1.2. LIBRARY IMPORTING**

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pickle  # For saving the model
from pickle import dump
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.feature_selection import f_classif, SelectKBest
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from xgboost import XGBClassifier
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

**1.3. DATA COLLECTION**

In [2]:
pd.options.display.max_columns=None
df = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/decision-tree-project-tutorial/main/diabetes.csv")
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [3]:
df.to_csv("../data/raw/diabetes_data.csv", index=False)