# Cardio Disease Prediction using XGBoost

This notebook trains an XGBoost model on the processed cardio dataset,
evaluates it, and saves the trained model locally.

## 1. Import Required Libraries
Import the necessary libraries for data manipulation, machine learning, visualization, and model serialization.

In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score

from xgboost import XGBClassifier
import xgboost as xgb

import matplotlib.pyplot as plt
import joblib

## 2. Load the Dataset
Load the cleaned cardiovascular disease dataset from the processed data folder.

In [None]:
df = pd.read_csv("../data/processed/clean_cardio.csv")
df.head()

## 3. Dataset Information
Display the structure and data types of the dataset.

In [None]:
df.info()

## 4. Target Variable Distribution
Check the class balance of the target variable (cardio).

In [None]:
df['cardio'].value_counts(normalize=True)

## 5. Prepare Features and Target
Separate the dataset into features (X) and target variable (y).

In [None]:
target = 'cardio'

X = df.drop(columns=[target])
y = df[target]

X.shape, y.shape

## 6. Train-Test Split
Split the data into training (80%) and testing (20%) sets with stratification.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

## 7. Train XGBoost Model
Initialize and train the XGBoost classifier with optimized hyperparameters.

In [None]:
model = XGBClassifier(
    n_estimators=300,
    max_depth=5,
    learning_rate=0.05,
    subsample=0.85,
    colsample_bytree=0.85,
    min_child_weight=1,
    gamma=0.1,
    eval_metric='logloss',
    random_state=42
)

model.fit(X_train, y_train)

## 8. Model Evaluation
Evaluate the trained model using accuracy, ROC AUC, classification report, and confusion matrix.

In [None]:
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

print('Accuracy:', accuracy_score(y_test, y_pred))
print('ROC AUC:', roc_auc_score(y_test, y_prob))

print('\nClassification Report:\n')
print(classification_report(y_test, y_pred))

print('Confusion Matrix:\n')
print(confusion_matrix(y_test, y_pred))

## 9. Feature Importance Visualization
Visualize the top 12 most important features used by the model.

In [None]:
plt.figure(figsize=(10, 6))
xgb.plot_importance(model, max_num_features=12)
plt.title('Top Feature Importances')
plt.show()

## 10. Save the Trained Model
Save the trained XGBoost model to a pickle file for later use.

In [None]:
joblib.dump(model, 'cardio_xgboost_model.pkl')
print('Model saved as cardio_xgboost_model.pkl')

## 11. Load the Saved Model
Load the saved model from the pickle file to verify it works correctly.

In [None]:
loaded_model = joblib.load('cardio_xgboost_model.pkl')
print('Model loaded successfully')

## 12. Make Sample Prediction
Test the loaded model by making a prediction on a sample from the test set.

In [None]:
sample = X_test.iloc[:1]

prediction = loaded_model.predict(sample)
probability = loaded_model.predict_proba(sample)

print('Prediction:', prediction[0])
print('Probability:', probability)