In [75]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/heart-disease-dataset/heart.csv


# <span style="color:lightblue;">**Binary Classification with XGBoost**</span>

---

In this notebook, we will dive into the **XGBoost** algorithm, demonstrating why its time complexity often outperforms the **Random Forest Classifier** on certain datasets. **XGBoost**, short for *Extreme Gradient Boosting*, combines boosting for reducing misclassification with gradient-based optimization. We will apply **XGBoostClassifier** to a binary classification task to showcase its effectiveness.

**Table of contents**
1. [Import Libraries](#Import-Libraries)
2. Load Heart disease dataset(#Load-Dataset)
3. Train XGBClassifer
4. Evaluation of XGBClassifier
5. Comparision with RandomForestClassifier

# Import Libraries

In [76]:

import numpy as np
import pandas as pd
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import time

# Load Dataset

In [77]:
df = pd.read_csv("/kaggle/input/heart-disease-dataset/heart.csv")
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0


In [78]:
X = df.drop("target", axis = 1)
y = df["target"]

df["target"].value_counts()

In [79]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.4, random_state = 123)

# Train Classifier

In [80]:
xgb_start = time.time()
xgb = XGBClassifier(n_estimators = 100, 
              colsample_bynode = 0.6, 
              gamma = 10, 
              random_state = 123)
xgb.fit(X_train,y_train)

xgb_stop = time.time()


In [81]:
def scoring(model = xgb, X_train = X_train, X_test = X_test, y_train = y_train, y_test = y_test):
    print("Test Score:", model.score(X_test,y_test))
    print("Training Score",model.score(X_train,y_train))

In [82]:
scoring()

Test Score: 0.8634146341463415
Training Score 0.8975609756097561


In [83]:
from sklearn.ensemble import RandomForestClassifier
rf_start = time.time()
rf = RandomForestClassifier(n_estimators = 100, random_state = 42,
                                 max_depth = 4)
rf.fit(X_train,y_train)
rf_stop = time.time()
scoring(model = rf)

Test Score: 0.8878048780487805
Training Score 0.9170731707317074


In [84]:
xgb_train = xgb_stop-xgb_start
rf_train = rf_stop - rf_start

print("XGB Train Time:",xgb_train)
print("RF Train time: ", rf_train)
print(rf_train/xgb_train)

XGB Train Time: 0.03485918045043945
RF Train time:  0.21311521530151367
6.113603720675741


In [85]:
XGB is 5 times faster, this can be attributed to its sequential tree building 
2. Also the regularization with gamma leads to smaller trees aka weak learners 
3. Because a proportion of features by colsample_bynode is used at each split - feature sampling is done
4. Parlell processing is done at node level
5. Incase of Random Forest each tree is built independently and to full extent without pruning which lead to increased computational time during Training.

SyntaxError: invalid syntax (977429066.py, line 1)