# XGBoost vs LightGBM vs Catboost vs Adaboost vs GBM

This notebook compares the performance of 5 different Python packages for boosting ensembles. These include:

* xgboost
* lightgbm
* catboost
* adaboost (scikit-learn)
* gbm (scikit-learn)

These packages represent different algorithmic, and implementation, variations on boosting. 

Let's compare these packages in terms of predictive performance, training time, and preparation required on a simple classification dataset. 

In [1]:
import time
import pandas as pd
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score
)
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier

## Read in and Prepare Data

Data that will be used is the Breast Cancer dataset from the UC Irvine Machine Learning repository (https://archive.ics.uci.edu/dataset/14/breast+cancer): 

Zwitter,Matjaz and Soklic,Milan. (1988). Breast Cancer. UCI Machine Learning Repository. https://doi.org/10.24432/C51P4M.

In [2]:
data = pd.read_csv(
    './breast+cancer/breast-cancer.data',
    names=[
        'class',
        'age',
        'menopause',
        'tumor-size',
        'inv-nodes',
        'node-caps',
        'deg-malig',
        'breast',
        'breast-quad',
        'irradiat'
    ]
)

In [3]:
data.shape

(286, 10)

In [4]:
data.head(5)

Unnamed: 0,class,age,menopause,tumor-size,inv-nodes,node-caps,deg-malig,breast,breast-quad,irradiat
0,no-recurrence-events,30-39,premeno,30-34,0-2,no,3,left,left_low,no
1,no-recurrence-events,40-49,premeno,20-24,0-2,no,2,right,right_up,no
2,no-recurrence-events,40-49,premeno,20-24,0-2,no,2,left,left_low,no
3,no-recurrence-events,60-69,ge40,15-19,0-2,no,2,right,left_up,no
4,no-recurrence-events,40-49,premeno,0-4,0-2,no,2,right,right_low,no


The 'class' column is our target, whereas all the other columns are predictor features. We can make some quick observations:

* features 'age', 'menopause', 'tumor-size', 'inv-nodes', and 'deg-malig' are ordeal
* features 'node-caps', 'breast', 'breast-quad', and 'irradiat' are categorical

In addition, the website states there are missing values for 'node-caps' and 'breast-quad'. Let's check this out:

In [5]:
data['node-caps'].unique()

array(['no', 'yes', '?'], dtype=object)

In [6]:
data[data['node-caps']=='?'].shape[0]

8

In [7]:
data['breast-quad'].unique()

array(['left_low', 'right_up', 'left_up', 'right_low', 'central', '?'],
      dtype=object)

In [8]:
data[data['breast-quad']=='?'].shape[0]

1

Let's fill in the missing values with the mode for each feature:

In [9]:
data.loc[data['node-caps']=='?','node-caps'] = data['node-caps'].mode().values[0]

In [10]:
data.loc[data['breast-quad']=='?','breast-quad'] = data['breast-quad'].mode().values[0]

Now we can separate the target and predictor features:

In [11]:
y = data['class']
X = data.drop('class',axis=1)

Ordeal columns need to be transformed into numerical values according to their order:

In [12]:
age = ['10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79', '80-89', '90-99']
menopause = ['lt40', 'ge40', 'premeno']
tumor_size = ['0-4', '5-9', '10-14', '15-19', '20-24', '25-29', '30-34', '35-39', '40-44', '45-49', '50-54', '55-59']
inv_nodes = ['0-2', '3-5', '6-8', '9-11', '12-14', '15-17', '18-20', '21-23', '24-26', '27-29', '30-32', '33-35', '36-39']

In [13]:
encoder = OrdinalEncoder(categories=[age,menopause,tumor_size,inv_nodes])

In [14]:
X.loc[:,['age','menopause','tumor-size','inv-nodes']] = encoder.fit_transform(X[['age','menopause','tumor-size','inv-nodes']])

In [15]:
X.head(5)

Unnamed: 0,age,menopause,tumor-size,inv-nodes,node-caps,deg-malig,breast,breast-quad,irradiat
0,2.0,2.0,6.0,0.0,no,3,left,left_low,no
1,3.0,2.0,4.0,0.0,no,2,right,right_up,no
2,3.0,2.0,4.0,0.0,no,2,left,left_low,no
3,5.0,1.0,3.0,0.0,no,2,right,left_up,no
4,3.0,2.0,0.0,0.0,no,2,right,right_low,no


Enforce datatypes in our columns:

In [16]:
for cat in X.columns.values:
    if cat in ['node-caps', 'breast', 'breast-quad', 'irradiat']:
        X[cat] = X[cat].astype('category')
    else:
        X[cat] = X[cat].astype('int')

Finally, convert target into numerical values:

In [17]:
y.unique()

array(['no-recurrence-events', 'recurrence-events'], dtype=object)

In [18]:
y.mask(y=='no-recurrence-events', 0, inplace=True)
y.mask(y=='recurrence-events', 1, inplace=True)

In [19]:
y.unique()

array([0, 1], dtype=object)

Class labels balance?

In [20]:
y.shape[0]

286

In [21]:
y[y==0].shape[0]

201

In [22]:
y[y==1].shape[0]

85

In [23]:
def balanced_train_test_split(X, y, test_size):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=42)
    # balance the classes of the training data
    D_train = X_train.copy()
    D_train['y'] = y_train 
    D0 = D_train[D_train.y == 0]
    D1 = D_train[D_train.y == 1]
    n_samples = int(X_train.shape[0]/2)
    D0 = D0.sample(n=n_samples,replace=True,random_state=42)
    D1 = D1.sample(n=n_samples,replace=True,random_state=42)
    D_train = pd.concat([D0,D1])
    y_train = D_train['y']
    X_train = D_train.drop('y',axis=1)
    return (X_train,X_test,y_train,y_test)

## XGBoost

In [24]:
X_train, X_test, y_train, y_test = balanced_train_test_split(X, y, test_size=0.2)

Now let's try out the experimental *enable_categorical* feature with XGBoost:

In [25]:
start_time = time.time()
model = XGBClassifier(
    learning_rate=0.01, 
    max_depth=5, 
    n_estimators=500, 
    enable_categorical=True, 
    random_state=42
)
model.fit(X_train,y_train)
print(f"training time duration: {time.time() - start_time:.2f}")

training time duration: 0.30


In [26]:
y_test = y_test.values.tolist()
y_pred = model.predict(X_test).tolist()
print(f'accuracy score: {accuracy_score(y_test,y_pred):.2f}')
print(f'precision score: {precision_score(y_test,y_pred):.2f}')
print(f'recall score: {recall_score(y_test,y_pred):.2f}')
print(f'f1 score: {f1_score(y_test,y_pred):.2f}')

accuracy score: 0.66
precision score: 0.52
recall score: 0.62
f1 score: 0.57


## Catboost

In [27]:
start_time = time.time()
model = CatBoostClassifier(
    learning_rate=0.01, 
    max_depth=5, 
    n_estimators=500, 
    cat_features=['node-caps', 'breast', 'breast-quad', 'irradiat'], 
    verbose=0,
    random_state=42
)
model.fit(X_train,y_train)
print(f"training time duration: {time.time() - start_time:.2f}")

training time duration: 0.30


In [28]:
y_pred = model.predict(X_test).tolist()
print(f'accuracy score: {accuracy_score(y_test,y_pred):.2f}')
print(f'precision score: {precision_score(y_test,y_pred):.2f}')
print(f'recall score: {recall_score(y_test,y_pred):.2f}')
print(f'f1 score: {f1_score(y_test,y_pred):.2f}')

accuracy score: 0.64
precision score: 0.50
recall score: 0.52
f1 score: 0.51


## LightGBM

In [29]:
# OHE the categorical features, then do train-test split
Xohe = pd.get_dummies(X,columns=['node-caps', 'breast', 'breast-quad', 'irradiat']).astype(int)
X_train, X_test, y_train, y_test = balanced_train_test_split(Xohe, y, test_size=0.2)

In [30]:
start_time = time.time()
model = LGBMClassifier(learning_rate=0.01, max_depth=5, n_estimators=500, verbose=-1, random_state=42)
model.fit(X_train,y_train.astype(int))
print(f"training time duration: {time.time() - start_time:.2f}")

training time duration: 0.28


In [31]:
y_test = y_test.values.tolist()
y_pred = model.predict(X_test).tolist()
print(f'accuracy score: {accuracy_score(y_test,y_pred):.2f}')
print(f'precision score: {precision_score(y_test,y_pred):.2f}')
print(f'recall score: {recall_score(y_test,y_pred):.2f}')
print(f'f1 score: {f1_score(y_test,y_pred):.2f}')

accuracy score: 0.67
precision score: 0.54
recall score: 0.62
f1 score: 0.58


## Adaboost

In [32]:
start_time = time.time()
model = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=5),
    algorithm='SAMME',
    learning_rate=0.01, 
    n_estimators=500, 
    random_state=42
)
model.fit(X_train,y_train.astype(int))
print(f"training time duration: {time.time() - start_time:.2f}")

training time duration: 0.29


In [33]:
y_pred = model.predict(X_test).tolist()
print(f'accuracy score: {accuracy_score(y_test,y_pred):.2f}')
print(f'precision score: {precision_score(y_test,y_pred):.2f}')
print(f'recall score: {recall_score(y_test,y_pred):.2f}')
print(f'f1 score: {f1_score(y_test,y_pred):.2f}')

accuracy score: 0.71
precision score: 0.59
recall score: 0.62
f1 score: 0.60


## GBM

In [34]:
start_time = time.time()
model = GradientBoostingClassifier(
    max_depth=5,
    learning_rate=0.01, 
    n_estimators=500, 
    random_state=42
)
model.fit(X_train,y_train.astype(int))
print(f"training time duration: {time.time() - start_time:.2f}")

training time duration: 0.31


In [35]:
y_pred = model.predict(X_test).tolist()
print(f'accuracy score: {accuracy_score(y_test,y_pred):.2f}')
print(f'precision score: {precision_score(y_test,y_pred):.2f}')
print(f'recall score: {recall_score(y_test,y_pred):.2f}')
print(f'f1 score: {f1_score(y_test,y_pred):.2f}')

accuracy score: 0.62
precision score: 0.48
recall score: 0.57
f1 score: 0.52


Model | Training Time | Accuracy | Precision | Recall | F1
--- | --- | --- | --- | --- | ---
xgboost | 0.29 | 0.66 | 0.52 | 0.62 | 0.57
catboost | 0.29 | 0.64 | 0.50 | 0.52 | 0.51
lightgbm | 0.27 | 0.67 | 0.54 | 0.62 | 0.58
adaboost | 0.29 | 0.71 | 0.59 | 0.62 | 0.60
gbm | 0.31 | 0.62 | 0.48 | 0.57 | 0.52