# Mastering Gradient Boosting with CatBoost

In this tutorial we will use dataset Amazon Employee Access Challenge from [Kaggle](https://www.kaggle.com) competition for our experiments. [Here](https://www.kaggle.com/c/amazon-employee-access-challenge/data) is the link to the challenge, that we will be exploring.

## Libraries installation

In [5]:
#!pip install --user --upgrade catboost
#!pip install --user --upgrade ipywidgets
#!pip install shap
#!pip install sklearn
#!jupyter nbextension enable --py widgetsnbextension

In [1]:
import os
import pandas as pd
import numpy as np
np.set_printoptions(precision=4)

import catboost
print(catboost.__version__)

1.2


## Reading the data

In [2]:
from catboost.datasets import amazon

# If you have "URLError: SSL: CERTIFICATE_VERIFY_FAILED" uncomment next two lines:
# import ssl
# ssl._create_default_https_context = ssl._create_unverified_context

# If you have any other error:
# Download datasets from http://bit.ly/2ZUXTSv and uncomment next line:
# train_df = pd.read_csv('train.csv', sep=',', header='infer')

(train_df, test_df) = amazon()

In [4]:
train_df

Unnamed: 0,ACTION,RESOURCE,MGR_ID,ROLE_ROLLUP_1,ROLE_ROLLUP_2,ROLE_DEPTNAME,ROLE_TITLE,ROLE_FAMILY_DESC,ROLE_FAMILY,ROLE_CODE
0,1,39353,85475,117961,118300,123472,117905,117906,290919,117908
1,1,17183,1540,117961,118343,123125,118536,118536,308574,118539
2,1,36724,14457,118219,118220,117884,117879,267952,19721,117880
3,1,36135,5396,117961,118343,119993,118321,240983,290919,118322
4,1,42680,5905,117929,117930,119569,119323,123932,19793,119325
...,...,...,...,...,...,...,...,...,...,...
32764,1,23497,16971,117961,118300,119993,118321,240983,290919,118322
32765,1,25139,311198,91261,118026,122392,121143,173805,249618,121145
32766,1,34924,28805,117961,118327,120299,124922,152038,118612,124924
32767,1,80574,55643,118256,118257,117945,280788,280788,292795,119082


## Exploring the data

Label values extraction

In [9]:
y = train_df.ACTION
X = train_df.drop('ACTION', axis=1)

Categorical features declaration

In [14]:
cat_features = list(range(0, X.shape[1]))
print(cat_features)

[0, 1, 2, 3, 4, 5, 6, 7, 8]


Looking on label balance in dataset

In [11]:
print('Labels: {}'.format(set(y)))
print('Zero count = {}, One count = {}'.format(len(y) - sum(y), sum(y)))

Labels: {0, 1}
Zero count = 1897, One count = 30872


# Training the first model

In [15]:
from catboost import CatBoostClassifier
model = CatBoostClassifier(iterations=100)
model.fit(X, y, cat_features=cat_features, verbose=10)

Learning rate set to 0.377604
0:	learn: 0.4528598	total: 244ms	remaining: 24.2s
10:	learn: 0.1744186	total: 510ms	remaining: 4.13s
20:	learn: 0.1676119	total: 803ms	remaining: 3.02s
30:	learn: 0.1652446	total: 1.11s	remaining: 2.48s
40:	learn: 0.1633644	total: 1.39s	remaining: 2s
50:	learn: 0.1621892	total: 1.63s	remaining: 1.57s
60:	learn: 0.1609164	total: 1.93s	remaining: 1.23s
70:	learn: 0.1594572	total: 2.25s	remaining: 921ms
80:	learn: 0.1585876	total: 2.59s	remaining: 608ms
90:	learn: 0.1573593	total: 2.86s	remaining: 282ms
99:	learn: 0.1566977	total: 3.09s	remaining: 0us


<catboost.core.CatBoostClassifier at 0x27e5482be10>

In [16]:
model.predict_proba(X)

array([[0.0098, 0.9902],
       [0.0101, 0.9899],
       [0.0579, 0.9421],
       ...,
       [0.0118, 0.9882],
       [0.1891, 0.8109],
       [0.0235, 0.9765]])

# Working with dataset

There are several ways of passing dataset to training - using X,y (the initial matrix) or using Pool class.
Pool class is the class for storing the dataset. In the next few blocks we'll explore the ways to create a Pool object.

You can use Pool class if the dataset has more than just X and y (for example, it has sample weights or groups) or if the dataset is large and it takes long time to read it into python.

In [17]:
from catboost import Pool
pool = Pool(data=X, label=y, cat_features=cat_features)

## Split your data into train and validation

In [19]:
from sklearn.model_selection import train_test_split

data = train_test_split(X, y, test_size=0.2, random_state=0)
X_train, X_validation, y_train, y_validation = data

train_pool = Pool(
    data=X_train, 
    label=y_train, 
    cat_features=cat_features
)

validation_pool = Pool(
    data=X_validation, 
    label=y_validation, 
    cat_features=cat_features
)

## Selecting the objective function

Possible options for binary classification:

`Logloss` for binary target.

`CrossEntropy` for probabilities in target.

In [20]:
model = CatBoostClassifier(
    iterations=5,
    learning_rate=0.1,
    # loss_function='CrossEntropy'
)
model.fit(train_pool, eval_set=validation_pool, verbose=False)

print('Model is fitted: {}'.format(model.is_fitted()))
print('Model params:\n{}'.format(model.get_params()))

Model is fitted: True
Model params:
{'iterations': 5, 'learning_rate': 0.1}


## Stdout of the training

In [21]:
model = CatBoostClassifier(
    iterations=15,
#     verbose=5,
)
model.fit(train_pool, eval_set=validation_pool);

Learning rate set to 0.441257
0:	learn: 0.4226231	test: 0.4217069	best: 0.4217069 (0)	total: 19.7ms	remaining: 276ms
1:	learn: 0.3157972	test: 0.3136469	best: 0.3136469 (1)	total: 43ms	remaining: 279ms
2:	learn: 0.2631196	test: 0.2603395	best: 0.2603395 (2)	total: 63.9ms	remaining: 255ms
3:	learn: 0.2334650	test: 0.2294580	best: 0.2294580 (3)	total: 83.8ms	remaining: 230ms
4:	learn: 0.2077060	test: 0.2017327	best: 0.2017327 (4)	total: 104ms	remaining: 208ms
5:	learn: 0.1961364	test: 0.1883112	best: 0.1883112 (5)	total: 124ms	remaining: 187ms
6:	learn: 0.1879266	test: 0.1794018	best: 0.1794018 (6)	total: 144ms	remaining: 165ms
7:	learn: 0.1841218	test: 0.1743149	best: 0.1743149 (7)	total: 159ms	remaining: 139ms
8:	learn: 0.1814626	test: 0.1698731	best: 0.1698731 (8)	total: 180ms	remaining: 120ms
9:	learn: 0.1785403	test: 0.1650335	best: 0.1650335 (9)	total: 200ms	remaining: 99.9ms
10:	learn: 0.1771678	test: 0.1634002	best: 0.1634002 (10)	total: 222ms	remaining: 80.6ms
11:	learn: 0.17621

## Metrics calculation and graph plotting

In [22]:
model = CatBoostClassifier(
    iterations=50,
    learning_rate=0.5,
    custom_loss=['AUC', 'Accuracy']
)

model.fit(
    train_pool,
    eval_set=validation_pool,
    verbose=False,
    plot=True
);

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

## Model comparison

In [23]:
model1 = CatBoostClassifier(
    learning_rate=0.7,
    iterations=100,
    train_dir='learing_rate_0.7'
)

model2 = CatBoostClassifier(
    learning_rate=0.01,
    iterations=100,
    train_dir='learing_rate_0.01'
)

model1.fit(train_pool, eval_set=validation_pool, verbose=20)
model2.fit(train_pool, eval_set=validation_pool, verbose=20);

0:	learn: 0.3264513	test: 0.3248170	best: 0.3248170 (0)	total: 31.6ms	remaining: 3.13s
20:	learn: 0.1688825	test: 0.1574182	best: 0.1573949 (16)	total: 443ms	remaining: 1.67s
40:	learn: 0.1632884	test: 0.1582531	best: 0.1571533 (23)	total: 936ms	remaining: 1.35s
60:	learn: 0.1584388	test: 0.1573279	best: 0.1569712 (52)	total: 2.18s	remaining: 1.39s
80:	learn: 0.1544282	test: 0.1583794	best: 0.1569712 (52)	total: 3.18s	remaining: 745ms
99:	learn: 0.1510415	test: 0.1583995	best: 0.1569712 (52)	total: 3.71s	remaining: 0us

bestTest = 0.1569712214
bestIteration = 52

Shrink model to first 53 iterations.
0:	learn: 0.6853769	test: 0.6853610	best: 0.6853610 (0)	total: 22.2ms	remaining: 2.2s
20:	learn: 0.5575578	test: 0.5568257	best: 0.5568257 (20)	total: 357ms	remaining: 1.34s
40:	learn: 0.4678112	test: 0.4663769	best: 0.4663769 (40)	total: 760ms	remaining: 1.09s
60:	learn: 0.4029225	test: 0.4011544	best: 0.4011544 (60)	total: 1.08s	remaining: 693ms
80:	learn: 0.3551621	test: 0.3530433	best: 

In [26]:
from catboost import MetricVisualizer
MetricVisualizer(['learing_rate_0.7', 'learing_rate_0.01']).start()

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

## Best iteration

In [27]:
model = CatBoostClassifier(
    iterations=100,
#     use_best_model=False
)
model.fit(
    train_pool,
    eval_set=validation_pool,
    verbose=False,
    plot=True
);

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

In [28]:
print('Tree count: ' + str(model.tree_count_))

Tree count: 82


## Cross-validation

In [29]:
from catboost import cv

params = {
    'loss_function': 'Logloss',
    'iterations': 80,
    'custom_loss': 'AUC',
    'learning_rate': 0.5,
}

cv_data = cv(
    params = params,
    pool = train_pool,
    fold_count=5,
    shuffle=True,
    partition_random_seed=0,
    plot=True,
    verbose=False
)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

Training on fold [0/5]

bestTest = 0.1628504126
bestIteration = 68
Training on fold [1/5]

bestTest = 0.1608272543
bestIteration = 77

Training on fold [2/5]

bestTest = 0.1694535356
bestIteration = 12

Training on fold [3/5]

bestTest = 0.1569498918
bestIteration = 29

Training on fold [4/5]

bestTest = 0.1644437541
bestIteration = 30


In [30]:
cv_data.head(10)

Unnamed: 0,iterations,test-Logloss-mean,test-Logloss-std,train-Logloss-mean,train-Logloss-std,test-AUC-mean,test-AUC-std
0,0,0.305831,2e-05,0.30581,1.4e-05,0.496175,0.014042
1,1,0.234755,0.00102,0.235427,0.000274,0.569789,0.030202
2,2,0.196036,0.002694,0.202469,0.003171,0.761543,0.025878
3,3,0.182656,0.001401,0.1914,0.001531,0.801902,0.008434
4,4,0.175531,0.001515,0.185466,0.002237,0.817244,0.008916
5,5,0.172046,0.00215,0.182478,0.002121,0.823333,0.006377
6,6,0.170193,0.001849,0.18045,0.00246,0.828977,0.007246
7,7,0.168321,0.002187,0.17814,0.002177,0.834106,0.010527
8,8,0.167416,0.002498,0.176438,0.002126,0.835628,0.011008
9,9,0.166474,0.00301,0.175041,0.002259,0.837496,0.011698


In [32]:
best_value = np.min(cv_data['test-Logloss-mean'])
best_iter = np.argmin(cv_data['test-Logloss-mean'])

print('Best validation Logloss score, not stratified: {:.4f}±{:.4f} on step {}'.format(
    best_value,
    cv_data['test-Logloss-std'][best_iter],
    best_iter)
)

Best validation Logloss score, not stratified: 0.1636±0.0046 on step 35


In [33]:
from catboost import cv

params = {
    'loss_function': 'Logloss',
    'iterations': 80,
    'custom_loss': 'AUC',
    'learning_rate': 0.5,
}

cv_data = cv(
    params = params,
    pool = train_pool,
    fold_count=5,
    shuffle=True,
    partition_random_seed=0,
    plot=True,
    stratified=False,
    verbose=False
)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

Training on fold [0/5]

bestTest = 0.1524159074
bestIteration = 40

Training on fold [1/5]

bestTest = 0.1684459607
bestIteration = 44

Training on fold [2/5]

bestTest = 0.1668276983
bestIteration = 34

Training on fold [3/5]

bestTest = 0.1508994738
bestIteration = 79

Training on fold [4/5]

bestTest = 0.1693370995
bestIteration = 41


In [34]:
best_value = cv_data['test-Logloss-mean'].min()
best_iter = cv_data['test-Logloss-mean'].values.argmin()

print('Best validation Logloss score, stratified: {:.4f}±{:.4f} on step {}'.format(
    best_value,
    cv_data['test-Logloss-std'][best_iter],
    best_iter)
)

Best validation Logloss score, stratified: 0.1620±0.0091 on step 40


## Sklearn Grid Search

In [35]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    "learning_rate": [0.001, 0.01, 0.5],
}

clf = CatBoostClassifier(
    iterations=20, 
    cat_features=cat_features, 
    verbose=20
)
grid_search = GridSearchCV(clf, param_grid=param_grid, cv=3)
results = grid_search.fit(X_train, y_train)
results.best_estimator_.get_params()

0:	learn: 0.6923673	total: 13.6ms	remaining: 258ms
19:	learn: 0.6778431	total: 274ms	remaining: 0us
0:	learn: 0.6923682	total: 12.4ms	remaining: 236ms
19:	learn: 0.6778558	total: 239ms	remaining: 0us
0:	learn: 0.6923682	total: 25.6ms	remaining: 487ms
19:	learn: 0.6778568	total: 305ms	remaining: 0us
0:	learn: 0.6853838	total: 14.4ms	remaining: 274ms
19:	learn: 0.5629769	total: 227ms	remaining: 0us
0:	learn: 0.6853928	total: 11.2ms	remaining: 212ms
19:	learn: 0.5630657	total: 210ms	remaining: 0us
0:	learn: 0.6853925	total: 11.1ms	remaining: 212ms
19:	learn: 0.5630556	total: 249ms	remaining: 0us
0:	learn: 0.3972934	total: 11.7ms	remaining: 221ms
19:	learn: 0.1766459	total: 322ms	remaining: 0us
0:	learn: 0.3977266	total: 13.2ms	remaining: 251ms
19:	learn: 0.1774892	total: 358ms	remaining: 0us
0:	learn: 0.3977128	total: 14.1ms	remaining: 267ms
19:	learn: 0.1733642	total: 421ms	remaining: 0us
0:	learn: 0.3971379	total: 16.2ms	remaining: 308ms
19:	learn: 0.1717590	total: 448ms	remaining: 0us


{'iterations': 20,
 'learning_rate': 0.5,
 'verbose': 20,
 'cat_features': [0, 1, 2, 3, 4, 5, 6, 7, 8]}

## Overfitting Detector

In [None]:
model_with_early_stop = CatBoostClassifier(
    iterations=200,
    learning_rate=0.5,
    early_stopping_rounds=20
)

model_with_early_stop.fit(
    train_pool,
    eval_set=validation_pool,
    verbose=False,
    plot=True
);

In [None]:
print(model_with_early_stop.tree_count_)

### Overfitting Detector with eval metric

In [None]:
model_with_early_stop = CatBoostClassifier(
    eval_metric='AUC',
    iterations=200,
    learning_rate=0.5,
    early_stopping_rounds=20
)
model_with_early_stop.fit(
    train_pool,
    eval_set=validation_pool,
    verbose=False,
    plot=True
);

In [None]:
print(model_with_early_stop.tree_count_)

## Model predictions

In [None]:
model = CatBoostClassifier(iterations=200, learning_rate=0.03)
model.fit(train_pool, verbose=50);

In [None]:
print(model.predict(X_validation))

In [None]:
print(model.predict_proba(X_validation))

In [None]:
raw_pred = model.predict(
    X_validation,
    prediction_type='RawFormulaVal'
)

print(raw_pred)

In [None]:
from numpy import exp

sigmoid = lambda x: 1 / (1 + exp(-x))

probabilities = sigmoid(raw_pred)

print(probabilities)

## Select decision boundary

![](https://habrastorage.org/webt/y4/1q/yq/y41qyqfm9mcerp2ziys48phpjia.png)

In [None]:
import matplotlib.pyplot as plt
from catboost.utils import get_roc_curve
from catboost.utils import get_fpr_curve
from catboost.utils import get_fnr_curve

curve = get_roc_curve(model, validation_pool)
(fpr, tpr, thresholds) = curve

(thresholds, fpr) = get_fpr_curve(curve=curve)
(thresholds, fnr) = get_fnr_curve(curve=curve)

In [None]:
plt.figure(figsize=(16, 8))
style = {'alpha':0.5, 'lw':2}

plt.plot(thresholds, fpr, color='blue', label='FPR', **style)
plt.plot(thresholds, fnr, color='green', label='FNR', **style)

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
plt.grid(True)
plt.xlabel('Threshold', fontsize=16)
plt.ylabel('Error Rate', fontsize=16)
plt.title('FPR-FNR curves', fontsize=20)
plt.legend(loc="lower left", fontsize=16);

In [None]:
from catboost.utils import select_threshold

print(select_threshold(model, validation_pool, FNR=0.01))
print(select_threshold(model, validation_pool, FPR=0.01))

## Metric evaluation on a new dataset

In [None]:
metrics = model.eval_metrics(
    data=validation_pool,
    metrics=['Logloss','AUC'],
    ntree_start=0,
    ntree_end=0,
    eval_period=1,
    plot=True
)

In [None]:
print('AUC values:\n{}'.format(np.array(metrics['AUC'])))

## Feature importances

### Prediction values change

Default feature importances for binary classification is PredictionValueChange - how much on average does the model change when the feature value changes.
These feature importances are non negative.
They are normalized and sum to 1, so you can look on these values like percentage of importance.

In [None]:
np.array(model.get_feature_importance(prettified=True))

### Loss function change

The non default feature importance approximates how much the optimized loss function will change if the value of the feature changes.
This importances might be negative if the feature has bad influence on the loss function.
The importances are not normalized, the absolute value of the importance has the same scale as the optimized loss value.
To calculate this importance value you need to pass train_pool as an argument.

In [None]:
np.array(model.get_feature_importance(
    train_pool, 
    'LossFunctionChange', 
    prettified=True
))

### Shap values

In [None]:
print(model.predict_proba([X.iloc[1,:]]))
print(model.predict_proba([X.iloc[91,:]]))

In [None]:
shap_values = model.get_feature_importance(
    validation_pool, 
    'ShapValues'
)
expected_value = shap_values[0,-1]
shap_values = shap_values[:,:-1]
print(shap_values.shape)

In [None]:
proba = model.predict_proba([X.iloc[1,:]])[0]
raw = model.predict([X.iloc[1,:]], prediction_type='RawFormulaVal')[0]
print('Probabilities', proba)
print('Raw formula value %.4f' % raw)
print('Probability from raw value %.4f' % sigmoid(raw))

In [None]:
import shap

shap.initjs()
shap.force_plot(expected_value, shap_values[1,:], X_validation.iloc[1,:])

In [None]:
proba = model.predict_proba([X.iloc[91,:]])[0]
raw = model.predict([X.iloc[91,:]], prediction_type='RawFormulaVal')[0]
print('Probabilities', proba)
print('Raw formula value %.4f' % raw)
print('Probability from raw value %.4f' % sigmoid(raw))

In [None]:
import shap
shap.initjs()
shap.force_plot(expected_value, shap_values[91,:], X_validation.iloc[91,:])

In [None]:
shap.summary_plot(shap_values, X_validation)

## Snapshotting

In [None]:
#!rm 'catboost_info/snapshot.bkp'

model = CatBoostClassifier(
    iterations=100,
    save_snapshot=True,
    snapshot_file='snapshot.bkp',
    snapshot_interval=1
)

model.fit(train_pool, eval_set=validation_pool, verbose=10);

## Saving the model

In [None]:
model = CatBoostClassifier(iterations=10)
model.fit(train_pool, eval_set=validation_pool, verbose=False)
model.save_model('catboost_model.bin')
model.save_model('catboost_model.json', format='json')

In [None]:
model.load_model('catboost_model.bin')
print(model.get_params())
print(model.learning_rate_)

## Hyperparameter tunning

In [None]:
tunned_model = CatBoostClassifier(
    iterations=1000,
    learning_rate=0.03,
    depth=6,
    l2_leaf_reg=3,
    random_strength=1,
    bagging_temperature=1
)

tunned_model.fit(
    X_train, y_train,
    cat_features=cat_features,
    verbose=False,
    eval_set=(X_validation, y_validation),
    plot=True
);

# Speeding up the training

In [None]:
fast_model = CatBoostClassifier(
    boosting_type='Plain',
    rsm=0.5,
    one_hot_max_size=50,
    leaf_estimation_iterations=1,
    max_ctr_complexity=1,
    iterations=100,
    learning_rate=0.3,
    bootstrap_type='Bernoulli',
    subsample=0.5
)
fast_model.fit(
    X_train, y_train,
    cat_features=cat_features,
    verbose=False,
    eval_set=(X_validation, y_validation),
    plot=True
);

# Reducing model size

In [None]:
small_model = CatBoostClassifier(
    learning_rate=0.03,
    iterations=500,
    model_size_reg=50,
    max_ctr_complexity=1,
    ctr_leaf_count_limit=100
)
small_model.fit(
    X_train, y_train,
    cat_features=cat_features,
    verbose=False,
    eval_set=(X_validation, y_validation),
    plot=True
);