# Python Catboost Tutorial - Binary Classification

Adapted from the Catboost <a href="https://github.com/catboost" target="_blank" >repository</a>.

### CatBoost installation
If you have not already installed CatBoost: <br>
pip install --upgrade catboost


### Data Loading

In [63]:
from catboost import CatBoostClassifier, Pool, cv
from catboost.eval.catboost_evaluation import *

import numpy as np
import pandas as pd
from collections import Counter
from itertools import product

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, cohen_kappa_score, confusion_matrix, classification_report

from imblearn.over_sampling import SMOTE, SMOTENC

In [3]:
#Import Data
df = pd.read_csv("titanic.csv")

#See the imported dataset
print("DF shape", df.shape)
df.head()


DF shape (891, 9)


Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,0,3,male,22.0,1,0,7.25,,S
1,1,1,female,38.0,1,0,71.2833,C85,C
2,1,3,female,26.0,0,0,7.925,,S
3,1,1,female,35.0,1,0,53.1,C123,S
4,0,3,male,35.0,0,0,8.05,,S


### Feature Preparation
First of all let's check how many missing values do we have:

In [4]:
df.isnull().sum()

Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Cabin       687
Embarked      2
dtype: int64

As we cat see, **`Age`**, **`Cabin`** and **`Embarked`** indeed have some missing values, so let's fill them with some number way out of their distributions - so the model would be able to easily distinguish between them and take it into account:

In [6]:
df.fillna(-999, inplace=True)
df.isnull().sum()

Survived    0
Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Cabin       0
Embarked    0
dtype: int64

Now let's separate features and label variable:

In [7]:
X = df.drop('Survived', axis=1)
y = df.Survived

Pay attention that our features are of differnt types - some of them are numeric, some are categorical, and some are even just strings, which normally should be handled in some specific way (for example encoded with bag-of-words representation). 

In [8]:
print(X.dtypes)

categorical_features_indices = np.where(X.dtypes != np.float)[0]
categorical_features_indices

Pclass        int64
Sex          object
Age         float64
SibSp         int64
Parch         int64
Fare        float64
Cabin        object
Embarked     object
dtype: object


array([0, 1, 3, 4, 6, 7], dtype=int64)

#### Encode Strings
Not strictly necessary in Catboost, but useful for example for SMOTE.

In [32]:
for var in ['Sex', 'Cabin', 'Embarked']:
    X[var] = X[var].astype('category').cat.codes
X.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,3,1,22.0,1,0,7.25,0,3
1,1,0,38.0,1,0,71.2833,82,1
2,3,0,26.0,0,0,7.925,0,3
3,1,0,35.0,1,0,53.1,56,3
4,3,1,35.0,0,0,8.05,0,3


### Data Splitting
Let's split the train data into training and validation sets.

In [38]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, random_state=14)

### SMOTE
Check if target var has balanced classes and use SMOTE if needed (**only on Train Set**).

In [35]:
Counter(np.array(y_train).ravel())

Counter({0: 410, 1: 258})

In [39]:
#Apply SMOTENC, since you have categorical variables
sm = SMOTENC(categorical_features=categorical_features_indices, random_state = 14, n_jobs=-1)

#Save column names
xcol = list(X_train.columns)
ycol = y_train.name

#Apply SMOTE and convert back to Pandas
X_train, y_train = sm.fit_resample(X_train, np.array(y_train).ravel())
X_train = pd.DataFrame(X_train, columns= xcol)
y_train = pd.DataFrame(y_train, columns= [ycol])

#Check new class balance
Counter(np.array(y_train).ravel())

Counter({0: 410, 1: 410})

### Parameters Tuning

In [43]:
#Define a grid of parameters to test
grid = {'learning_rate': [0.01, 0.03, 0.1, 0.2],
        'depth': [4, 6, 10],
        'l2_leaf_reg': [1, 3, 5, 7, 9],
        }

#Count all possible combinations
print("# Combinations:", len([dict(zip(grid.keys(),v)) for v in product(*grid.values())]))

# Combinations: 60


In [46]:
#Define Model (could also use custom loss here)
#custom_loss = ["Accuracy"]
model = CatBoostClassifier()

#Grid Search
#Default cross-validation is 3-fold
grid_search_result = model.grid_search(grid, X=X_train, y=y_train, cv=3)
bestparam = grid_search_result["params"]
bestparam

0:	loss: 0.3304858	best: 0.3304858 (0)	total: 1.02s	remaining: 1m
1:	loss: 0.3226414	best: 0.3226414 (1)	total: 1.87s	remaining: 54.2s
2:	loss: 0.3296104	best: 0.3226414 (1)	total: 2.66s	remaining: 50.6s
3:	loss: 0.3331585	best: 0.3226414 (1)	total: 3.48s	remaining: 48.7s
4:	loss: 0.3356785	best: 0.3226414 (1)	total: 4.24s	remaining: 46.6s
5:	loss: 0.3302601	best: 0.3226414 (1)	total: 5.03s	remaining: 45.2s
6:	loss: 0.3311892	best: 0.3226414 (1)	total: 5.81s	remaining: 44s
7:	loss: 0.3325525	best: 0.3226414 (1)	total: 6.58s	remaining: 42.8s
8:	loss: 0.3403906	best: 0.3226414 (1)	total: 7.35s	remaining: 41.6s
9:	loss: 0.3278574	best: 0.3226414 (1)	total: 8.14s	remaining: 40.7s
10:	loss: 0.3280204	best: 0.3226414 (1)	total: 8.9s	remaining: 39.6s
11:	loss: 0.3287230	best: 0.3226414 (1)	total: 9.69s	remaining: 38.7s
12:	loss: 0.3419757	best: 0.3226414 (1)	total: 10.5s	remaining: 37.8s
13:	loss: 0.3321732	best: 0.3226414 (1)	total: 11.3s	remaining: 37.2s
14:	loss: 0.3231395	best: 0.3226414 

{'depth': 10, 'l2_leaf_reg': 1, 'learning_rate': 0.1}

In [77]:
#Set best params
model = CatBoostClassifier()

#Can define Custom Loss
bestparam["custom_loss"] = "Kappa"

#Depending on your objective you can also customize the evaluation metric
bestparam["eval_metric"] = "Kappa"

model.set_params(**bestparam)
print(model.get_params())

{'depth': 10, 'l2_leaf_reg': 1, 'learning_rate': 0.1, 'eval_metric': 'Kappa', 'custom_loss': 'Kappa'}


### Model Training
Retaining the best model and with early stopping, to avoid overfit.
**In real cases, we need an external test set, not used for training or validation (early stopping). That dataset is the one to be used to evaluate the final moldel.**

In [78]:
#Furter split the train set into final_train and validation sets
X_train_final, X_validation, y_train_final, y_validation = train_test_split(X_train, y_train,\
                                                                            train_size=0.75, random_state=14)

print(X_train.shape, X_train_final.shape, X_validation.shape)

(820, 8) (615, 8) (205, 8)


Use early sotopping rounds and validation set, to stop after K iterations with no improvement of the evaluation metric.

In [83]:
model.fit(X_train_final, y_train_final, cat_features=categorical_features_indices, eval_set=(X_validation, y_validation), \
                   early_stopping_rounds = 80, use_best_model=True, logging_level = "Verbose")


0:	learn: 0.4967700	test: 0.5181766	best: 0.5181766 (0)	total: 16.3ms	remaining: 16.3s
1:	learn: 0.4967700	test: 0.5181766	best: 0.5181766 (0)	total: 17.9ms	remaining: 8.93s
2:	learn: 0.5190213	test: 0.5181766	best: 0.5181766 (0)	total: 34.2ms	remaining: 11.4s
3:	learn: 0.5190213	test: 0.5181766	best: 0.5181766 (0)	total: 36ms	remaining: 8.96s
4:	learn: 0.5703099	test: 0.6179696	best: 0.6179696 (4)	total: 64.5ms	remaining: 12.8s
5:	learn: 0.5691612	test: 0.6179696	best: 0.6179696 (4)	total: 83.2ms	remaining: 13.8s
6:	learn: 0.5703099	test: 0.6203058	best: 0.6203058 (6)	total: 84.1ms	remaining: 11.9s
7:	learn: 0.5972594	test: 0.6203058	best: 0.6203058 (6)	total: 106ms	remaining: 13.1s
8:	learn: 0.6128868	test: 0.5718635	best: 0.6203058 (6)	total: 131ms	remaining: 14.4s
9:	learn: 0.6243533	test: 0.5718635	best: 0.6203058 (6)	total: 138ms	remaining: 13.6s
10:	learn: 0.5960022	test: 0.5393477	best: 0.6203058 (6)	total: 147ms	remaining: 13.2s
11:	learn: 0.6002128	test: 0.5122505	best: 0.620

100:	learn: 0.9693109	test: 0.6248935	best: 0.6785742 (44)	total: 2.27s	remaining: 20.2s
101:	learn: 0.9693109	test: 0.6248935	best: 0.6785742 (44)	total: 2.31s	remaining: 20.3s
102:	learn: 0.9693109	test: 0.6248935	best: 0.6785742 (44)	total: 2.33s	remaining: 20.3s
103:	learn: 0.9693109	test: 0.6248935	best: 0.6785742 (44)	total: 2.37s	remaining: 20.4s
104:	learn: 0.9693109	test: 0.6248935	best: 0.6785742 (44)	total: 2.4s	remaining: 20.5s
105:	learn: 0.9693109	test: 0.6248935	best: 0.6785742 (44)	total: 2.42s	remaining: 20.4s
106:	learn: 0.9737279	test: 0.6248935	best: 0.6785742 (44)	total: 2.46s	remaining: 20.6s
107:	learn: 0.9737279	test: 0.6248935	best: 0.6785742 (44)	total: 2.5s	remaining: 20.6s
108:	learn: 0.9737279	test: 0.6248935	best: 0.6785742 (44)	total: 2.53s	remaining: 20.7s
109:	learn: 0.9737279	test: 0.6248935	best: 0.6785742 (44)	total: 2.56s	remaining: 20.7s
110:	learn: 0.9737279	test: 0.6248935	best: 0.6785742 (44)	total: 2.59s	remaining: 20.7s
111:	learn: 0.9737279	t

<catboost.core.CatBoostClassifier object at 0x0000015AADDFD4C8>

With this we can see that the best **Kappa** value of **0.6786** (on validation set) was acheived at step **44** with no futher improvement after **80** iterations (so the training stopped). We now retain this model as the **best model**.

### Model Predictions and Fit

In [84]:
#Predict on the original Test Set
predictions = model.predict(X_test)
truevalues = np.array(y_test)

#Confusion Matrix
print(confusion_matrix(truevalues,predictions))

#Classificatiion Report
print(classification_report(truevalues,predictions))

#Cohen's Kappa
print("ACCURACY:", '%.4f' % accuracy_score(truevalues, predictions))
print("COHEN'S KAPPA:", '%.4f' % cohen_kappa_score(truevalues, predictions))

[[109  15]
 [ 29  70]]
              precision    recall  f1-score   support

           0       0.79      0.88      0.83       124
           1       0.82      0.71      0.76        99

    accuracy                           0.80       223
   macro avg       0.81      0.79      0.80       223
weighted avg       0.80      0.80      0.80       223

ACCURACY: 0.8027
COHEN'S KAPPA: 0.5946


### Monte Carlo Cross-Validation
Now repeat the process 1,000 times and provide average fit statistics, with their standard deviation.

In [100]:
#Save accuracy and kappa scores in a list
a,k = [], []

#For demonstrational purposes we now reapet it 10 times
for i in range(0,10):
    #Split with no random seed in train, validation and test set
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75)
    X_train_final, X_validation, y_train_final, y_validation = train_test_split(X_train, y_train, train_size=0.75)
    
    model.fit(X_train_final, y_train_final, cat_features=categorical_features_indices, \
              eval_set=(X_validation, y_validation), early_stopping_rounds = 80, use_best_model=True, \
              logging_level = "Silent")
    
    predictions = model.predict(X_test)
    truevalues = np.array(y_test)
    
    a.append(accuracy_score(truevalues, predictions))
    k.append(cohen_kappa_score(truevalues, predictions))

In [104]:
print("Accuracy at each cross-validation step\n", a, "\n")
print("Kappa at each cross-validation step\n", k, "\n")
print("Accuracy M", '%.4f' % np.mean(a), "SD", '%.4f' % np.std(a), "\n")
print("Kappa M", np.mean(k), "SD", '%.4f' % np.std(k))

Accuracy at each cross-validation step
 [0.8295964125560538, 0.7668161434977578, 0.8385650224215246, 0.820627802690583, 0.7982062780269058, 0.8026905829596412, 0.7533632286995515, 0.7892376681614349, 0.7757847533632287, 0.7937219730941704] 

Kappa at each cross-validation step
 [0.603165683244357, 0.514201927105153, 0.6309304891504229, 0.6216170357173157, 0.5795449784220891, 0.5622769450392576, 0.4923217020572044, 0.5526866117536597, 0.47613230595752676, 0.5587577426015141] 

Accuracy M 0.7969 SD 0.0259 

Kappa M 0.5592 SD 0.0498
