# Python Catboost Tutorial - Regression

Adapted from the Catboost <a href="https://github.com/catboost" target="_blank" >repository</a>.

### CatBoost installation
If you have not already installed CatBoost: <br>
pip install --upgrade catboost


### Data Loading

In [21]:
from catboost import CatBoostRegressor, Pool, cv
from catboost.eval.catboost_evaluation import *

import numpy as np
import pandas as pd
from collections import Counter
from itertools import product

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error

from imblearn.over_sampling import SMOTE, SMOTENC

In [22]:
#Define function to calculate the MAPE
def mean_absolute_percentage_error(y_true, y_pred): 
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

In [2]:
#Import Data
df = pd.read_csv("titanic.csv")

#See the imported dataset
print("DF shape", df.shape)
df.head()


DF shape (891, 9)


Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,0,3,male,22.0,1,0,7.25,,S
1,1,1,female,38.0,1,0,71.2833,C85,C
2,1,3,female,26.0,0,0,7.925,,S
3,1,1,female,35.0,1,0,53.1,C123,S
4,0,3,male,35.0,0,0,8.05,,S


### Feature Preparation
First of all let's check how many missing values do we have:

In [3]:
df.isnull().sum()

Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Cabin       687
Embarked      2
dtype: int64

As we cat see, **`Age`**, **`Cabin`** and **`Embarked`** indeed have some missing values, so let's fill them with some number way out of their distributions - so the model would be able to easily distinguish between them and take it into account:

In [4]:
df.fillna(-999, inplace=True)
df.isnull().sum()

Survived    0
Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Cabin       0
Embarked    0
dtype: int64

Now let's separate features and label variable, **to test regression we try to predict the ticket price**:

In [5]:
X = df.drop('Fare', axis=1)
y = df.Fare

Pay attention that our features are of differnt types - some of them are numeric, some are categorical, and some are even just strings, which normally should be handled in some specific way (for example encoded with bag-of-words representation). But in our case we could treat these string features just as categorical one - all the heavy lifting is done inside CatBoost. How cool is that? :)

In [6]:
print(X.dtypes)

categorical_features_indices = np.where(X.dtypes != np.float)[0]
categorical_features_indices

Survived      int64
Pclass        int64
Sex          object
Age         float64
SibSp         int64
Parch         int64
Cabin        object
Embarked     object
dtype: object


array([0, 1, 2, 4, 5, 6, 7], dtype=int64)

#### Encode Strings
Not strictly necessary in Catboost, but useful for example for SMOTE.

In [10]:
for var in ['Sex', 'Cabin', 'Embarked']:
    X[var] = X[var].astype('category').cat.codes
X.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Cabin,Embarked
0,0,3,1,22.0,1,0,0,3
1,1,1,0,38.0,1,0,82,1
2,1,3,0,26.0,0,0,0,3
3,1,1,0,35.0,1,0,56,3
4,0,3,1,35.0,0,0,0,3


### Data Splitting
Let's split the train data into training and validation sets.

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, random_state=14)

### Parameters Tuning

In [12]:
#Define a grid of parameters to test
grid = {'learning_rate': [0.01, 0.03, 0.1, 0.2],
        'depth': [4, 6, 10],
        'l2_leaf_reg': [1, 3, 5, 7, 9],
        }

#Count all possible combinations
print("# Combinations:", len([dict(zip(grid.keys(),v)) for v in product(*grid.values())]))

# Combinations: 60


In [16]:
#Define Model
model = CatBoostRegressor()

#Grid Search
#Default cross-validation is 3-fold
grid_search_result = model.grid_search(grid, X=X_train, y=y_train, cv=3)
bestparam = grid_search_result["params"]
bestparam

0:	loss: 56.0010108	best: 56.0010108 (0)	total: 599ms	remaining: 35.3s
1:	loss: 55.0748455	best: 55.0748455 (1)	total: 1.24s	remaining: 35.9s
2:	loss: 54.7943036	best: 54.7943036 (2)	total: 1.96s	remaining: 37.2s
3:	loss: 54.5304768	best: 54.5304768 (3)	total: 2.47s	remaining: 34.6s
4:	loss: 56.5582544	best: 54.5304768 (3)	total: 3s	remaining: 33s
5:	loss: 56.1316956	best: 54.5304768 (3)	total: 3.58s	remaining: 32.2s
6:	loss: 55.2493711	best: 54.5304768 (3)	total: 4.16s	remaining: 31.5s
7:	loss: 55.2764053	best: 54.5304768 (3)	total: 4.74s	remaining: 30.8s
8:	loss: 56.9629922	best: 54.5304768 (3)	total: 5.3s	remaining: 30.1s
9:	loss: 56.5150976	best: 54.5304768 (3)	total: 5.9s	remaining: 29.5s
10:	loss: 55.7352622	best: 54.5304768 (3)	total: 6.49s	remaining: 28.9s
11:	loss: 55.8818405	best: 54.5304768 (3)	total: 7.04s	remaining: 28.2s
12:	loss: 56.6578179	best: 54.5304768 (3)	total: 7.64s	remaining: 27.6s
13:	loss: 56.7010579	best: 54.5304768 (3)	total: 8.17s	remaining: 26.8s
14:	loss:

{'depth': 4, 'l2_leaf_reg': 1, 'learning_rate': 0.2}

In [17]:
#Set best params
model = CatBoostRegressor()

#Depending on your objective you can also customize the evaluation metric
bestparam["eval_metric"] = "RMSE"

model.set_params(**bestparam)
print(model.get_params())

{'loss_function': 'RMSE', 'depth': 4, 'l2_leaf_reg': 1, 'learning_rate': 0.2, 'eval_metric': 'RMSE'}


### Model Training
Retaining the best model and with early stopping, to avoid overfit.
**In real cases, we need an external test set, not used for training or validation (early stopping). That dataset is the one to be used to evaluate the final moldel.**

In [18]:
#Furter split the train set into final_train and validation sets
X_train_final, X_validation, y_train_final, y_validation = train_test_split(X_train, y_train,\
                                                                            train_size=0.75, random_state=14)

print(X_train.shape, X_train_final.shape, X_validation.shape)

(668, 8) (501, 8) (167, 8)


Use early sotopping rounds and validation set, to stop after K iterations with no improvement of the evaluation metric.

In [33]:
model.fit(X_train_final, y_train_final, cat_features=categorical_features_indices, eval_set=(X_validation, y_validation), \
                   early_stopping_rounds = 80, use_best_model=True, logging_level = "Verbose")


0:	learn: 43.6493393	test: 52.1832622	best: 52.1832622 (0)	total: 12.7ms	remaining: 12.7s
1:	learn: 40.7367274	test: 49.1197856	best: 49.1197856 (1)	total: 23.7ms	remaining: 11.8s
2:	learn: 38.4952923	test: 47.2401130	best: 47.2401130 (2)	total: 25.5ms	remaining: 8.47s
3:	learn: 36.7473355	test: 46.0462151	best: 46.0462151 (3)	total: 32.1ms	remaining: 7.99s
4:	learn: 35.1955850	test: 44.8261668	best: 44.8261668 (4)	total: 33.3ms	remaining: 6.63s
5:	learn: 34.3082789	test: 44.1016612	best: 44.1016612 (5)	total: 37ms	remaining: 6.13s
6:	learn: 33.4253749	test: 43.3189825	best: 43.3189825 (6)	total: 40.9ms	remaining: 5.8s
7:	learn: 32.9105041	test: 43.1057861	best: 43.1057861 (7)	total: 42ms	remaining: 5.21s
8:	learn: 32.3624857	test: 42.3654827	best: 42.3654827 (8)	total: 52.8ms	remaining: 5.81s
9:	learn: 32.0007112	test: 42.2778349	best: 42.2778349 (9)	total: 53.9ms	remaining: 5.33s
10:	learn: 31.8417047	test: 42.3882976	best: 42.2778349 (9)	total: 57.5ms	remaining: 5.17s
11:	learn: 31.

99:	learn: 20.2205215	test: 41.3910971	best: 41.0182851 (64)	total: 559ms	remaining: 5.03s
100:	learn: 20.2182362	test: 41.3873315	best: 41.0182851 (64)	total: 566ms	remaining: 5.04s
101:	learn: 20.1450386	test: 41.3300656	best: 41.0182851 (64)	total: 576ms	remaining: 5.07s
102:	learn: 20.0326910	test: 41.3179782	best: 41.0182851 (64)	total: 586ms	remaining: 5.1s
103:	learn: 19.9723603	test: 41.2931705	best: 41.0182851 (64)	total: 592ms	remaining: 5.1s
104:	learn: 19.8699802	test: 41.3176682	best: 41.0182851 (64)	total: 599ms	remaining: 5.11s
105:	learn: 19.8439732	test: 41.3843892	best: 41.0182851 (64)	total: 603ms	remaining: 5.09s
106:	learn: 19.7888115	test: 41.3618913	best: 41.0182851 (64)	total: 609ms	remaining: 5.09s
107:	learn: 19.7705313	test: 41.3340152	best: 41.0182851 (64)	total: 618ms	remaining: 5.11s
108:	learn: 19.7661371	test: 41.3265322	best: 41.0182851 (64)	total: 622ms	remaining: 5.08s
109:	learn: 19.7612990	test: 41.3372485	best: 41.0182851 (64)	total: 631ms	remainin

<catboost.core.CatBoostRegressor at 0x172ff1ee948>

With this we can see that the best **RMSE** value of **41.01** (on validation set) was acheived at step **64** with no futher improvement after **80** iterations (so the training stopped). We now retain this model as the **best model**.

### Model Predictions and Fit

In [40]:
#Predict on the original Test Set
predictions = model.predict(X_test)
truevalues = np.array(y_test)

#Calculate MAE, MAPE (if no 0 values) and RMSE
print("MAE:", '%.4f' % mean_absolute_error(truevalues, predictions))
print("MAPE:", '%.4f' % mean_absolute_percentage_error(truevalues, predictions))
print("RMSE:", '%.4f' %  np.sqrt(mean_squared_error(truevalues, predictions)))

#Replace zero (only for demonstrational purpose) and calculate MAPE
truevalues[truevalues == 0] = 1
print("\nMAPE(replaced zeros):", '%.4f' % mean_absolute_percentage_error(truevalues, predictions), "%")

#Compare with variable, just to get an idea
print("\ny_test M and SD (for comparison)")
print('%.4f' % y_test.mean(),'%.4f' % y_test.std())

MAE: 12.7431
MAPE: inf
RMSE: 34.8058

MAPE(replaced zeros): 62.1592 %

y_test M and SD (for comparison)
31.2733 49.3787


  after removing the cwd from sys.path.


### Monte Carlo Cross-Validation
Now repeat the process 1,000 times and provide average fit statistics, with their standard deviation.

In [27]:
#Save accuracy and kappa scores in a list
MAE, RMSE = [], []

#For demonstrational purposes we now reapet it 10 times
for i in range(0,10):
    #Split with no random seed in train, validation and test set
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75)
    X_train_final, X_validation, y_train_final, y_validation = train_test_split(X_train, y_train, train_size=0.75)
    
    model.fit(X_train_final, y_train_final, cat_features=categorical_features_indices, \
              eval_set=(X_validation, y_validation), early_stopping_rounds = 80, use_best_model=True, \
              logging_level = "Silent")
    
    predictions = model.predict(X_test)
    truevalues = np.array(y_test)
    
    MAE.append(mean_absolute_error(truevalues, predictions))
    RMSE.append(np.sqrt(mean_squared_error(truevalues, predictions)))

In [37]:
print("MAE at each cross-validation step\n", MAE, "\n")
print("RMSE at each cross-validation step\n", RMSE, "\n")
print("MAE M", '%.4f' % np.mean(MAE), "SD", '%.4f' % np.std(MAE), "\n")
print("RMSE M", '%.4f' % np.mean(RMSE), "SD", '%.4f' % np.std(RMSE))

MAE at each cross-validation step
 [12.42592182702768, 17.092323839083598, 12.753634786967016, 12.912696719755921, 12.792237100196928, 15.784416622123544, 19.07116941742824, 15.78456335422633, 15.359102578951799, 12.743086698194608] 

RMSE at each cross-validation step
 [24.836901626612047, 47.73921682642773, 27.611795875810227, 25.349319997874577, 39.12348797075578, 39.47469426672077, 52.15669006483145, 34.239860097214596, 31.063438001199927, 34.805845754677655] 

MAE M 14.6719 SD 2.1733 

RMSE M 35.6401 SD 8.6988
