### ðŸ”¹ CatBoost Regression  

CatBoost (Categorical Boosting) is a gradient boosting algorithm that is **efficient, accurate, and designed to handle categorical features automatically**.  
It is particularly useful when working with datasets that contain a mix of numerical and categorical data.  

The main idea:  
- CatBoost builds an ensemble of trees sequentially, where each new tree tries to correct the errors of previous trees.  
- Uses **ordered boosting** and **special encoding for categorical features** to reduce overfitting and improve accuracy.  
- Supports both **regression** and **classification** tasks.  

The prediction can be expressed as:  

$$
\hat{Y} = \sum_{k=1}^{K} f_k(X), \quad f_k \in \mathcal{F}
$$  

where:  
- $f_k$ is the $k$th regression tree,  
- $mathcal{F}$ is the space of all possible trees,  
- $K$ is the total number of trees.  

CatBoost Regression helps us to:  
- Handle datasets with categorical features **without manual preprocessing**.  
- Capture **complex non-linear relationships** in data.  
- Achieve **state-of-the-art performance** with less parameter tuning.  

In this notebook, we will implement **CatBoost Regression** and compare its performance with LightGBM and XGBoost ðŸš€.  


# --------------------------------------------------------------------------

# load dataset

In [1]:
# from google.colab import files, drive

# up = files.upload()
# drive.mount('/content/drive')

In [2]:
import pandas as pd

df = pd.read_csv('dataset.csv')
df.head(3)

Unnamed: 0,f1,f2,f3,T
0,2.0,4,8.5,196
1,2.4,4,9.6,221
2,1.5,4,5.9,136


In [3]:
# df.info()

# cleaning

In [4]:
# clean data

# encoding

In [5]:
# encode data

# define x , y

In [6]:
import numpy as np

x = df[['f1', 'f2', 'f3']].values
y = df['T'].values

In [7]:
x[:5]

array([[ 2. ,  4. ,  8.5],
       [ 2.4,  4. ,  9.6],
       [ 1.5,  4. ,  5.9],
       [ 3.5,  6. , 11.1],
       [ 3.5,  6. , 10.6]])

In [8]:
y[:5]

array([196, 221, 136, 255, 244])

# spliting

In [9]:
# ! pip install catboost

In [10]:
# # finding best random state 

# from sklearn.model_selection import train_test_split
# from catboost import CatBoostRegressor
# cbr = CatBoostRegressor(verbose=0, random_state=1)
# from sklearn.metrics import r2_score

# import time
# t1 = time.time()
# lst = []
# for i in range(1,10):
#     x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=i) 
#     cbr.fit(x_train, y_train)
#     yhat_test = cbr.predict(x_test)
#     r2 = r2_score(y_test, yhat_test)
#     lst.append(r2)

# t2 = time.time()
# print(f"run time: {round((t2 - t1) / 60 , 0)} min")
# print(f"R2_score = {round(max(lst),2)}")
# print(f"random_state = {np.argmax(lst) + 1}")

In [11]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=9)

# scaling

In [12]:
# No need for scaling in catboost Regression

# fit train data

In [13]:
## k-fold cross validation

# from catboost import CatBoostRegressor
# from sklearn.model_selection import GridSearchCV

# parameters = {
#     '': [],
#     '': []
# }

# cb = CatBoostRegressor(random_state=42)
# gs = GridSearchCV(estimator=cb, param_grid=parameters, cv=5)

# gs.fit(x_train, y_train)

# best_params = gs.best_params_
# print(best_params)

In [14]:
# https://catboost.ai/docs/en/concepts/python-reference_catboostregressor

In [15]:
from catboost import CatBoostRegressor

cbr = CatBoostRegressor(random_state=1)  # verbose=0
cbr.fit(x_train, y_train)

Learning rate set to 0.039927
0:	learn: 60.3297311	total: 158ms	remaining: 2m 37s
1:	learn: 58.3940411	total: 159ms	remaining: 1m 19s
2:	learn: 56.4549506	total: 160ms	remaining: 53.3s
3:	learn: 54.6930536	total: 162ms	remaining: 40.2s
4:	learn: 52.9306177	total: 163ms	remaining: 32.4s
5:	learn: 51.3046936	total: 164ms	remaining: 27.1s
6:	learn: 49.6999264	total: 165ms	remaining: 23.4s
7:	learn: 48.2209995	total: 166ms	remaining: 20.6s
8:	learn: 46.7149773	total: 167ms	remaining: 18.4s
9:	learn: 45.3128098	total: 168ms	remaining: 16.7s
10:	learn: 44.0993558	total: 169ms	remaining: 15.2s
11:	learn: 42.7951321	total: 170ms	remaining: 14s
12:	learn: 41.5197281	total: 172ms	remaining: 13s
13:	learn: 40.4305807	total: 173ms	remaining: 12.2s
14:	learn: 39.2614307	total: 174ms	remaining: 11.4s
15:	learn: 38.1146810	total: 175ms	remaining: 10.7s
16:	learn: 37.0352673	total: 176ms	remaining: 10.2s
17:	learn: 36.0750518	total: 177ms	remaining: 9.67s
18:	learn: 35.0887034	total: 178ms	remaining: 

<catboost.core.CatBoostRegressor at 0x22baebe5940>

## predict test data

In [16]:
yhat_test = cbr.predict(x_test)

## evaluate the model

In [17]:
from sklearn.metrics import r2_score

print("r2-score (train data): %0.4f" % r2_score(y_train, cbr.predict(x_train)))
print("r2-score (test data): %0.4f" % r2_score(y_test, yhat_test))

r2-score (train data): 0.9918
r2-score (test data): 0.9922


In [18]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

print(f"MSE (train data): {mean_squared_error(y_train, cbr.predict(x_train))}")
print(f"RMSE (train data): {np.sqrt(mean_squared_error(y_train, cbr.predict(x_train)))}")
print(f"MAE (train data): {mean_absolute_error(y_train, cbr.predict(x_train))}")
print('------------')
print(f"MSE (test data): {mean_squared_error(y_test, yhat_test)}")
print(f"RMSE (test data): {np.sqrt(mean_squared_error(y_test, yhat_test))}")
print(f"MAE (test data): {mean_absolute_error(y_test, yhat_test)}")

MSE (train data): 31.922693434384573
RMSE (train data): 5.650017118061199
MAE (train data): 2.515622374374697
------------
MSE (test data): 35.483003430049344
RMSE (test data): 5.956761152677632
MAE (test data): 2.8385144865832674


# predict new data

In [21]:
cbr.predict([[2, 4, 8.5]])

array([196.89227736])

# save the model

In [19]:
# import joblib

# joblib.dump(cbr, 'cbr_model.pkl')

# load the model

In [20]:
# import joblib

# cbr = joblib.load('cbr_model.pkl')