###
## 3. CatBoost (Categorical Boosting):
###

CatBoost is another advanced boosting algorithm specifically designed to handle categorical data efficiently. It is based on Gradient Boosting but incorporates specialized techniques to process categorical features, making it particularly useful for datasets with mixed data types (numerical + categorical).

### How CatBoost Works:

#### Handle Categorical Data Automatically:

Unlike other algorithms (e.g., XGBoost or LightGBM), CatBoost does not require manual one-hot encoding or other preprocessing for categorical features.
It uses a unique encoding method called Ordered Target Statistics, which encodes categorical variables by estimating their effect on the target variable while avoiding data leakage.

#### Start with Predictions:

Like other boosting methods, CatBoost starts with an initial prediction (e.g., the mean for regression or probabilities for classification).

#### Train a Weak Learner:

A weak learner (decision tree) is trained to minimize the loss function by predicting the residuals (errors) of the current model.

#### Use Symmetric Trees:

CatBoost builds symmetric decision trees (same structure for all branches), which speeds up training and improves model interpretability.

#### Sequentially Improve Predictions:

Each new tree is trained to correct the mistakes (residuals) of the previous trees. Predictions are updated iteratively.

#### Combine Weak Learners:

All the weak learners are combined into a single strong model, just like other gradient boosting algorithms.

In [2]:
!where python

C:\Users\HP\anaconda3\python.exe
C:\Users\HP\AppData\Local\Microsoft\WindowsApps\python.exe


In [87]:
import pandas as pd

df = pd.read_csv('heartdisease.csv')
df

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

x = df.drop('num', axis = 1)
y = df['num']

cat_cols = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'thal']
x[cat_cols] = x[cat_cols].astype('object')
x['ca'] = x['ca'].astype('int64')

x = pd.get_dummies(x, drop_first = True)
x

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 1, stratify = y)

scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

scaler.scale_

from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

ada = AdaBoostClassifier(algorithm = 'SAMME')
ada.fit(x_train, y_train)
ada_pred = ada.predict(x_test)
accuracy_score(y_test, ada.predict(x_test))

grid = CatBoostClassifier(n_estimators = 19)
grid.fit(x_train, y_train)
grid.predict(x_test)
accuracy_score(y_test, grid.predict(x_test))

Learning rate set to 0.5
0:	learn: 1.3504277	total: 6.02ms	remaining: 108ms
1:	learn: 1.0893232	total: 13.1ms	remaining: 111ms
2:	learn: 0.9754495	total: 19.3ms	remaining: 103ms
3:	learn: 0.8927807	total: 25.1ms	remaining: 94ms
4:	learn: 0.8368805	total: 30.4ms	remaining: 85.2ms
5:	learn: 0.7952596	total: 36.4ms	remaining: 78.8ms
6:	learn: 0.7325217	total: 42.4ms	remaining: 72.8ms
7:	learn: 0.6834123	total: 48.9ms	remaining: 67.2ms
8:	learn: 0.6529264	total: 55.9ms	remaining: 62.1ms
9:	learn: 0.6101206	total: 61.6ms	remaining: 55.4ms
10:	learn: 0.5791257	total: 68.4ms	remaining: 49.7ms
11:	learn: 0.5508969	total: 75.6ms	remaining: 44.1ms
12:	learn: 0.5248974	total: 81.1ms	remaining: 37.4ms
13:	learn: 0.4914839	total: 86.6ms	remaining: 30.9ms
14:	learn: 0.4603212	total: 92.8ms	remaining: 24.7ms
15:	learn: 0.4400414	total: 98.6ms	remaining: 18.5ms
16:	learn: 0.4183913	total: 106ms	remaining: 12.4ms
17:	learn: 0.4043614	total: 112ms	remaining: 6.24ms
18:	learn: 0.3885697	total: 120ms	rema

0.639344262295082

In [4]:
from catboost import CatBoostClassifier

In [88]:
cat = CatBoostClassifier()
cat.fit(x_train, y_train)



Learning rate set to 0.073603
0:	learn: 1.5672836	total: 5.95ms	remaining: 5.95s
1:	learn: 1.4935932	total: 11.2ms	remaining: 5.57s
2:	learn: 1.4497717	total: 16.9ms	remaining: 5.62s
3:	learn: 1.3940605	total: 19.2ms	remaining: 4.78s
4:	learn: 1.3522646	total: 24.3ms	remaining: 4.83s
5:	learn: 1.3098548	total: 29.7ms	remaining: 4.92s
6:	learn: 1.2788582	total: 34.3ms	remaining: 4.87s
7:	learn: 1.2430561	total: 39.5ms	remaining: 4.9s
8:	learn: 1.2165290	total: 44.7ms	remaining: 4.92s
9:	learn: 1.1860353	total: 49.7ms	remaining: 4.92s
10:	learn: 1.1612364	total: 55.8ms	remaining: 5.01s
11:	learn: 1.1353388	total: 61.4ms	remaining: 5.06s
12:	learn: 1.1127047	total: 66.5ms	remaining: 5.05s
13:	learn: 1.0915250	total: 72.5ms	remaining: 5.1s
14:	learn: 1.0706409	total: 77.8ms	remaining: 5.11s
15:	learn: 1.0501907	total: 83.2ms	remaining: 5.11s
16:	learn: 1.0347573	total: 88.3ms	remaining: 5.11s
17:	learn: 1.0194354	total: 90ms	remaining: 4.91s
18:	learn: 1.0023357	total: 94.6ms	remaining: 4.

178:	learn: 0.3017647	total: 972ms	remaining: 4.46s
179:	learn: 0.3007355	total: 978ms	remaining: 4.46s
180:	learn: 0.2995427	total: 983ms	remaining: 4.45s
181:	learn: 0.2973870	total: 990ms	remaining: 4.45s
182:	learn: 0.2967280	total: 996ms	remaining: 4.44s
183:	learn: 0.2947706	total: 1s	remaining: 4.44s
184:	learn: 0.2935018	total: 1.01s	remaining: 4.44s
185:	learn: 0.2920081	total: 1.01s	remaining: 4.43s
186:	learn: 0.2899239	total: 1.02s	remaining: 4.43s
187:	learn: 0.2879617	total: 1.02s	remaining: 4.43s
188:	learn: 0.2865453	total: 1.03s	remaining: 4.44s
189:	learn: 0.2855951	total: 1.04s	remaining: 4.44s
190:	learn: 0.2840455	total: 1.04s	remaining: 4.43s
191:	learn: 0.2829037	total: 1.05s	remaining: 4.42s
192:	learn: 0.2816238	total: 1.06s	remaining: 4.42s
193:	learn: 0.2799107	total: 1.06s	remaining: 4.42s
194:	learn: 0.2781560	total: 1.07s	remaining: 4.41s
195:	learn: 0.2766416	total: 1.07s	remaining: 4.4s
196:	learn: 0.2750093	total: 1.08s	remaining: 4.4s
197:	learn: 0.273

348:	learn: 0.1451124	total: 1.95s	remaining: 3.65s
349:	learn: 0.1446042	total: 1.96s	remaining: 3.64s
350:	learn: 0.1440936	total: 1.96s	remaining: 3.63s
351:	learn: 0.1435417	total: 1.97s	remaining: 3.63s
352:	learn: 0.1431773	total: 1.97s	remaining: 3.62s
353:	learn: 0.1425820	total: 1.98s	remaining: 3.61s
354:	learn: 0.1421983	total: 1.98s	remaining: 3.6s
355:	learn: 0.1415361	total: 1.99s	remaining: 3.6s
356:	learn: 0.1412602	total: 1.99s	remaining: 3.59s
357:	learn: 0.1407011	total: 2s	remaining: 3.58s
358:	learn: 0.1401565	total: 2s	remaining: 3.58s
359:	learn: 0.1397615	total: 2.01s	remaining: 3.57s
360:	learn: 0.1392399	total: 2.01s	remaining: 3.56s
361:	learn: 0.1386342	total: 2.02s	remaining: 3.55s
362:	learn: 0.1381602	total: 2.02s	remaining: 3.55s
363:	learn: 0.1376473	total: 2.03s	remaining: 3.54s
364:	learn: 0.1371861	total: 2.03s	remaining: 3.53s
365:	learn: 0.1367639	total: 2.04s	remaining: 3.53s
366:	learn: 0.1360371	total: 2.04s	remaining: 3.52s
367:	learn: 0.135574

509:	learn: 0.0903239	total: 2.76s	remaining: 2.65s
510:	learn: 0.0901378	total: 2.77s	remaining: 2.65s
511:	learn: 0.0899079	total: 2.77s	remaining: 2.64s
512:	learn: 0.0896630	total: 2.78s	remaining: 2.63s
513:	learn: 0.0894635	total: 2.78s	remaining: 2.63s
514:	learn: 0.0893349	total: 2.79s	remaining: 2.62s
515:	learn: 0.0890736	total: 2.79s	remaining: 2.62s
516:	learn: 0.0888372	total: 2.79s	remaining: 2.61s
517:	learn: 0.0885360	total: 2.8s	remaining: 2.6s
518:	learn: 0.0883870	total: 2.8s	remaining: 2.6s
519:	learn: 0.0882450	total: 2.81s	remaining: 2.59s
520:	learn: 0.0880562	total: 2.81s	remaining: 2.59s
521:	learn: 0.0878489	total: 2.82s	remaining: 2.58s
522:	learn: 0.0876652	total: 2.82s	remaining: 2.57s
523:	learn: 0.0874371	total: 2.83s	remaining: 2.57s
524:	learn: 0.0872886	total: 2.83s	remaining: 2.56s
525:	learn: 0.0871080	total: 2.84s	remaining: 2.56s
526:	learn: 0.0869040	total: 2.84s	remaining: 2.55s
527:	learn: 0.0866998	total: 2.85s	remaining: 2.54s
528:	learn: 0.08

673:	learn: 0.0643810	total: 3.56s	remaining: 1.72s
674:	learn: 0.0642785	total: 3.56s	remaining: 1.72s
675:	learn: 0.0641335	total: 3.57s	remaining: 1.71s
676:	learn: 0.0640461	total: 3.57s	remaining: 1.7s
677:	learn: 0.0639222	total: 3.58s	remaining: 1.7s
678:	learn: 0.0637995	total: 3.58s	remaining: 1.69s
679:	learn: 0.0636806	total: 3.59s	remaining: 1.69s
680:	learn: 0.0635757	total: 3.59s	remaining: 1.68s
681:	learn: 0.0634769	total: 3.6s	remaining: 1.68s
682:	learn: 0.0633545	total: 3.6s	remaining: 1.67s
683:	learn: 0.0632202	total: 3.61s	remaining: 1.67s
684:	learn: 0.0630847	total: 3.61s	remaining: 1.66s
685:	learn: 0.0630542	total: 3.62s	remaining: 1.65s
686:	learn: 0.0629427	total: 3.62s	remaining: 1.65s
687:	learn: 0.0628290	total: 3.62s	remaining: 1.64s
688:	learn: 0.0626924	total: 3.63s	remaining: 1.64s
689:	learn: 0.0626235	total: 3.63s	remaining: 1.63s
690:	learn: 0.0625047	total: 3.64s	remaining: 1.63s
691:	learn: 0.0624126	total: 3.64s	remaining: 1.62s
692:	learn: 0.06

838:	learn: 0.0484345	total: 4.35s	remaining: 835ms
839:	learn: 0.0483375	total: 4.36s	remaining: 830ms
840:	learn: 0.0482704	total: 4.36s	remaining: 824ms
841:	learn: 0.0482092	total: 4.36s	remaining: 819ms
842:	learn: 0.0481340	total: 4.37s	remaining: 814ms
843:	learn: 0.0480591	total: 4.37s	remaining: 808ms
844:	learn: 0.0479871	total: 4.38s	remaining: 803ms
845:	learn: 0.0479335	total: 4.38s	remaining: 798ms
846:	learn: 0.0478713	total: 4.39s	remaining: 793ms
847:	learn: 0.0478169	total: 4.39s	remaining: 787ms
848:	learn: 0.0477279	total: 4.4s	remaining: 782ms
849:	learn: 0.0476319	total: 4.4s	remaining: 777ms
850:	learn: 0.0476038	total: 4.41s	remaining: 772ms
851:	learn: 0.0475395	total: 4.41s	remaining: 766ms
852:	learn: 0.0474789	total: 4.42s	remaining: 761ms
853:	learn: 0.0474407	total: 4.42s	remaining: 756ms
854:	learn: 0.0473808	total: 4.42s	remaining: 750ms
855:	learn: 0.0473120	total: 4.43s	remaining: 745ms
856:	learn: 0.0472205	total: 4.43s	remaining: 740ms
857:	learn: 0.

<catboost.core.CatBoostClassifier at 0x145d2203690>

In [89]:
y_pred = cat.predict(x_test)
y_pred

array([[0],
       [0],
       [2],
       [0],
       [0],
       [3],
       [0],
       [1],
       [3],
       [0],
       [0],
       [0],
       [0],
       [1],
       [2],
       [0],
       [0],
       [0],
       [3],
       [0],
       [0],
       [0],
       [0],
       [2],
       [0],
       [2],
       [1],
       [2],
       [2],
       [2],
       [1],
       [0],
       [1],
       [1],
       [0],
       [0],
       [2],
       [0],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [0],
       [1],
       [2],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [4],
       [1],
       [1],
       [0],
       [0],
       [1],
       [0],
       [0]], dtype=int64)

In [90]:
accuracy_score(y_test, y_pred)

0.5901639344262295

In [91]:
from sklearn.model_selection import GridSearchCV

Grid takes alot of time in catboost

In [102]:
grid = CatBoostClassifier(n_estimators = 19)
grid.fit(x_train, y_train)
grid.predict(x_test)
accuracy_score(y_test, grid.predict(x_test))

Learning rate set to 0.5
0:	learn: 1.2221402	total: 4.03ms	remaining: 72.6ms
1:	learn: 1.0595259	total: 4.82ms	remaining: 41ms
2:	learn: 0.9664289	total: 7.57ms	remaining: 40.4ms
3:	learn: 0.9015252	total: 10.6ms	remaining: 39.8ms
4:	learn: 0.8300366	total: 14.2ms	remaining: 39.7ms
5:	learn: 0.7721174	total: 18ms	remaining: 39ms
6:	learn: 0.7245954	total: 20.6ms	remaining: 35.4ms
7:	learn: 0.6717048	total: 23.1ms	remaining: 31.8ms
8:	learn: 0.6443238	total: 25.9ms	remaining: 28.8ms
9:	learn: 0.6135802	total: 28.8ms	remaining: 25.9ms
10:	learn: 0.5774641	total: 32.6ms	remaining: 23.7ms
11:	learn: 0.5464912	total: 35.2ms	remaining: 20.5ms
12:	learn: 0.5203790	total: 37.7ms	remaining: 17.4ms
13:	learn: 0.5005172	total: 40.2ms	remaining: 14.3ms
14:	learn: 0.4818818	total: 42.5ms	remaining: 11.3ms
15:	learn: 0.4600115	total: 45.2ms	remaining: 8.47ms
16:	learn: 0.4388399	total: 48.6ms	remaining: 5.72ms
17:	learn: 0.4199683	total: 51.2ms	remaining: 2.84ms
18:	learn: 0.3974243	total: 53.5ms	re

0.6065573770491803

In [109]:
x = df.drop('num', axis = 1)
y = df['num']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 1, stratify = y)
cat = CatBoostClassifier(n_estimators = 50)
cat.fit(x_train, y_train)
y_pred = cat.predict(x_test)
accuracy_score(y_test, y_pred)

Learning rate set to 0.5
0:	learn: 1.2221402	total: 3.63ms	remaining: 178ms
1:	learn: 1.0595259	total: 4.87ms	remaining: 117ms
2:	learn: 0.9664289	total: 8.36ms	remaining: 131ms
3:	learn: 0.9015252	total: 11.3ms	remaining: 130ms
4:	learn: 0.8300366	total: 14.9ms	remaining: 134ms
5:	learn: 0.7721174	total: 17.8ms	remaining: 131ms
6:	learn: 0.7245954	total: 20.8ms	remaining: 128ms
7:	learn: 0.6717048	total: 23.9ms	remaining: 126ms
8:	learn: 0.6443238	total: 26.9ms	remaining: 123ms
9:	learn: 0.6135802	total: 29.6ms	remaining: 118ms
10:	learn: 0.5774641	total: 32.3ms	remaining: 115ms
11:	learn: 0.5464912	total: 35.2ms	remaining: 112ms
12:	learn: 0.5203790	total: 38ms	remaining: 108ms
13:	learn: 0.5005172	total: 40.8ms	remaining: 105ms
14:	learn: 0.4818818	total: 43.7ms	remaining: 102ms
15:	learn: 0.4600115	total: 46.6ms	remaining: 99ms
16:	learn: 0.4388399	total: 50ms	remaining: 97ms
17:	learn: 0.4199683	total: 52.8ms	remaining: 93.8ms
18:	learn: 0.3974243	total: 55.5ms	remaining: 90.5ms
1

0.6721311475409836