# CatBoost Algorithm

![CatBoost](images/catboost.png)
[CatBoost](https://catboost.ai/) is a gradient boosting framework developed by Yandex, the company behind the popular search

engine in Russia. It is open-source and can be used for both classification and regression tasks. CatBoost is an extension of the gradient boosting framework that can work with categorical features without the need for one-hot encoding. It is based on decision trees and is similar to LightGBM and XGBoost. CatBoost is a popular choice for Kaggle competitions and has been used to win many of them.





## pip install catboost 

In [64]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score , confusion_matrix , classification_report

In [65]:
# load_dataset 
df = sns.load_dataset('titanic')
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


# **pre-processing** 

In [66]:
df.isnull().sum().sort_values(ascending=False)


deck           688
age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

In [67]:
# # impute missing values using knn imputer in fare, age, embarked and embarked_town
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
df[['age']] = imputer.fit_transform(df[['age']])
df.isnull().sum().sort_values(ascending=False)

# impute missing values using in embarked and embarked_town
from sklearn.impute import SimpleImputer

# Impute missing values in 'embarked' and 'embark_town' columns
imputer = SimpleImputer(strategy='most_frequent')
df[['embarked', 'embark_town']] = imputer.fit_transform(df[['embarked', 'embark_town']])
df.isnull().sum().sort_values(ascending=False)


deck           688
survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
embarked         0
class            0
who              0
adult_male       0
embark_town      0
alive            0
alone            0
dtype: int64

In [68]:
df.drop(['deck'],axis=1,inplace=True)


In [69]:
df.isnull().sum().sort_values(ascending=False)


survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    0
alive          0
alone          0
dtype: int64

In [70]:
# convert all columns in category

# to be numeric. If the column is not a category, 
# then it will be converted into a category column.
categorical_col = df.select_dtypes(include=['object','category']).columns

# add this as a new column in the dataframe
df[categorical_col] = df[categorical_col].apply(lambda x: x.astype('category'))


In [71]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    category
 3   age          891 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     891 non-null    category
 8   class        891 non-null    category
 9   who          891 non-null    category
 10  adult_male   891 non-null    bool    
 11  embark_town  891 non-null    category
 12  alive        891 non-null    category
 13  alone        891 non-null    bool    
dtypes: bool(2), category(6), float64(2), int64(4)
memory usage: 49.6 KB


In [72]:
# split data into X and y
X = df.drop('survived', axis=1)
y = df['survived']

# split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [75]:
# run the catboost classifier
clf = CatBoostClassifier(
    iterations=100,
    learning_rate=0.1,
    depth=3,
    loss_function='CrossEntropy',  # Change the loss function to 'CrossEntropy'
    eval_metric='Accuracy',
    random_seed=42,
    verbose=False)

# train the model
clf.fit(X_train, y_train, cat_features=categorical_col.tolist())

# make predictions
y_pred = clf.predict(X_test)

# evaluate the model

print(f'Accuracy Score: {accuracy_score(y_test,y_pred)}')
print(f'Confusion Matrix: {confusion_matrix(y_test,y_pred)}')
print(f'Classification Report: {classification_report(y_test,y_pred)}')


Accuracy Score: 1.0
Confusion Matrix: [[105   0]
 [  0  74]]
Classification Report:               precision    recall  f1-score   support

           0       1.00      1.00      1.00       105
           1       1.00      1.00      1.00        74

    accuracy                           1.00       179
   macro avg       1.00      1.00      1.00       179
weighted avg       1.00      1.00      1.00       179

