### **Key Points: CatBoost**

1. **Definition**:  
   CatBoost (Categorical Boosting) is a high-performance, open-source gradient boosting library designed to handle categorical features efficiently and provide high accuracy with minimal parameter tuning.

2. **Key Features**:  
   - **Handles Categorical Data**: Automatically processes categorical features without the need for extensive preprocessing (e.g., one-hot encoding or label encoding).  
   - **Supports GPU Acceleration**: Faster training with GPU support for large datasets.  
   - **Robust to Overfitting**: Incorporates techniques like ordered boosting to reduce overfitting.  
   - **Minimal Parameter Tuning**: Often delivers strong results with default hyperparameters.  
   - **Cross-Validation Support**: Built-in tools for cross-validation and parameter tuning.

3. **Advantages**:  
   - Efficient with datasets containing many categorical features.  
   - Outperforms traditional models in many real-world scenarios.  
   - Provides insights into **feature importance** for better explainability.  
   - Fast implementation and easy to use.  

4. **Disadvantages**:  
   - Higher memory consumption compared to some simpler models.  
   - May not perform as well on extremely small datasets.  
   - Requires careful handling when deploying due to its dependency on the CatBoost library.  

5. **Applications**:  
   - Fraud detection.  
   - Recommendation systems.  
   - Predictive modeling tasks in finance, healthcare, and retail.  
   - Any machine learning task with complex categorical data.  

6. **Best Practices**:  
   - Use **CatBoost’s native handling of categorical features** instead of manual encoding.  
   - Leverage **GPU support** for faster training on large datasets.  
   - Perform hyperparameter tuning for optimal performance (e.g., tuning `iterations`, `learning_rate`, `depth`).  
   - Use **early stopping** to avoid overfitting.  
   - Monitor CatBoost's built-in evaluation metrics during training to assess model performance.

7. **Key Hyperparameters**:  
   - `iterations`: The number of boosting iterations.  
   - `learning_rate`: Step size for each iteration.  
   - `depth`: Maximum depth of the tree.  
   - `l2_leaf_reg`: L2 regularization term to prevent overfitting.  
   - `cat_features`: Specify categorical features for automatic handling.  
   - `loss_function`: Loss function to optimize (e.g., `Logloss` for classification, `RMSE` for regression).  

8. **Common Metrics for Evaluation**:  
   - **Classification**: Accuracy, F1-score, ROC-AUC.  
   - **Regression**: RMSE, MAE, R-squared.  

---

### **Conclusion**  
CatBoost is a powerful and efficient gradient boosting library that excels in handling datasets with categorical features. It is particularly well-suited for real-world applications requiring high accuracy and minimal preprocessing, making it a valuable tool for data scientists and machine learning practitioners.

In [26]:
# import libraries 
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [27]:
# import dataset of titanic
df = sns.load_dataset('titanic')
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


# preprocessing

In [28]:
# impute missing values using knn imputers in fare and age
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
df[['fare' , 'age' ]] = imputer.fit_transform(df[['fare' , 'age']])

# impute missing values using mode in embarked and embark_town using simple imputer
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='most_frequent')
df[['embarked' , 'embark_town']] = imputer.fit_transform(df[['embarked' , 'embark_town']])

# drop deck column
df = df.drop(['deck' ] , axis=1)

# df missing values
df.isnull().sum().sort_values(ascending=False)

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    0
alive          0
alone          0
dtype: int64

In [29]:
# convert each category column to category
categorical_columns = df.select_dtypes(include=['object' , 'category']).columns

# add this as a new column in the dataframe
df[categorical_columns] = df[categorical_columns].astype('category')

# lets check
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    category
 3   age          891 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     891 non-null    category
 8   class        891 non-null    category
 9   who          891 non-null    category
 10  adult_male   891 non-null    bool    
 11  embark_town  891 non-null    category
 12  alive        891 non-null    category
 13  alone        891 non-null    bool    
dtypes: bool(2), category(6), float64(2), int64(4)
memory usage: 49.6 KB


# CatBoost Classifier

In [31]:
# split the data into train and test
X = df.drop(['survived' ] , axis=1)
y = df['survived']

X_train , X_test , y_train , y_test = train_test_split(X , y , test_size=0.2 , random_state=42)


In [None]:
# create the model
model = CatBoostClassifier(iterations=100
                           , learning_rate=0.1
                           , depth=3
                           , loss_function='Logloss'
                           , eval_metric='Accuracy'
                           , verbose=False)

# fit the model
model.fit(X_train , y_train , cat_features=categorical_columns.tolist())

# predict the model
y_pred = model.predict(X_test)

# evaluate the model
accuracy = accuracy_score(y_test , y_pred)
print(f'Accuracy of the model is {accuracy}')

# confusion matrix
cm = confusion_matrix(y_test , y_pred)
print(cm)

# classification report
cr = classification_report(y_test , y_pred)
print(cr)


Accuracy of the model is 1.0
[[105   0]
 [  0  74]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       105
           1       1.00      1.00      1.00        74

    accuracy                           1.00       179
   macro avg       1.00      1.00      1.00       179
weighted avg       1.00      1.00      1.00       179



# CatBoost Regressor

In [23]:
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from catboost import CatBoostRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Load the tips dataset using seaborn
tips = sns.load_dataset('tips')

# Create a dictionary to store LabelEncoders for each categorical column
label_encoders = {}

# Encode categorical features and store encoders
categorical_cols = ['sex', 'smoker', 'day', 'time']
for col in categorical_cols:
    le = LabelEncoder()
    tips[col] = le.fit_transform(tips[col])  # Encode the column
    label_encoders[col] = le  # Store the encoder

# Split data into features (X) and target (y)
X = tips[['total_bill', 'sex', 'smoker', 'day', 'time', 'size']]
y = tips['tip']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the CatBoostRegressor model
model = CatBoostRegressor(iterations=100, learning_rate=0.1, depth=4, random_state=42, verbose=0 )
model.fit(X_train, y_train, cat_features=[1, 2, 3, 4])  # Specify categorical feature indices

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")
print(f"Mean Absolute Error: {mae}")

# Inverse transform the encoded features in the original DataFrame
for col in categorical_cols:
    tips[col] = label_encoders[col].inverse_transform(tips[col].astype(int))  # Inverse transform

Mean Squared Error: 0.7227460425659444
R-squared: 0.42178991118535847
Mean Absolute Error: 0.703186662750036
