# Catboost Model

There are a few different options I could've chosen for the model, and in a real-world scenario I would test many different ones. My choice for Catboost is based on a few different things:

1. I have experience with it.
2. It has a great set of default parameters that produce good results - in a real-world scenario I would spend more time tuning hyperparameters, but I want to keep this analysis simple.
3. It has some useful plotting functions which can show use what the most important features are.

In [26]:
import catboost as cb
import pandas as pd

from zilch_interview.prepare import DataCleaner

# autoreload
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [27]:
# Load and prepare data
train_df_raw = pd.read_csv("../data/external/train.csv")

cleaner = DataCleaner()
X_train = cleaner.fit_transform(train_df_raw)
y_train = train_df_raw["credit_score_target"]

In [28]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 28 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   age                       97219 non-null   float64
 1   occupation                100000 non-null  object 
 2   annual_income             99001 non-null   float64
 3   monthly_inhand_salary     84998 non-null   float64
 4   num_bank_accounts         98685 non-null   float64
 5   num_credit_card           97729 non-null   float64
 6   interest_rate             97966 non-null   float64
 7   delay_from_due_date       100000 non-null  int64  
 8   num_of_delayed_payment    92262 non-null   float64
 9   changed_credit_limit      97909 non-null   float64
 10  num_credit_inquiries      96386 non-null   float64
 11  outstanding_debt          100000 non-null  float64
 12  credit_utilization_ratio  100000 non-null  float64
 13  credit_history_age        90970 non-null   fl

In [29]:
y_train.describe()

count    100000.000000
mean        701.491960
std          26.982129
min         543.000000
25%         682.000000
50%         701.000000
75%         721.000000
max         900.000000
Name: credit_score_target, dtype: float64

In [30]:
from catboost import CatBoostRegressor

cat_features = [
    "occupation",
    "payment_of_min_amount",
    "payment_behaviour"
]

# Initialize CatBoostRegressor
catboost_regressor = CatBoostRegressor(
    iterations=1000,
    learning_rate=0.1,
    depth=6,
    loss_function='RMSE',
    verbose=100,
    one_hot_max_size=20
)

# Fit the model
catboost_regressor.fit(X_train, y_train, cat_features=cat_features)

TypeError: must be real number, not NoneType

In [None]:
# Calculate feature statistics
catboost_regressor.calc_feature_statistics(X_train, y_train, max_cat_features_on_plot=20);

{'age': {'borders': array([-3.4028235e+38,  1.4500000e+01,  1.5500000e+01,  1.6500000e+01,
          1.7500000e+01,  1.8500000e+01,  1.9500000e+01,  2.0500000e+01,
          2.1500000e+01,  2.2500000e+01,  2.4500000e+01,  2.5500000e+01,
          2.6500000e+01,  2.7500000e+01,  2.8500000e+01,  2.9500000e+01,
          3.0500000e+01,  3.1500000e+01,  3.2500000e+01,  3.3500000e+01,
          3.4500000e+01,  3.5500000e+01,  3.6500000e+01,  3.7500000e+01,
          3.8500000e+01,  3.9500000e+01,  4.0500000e+01,  4.1500000e+01,
          4.2500000e+01,  4.3500000e+01,  4.4500000e+01,  4.5500000e+01,
          4.6500000e+01,  4.7500000e+01,  4.8500000e+01,  4.9500000e+01,
          5.0500000e+01,  5.1500000e+01,  5.2500000e+01,  5.3500000e+01,
          5.4500000e+01,  5.5500000e+01], dtype=float32),
  'binarized_feature': array([10, 10,  0, ..., 11, 11, 11]),
  'mean_target': array([702.18805, 684.1949 , 683.83167, 684.94434, 682.739  , 696.29266,
         699.94556, 701.0751 , 699.9466 , 7

In [None]:
# Load and prepare test data
test_df_raw = pd.read_csv("../data/external/test.csv")

# Clean the test data
X_test = cleaner.fit_transform(test_df_raw)

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, root_mean_squared_error
import numpy as np

# Make predictions on the test set
y_pred = catboost_regressor.predict(X_test)

# Calculate metrics
mae = mean_absolute_error(test_df_raw["credit_score_target"], y_pred)
mse = mean_squared_error(test_df_raw["credit_score_target"], y_pred)
rmse = root_mean_squared_error(test_df_raw["credit_score_target"], y_pred)

print(f"MAE: {mae}")
print(f"MSE: {mse}")
print(f"RMSE: {rmse}")

MAE: 11.418499141584894
MSE: 225.02104095156864
RMSE: 15.000701348655957


In [None]:
# Calculate feature statistics
feature_importances = catboost_regressor.get_feature_importance(prettified=True)
feature_importances

Unnamed: 0,Feature Id,Importances
0,payment_of_min_amount,20.116421
1,outstanding_debt,14.525036
2,monthly_inhand_salary,8.141759
3,annual_income,8.079188
4,occupation,6.963972
5,delay_from_due_date,5.362183
6,num_bank_accounts,4.430182
7,total_emi_per_month,4.304946
8,credit_history_age,4.016485
9,amount_invested_monthly,3.953824
