## Catboost

CatBoost, a machine learning algorithm that uses gradient boosting on decision trees.

- Unique feature：<br>
For every tree level, 
Catboost utilizes same features to divide learning samples into left and right partitions. <br>
- Can be tuned into fast and efficient model


#### Prepare training data
Include all features except target columns below: 
- ['target', 'ret', 'transactionRevenue_sum', 'fullVisitorId']


### best catboost parameters found from running grid search are fitted to the model

In [21]:
from catboost import CatBoostRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

def rmse(y_true, y_pred):
    return round(np.sqrt(mean_squared_error(y_true, y_pred)), 5)


X_train, X_validation, y_train, y_validation = train_test_split(train_x, 
                                                                train_y, 
                                                                test_size=0.15,     #  train size 85%, test size 15%
                                                                random_state=1)     # fixed train-test splits to be deterministic
# use best parameters from Grid Search results
clf = CatBoostRegressor(iterations = 1000,                      
                        learning_rate = 0.03,
                        random_seed = 2,
                        depth = 8,
                        l2_leaf_reg = 1,
                        eval_metric='RMSE',               # square root of mean-squared error [accuracy test]
                        od_wait = 10)

clf.fit(X_train, y_train,
        eval_set = (X_validation, y_validation),
        use_best_model = True,
        verbose=True)

y_pred_train = clf.predict(X_train)
y_pred_validation = clf.predict(X_validation)
y_pred_test = clf.predict(test_x)



0:	learn: 0.3127171	test: 0.2964357	best: 0.2964357 (0)	total: 239ms	remaining: 3m 58s
1:	learn: 0.3122579	test: 0.2963198	best: 0.2963198 (1)	total: 427ms	remaining: 3m 33s
2:	learn: 0.3118144	test: 0.2961903	best: 0.2961903 (2)	total: 590ms	remaining: 3m 16s
3:	learn: 0.3113369	test: 0.2960126	best: 0.2960126 (3)	total: 760ms	remaining: 3m 9s
4:	learn: 0.3109190	test: 0.2959735	best: 0.2959735 (4)	total: 931ms	remaining: 3m 5s
5:	learn: 0.3104853	test: 0.2959617	best: 0.2959617 (5)	total: 1.11s	remaining: 3m 4s
6:	learn: 0.3100866	test: 0.2958700	best: 0.2958700 (6)	total: 1.26s	remaining: 2m 58s
7:	learn: 0.3097124	test: 0.2958259	best: 0.2958259 (7)	total: 1.44s	remaining: 2m 58s
8:	learn: 0.3093678	test: 0.2958229	best: 0.2958229 (8)	total: 1.59s	remaining: 2m 55s
9:	learn: 0.3089611	test: 0.2957122	best: 0.2957122 (9)	total: 1.77s	remaining: 2m 55s
10:	learn: 0.3086486	test: 0.2955362	best: 0.2955362 (10)	total: 1.92s	remaining: 2m 52s
11:	learn: 0.3083570	test: 0.2955229	best: 0

### from results above, best RMSE is 0.290167 at 147th iteration

- model is hence shrinked to first 148 iterations
- difference in RMSE value between validation and training sets is 0.00043.

### overfitting evaluation

In [44]:
print(f"CatB: RMSE val: {rmse(y_validation, y_pred_validation)}  - RMSE train: {rmse(y_train, y_pred_train)}" , "= {}".format((rmse(y_validation, y_pred_validation)-(rmse(y_train, y_pred_train)))))
print()
print("The RMSE difference between train and val data is small, thus no overfitting problem for this trained model.")

CatB: RMSE val: 0.28871  - RMSE train: 0.28622 = 0.0024900000000000477

The RMSE difference between train and val data is small, thus no overfitting problem for this trained model.


### feature importance for catboost

In [41]:
cat_fi = clf.get_feature_importance()

indices = np.argsort(cat_fi)[::-1]
names = train_x.columns
sort_fi = sorted(zip(map(lambda train_x: round(train_x, 4), 
                     cat_fi), 
                 names), 
             reverse=True)
sort_fi


[(15.417, 'transactions'),
 (13.4689, 'interval_dates'),
 (9.5166, 'first_ses_from_the_period_start'),
 (9.0417, 'last_ses_from_the_period_end'),
 (6.1191, 'pageviews_sum'),
 (4.0997, 'hits_sum'),
 (3.7146, 'metro'),
 (3.6906, 'pageviews_mean'),
 (3.0761, 'pageviews_max'),
 (2.6829, 'channelGrouping'),
 (2.5039, 'operatingSystem'),
 (2.3517, 'sessionQualityDimSum'),
 (2.2093, 'bounces_mean'),
 (2.0742, 'hits_max'),
 (2.0273, 'visitStartTime_counts'),
 (1.9448, 'visitNumber_max'),
 (1.6068, 'timeOnSite_max'),
 (1.5769, 'sessionQualityDimMean'),
 (1.5203, 'pageviews_min'),
 (1.3856, 'networkDomain'),
 (1.3195, 'hits_min'),
 (1.2308, 'hits_mean'),
 (1.2243, 'region'),
 (1.1733, 'timeOnSite_min'),
 (1.0727, 'source'),
 (1.0592, 'timeOnSite_mean'),
 (0.6229, 'sessionQualityDimMax'),
 (0.6049, 'city'),
 (0.3522, 'medium'),
 (0.2315, 'timeOnSite_sum'),
 (0.1772, 'referralPath'),
 (0.1762, 'country'),
 (0.1689, 'sessionQualityDimMin'),
 (0.1333, 'continent'),
 (0.0877, 'customDimensions_value'

### top 3 important features contributing to customer revenue predictions using catboost are:

- 15.417, 'transactions' <br>
- 13.4689, 'interval_dates' <br>
- 9.5166, 'first_ses_from_the_period_start'