### XGBoost tuning using a custom two-parameter scorer


ChatGPT couldn't solve this problem - so I had to tune it manually.
<br><br>My prompt was:
<br><font face="Courier" color="#3333cc">
<br>Use XGBoost regressor with sklearn interface.
<br>
<br>Generate 10,000 rows of synthetic data with 5 numeric features 
<br>and add correlation between those features and the target column.
<br>
<br>Write python code to demonstrate the use of make_scorer from sklearn.metrics.
<br>
<br>Create a custom estimator function to evaluate combination of two results:
<br>1. MSE (model error)
<br>2. The size of model file
<br>
<br>Use parameter grid like this:
<br>
<br>param_grid = {
<br>   'max_depth': [3, 5, 7],
<br>   'learning_rate': [0.1, 0.01, 0.001],
<br>   'subsample': [0.5, 0.7, 1],
<br>   'n_estimators': [50, 100, 200]
<br>}
<br>
<br>Make a 3D plot showing the score as function of MSE and size of the model
</font>

<br>Unfortunately the code creaed by ChatGPT was giving errors 
<br>that some parameters were missing.
<br>So after several several attempts I had to dive into the code and fix it myself.
<br>I couldn't figure out how to best use the GridSearchCV with make_scorer
<br> so I simply wrote a for-loop myself

In [1]:
import os, pickle
import numpy as np
import xgboost as xgb
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from joblib import dump
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

In [2]:
# Generate synthetic data with 5 numeric features 
# and add correlation between those features and the target column
X, y = make_regression(n_samples=10000, n_features=5, noise=0.1, random_state=42)
for i in range(X.shape[1]):
    X[:, i] *= np.random.uniform(0.5, 1.5)
    y += 0.2 * X[:, i]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [5]:
# define the grid of parameters we will be testing
# Note in original testing I had more values, 
# but I removed them as the accuracy was not good
param_grid = {
    'max_depth'    : [1,2,3,4,5,6], # default 6, removed  5,7,10
    'learning_rate': [0.2, 0.1],    # removed 0.01
    'subsample'    : [0.7],         # default 1, removed 0.5, 1
    'n_estimators' : [200,300]      # default 100, removed 50, 100
}

In [6]:
results = {}
ii=0
for max_depth in param_grid['max_depth']:
    for learning_rate in param_grid['learning_rate']:
        for subsample in param_grid['subsample']:
            for n_estimators in param_grid['n_estimators']:
                ii += 1
                xgb_model = xgb.XGBRegressor(objective='reg:squarederror', 
                                             random_state=42,
                                             max_depth=max_depth,
                                             learning_rate=learning_rate,
                                             subsample=subsample,
                                             n_estimators=n_estimators
                                            )
                ss = f"max_depth:{max_depth:2d},learn_rate:{learning_rate:5.3f}"
                ss +=f",subsample:{subsample:3.1f},n_estim:{n_estimators:3d}"
                # print(f"training {ss}",)
                xgb_model.fit(X_train, y_train)
                y_pred = xgb_model.predict(X_test)
                mse = round(mean_squared_error(y_test, y_pred),3)
                r2 = round(r2_score(y_test, y_pred),3)
                fname = 'junk.pkl'
                with open(fname, 'wb') as f:
                    pickle.dump(xgb_model, f)
                fsize_kb = round(os.path.getsize(fname) / 1024.0)
                results[ss] = {'mse':mse,'r2':r2,'kb':fsize_kb}
                print(f"{ii:3} : {str(ss):25}: {str(results[ss]):30}")

  1 : max_depth: 1,learn_rate:0.200,subsample:0.7,n_estim:200: {'mse': 89.556, 'r2': 0.995, 'kb': 153}
  2 : max_depth: 1,learn_rate:0.200,subsample:0.7,n_estim:300: {'mse': 73.679, 'r2': 0.996, 'kb': 227}
  3 : max_depth: 1,learn_rate:0.100,subsample:0.7,n_estim:200: {'mse': 192.834, 'r2': 0.99, 'kb': 153}
  4 : max_depth: 1,learn_rate:0.100,subsample:0.7,n_estim:300: {'mse': 73.994, 'r2': 0.996, 'kb': 227}
  5 : max_depth: 2,learn_rate:0.200,subsample:0.7,n_estim:200: {'mse': 88.533, 'r2': 0.996, 'kb': 179}
  6 : max_depth: 2,learn_rate:0.200,subsample:0.7,n_estim:300: {'mse': 70.998, 'r2': 0.996, 'kb': 266}
  7 : max_depth: 2,learn_rate:0.100,subsample:0.7,n_estim:200: {'mse': 60.448, 'r2': 0.997, 'kb': 179}
  8 : max_depth: 2,learn_rate:0.100,subsample:0.7,n_estim:300: {'mse': 48.322, 'r2': 0.998, 'kb': 265}
  9 : max_depth: 3,learn_rate:0.200,subsample:0.7,n_estim:200: {'mse': 68.575, 'r2': 0.997, 'kb': 229}
 10 : max_depth: 3,learn_rate:0.200,subsample:0.7,n_estim:300: {'mse': 55

In [7]:
# Originall I saw mse in range 48 to 18,000
# Now the range is smaller: 44 .. 101
# because we removed some values from the grid
# Let us only consider entries where mse < 48

# --------------------------------------------
def parse_key(ss):
    """ convenient to filter by some grid parameters"""
    dd = {}
    for part in ss.split(","):
        kk,vv = part.split(":")
        dd[kk] = float(vv)
    return dd

# --------------------------------------------
res2 = {}
for k,v in results.items():
    mse = v['mse']
    kb = v['kb']
    if mse >= 48:
        continue
    res2[k] = v
    print(f"{k} => {v}")

max_depth: 3,learn_rate:0.100,subsample:0.7,n_estim:300 => {'mse': 41.587, 'r2': 0.998, 'kb': 340}
max_depth: 4,learn_rate:0.100,subsample:0.7,n_estim:300 => {'mse': 45.641, 'r2': 0.998, 'kb': 476}
max_depth: 5,learn_rate:0.100,subsample:0.7,n_estim:300 => {'mse': 47.297, 'r2': 0.998, 'kb': 719}


In [6]:
# We have a clear winner:
#     max_depth: 3,learn_rate:0.100,subsample:0.7,n_estim:300 
#       => {'mse': 41.688, 'r2': 0.998, 'kb': 340}
# Interesting how we get the best accuracy 
# from very shallow trees (depth = 3), 
# but we use a lot of trees.