### ChatGPT - compare XGBoost, Catboost and LightGBM - size of the model

```
ChatGPT prompt:
    
Please demonstrate methods of reducing the model size
for 3 models:

 - XGBoost regressor (with sklearn interface)
 - CatBoost
 - LightGBM

First generate synthetic data having 10 numeric
 feature columns and one numeric target. 

Generate 100,000 rows.

Add correlation between target and features 
so that the model make sense.

Then train all three models.

Calculate and print models' quality metrics (MSE and R2).
Save the models to files, and print files' sizes in KBytes.

Now apply the following methods of decreasing the model size
sequentially - and demonstrate decreasing of the model size:

 - reduce number of trees
 - decrease max_depth parameter
 - increase learning_rate
 - use smaller data types (float32 isntead of float64)
 - compress the model file using gzip

Do this for all three models.

Calculate and print the errors of the reduced model and final compressed model for all three models. 

Then in the summary table provide results for all 3 models 
and for all 3 types (original, reduced, compressed).

The results should include file size, Error (MSE), and R2
```

In [1]:
import numpy as np
import xgboost as xgb
import catboost as cb
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import os
import pickle
import gzip
import pandas as pd

In [2]:
# -----------------------------------------------------
# 1. Generate synthetic data
np.random.seed(42)
X = np.random.rand(100000, 10)
y = np.sum(X, axis=1) + 0.1 * np.random.randn(100000)  # Add correlation between target and features

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [3]:
# -----------------------------------------------------
# 2. Train the models using XGBoost, CatBoost, and LightGBM regressors with default and reduced size versions
xgb_model = xgb.XGBRegressor(random_state=42)
xgb_model.fit(X_train, y_train)

In [4]:
xgb_reduced_model = xgb.XGBRegressor(random_state=42, n_estimators=50, max_depth=3, learning_rate=0.3)
xgb_reduced_model.fit(X_train, y_train)

In [5]:
cb_model = cb.CatBoostRegressor(random_state=42, verbose=0)
cb_model.fit(X_train, y_train)

<catboost.core.CatBoostRegressor at 0x7f98f3ea51f0>

In [6]:
cb_reduced_model = cb.CatBoostRegressor(random_state=42, verbose=0, iterations=200, depth=6, learning_rate=0.3)
cb_reduced_model.fit(X_train, y_train)

<catboost.core.CatBoostRegressor at 0x7f98f34effa0>

In [7]:
lgb_model = lgb.LGBMRegressor(random_state=42)
lgb_model.fit(X_train, y_train)

In [8]:
lgb_reduced_model = lgb.LGBMRegressor(random_state=42, n_estimators=50, max_depth=3, learning_rate=0.3)
lgb_reduced_model.fit(X_train, y_train)

In [9]:
# 3. Calculate and print the errors of the models
models = {'XGBoost': xgb_model, 'XGBoost Reduced': xgb_reduced_model, 
          'CatBoost': cb_model, 'CatBoost Reduced': cb_reduced_model, 
          'LightGBM': lgb_model, 'LightGBM Reduced': lgb_reduced_model}
errors = []
for name, model in models.items():
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    errors.append((name, round(mse,3), round(r2,3)))
errors

[('XGBoost', 0.036, 0.957),
 ('XGBoost Reduced', 0.035, 0.958),
 ('CatBoost', 0.011, 0.987),
 ('CatBoost Reduced', 0.011, 0.987),
 ('LightGBM', 0.021, 0.974),
 ('LightGBM Reduced', 0.036, 0.957)]

In [10]:
# 4. Save the models to files and get their sizes in megabytes
model_files = {}
for name, model in models.items():
    file_name = f'{name.lower().replace(" ", "_")}_model.pkl'
    with open(file_name, 'wb') as f:
        pickle.dump(model, f)
    file_size = os.path.getsize(file_name) / 1024
    model_files[name] = (file_name, file_size)

In [11]:
# Compress the models
compressed_files = {}
for name, (file_name, file_size) in model_files.items():
    compressed_file_name = f'{file_name[:-4]}_compressed.pkl.gz'
    with open(file_name, 'rb') as f_in:
        with gzip.open(compressed_file_name, 'wb') as f_out:
            f_out.writelines(f_in)
    compressed_file_size = os.path.getsize(compressed_file_name) / 1024
    compressed_files[name] = (compressed_file_name, compressed_file_size)

In [12]:
# Summary table
summary_data = []
for name, mse, r2 in errors:
    original_file_name, original_file_size = model_files[name]
    compressed_file_name, compressed_file_size = compressed_files[name]
    summary_data.append({'Model': name, 
                         'Type': 'Original', 
                         'File Size (KB)': round(original_file_size, 3), 
                         'Mean Squared Error': round(mse, 3), 
                         'R2 Score': round(r2, 3)})
    summary_data.append({'Model': name, 
                         'Type': 'Compressed', 
                         'File Size (KB)': round(compressed_file_size, 3), 
                         'Mean Squared Error': round(mse, 3), 
                         'R2 Score': round(r2, 3)})

summary = pd.DataFrame(summary_data)
print(summary)

               Model        Type  File Size (KB)  Mean Squared Error  R2 Score
0            XGBoost    Original         470.762               0.036     0.957
1            XGBoost  Compressed         162.501               0.036     0.957
2    XGBoost Reduced    Original          61.777               0.035     0.958
3    XGBoost Reduced  Compressed          12.514               0.035     0.958
4           CatBoost    Original        1068.744               0.011     0.987
5           CatBoost  Compressed         434.369               0.011     0.987
6   CatBoost Reduced    Original         220.871               0.011     0.987
7   CatBoost Reduced  Compressed          62.790               0.011     0.987
8           LightGBM    Original         297.872               0.021     0.974
9           LightGBM  Compressed         110.797               0.021     0.974
10  LightGBM Reduced    Original          46.987               0.036     0.957
11  LightGBM Reduced  Compressed          17.346    