For the following examples we'll use `sklearn` california housing dataset


In [1]:
from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing(as_frame=True)
df = housing.frame
df

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
...,...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,0.781
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,0.771
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.923
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32,0.847


In [2]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(df, test_size=0.2)

## fold_train

Split data using Kfold and train models using cross-validation. Uses KFold for regression tasks and StratifiedKFold for classification tasks. Returns List of trained models.


In [3]:
from ml_qol import fold_train

lgb_models = fold_train(
    model_type="lightgbm",
    task="regression",
    params={
        "iterations": 800,
        "learning_rate": 1e-2,
        "loss_function": "RMSE",
        "depth": 2,
        "metric": "MAE",
    },
    data=train_df,
    target_col="MedHouseVal",
    n_splits=5,
)

Running Fold: 1
Training until validation scores don't improve for 500 rounds




[100]	training's l1: 0.674142	valid_1's l1: 0.668539
[200]	training's l1: 0.586044	valid_1's l1: 0.584105
[300]	training's l1: 0.52613	valid_1's l1: 0.52601
[400]	training's l1: 0.489205	valid_1's l1: 0.490072
[500]	training's l1: 0.45812	valid_1's l1: 0.459927
[600]	training's l1: 0.43817	valid_1's l1: 0.441042
[700]	training's l1: 0.424538	valid_1's l1: 0.428429
[800]	training's l1: 0.415846	valid_1's l1: 0.420873
Did not meet early stopping. Best iteration is:
[800]	training's l1: 0.415846	valid_1's l1: 0.420873

Validation mse score: 0.3566053227669459
Running Fold: 2
Training until validation scores don't improve for 500 rounds




[100]	training's l1: 0.672634	valid_1's l1: 0.674826
[200]	training's l1: 0.58565	valid_1's l1: 0.589433
[300]	training's l1: 0.52589	valid_1's l1: 0.532421
[400]	training's l1: 0.488832	valid_1's l1: 0.496605
[500]	training's l1: 0.45989	valid_1's l1: 0.468478
[600]	training's l1: 0.438953	valid_1's l1: 0.44872
[700]	training's l1: 0.426858	valid_1's l1: 0.437063
[800]	training's l1: 0.415194	valid_1's l1: 0.425563
Did not meet early stopping. Best iteration is:
[800]	training's l1: 0.415194	valid_1's l1: 0.425563

Validation mse score: 0.3622104013404486
Running Fold: 3
Training until validation scores don't improve for 500 rounds




[100]	training's l1: 0.67848	valid_1's l1: 0.653298
[200]	training's l1: 0.591192	valid_1's l1: 0.568254
[300]	training's l1: 0.530675	valid_1's l1: 0.509478
[400]	training's l1: 0.493889	valid_1's l1: 0.476
[500]	training's l1: 0.462657	valid_1's l1: 0.447665
[600]	training's l1: 0.441545	valid_1's l1: 0.429358
[700]	training's l1: 0.427501	valid_1's l1: 0.417823
[800]	training's l1: 0.418181	valid_1's l1: 0.410597
Did not meet early stopping. Best iteration is:
[800]	training's l1: 0.418181	valid_1's l1: 0.410597

Validation mse score: 0.32800415298987656
Running Fold: 4
Training until validation scores don't improve for 500 rounds




[100]	training's l1: 0.663942	valid_1's l1: 0.69604
[200]	training's l1: 0.577814	valid_1's l1: 0.610008
[300]	training's l1: 0.52018	valid_1's l1: 0.551574
[400]	training's l1: 0.483352	valid_1's l1: 0.513899
[500]	training's l1: 0.454441	valid_1's l1: 0.4837
[600]	training's l1: 0.434766	valid_1's l1: 0.463498
[700]	training's l1: 0.421228	valid_1's l1: 0.448891
[800]	training's l1: 0.412093	valid_1's l1: 0.439006
Did not meet early stopping. Best iteration is:
[800]	training's l1: 0.412093	valid_1's l1: 0.439006

Validation mse score: 0.3828028114136488
Running Fold: 5




Training until validation scores don't improve for 500 rounds
[100]	training's l1: 0.671984	valid_1's l1: 0.678291
[200]	training's l1: 0.585321	valid_1's l1: 0.589962
[300]	training's l1: 0.526805	valid_1's l1: 0.529612
[400]	training's l1: 0.490574	valid_1's l1: 0.492556
[500]	training's l1: 0.46278	valid_1's l1: 0.463812
[600]	training's l1: 0.442695	valid_1's l1: 0.442099
[700]	training's l1: 0.428902	valid_1's l1: 0.427197
[800]	training's l1: 0.418949	valid_1's l1: 0.416673
Did not meet early stopping. Best iteration is:
[800]	training's l1: 0.418949	valid_1's l1: 0.416673

Validation mse score: 0.3344522589861587

Mean validation score: 0.3528149894994157


In [4]:
cb_models = fold_train(
    model_type="catboost",
    task="regression",
    params={
        "iterations": 800,
        "learning_rate": 1e-2,
        "loss_function": "RMSE",
        "depth": 2,
        "metric": "MAE",
    },
    data=train_df,
    target_col="MedHouseVal",
    n_splits=5,
)

Running Fold: 1
0:	learn: 0.9044505	test: 0.9044505	test1: 0.8970182	best: 0.8970182 (0)	total: 53.1ms	remaining: 42.4s
100:	learn: 0.6893665	test: 0.6893665	test1: 0.6841205	best: 0.6841205 (100)	total: 334ms	remaining: 2.31s
200:	learn: 0.6060359	test: 0.6060359	test1: 0.6044232	best: 0.6044232 (200)	total: 603ms	remaining: 1.79s
300:	learn: 0.5547067	test: 0.5547067	test1: 0.5550548	best: 0.5550548 (300)	total: 807ms	remaining: 1.34s
400:	learn: 0.5188126	test: 0.5188126	test1: 0.5203847	best: 0.5203847 (400)	total: 1.08s	remaining: 1.07s
500:	learn: 0.4925459	test: 0.4925459	test1: 0.4944781	best: 0.4944781 (500)	total: 1.27s	remaining: 761ms
600:	learn: 0.4741671	test: 0.4741671	test1: 0.4766608	best: 0.4766608 (600)	total: 1.53s	remaining: 507ms
700:	learn: 0.4596507	test: 0.4596507	test1: 0.4628457	best: 0.4628457 (700)	total: 1.77s	remaining: 250ms
799:	learn: 0.4481242	test: 0.4481242	test1: 0.4516377	best: 0.4516377 (799)	total: 2.03s	remaining: 0us

bestTest = 0.4516377423
b

#### `fold_train` parameters

- **model_type**: Currently supports `catboost`, `lightgbm`, and `xgboost`. Defaults to `catboost`
- **task**: 'regression' or 'classification'. Defaults to 'regression
- **params**: dict of hyper-parameters for training. Default params: {
  'iterations': 1000,
  'learning_rate': 0.01,
  'loss_function': 'RMSE',
  'device': 'CPU'
  }
- **data**: Training data as a pandas.DataFrame.
- **target_col**: Name of target column in data
- **n_splits**: Number of splits to perform with KFold
- **metric**: Indicates how to calculate final validation score. Supported values: "mse", "mae", "accuracy", "f1"
- **verbose**: Controls frequency of evaluation logs. Defaults to 100
- **early_stop**: Number of iterations after which training will stop if no improvement is found in validation score. Defaults to 500
- **random_state**: Seed used to control the randomness in model training and data splitting. Defaults to 42 for reproducible results. Pass `None` to achieve non-deterministic behavior.


## get_fold_preds

Outputs averaged predictions from a list of models


In [5]:
from ml_qol import get_fold_preds
from sklearn.metrics import mean_absolute_error

X_test = test_df.drop(columns="MedHouseVal")

lgb_preds = get_fold_preds(models=lgb_models, test_df=X_test)
lgb_preds

array([3.20217756, 2.88144391, 2.56642513, ..., 1.74657   , 1.4189248 ,
       0.76742656])

In [6]:
mean_absolute_error(test_df["MedHouseVal"], lgb_preds)

0.4296036277382521

In [7]:
cb_preds = get_fold_preds(models=cb_models, test_df=X_test)
mean_absolute_error(test_df["MedHouseVal"], cb_preds)

0.46027083720407913

#### `get_fold_preds` parameters

- **models**: List of trained models
- **test_df**: Inference dataset for generating predictions


## get_ensemble_preds

Takes list of predictions from different models, and performs weighted average to create single ensembled prediction


In [8]:
from ml_qol import get_ensemble_preds

ensemble_preds = get_ensemble_preds(
    preds_list=[lgb_preds, cb_preds], weights=[0.7, 0.3]
)
ensemble_preds

array([3.14995385, 2.9287071 , 2.54459153, ..., 1.76655099, 1.42649875,
       0.78752358])

#### `get_ensemble` parameters

- **preds_list**: List of predictions from various models.
- **weights**: List of weights used to calculate average predictions. Default is None, which weighs all predictions equally.
