For the following examples we'll use `sklearn` california housing dataset


In [1]:
from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing(as_frame=True)
df = housing.frame
df

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
...,...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,0.781
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,0.771
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.923
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32,0.847


In [3]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(df, test_size=0.2)

## fold_train

Split data using Kfold and train models using cross-validation. Uses KFold for regression tasks and StratifiedKFold for classification tasks. Returns List of trained models.


In [8]:
from ml_qol import fold_train

lgb_models = fold_train(
    model_type="lightgbm",
    task="regression",
    params={
        "iterations": 800,
        "learning_rate": 1e-2,
        "loss_function": "RMSE",
        "depth": 2,
        "metric": "MAE",
    },
    data=train_df,
    target_col="MedHouseVal",
    n_splits=5,
)

Fold 
Training until validation scores don't improve for 500 rounds




[100]	training's l1: 0.672997	valid_1's l1: 0.678746
[200]	training's l1: 0.584289	valid_1's l1: 0.588444
[300]	training's l1: 0.52357	valid_1's l1: 0.528043
[400]	training's l1: 0.486015	valid_1's l1: 0.49134
[500]	training's l1: 0.456554	valid_1's l1: 0.462282
[600]	training's l1: 0.437047	valid_1's l1: 0.443281
[700]	training's l1: 0.424501	valid_1's l1: 0.431399
[800]	training's l1: 0.413642	valid_1's l1: 0.420925
Did not meet early stopping. Best iteration is:
[800]	training's l1: 0.413642	valid_1's l1: 0.420925

Validation mse score: 0.3511182547265784
Fold 
Training until validation scores don't improve for 500 rounds




[100]	training's l1: 0.676602	valid_1's l1: 0.666797
[200]	training's l1: 0.587326	valid_1's l1: 0.579561
[300]	training's l1: 0.527449	valid_1's l1: 0.518256
[400]	training's l1: 0.490098	valid_1's l1: 0.480813
[500]	training's l1: 0.458508	valid_1's l1: 0.448013
[600]	training's l1: 0.43889	valid_1's l1: 0.428491
[700]	training's l1: 0.425931	valid_1's l1: 0.415564
[800]	training's l1: 0.416632	valid_1's l1: 0.406679
Did not meet early stopping. Best iteration is:
[800]	training's l1: 0.416632	valid_1's l1: 0.406679

Validation mse score: 0.32598028776432264
Fold 
Training until validation scores don't improve for 500 rounds




[100]	training's l1: 0.670354	valid_1's l1: 0.683064
[200]	training's l1: 0.581053	valid_1's l1: 0.596209
[300]	training's l1: 0.520515	valid_1's l1: 0.535889
[400]	training's l1: 0.483566	valid_1's l1: 0.498868
[500]	training's l1: 0.453156	valid_1's l1: 0.468824
[600]	training's l1: 0.433682	valid_1's l1: 0.449423
[700]	training's l1: 0.421524	valid_1's l1: 0.436952
[800]	training's l1: 0.41179	valid_1's l1: 0.427125
Did not meet early stopping. Best iteration is:
[800]	training's l1: 0.41179	valid_1's l1: 0.427125

Validation mse score: 0.3628535062224218
Fold 




Training until validation scores don't improve for 500 rounds
[100]	training's l1: 0.6758	valid_1's l1: 0.661552
[200]	training's l1: 0.585931	valid_1's l1: 0.577723
[300]	training's l1: 0.525666	valid_1's l1: 0.520106
[400]	training's l1: 0.488014	valid_1's l1: 0.483969
[500]	training's l1: 0.455755	valid_1's l1: 0.45318
[600]	training's l1: 0.435689	valid_1's l1: 0.433795
[700]	training's l1: 0.423786	valid_1's l1: 0.42263
[800]	training's l1: 0.414811	valid_1's l1: 0.414494
Did not meet early stopping. Best iteration is:
[800]	training's l1: 0.414811	valid_1's l1: 0.414494

Validation mse score: 0.33939312604837846
Fold 
Training until validation scores don't improve for 500 rounds
[100]	training's l1: 0.670694	valid_1's l1: 0.686993




[200]	training's l1: 0.58226	valid_1's l1: 0.596817
[300]	training's l1: 0.52089	valid_1's l1: 0.537093
[400]	training's l1: 0.483897	valid_1's l1: 0.500279
[500]	training's l1: 0.455187	valid_1's l1: 0.472423
[600]	training's l1: 0.432417	valid_1's l1: 0.451166
[700]	training's l1: 0.419614	valid_1's l1: 0.438855
[800]	training's l1: 0.40998	valid_1's l1: 0.429666
Did not meet early stopping. Best iteration is:
[800]	training's l1: 0.40998	valid_1's l1: 0.429666

Validation mse score: 0.3638008527648179

Mean validation score: 0.34862920550530385


In [7]:
cb_models = fold_train(
    model_type="catboost",
    task="regression",
    params={
        "iterations": 800,
        "learning_rate": 1e-2,
        "loss_function": "RMSE",
        "depth": 2,
        "metric": "MAE",
    },
    data=train_df,
    target_col="MedHouseVal",
    n_splits=5,
)

Fold 
0:	learn: 0.9062746	test: 0.9062746	test1: 0.9210785	best: 0.9210785 (0)	total: 68.1ms	remaining: 54.4s
100:	learn: 0.6884942	test: 0.6884942	test1: 0.6934857	best: 0.6934857 (100)	total: 292ms	remaining: 2.02s
200:	learn: 0.6048796	test: 0.6048796	test1: 0.6064947	best: 0.6064947 (200)	total: 544ms	remaining: 1.62s
300:	learn: 0.5518335	test: 0.5518335	test1: 0.5525376	best: 0.5525376 (300)	total: 802ms	remaining: 1.33s
400:	learn: 0.5162364	test: 0.5162364	test1: 0.5171875	best: 0.5171875 (400)	total: 1.08s	remaining: 1.07s
500:	learn: 0.4901473	test: 0.4901473	test1: 0.4920071	best: 0.4920071 (500)	total: 1.27s	remaining: 758ms
600:	learn: 0.4717363	test: 0.4717363	test1: 0.4747867	best: 0.4747867 (600)	total: 1.54s	remaining: 511ms
700:	learn: 0.4574238	test: 0.4574238	test1: 0.4607660	best: 0.4607660 (700)	total: 1.79s	remaining: 253ms
799:	learn: 0.4454668	test: 0.4454668	test1: 0.4492614	best: 0.4492614 (799)	total: 2.03s	remaining: 0us

bestTest = 0.449261383
bestIteratio

#### `fold_train` parameters

- **model_type**: Currently supports `catboost`, `lightgbm`, and `xgboost`. Defaults to `catboost`
- **task**: 'regression' or 'classification'. Defaults to 'regression
- **params**: dict of hyper-parameters for training. Default params: {
  'iterations': 1000,
  'learning_rate': 0.01,
  'loss_function': 'RMSE',
  'device': 'CPU'
  }
- **data**: Training data as a pandas.DataFrame.
- **target_col**: Name of target column in data
- **n_splits**: Number of splits to perform with KFold
- **metric**: Indicates how to calculate final validation score. Supported values: "mse", "mae", "accuracy", "f1"
- **verbose**: Controls frequency of evaluation logs. Defaults to 100
- **early_stop**: Number of iterations after which training will stop if no improvement is found in validation score. Defaults to 500
- **random_state**: Seed used to control the randomness in model training and data splitting. Defaults to 42 for reproducible results. Pass `None` to achieve non-deterministic behavior.


## get_fold_preds

Outputs averaged predictions from a list of models


In [None]:
from ml_qol import get_fold_preds
from sklearn.metrics import mean_absolute_error

X_test = test_df.drop(columns="MedHouseVal")

lgb_preds = get_fold_preds(models=lgb_models, test_df=X_test)
lgb_preds

array([1.26517062, 1.4071246 , 1.88837253, ..., 1.7273395 , 1.55077599,
       3.2716706 ])

In [None]:
mean_absolute_error(test_df["MedHouseVal"], lgb_preds)

0.43427109845185424

In [14]:
cb_preds = get_fold_preds(models=cb_models, test_df=X_test)
mean_absolute_error(test_df["MedHouseVal"], cb_preds)

0.4622639587849313

#### `get_fold_preds` parameters

- **models**: List of trained models
- **test_df**: Inference dataset for generating predictions


## get_ensemble_preds

Takes list of predictions from different models, and performs weighted average to create single ensembled prediction


In [20]:
from ml_qol import get_ensemble_preds

ensemble_preds = get_ensemble_preds(
    preds_list=[lgb_preds, cb_preds], weights=[0.7, 0.3]
)
ensemble_preds

array([1.28410673, 1.37965751, 1.9073016 , ..., 1.73696613, 1.57259953,
       3.24506535])

#### `get_ensemble` parameters

- **preds_list**: List of predictions from various models.
- **weights**: List of weights used to calculate average predictions. Default is None, which weighs all predictions equally.
