For the following examples we'll use `sklearn` california housing dataset


In [1]:
from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing(as_frame=True)
df = housing.frame
df

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
...,...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,0.781
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,0.771
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.923
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32,0.847


In [2]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(df, test_size=0.2)

## fold_train

Split data using Kfold and train models using cross-validation. Uses KFold for regression tasks and StratifiedKFold for classification tasks. Returns List of trained models.


In [3]:
from ml_qol import fold_train

lgb_models = fold_train(
    model_type="lightgbm",
    task="regression",
    params={
        "iterations": 800,
        "learning_rate": 1e-2,
        "loss_function": "RMSE",
        "depth": 2,
        "metric": "MAE",
    },
    data=train_df,
    target_col="MedHouseVal",
    n_splits=5,
)

Running Fold: 1
Training until validation scores don't improve for 500 rounds




[100]	training's l1: 0.676191	valid_1's l1: 0.679831
[200]	training's l1: 0.588918	valid_1's l1: 0.596189
[300]	training's l1: 0.527552	valid_1's l1: 0.535111
[400]	training's l1: 0.491897	valid_1's l1: 0.500029
[500]	training's l1: 0.46558	valid_1's l1: 0.473279
[600]	training's l1: 0.444124	valid_1's l1: 0.45122
[700]	training's l1: 0.431321	valid_1's l1: 0.438139
[800]	training's l1: 0.421126	valid_1's l1: 0.427398
Did not meet early stopping. Best iteration is:
[800]	training's l1: 0.421126	valid_1's l1: 0.427398

Validation mse score: 0.37117123752533593
Running Fold: 2
Training until validation scores don't improve for 500 rounds




[100]	training's l1: 0.676068	valid_1's l1: 0.686162
[200]	training's l1: 0.58856	valid_1's l1: 0.60045
[300]	training's l1: 0.529739	valid_1's l1: 0.542797
[400]	training's l1: 0.493057	valid_1's l1: 0.5059
[500]	training's l1: 0.466293	valid_1's l1: 0.478194
[600]	training's l1: 0.446882	valid_1's l1: 0.458084
[700]	training's l1: 0.433696	valid_1's l1: 0.4444
[800]	training's l1: 0.423924	valid_1's l1: 0.434505
Did not meet early stopping. Best iteration is:
[800]	training's l1: 0.423924	valid_1's l1: 0.434505

Validation mse score: 0.35955646088361237
Running Fold: 3
Training until validation scores don't improve for 500 rounds




[100]	training's l1: 0.675547	valid_1's l1: 0.680255
[200]	training's l1: 0.588743	valid_1's l1: 0.594013
[300]	training's l1: 0.529173	valid_1's l1: 0.535332
[400]	training's l1: 0.492099	valid_1's l1: 0.498682
[500]	training's l1: 0.462691	valid_1's l1: 0.469926
[600]	training's l1: 0.445093	valid_1's l1: 0.452373
[700]	training's l1: 0.431992	valid_1's l1: 0.439354
[800]	training's l1: 0.420403	valid_1's l1: 0.427983
Did not meet early stopping. Best iteration is:
[800]	training's l1: 0.420403	valid_1's l1: 0.427983

Validation mse score: 0.35285353991926877
Running Fold: 4
Training until validation scores don't improve for 500 rounds




[100]	training's l1: 0.67739	valid_1's l1: 0.671166
[200]	training's l1: 0.59004	valid_1's l1: 0.58601
[300]	training's l1: 0.529734	valid_1's l1: 0.526129
[400]	training's l1: 0.493869	valid_1's l1: 0.490634
[500]	training's l1: 0.465485	valid_1's l1: 0.463826
[600]	training's l1: 0.446551	valid_1's l1: 0.44652
[700]	training's l1: 0.433291	valid_1's l1: 0.434154
[800]	training's l1: 0.423289	valid_1's l1: 0.424845
Did not meet early stopping. Best iteration is:
[800]	training's l1: 0.423289	valid_1's l1: 0.424845

Validation mse score: 0.3616534308496843
Running Fold: 5
Training until validation scores don't improve for 500 rounds




[100]	training's l1: 0.676747	valid_1's l1: 0.67582
[200]	training's l1: 0.589625	valid_1's l1: 0.587386
[300]	training's l1: 0.529094	valid_1's l1: 0.529615
[400]	training's l1: 0.492307	valid_1's l1: 0.494435
[500]	training's l1: 0.465192	valid_1's l1: 0.469404
[600]	training's l1: 0.446076	valid_1's l1: 0.451528
[700]	training's l1: 0.43255	valid_1's l1: 0.43909
[800]	training's l1: 0.420131	valid_1's l1: 0.427421
Did not meet early stopping. Best iteration is:
[800]	training's l1: 0.420131	valid_1's l1: 0.427421

Validation mse score: 0.35992226884701684

Mean validation score: 0.36103138760498366


In [4]:
cb_models = fold_train(
    model_type="catboost",
    task="regression",
    params={
        "iterations": 800,
        "learning_rate": 1e-2,
        "loss_function": "RMSE",
        "depth": 2,
        "metric": "MAE",
    },
    data=train_df,
    target_col="MedHouseVal",
    n_splits=5,
)

Running Fold: 1
0:	learn: 0.9109550	test: 0.9109550	test1: 0.9062780	best: 0.9062780 (0)	total: 52.6ms	remaining: 42.1s
100:	learn: 0.6922582	test: 0.6922582	test1: 0.6948952	best: 0.6948952 (100)	total: 369ms	remaining: 2.55s
200:	learn: 0.6082710	test: 0.6082710	test1: 0.6150335	best: 0.6150335 (200)	total: 620ms	remaining: 1.85s
300:	learn: 0.5562799	test: 0.5562799	test1: 0.5636271	best: 0.5636271 (300)	total: 860ms	remaining: 1.43s
400:	learn: 0.5211795	test: 0.5211795	test1: 0.5277817	best: 0.5277817 (400)	total: 1.14s	remaining: 1.13s
500:	learn: 0.4956273	test: 0.4956273	test1: 0.5013471	best: 0.5013471 (500)	total: 1.38s	remaining: 826ms
600:	learn: 0.4775318	test: 0.4775318	test1: 0.4824162	best: 0.4824162 (600)	total: 1.64s	remaining: 542ms
700:	learn: 0.4638496	test: 0.4638496	test1: 0.4685158	best: 0.4685158 (700)	total: 1.86s	remaining: 263ms
799:	learn: 0.4522098	test: 0.4522098	test1: 0.4568862	best: 0.4568862 (799)	total: 2.09s	remaining: 0us

bestTest = 0.4568861678
b

#### `fold_train` parameters

- **model_type**: Currently supports `catboost`, `lightgbm`, and `xgboost`. Defaults to `catboost`
- **task**: 'regression' or 'classification'. Defaults to 'regression
- **params**: dict of hyper-parameters for training. Default params: {
  'iterations': 1000,
  'learning_rate': 0.01,
  'loss_function': 'RMSE',
  'device': 'CPU'
  }
- **data**: Training data as a pandas.DataFrame.
- **target_col**: Name of target column in data
- **n_splits**: Number of splits to perform with KFold
- **metric**: Indicates how to calculate final validation score. Supported values: "mse", "mae", "accuracy", "f1"
- **verbose**: Controls frequency of evaluation logs. Defaults to 100
- **early_stop**: Number of iterations after which training will stop if no improvement is found in validation score. Defaults to 500
- **random_state**: Seed used to control the randomness in model training and data splitting. Defaults to 42 for reproducible results. Pass `None` to achieve non-deterministic behavior.


## get_fold_preds

Outputs averaged predictions from a list of models


In [5]:
from ml_qol import get_fold_preds
from sklearn.metrics import mean_absolute_error

X_test = test_df.drop(columns="MedHouseVal")

lgb_preds = get_fold_preds(models=lgb_models, test_df=X_test)
lgb_preds

array([2.23816343, 1.92668548, 1.33629134, ..., 1.90389427, 3.14148662,
       2.89017136])

In [6]:
mean_absolute_error(test_df["MedHouseVal"], lgb_preds)

0.41537856171473453

In [7]:
cb_preds = get_fold_preds(models=cb_models, test_df=X_test)
mean_absolute_error(test_df["MedHouseVal"], cb_preds)

0.4434645996091336

#### `get_fold_preds` parameters

- **models**: List of trained models
- **test_df**: Inference dataset for generating predictions


## get_ensemble_preds

Takes list of predictions from different models, and performs weighted average to create single ensembled prediction


In [8]:
from ml_qol import get_ensemble_preds

ensemble_preds = get_ensemble_preds(
    preds_list=[lgb_preds, cb_preds], weights=[0.7, 0.3]
)
ensemble_preds

array([2.25105386, 1.94135131, 1.35334026, ..., 1.88861195, 3.10551075,
       2.88583463])

#### `get_ensemble` parameters

- **preds_list**: List of predictions from various models.
- **weights**: List of weights used to calculate average predictions. Default is None, which weighs all predictions equally.
