In [2]:
import pandas as pd

cycling = pd.read_csv("../datasets/bike_rides.csv", index_col=0, parse_dates=True)
cycling.index.name = ""
target_name = "power"
data, target = cycling.drop(columns=target_name), cycling[target_name]
data

Unnamed: 0,heart-rate,cadence,speed,acceleration,slope
,,,,,
2020-08-18 14:43:19,102.0,64.0,4.325,0.0880,-0.033870
2020-08-18 14:43:20,103.0,64.0,4.336,0.0842,-0.033571
2020-08-18 14:43:21,105.0,66.0,4.409,0.0234,-0.033223
2020-08-18 14:43:22,106.0,66.0,4.445,0.0016,-0.032908
2020-08-18 14:43:23,106.0,67.0,4.441,0.1144,0.000000
...,...,...,...,...,...
2020-09-13 14:55:57,130.0,0.0,1.054,0.0234,0.000000
2020-09-13 14:55:58,130.0,0.0,0.829,0.0258,0.000000
2020-09-13 14:55:59,129.0,0.0,0.616,-0.1686,0.000000


### Building model with ShuffleSplit cross validation


We will create a predictive model that uses all data available and use a non-linear regressor, a `sklearn.ensemble.HistGradientBoostingRegressor`.The number of maximum iterations will be set to 1000 (`max_iter=1_000`) and the early stopping activated (`early_stopping=True`).

In [5]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_validate
from sklearn.ensemble import HistGradientBoostingRegressor

hgbr = make_pipeline(StandardScaler(),
                     HistGradientBoostingRegressor(max_iter=1000, early_stopping=True))

cv = ShuffleSplit(n_splits=4,random_state=0)
cv_results = cross_validate(hgbr, data, target, cv=cv, scoring='neg_mean_absolute_error',
                            return_estimator=True, return_train_score=True)

errors_SS_hgbr = -cv_results["train_score"]
print(f"Histogram GBDT - MAE on train sets:\t",
      f"{errors_SS_hgbr.mean():.3f} +/- {errors_SS_hgbr.std():.3f} Watts")
errors_SS_hgbr = -cv_results["test_score"]
print(f"Histogram GBDT - MAE on test sets:\t",
      f"{errors_SS_hgbr.mean():.3f} +/- {errors_SS_hgbr.std():.3f} Watts")

Histogram GBDT - MAE on train sets:	 40.104 +/- 1.097 Watts
Histogram GBDT - MAE on test sets:	 43.853 +/- 0.160 Watts



### Building model with LeaveOneOutGroup cross validation

We would like to have a cross-validation strategy that evaluates the capacity of our model to predict on a completely new bike ride: the samples in the validation set should only come from rides not present in the training set. Therefore, we can use a LeaveOneGroupOut strategy: at each iteration of the cross-validation, we will keep a bike ride for the evaluation and use all other bike rides to train our model.

Concretely, we need to:

* create a variable called ***group*** that is a 1D numpy array containing the index of each ride present in the dataframe. Therefore, the length of group will be equal to the number of samples in data. If we had 2 bike rides, we would expect the indices 0 and 1 in group to differentiate the bike ride.
* create a cross-validation object named cv using the `sklearn.model_selection.LeaveOneGroupOut` strategy.


In [10]:
from sklearn.model_selection import LeaveOneGroupOut

groups, _ = pd.factorize(data.index.date)
cv = LeaveOneGroupOut()

cv_results_hgbr = cross_validate(
    hgbr, data, target, groups=groups, cv=cv,
    scoring="neg_mean_absolute_error", return_estimator=True,
    return_train_score=True, n_jobs=2)

errors_LOGO_hgbr = -cv_results_hgbr["train_score"]
print(f"Histogram GBDT - MAE on train sets:\t",
      f"{errors_LOGO_hgbr.mean():.3f} +/- {errors_LOGO_hgbr.std():.3f} Watts")
errors_LOGO_hgbr = -cv_results_hgbr["test_score"]
print(f"Histogram GBDT - MAE on test sets:\t",
      f"{errors_LOGO_hgbr.mean():.3f} +/- {errors_LOGO_hgbr.std():.3f} Watts")

Histogram GBDT - MAE on train sets:	 38.246 +/- 1.597 Watts
Histogram GBDT - MAE on test sets:	 49.832 +/- 2.657 Watts


Regarding model under- and over-fitting, we observe that the histogram gradient boosting regressor is overfitting, in particular with the LeaveOneGroupOut cross-validation strategy as the gap between train and test scores is even wider than measured with the ShuffleSplit strategy.

In [12]:
print(
    "HGBDT with LeaveOneGroupOut has a bigger test error than HGBDT with ShuffleSplit by "
    f"{errors_LOGO_hgbr.mean() - errors.mean()}"
    " Watts."
)

HGBDT with LeaveOneGroupOut has a bigger test error than HGBDT with ShuffleSplit by 5.910040087520372 Watts.


We observe a higher standard deviation of the test MAE when it is computed by respecting the ride dependency structure using LeaveOneGroupOut.