# The cross-validation approach

The cross-validation process that is implemented in `eensight` is governed by four (4) parameters:

1. `group_by`: This parameter defines what constitutes an indivisible group of data. If its value is "day", the-cross validation process will consider the different days in the dataset as groups. If its value is "week", the different weeks in the dataset will be regarded as groups. In either case, the same group **will not** appear in two different folds. This is summarized schematically in the diagram below.

   <img src="images/group_by.png?modified=12345678" alt="grouped" width="600"/>


2. `stratify_by`: This parameter defines how the cross-validation process will stratify the folds. The default value is "month", which means that the folds will preserve the percentage of month occurrences across test sets. In other words, all the test sets will contain observations from all the months that can be found in the full dataset. Alternative values for `stratify_by` are "week" and None (for no stratification).

   This is summarized schematically in the diagram below.

   <img src="images/stratify_by.png?modified=12345678" alt="grouped" width="600"/>

3. `n_splits`: The number of folds. It must be at least 2. 


4. `n_repeats`: This parameter defines the number of times the cross-validation process should be repeated. 

   If we choose the number of folds to be five (5), we actually choose to estimate the model’s performance on unseen data as the mean of five (5) values. This is generally a low number of values and leads to a high variance of the result. The solution, however, is not to increase the number of folds, since this would decrease the size of the test sets and the extent to which they adequately represent the data. Instead, we repeat the cross-validation process and merge the results. 
   
   If we stratify over months, group by weeks and select the number of folds to be four (4), this roughly translates into test datasets that include one (1) week from each month. Given that each month has four (4) weeks, the possible number of week combinations that define a test set is very high. As a result, it is highly unlikely that repeating the cross-validation process a few times will lead to evaluating identical folds. Empirically four (4) repetitions of a cross-validation process with four (4) folds, i.e. 16 folds in total, is enough to provide stable performance estimations.

The combination of `group_by` and `stratify_by` allows the model to see data that covers all seasonality and operating mode patterns, while still making it non trivial for it to predict on unseen data.  

**Note**: We should not stratify the cross-validation folds if the model that we want to evaluate aims at forecasting. However, M&V models aim at interpolating the available data, and this makes stratification a suitable strategy.

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [3]:
from eensight.utils.jupyter import load_catalog
from eensight.pipelines.model_selection import create_splits, CrossValidator
from eensight.pipelines.model_selection import cvrmse, nmbe

### Load some data

In [4]:
catalog = load_catalog('demo', partial_catalog=True)

In [None]:
data = catalog.load('train.model_input_data')

consumption_daily = data['consumption'].groupby(lambda x: x.date).sum()

When we group days or weeks, we don't want to treat same days or weeks of the year that are in different years as the same group. So, we always assign data to groups that make this distinction:   

In [7]:
def create_groups(X, group_block):
    if group_block == "day":
        grouped = X.groupby([lambda x: x.year, lambda x: x.dayofyear])
    elif group_block == "week":
        grouped = X.groupby([lambda x: x.year, lambda x: x.isocalendar()[1]])
    elif group_block == "month":
        grouped = X.groupby([lambda x: x.year, lambda x: x.month])
    else:
        raise ValueError("`groups` can be either `day`, `week` or `month`.")

    groups = None
    for i, (_, group) in enumerate(grouped):
        groups = pd.concat([groups, pd.Series(i, index=group.index)])
    return groups.reindex(X.index)

### Case: `group_by` is "day", no stratification

In [8]:
splits = create_splits(data, group_by='day', stratify_by=None, 
                       n_splits=3, n_repeats=1)

In [None]:
data_as_days = create_groups(data, 'day')

for train_idx, test_idx in splits(data):
    days_in_train = np.unique(data_as_days[train_idx])
    days_in_test = np.unique(data_as_days[test_idx])
    common_days = np.intersect1d(days_in_train, days_in_test)
    print('Common days between train and test splits: '
          f'{len(common_days) / data_as_days.nunique()}')

We can visualize the data per split:

In [11]:
def visualize_splits(n_splits, splits, data, figsize=(12, 8)):
    colors = ['#d8b365', '#01665e']

    with plt.style.context('seaborn-whitegrid'):    
        fig = plt.figure(figsize=figsize, dpi=96)
        layout = (n_splits, 1)
        
        axes = []
        for i in range(n_splits):
            axes.append(plt.subplot2grid(layout, (i, 0)))
        
        for i, (train_idx, test_idx) in enumerate(splits(data)):
            subset = consumption_daily[
                        np.isin(consumption_daily.index, data.iloc[train_idx].index.date)
            ]

            subset.plot(ax=axes[i], color=colors[0], style='.', ms=8, alpha=0.8, 
                        label=f'split_{i}:train')

            subset = consumption_daily[
                        np.isin(consumption_daily.index, data.iloc[test_idx].index.date)
            ]

            subset.plot(ax=axes[i], color=colors[1], style='.', ms=8, alpha=0.8, 
                        label=f'split_{i}:test')

        for ax in axes:
            ax.legend(frameon=True, shadow=True, bbox_to_anchor=(1.01, 1.01))

In [None]:
visualize_splits(3, splits, data, figsize=(12, 8))

### Case: `group_by` is "week", no stratification

In [13]:
splits = create_splits(data, group_by='week', stratify_by=None, 
                       n_splits=3, n_repeats=1)

In [None]:
data_as_weeks = create_groups(data, 'week')

for train_idx, test_idx in splits(data):
    weeks_in_train = np.unique(data_as_weeks[train_idx])
    weeks_in_test = np.unique(data_as_weeks[test_idx])
    common_weeks = np.intersect1d(weeks_in_train, weeks_in_test)
    print('Common weeks between train and test splits: '
          f'{len(common_weeks) / data_as_weeks.nunique()}')

In [None]:
visualize_splits(3, splits, data, figsize=(12, 8))

### Case: `group_by` is "day", `stratify_by` is "week"

In [16]:
data_as_days = create_groups(data, 'day').to_frame('data_as_days')
data_as_weeks = create_groups(data, 'week').to_frame('data_as_weeks')
data_as_groups = pd.concat((data_as_days, data_as_weeks), axis=1)

Different values of `n_splits` lead to different coverage levels. In other words, the more splits we want to make, the more probable becomes that some weeks will not be represented in both train and test splits:

In [None]:
for n_splits in range(2, 10):
    splits = create_splits(data, group_by='day', stratify_by='week', 
                           n_splits=n_splits, n_repeats=1)
    
    for train_idx, test_idx in splits(data):
        all_values = []
        weeks_in_train = np.unique(data_as_groups.iloc[train_idx]['data_as_weeks'])
        weeks_in_test = np.unique(data_as_groups.iloc[test_idx]['data_as_weeks'])
        common_weeks = np.intersect1d(weeks_in_train, weeks_in_test)
        all_values.append(len(common_weeks) / data_as_groups["data_as_weeks"].nunique())
    print(f'n_splits: {n_splits}, average coverage: {np.mean(all_values)}')

The grouping however is always maintained:

In [None]:
for n_splits in range(2, 10):
    splits = create_splits(data, group_by='day', stratify_by='week', 
                           n_splits=n_splits, n_repeats=1)
    
    for train_idx, test_idx in splits(data):
        all_values = []
        days_in_train = np.unique(data_as_groups.iloc[train_idx]['data_as_days'])
        days_in_test = np.unique(data_as_groups.iloc[test_idx]['data_as_days'])
        common_days = np.intersect1d(days_in_train, days_in_test)
        all_values.append(len(common_days) / data_as_groups["data_as_days"].nunique())
    print(f'n_splits: {n_splits}, common days between train and test splits: '
          f'{np.mean(all_values)}')

We can visualize the data per split:

In [None]:
splits = create_splits(data, group_by='day', stratify_by='week', 
                        n_splits=3, n_repeats=1)
    
visualize_splits(3, splits, data, figsize=(12, 8))

### Case: `group_by` is "week", `stratify_by` is "month"

In [20]:
data_as_weeks = create_groups(data, 'week').to_frame('data_as_weeks')
data_as_months = create_groups(data, 'month').to_frame('data_as_months')
data_as_groups = pd.concat((data_as_weeks, data_as_months), axis=1)

Different values of `n_splits` lead to different coverage levels:

In [None]:
for n_splits in range(2, 10):
    splits = create_splits(data, group_by='week', stratify_by='month', 
                           n_splits=n_splits, n_repeats=1)

    for train_idx, test_idx in splits(data):
        all_values = []
        months_in_train = np.unique(data_as_groups.iloc[train_idx]['data_as_months'])
        months_in_test = np.unique(data_as_groups.iloc[test_idx]['data_as_months'])
        common_months = np.intersect1d(months_in_train, months_in_test)
        all_values.append(len(common_months) / data_as_groups["data_as_months"].nunique())
    print(f'n_splits: {n_splits}, average coverage: {np.mean(all_values)}')

The grouping however is always maintained:

In [None]:
for n_splits in range(2, 10):
    splits = create_splits(data, group_by='week', stratify_by='month', 
                           n_splits=n_splits, n_repeats=1)
    for train_idx, test_idx in splits(data):
        all_values = []
        weeks_in_train = np.unique(data_as_groups.iloc[train_idx]['data_as_weeks'])
        weeks_in_test = np.unique(data_as_groups.iloc[test_idx]['data_as_weeks'])
        common_weeks = np.intersect1d(weeks_in_train, weeks_in_test)
        all_values.append(len(common_weeks) / data_as_groups["data_as_weeks"].nunique())
    print(f'n_splits: {n_splits}, common weeks between train and test splits: '
          f'{np.mean(all_values)}')

We can visualize the data per split:

In [None]:
splits = create_splits(data, group_by='week', stratify_by='month', 
                       n_splits=3, n_repeats=1)
visualize_splits(3, splits, data, figsize=(12, 8))

### Case: `group_by` is "day", `stratify_by` is "month"

In [24]:
data_as_days = create_groups(data, 'day').to_frame('data_as_days')
data_as_months = create_groups(data, 'month').to_frame('data_as_months')
data_as_groups = pd.concat((data_as_days, data_as_months), axis=1)

Since we have a sufficient number of unique days per moth, different values of `n_splits` do not affect coverage levels (unless we ask for a very high number of splits):

In [None]:
for n_splits in range(2, 10):
    splits = create_splits(data, group_by='day', stratify_by='month', 
                           n_splits=n_splits, n_repeats=1)
    
    for train_idx, test_idx in splits(data):
        all_values = []
        months_in_train = np.unique(data_as_groups.iloc[train_idx]['data_as_months'])
        months_in_test = np.unique(data_as_groups.iloc[test_idx]['data_as_months'])
        common_months = np.intersect1d(months_in_train, months_in_test)
        all_values.append(len(common_months) / data_as_groups["data_as_months"].nunique())
    print(f'n_splits: {n_splits}, average coverage: {np.mean(all_values)}')

.. and also:

In [None]:
for n_splits in range(2, 10):
    splits = create_splits(data, group_by='day', stratify_by='month', 
                           n_splits=n_splits, n_repeats=1)
    for train_idx, test_idx in splits(data):
        all_values = []
        days_in_train = np.unique(data_as_groups.iloc[train_idx]['data_as_days'])
        days_in_test = np.unique(data_as_groups.iloc[test_idx]['data_as_days'])
        common_days = np.intersect1d(days_in_train, days_in_test)
        all_values.append(len(common_days) / data_as_groups["data_as_days"].nunique())
    print(f'n_splits: {n_splits}, common days between train and test splits: '
          f'{np.mean(all_values)}')

We can visualize the data per split:

In [None]:
splits = create_splits(data, group_by='day', stratify_by='month', 
                           n_splits=3, n_repeats=1)
visualize_splits(3, splits, data, figsize=(12, 8))

### Case: `group_by` is "week", `stratify_by` is "month", `n_repeats`>0

The combination `group_by='week'` and `stratify_by='month'` is the default in `eensight`.

The default value for `n_splits` is 3. If one wants to apply the train/test cycle to more splits, it is advisable to increase the `n_repeats` parameter:

In [28]:
data_as_weeks = create_groups(data, 'week').to_frame('data_as_weeks')
data_as_months = create_groups(data, 'month').to_frame('data_as_months')
data_as_groups = pd.concat((data_as_weeks, data_as_months), axis=1)

In [None]:
for n_repeats in range(1, 10):
    splits = create_splits(data, group_by='week', stratify_by='month', 
                           n_splits=3, n_repeats=n_repeats)
    for i, (train_idx, test_idx) in enumerate(splits(data)):
        all_values = []
        months_in_train = np.unique(data_as_groups.iloc[train_idx]['data_as_months'])
        months_in_test = np.unique(data_as_groups.iloc[test_idx]['data_as_months'])
        common_months = np.intersect1d(months_in_train, months_in_test)
        all_values.append(len(common_months) / data_as_groups["data_as_months"].nunique())
    print(f'n_repeats: {n_repeats}, total splits: {i+1}, average coverage: '
          f'{np.mean(all_values)}')

The grouping requirement is still maintained:

In [None]:
for n_repeats in range(1, 10):
    splits = create_splits(data, group_by='week', stratify_by='month', 
                           n_splits=3, n_repeats=n_repeats)
    for i, (train_idx, test_idx) in enumerate(splits(data)):
        all_values = []
        weeks_in_train = np.unique(data_as_groups.iloc[train_idx]['data_as_weeks'])
        weeks_in_test = np.unique(data_as_groups.iloc[test_idx]['data_as_weeks'])
        common_weeks = np.intersect1d(weeks_in_train, weeks_in_test)
        all_values.append(len(common_weeks) / data_as_groups["data_as_weeks"].nunique())
    print(f'n_repeats: {n_repeats}, total splits: {i+1}, common weeks between train and '
          f'test splits: {np.mean(all_values)}')

### Apply cross-validation

`eensight.pipelines.model_selection.CrossValidator`

    Parameters
    ----------
    estimator : Any regressor with scikit-learn API (i.e. with fit and predict methods)
        The object to use to fit the data and evaluate the metrics.
    group_by : str {None, 'day', 'week'}, default='week'
        Parameter that defines what constitutes an indivisible group of data. The same
        group will not appear in two different folds. If `group_by='week'`, the cross
        validation process will consider the different weeks of the year as groups. If
        `group_by='day'`, the different days of the year will be considered as groups.
        If None, no groups will be considered.
    stratify_by : str {None, 'week', 'month'}, default='month'
        Parameter that defines if the cross validation process will stratify the folds.
        If `stratify_by='month'`, the folds will preserve the percentage of month
        occurrences across test sets.
    n_splits : int, default=3
        Number of folds. Must be at least 2.
    n_repeats : int (default=None)
        Number of times the cross-validation process needs to be repeated.
    target_name : str, default='consumption'
            It is expected that both y and the predictions of the `estimator` are
            dataframes with a single column, the name of which is the one provided
            for `target_name`.
    scorers : dict, default=None
            dict mapping scorer name to a callable. The callable object
            should have signature ``scorer(y_true, y_pred)``.
            The default value is:
            `OrderedDict(
                {
                    "CVRMSE": lambda y_true, y_pred:
                        eensight.pipelines.model_selection.cvrmse(
                            y_true[target_name], y_pred[target_name]
                        ),
                    "NMBE": lambda y_true, y_pred:
                        eensight.pipelines.model_selection.nmbe(
                            y_true[target_name], y_pred[target_name]
                        )
                }
            )`
    keep_estimators : bool, default=False
        Whether to keep the fitted estimators per fold.
    n_jobs : int, default=None
        Number of jobs to run in parallel. Training the estimator and computing
        the score are parallelized over the cross-validation splits.
        ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
        ``-1`` means using all processors.
    verbose : int, default=0
        The verbosity level.
    fit_params : dict, default=None
        Parameters to pass to the fit method of the estimator.
    pre_dispatch : int or str, default='2*n_jobs'
        Controls the number of jobs that get dispatched during parallel
        execution. Reducing this number can be useful to avoid an
        explosion of memory consumption when more jobs get dispatched
        than CPUs can process. This parameter can be:
            - None, in which case all the jobs are immediately
            created and spawned. Use this for lightweight and
            fast-running jobs, to avoid delays due to on-demand
            spawning of the jobs
            - An int, giving the exact number of total jobs that are
            spawned
            - A str, giving an expression as a function of n_jobs,
            as in '2*n_jobs'
    random_state : int, RandomState instance or None (default=None)
        Controls the randomness of each repeated cross-validation instance.
        Pass an int for reproducible output across multiple function calls.

In [None]:
model = catalog.load('ensemble_model')

In [8]:
data_train = catalog.load('train.model_input_data')

X_train = data_train.loc[~data_train['consumption_outlier'], ['temperature']]
y_train = data_train.loc[X_train.index, ['consumption']]

In [9]:
cv = CrossValidator(model, group_by='day',
                    n_repeats=3, keep_estimators=True, 
                    n_jobs=-1, verbose=False)

In [None]:
%%time
cv = cv.fit(X_train, y_train)

The `CrossValidator` contains the cross-validation scores: 

In [None]:
cv.scores_

In [None]:
print(f'Mean out-of-sample CVRMSE (%): {np.mean(cv.scores_["CVRMSE"])*100}')
print(f'Mean out-of-sample NMBE: (%) {np.mean(cv.scores_["NMBE"])*100}')

--------------------------------------