# Constructing uncertainty intervals

Suppose we have a training dataset $(X_i,Y_i) \in R^d×R$, $i=1,..,n$, and a new test point $(X_{n+1},Y_{n+1})$ drawn from the same distribution. If we have a regression model $\hat{\mu}$  that has been fitted on the training data, we can apply it on the new test point’s features to get a prediction for the target $\hat{Y}_{n+1}=\hat{\mu}(X_{n+1})$. 

In addition, we want a prediction interval for the test point, i.e. an interval around $\hat{\mu}(X_{n+1})$  that is likely to contain the true value of $Y_{n+1}$. If the desired miscoverage rate is $a$ (or alternatively, the desired confidence level for the interval is $1-a$), this goal can be stated as find:

$$C_a(X_{n+1}):  P\{Y_{n+1} \in C_a(X_{n+1} ) \} \geq 1-a$$


`eensight` uses [conformal prediction](https://arxiv.org/abs/2107.07511) for distribution-free uncertainty quantification. 

The method builds upon the cross-validation functionality. If `CrossValidator` is instantiated with `keep_estimators=True`, we get: (a) a set of estimators, each of which has been fitted on a subset of the data, and (b) the index of the data subset that each estimator did not see during fitting. 

More formally, during cross-validation, we can split the training dataset into $K$ subsets $S_1,…,S_K$, and define $\hat{\mu}_{/S_k}$ as a regression model that has been fitted onto the training data with the $k$-th subset removed. `CrossValidator` stores each model $\hat{\mu}_{/S_k}$ (in attribute `estimators`) alongside with a mapping of the form $S_k(i) \rightarrow \hat{\mu}_{/S_k}$ (in attribute `oos_masks`, here *oos* stands for out-of-sample), where $S_k(i)$ identifies the subset that contains the observation $X_i$.

Conformal estimation uses the non-conformity scores defined as:

<img src="images/conformal_01.png?modified=12345678" alt="grouped" width="500"/>

Then, the conformal prediction interval is defined as:

<img src="images/conformal_02.png?modified=1234567" alt="grouped" width="700"/>

where $q_{n,a}^+ (e_i)$ is the $\lceil(1-a)(n+1)\rceil$-th smallest value of $e_i$

The conformal prediction functionality of `eensight` resides in:

- `eensight.pipelines.uncertainty.IcpEstimator`
- `eensight.pipelines.uncertainty.AggregatedCp`


### `eensight.pipelines.uncertainty.IcpEstimator`

Inductive conformal estimator.

    Parameters
    ----------
    estimator : Any regressor with scikit-learn predictor API (i.e. with fit and
        predict methods)
        The object to use to calculate the non-conformity scores. The estimator is
        expected to be the result of a cross-validation process and it must be already
        fitted.
    oos_mask : array-like
        The index of the training dataset's subset that the `estimator` has not seen
        during its fitting (i.e. the test sample of the relevant cross-validation fold).
    add_normalizer : bool, default=True
        If True, a normalization model will be added. Its predictions act as a
        multiplicative correction factor of the non-conformity scores. The
        normalization model is a random forest regressor
        (`sklearn.ensemble.RandomForestRegressor`).
    extra_regressors : str or list of str, default=None
        The names of the additional regressors to use for the normalization model. By
        default, the normalization model uses only the month of year, day of week, and
        hour of day features.
    n_estimators : int, default=100
        The number of trees in the normalization model.
    min_samples_leaf : int or float, default=0.05
        The minimum number of samples required to be at a leaf node of the
        normalization model. A split point at any depth will only be considered
        if it leaves at least ``min_samples_leaf`` training samples in each of
        the left and right branches.  This may have the effect of smoothing the
        model, especially in regression.
        - If int, then consider `min_samples_leaf` as the minimum number.
        - If float, then `min_samples_leaf` is a fraction and
        `ceil(min_samples_leaf * n_samples)` are the minimum
        number of samples for each node.
    max_samples : int or float, default=0.6
        The number of samples to draw from X to train each base estimator in the
        normalization model.
        - If None (default), then draw `X.shape[0]` samples.
        - If int, then draw `max_samples` samples.
        - If float, then draw `max_samples * X.shape[0]` samples. Thus,
        `max_samples` should be in the interval `(0, 1)`.



### `eensight.pipelines.uncertainty.AggregatedCp`

Aggregated conformal estimator. Combines multiple IcpRegressor estimators
    into an aggregated model.

    Parameters
    ----------
    estimators : List of estimators with scikit-learn predictor API (with fit
        and predict methods).
        Each estimator is expected to be the result of a cross-validation process
        and it must be already fitted.
    oos_masks : list of array-like
        List containing the index of the training dataset's subset that each `estimator`
        in `estimators` has not seen during its fitting (i.e. the test sample of the
        relevant cross-validation fold).
    add_normalizer : bool, default=True
        If True, a normalization model will be added. Its predictions act as a
        multiplicative correction factor of the non-conformity scores. The
        normalization model is a random forest regressor
        (`sklearn.ensemble.RandomForestRegressor`).
    extra_regressors : str or list of str, default=None
        The names of the additional regressors to use for the normalization model. By
        default, the normalization model uses only the month of year, day of week, and
        hour of day features.
    n_estimators : int, default=100
        The number of trees in the normalization model.
    min_samples_leaf : int or float, default=0.05
        The minimum number of samples required to be at a leaf node of the
        normalization model. A split point at any depth will only be considered
        if it leaves at least ``min_samples_leaf`` training samples in each of
        the left and right branches.  This may have the effect of smoothing the
        model, especially in regression.
        - If int, then consider `min_samples_leaf` as the minimum number.
        - If float, then `min_samples_leaf` is a fraction and
        `ceil(min_samples_leaf * n_samples)` are the minimum
        number of samples for each node.
    max_samples : int or float, default=0.8
        The number of samples to draw from X to train each base estimator in the
        normalization model.
        - If None (default), then draw `X.shape[0]` samples.
        - If int, then draw `max_samples` samples.
        - If float, then draw `max_samples * X.shape[0]` samples. Thus,
        `max_samples` should be in the interval `(0, 1)`.
    n_jobs : int, default=None
        Number of jobs to run in parallel. ``None`` means 1 unless in a
        `joblib.parallel_backend` context. ``-1`` means using all processors.

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [58]:
from eensight.utils.jupyter import load_catalog
from eensight.pipelines.uncertainty import AggregatedCp, mpiw, picp, generate_samples


In [4]:
catalog = load_catalog('demo')

In [5]:
cv_model = catalog.load('train.cross_validator_model')

First, create a conformal predictor <ins>without a normalizer</ins>:

In [6]:
conformal = AggregatedCp(estimators=cv_model.estimators, 
                         oos_masks=cv_model.oos_masks,
                         add_normalizer=False,
                         n_jobs=-1
)

The conformal predictor must be fitted on the **same dataset** that was used for cross-validation:

In [7]:
data_train = catalog.load('train.preprocessed_data')
data_train = data_train.dropna()

X_train = data_train.loc[~data_train['consumption_outlier'], ['temperature']]
y_train = data_train.loc[X_train.index, ['consumption']]

In [None]:
%%time
conformal = conformal.fit(X_train, y_train)

The `predict` function of `AggregatedCp` can be applied on any dataset for which we want uncertainty intervals, and it requires a `significance` parameter:

    significance : float or list of floats between 0 and 1
        Significance level (maximum allowed error rate) of predictions. If ``None``,
        then intervals for all significance levels (0.01, 0.02, ..., 0.99) will be
        computed.

In [None]:
%%time
pred_conf = conformal.predict(data_train[['temperature']], significance=[0.9, 0.95, 0.99])

The result of the `predict` method is a `sklearn.utils.Bunch` (dict-like) object with fields:

 - *significance*: The significance levels used in the calculations, list of float.
 - *quantiles*: The quantiles of the non-conformity scores, array of shape (len(X), len(significance))

In [16]:
%%time
prediction = cv_model.predict(data_train[['temperature']])

In [38]:
s = 1

intervals = pd.concat((
(prediction['consumption'] + pd.Series(data=pred_conf.quantiles[:, s], 
                                       index=prediction.index)).to_frame('consumption_high'),
 (prediction['consumption'] - pd.Series(data=pred_conf.quantiles[:, s], 
                                        index=prediction.index)).to_frame('consumption_low')
), axis=1)



In [33]:
def plot_intervals(X, intervals, size=2000, title=''):
    start = np.random.randint(0, high=len(intervals)-size)
    
    with plt.style.context('seaborn-whitegrid'):    
        fig = plt.figure(figsize=(12, 3.54), dpi=96)
        layout = (1, 1)
        ax = plt.subplot2grid(layout, (0, 0))
        
        ax.fill_between(intervals.index[start:start+size], 
                        intervals['consumption_low'][start:start+size],
                        intervals['consumption_high'][start:start+size],
                        color='#DDA0DD', alpha=0.5
        )
        X.loc[intervals.index][start:start+size].plot(
            ax=ax, alpha=0.8, rot=0)
        ax.set_title(title)

In [None]:
plot_intervals(data_train[['consumption']], intervals, 
               title=f'Intervals at {pred.significance[s]}')

`eensight` includes two metrics for evaluating the uncertainty intervals:

- Prediction Interval Coverage Probability (PICP). Computes the fraction of samples for
    which the ground truth lies within predicted interval.
- Mean Prediction Interval Width (MPIW). Computes the average width of the the prediction
    intervals. Measures the sharpness of intervals.

In [None]:
picp(data_train['consumption'], intervals['consumption_low'], intervals['consumption_high'])

In [None]:
mpiw(intervals['consumption_low'], intervals['consumption_high'])

Next, create a conformal predictor <ins>with a normalizer</ins>:

In [51]:
conformal = AggregatedCp(estimators=cv_model.estimators, 
                         oos_masks=cv_model.oos_masks,
                         add_normalizer=True,
                         max_samples=0.8,
                         n_jobs=-1
)

In [None]:
%%time
conformal = conformal.fit(X_train, y_train)

In [None]:
%%time
pred_conf = conformal.predict(data_train[['temperature']], significance=[0.9, 0.95, 0.99])

In [54]:
s = 1

intervals = pd.concat((
(prediction['consumption'] + pd.Series(data=pred_conf.quantiles[:, s], 
                                       index=prediction.index)).to_frame('consumption_high'),
 (prediction['consumption'] - pd.Series(data=pred_conf.quantiles[:, s], 
                                        index=prediction.index)).to_frame('consumption_low')
), axis=1)

In [None]:
plot_intervals(data_train[['consumption']], intervals, 
               title=f'Intervals at {pred.significance[s]}')

In [None]:
picp(data_train['consumption'], intervals['consumption_low'], intervals['consumption_high'])

In [None]:
mpiw(intervals['consumption_low'], intervals['consumption_high'])

## Sampling from the uncertainty intervals

Uncertainty intervals are useful for evaluating the confidence one should have about a model's predictions, but in M&V we need to quantify the uncertainty of cumulative sums of the predictions (more accurately, cumulative sums of the predicted minus the actual consumption).

For this purpose, `eensight` includes functionality for sampling from the distribution that is implied by the uncertainty intervals:

In [None]:
%%time
samples = generate_samples(500, prediction=prediction['consumption'], 
                                significance=pred_conf.significance, 
                                quantiles=pred_conf.quantiles
)

In [85]:
def plot_samples(X, samples, size=2000, title=''):
    start = np.random.randint(0, high=len(X)-size)
    
    with plt.style.context('seaborn-whitegrid'):    
        fig = plt.figure(figsize=(12, 3.54), dpi=96)
        layout = (1, 1)
        ax = plt.subplot2grid(layout, (0, 0))
        
        samples.iloc[start:start+size].plot(ax=ax, alpha=0.005, color='#DDA0DD', legend=False)
            
        X.iloc[start:start+size].plot(
            ax=ax, alpha=0.8, rot=0)
        ax.set_title(title)


In [None]:
plot_samples(data_train[['consumption']], samples)

The larger the number of significance levels, the more accurately the sampling process will reflect the underying distribution of the predictions.  

In [None]:
%%time
pred_conf = conformal.predict(data_train[['temperature']])

In [None]:
pred_conf.significance

In [None]:
%%time
samples = generate_samples(500, prediction=prediction['consumption'], 
                                significance=pred_conf.significance, 
                                quantiles=pred_conf.quantiles
)

In [None]:
plot_samples(data_train[['consumption']], samples)