# Conformalized quantile regression (CQR): Real data experiment

In this tutorial we will load a real dataset and construct prediction intervals using CQR [1].

[1] Yaniv Romano, Evan Patterson, and Emmanuel J. Candes, “Conformalized quantile regression.” 2019.

## Prediction intervals

Suppose we are given $ n $ training samples $ \{(X_i, Y_i)\}_{i=1}^n$ and we must now predict the unknown value of $Y_{n+1}$ at a test point $X_{n+1}$. We assume that all the samples $ \{(X_i,Y_i)\}_{i=1}^{n+1} $ are drawn exchangeably$-$for instance, they may be drawn i.i.d.$-$from an arbitrary joint distribution $P_{XY}$ over the feature vectors $ X\in \mathbb{R}^p $ and response variables $ Y\in \mathbb{R} $. We aim to construct a marginal distribution-free prediction interval $C(X_{n+1}) \subseteq \mathbb{R}$ that is likely to contain the unknown response $Y_{n+1} $. That is, given a desired miscoverage rate $ \alpha $, we ask that
$$ \mathbb{P}\{Y_{n+1} \in C(X_{n+1})\} \geq 1-\alpha $$
for any joint distribution $ P_{XY} $ and any sample size $n$. The probability in this statement is marginal, being taken over all the samples $ \{(X_i, Y_i)\}_{i=1}^{n+1} $.

To accomplish this, we build on the method of split conformal prediction. We first split the training data into two disjoint subsets, a proper training set and a calibration set. We fit two quantile regressors on the proper training set to obtain initial estimates of the lower and upper bounds of the prediction interval. Then, using the calibration set, we conformalize and, if necessary, correct this prediction interval. Unlike the original interval, the conformalized prediction interval is guaranteed to satisfy the coverage requirement regardless of the choice or accuracy of the quantile regression estimator.



## A case study

We start by importing several libraries, loading the real dataset and standardize its features and response. We set the target miscoverage rate $\alpha$ to 0.1.

# load the data

In [1]:
import pandas as pd

# df = pd.read_csv('D:\Conformalized_Quantile_Regression\LUCAS_2015_features_V3.csv')
df = pd.read_csv('C:\\Users\\nkakhani\\_CP_DSM\\Conformal_Prediction_DSM\\LUCAS_2015_features_V3.csv')

In [2]:
# Convert the column to numeric, omitting non-double values
df['OC'] = pd.to_numeric(df['OC'], errors='coerce')

# Drop rows with NaN values (non-double values)
df.dropna(subset=['OC'], inplace=True)

In [3]:
df

Unnamed: 0,SR_B3_1,SR_B3_2,SR_B3_3,SR_B3_4,SR_B4_1,SR_B4_2,SR_B4_3,SR_B4_4,SR_B5_1,SR_B5_2,...,average_7,average_8,average_9,average_10,average_11,average_12,average_13,average_14,OC,point_id
0,0.065401,0.065401,0.063259,0.063259,0.058280,0.058280,0.053137,0.053137,0.341738,0.341738,...,532.638051,498.618622,502.788853,704.755629,402.898849,469.73883,617.714286,396.468312,24.6,26581768
1,0.037099,0.036429,0.035900,0.035307,0.034250,0.033989,0.032527,0.032337,0.179388,0.175070,...,512.001688,489.344548,512.317155,678.640244,404.306692,481.17483,593.808163,402.009979,21.9,26581792
2,0.072997,0.075304,0.071491,0.072426,0.063204,0.065369,0.060863,0.061394,0.347341,0.352894,...,519.103506,509.807511,457.939797,668.159475,448.914535,505.68683,625.240816,395.747479,18.4,26581954
3,0.026582,0.026362,0.026356,0.026522,0.022784,0.022857,0.022424,0.022880,0.189531,0.186850,...,520.863506,495.344548,490.902061,686.594090,401.957672,480.77883,613.922449,398.876646,48.0,26601784
4,0.051178,0.051178,0.047617,0.047617,0.031081,0.031081,0.028169,0.028169,0.358772,0.358772,...,497.983506,490.677881,449.362438,634.390244,435.981202,489.49483,594.461224,388.101646,25.2,26601978
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21853,0.073742,0.073323,0.073456,0.073102,0.110663,0.109788,0.111120,0.110324,0.185107,0.184097,...,644.881688,600.148252,477.336023,746.617167,403.169437,416.83883,690.440816,344.026646,8.4,64881666
21854,0.085582,0.085886,0.086188,0.086588,0.071660,0.072601,0.071150,0.072110,0.402217,0.400485,...,636.685324,590.337140,476.524702,736.836398,395.087084,410.73883,680.738776,341.705812,10.8,64901668
21855,0.084536,0.087933,0.081859,0.085314,0.071451,0.077255,0.066329,0.072025,0.391538,0.383660,...,636.685324,590.337140,476.524702,736.836398,395.087084,410.73883,680.738776,341.705812,6.7,64901672
21856,0.087019,0.087398,0.086868,0.087840,0.099747,0.102079,0.099499,0.102728,0.286553,0.283027,...,648.761688,602.381585,490.090740,754.286398,404.941986,417.09883,695.187755,351.993312,5.7,64961676


In [4]:
X = df.iloc[1:10000,:-2]
y = df.iloc[1:10000,-2]

In [5]:
import torch
import random
import numpy as np
np.warnings.filterwarnings('ignore')

from datasets import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

seed = 1

random_state_train_test = seed
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)

if torch.cuda.is_available():
    torch.cuda.manual_seed_all(seed)
    
# desired miscoverage error
alpha = 0.1

# desired quanitile levels
quantiles = [0.05, 0.95]

# used to determine the size of test set
test_ratio = 0.2

# # name of dataset
# dataset_base_path = "./datasets/"
# dataset_name = "community"

# # load the dataset
# X, y = datasets.GetDataset(dataset_name, dataset_base_path)

# divide the dataset into test and train based on the test_ratio parameter
x_train, x_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=test_ratio,
                                                    random_state=random_state_train_test)
                                                    
#let us keep the indices for test samples for furthure mapping
keep_inds = x_test.index.tolist()
point_id = df.loc[keep_inds, 'point_id']

# reshape the data
x_train = np.asarray(x_train)
y_train = np.asarray(y_train)
x_test = np.asarray(x_test)
y_test = np.asarray(y_test)

# compute input dimensions
n_train = x_train.shape[0]
in_shape = x_train.shape[1]

# display basic information
# print("Dataset: %s" % (dataset_name))
print("Dimensions: train set (n=%d, p=%d) ; test set (n=%d, p=%d)" % 
      (x_train.shape[0], x_train.shape[1], x_test.shape[0], x_test.shape[1]))

Dimensions: train set (n=7999, p=72) ; test set (n=2000, p=72)


## Data splitting

We begin by splitting the data into a proper training set and a calibration set. Recall that the main idea is to fit a regression model on the proper training samples, then use the residuals on a held-out validation set to quantify the uncertainty in future predictions.

In [6]:
# divide the data into proper training set and calibration set
idx = np.random.permutation(n_train)
split_point = int(np.floor(n_train * 0.9))
# idx_train, idx_cal = idx[:n_half], idx[n_half:2*n_half]

# Split the indices into training and calibration sets
idx_train, idx_cal = idx[:split_point], idx[split_point:]

# zero mean and unit variance scaling 
scalerX = StandardScaler()
scalerX = scalerX.fit(x_train[idx_train])

# scale
x_train = scalerX.transform(x_train)
x_test = scalerX.transform(x_test)

# scale the labels by dividing each by the mean absolute response
mean_y_train = np.mean(np.abs(y_train[idx_train]))
# y_train = np.squeeze(y_train)/mean_y_train

#using log transformation to see whether the results are improved
y_train = np.log(np.squeeze(y_train))
y_test = np.log(np.squeeze(y_test))

# Classical Random Forest

Cross Validation 

In [7]:
# from cqr import helper
# from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
# from math import sqrt

# from sklearn.model_selection import KFold, GridSearchCV
# from sklearn.ensemble import RandomForestRegressor
# from sklearn.metrics import mean_squared_error
# import numpy as np

# # Create your dataset (x_train, y_train, x_test, y_test)

# # Define the parameter grid for hyperparameter tuning
# param_grid = {
#     "min_samples_leaf": [1, 2, 5],
#     "n_estimators": [100, 500, 1000],
#     "max_features": [None, "sqrt", "log2"]
# }

# # Perform nested cross-validation (e.g., 5-fold outer, 3-fold inner)
# outer_cv = KFold(n_splits=5, shuffle=True, random_state=422)
# inner_cv = KFold(n_splits=3, shuffle=True, random_state=422)

# # Initialize lists to store results
# mean_mse_scores = []
# std_mse_scores = []

# for train_index, test_index in outer_cv.split(x_train):
#     x_train_outer, x_test_outer = x_train[train_index], x_train[test_index]
#     y_train_outer, y_test_outer = y_train[train_index], y_train[test_index]

#     # Initialize the inner loop results
#     inner_residual_matrix = []

#     for inner_train_index, inner_test_index in inner_cv.split(x_train_outer):
#         x_train_inner, x_val_inner = x_train_outer[inner_train_index], x_train_outer[inner_test_index]
#         y_train_inner, y_val_inner = y_train_outer[inner_train_index], y_train_outer[inner_test_index]

#         # Initialize the inner GridSearchCV for hyperparameter tuning
#         grid_search = GridSearchCV(
#             estimator=RandomForestRegressor(random_state=422),
#             param_grid=param_grid,
#             scoring='neg_mean_squared_error',
#             cv=inner_cv,
#             n_jobs=-1
#         )

#         # Fit the GridSearchCV to the inner training data
#         grid_search.fit(x_train_inner, y_train_inner)

#         # Get the best hyperparameters from the inner loop
#         best_params = grid_search.best_params_

#         # Create a RandomForestRegressor with the best hyperparameters
#         rf_model = RandomForestRegressor(
#             random_state=422,
#             min_samples_leaf=best_params['min_samples_leaf'],
#             n_estimators=best_params['n_estimators'],
#             max_features=best_params['max_features']
#         )

#         # Fit the model to the inner training data
#         rf_model.fit(x_train_inner, y_train_inner)

#         # Make predictions on the inner validation data
#         predictions = rf_model.predict(x_val_inner)

#         # Calculate residuals for the inner fold and store them in the inner_residual_matrix
#         residuals = y_val_inner - predictions
#         inner_residual_matrix.append(residuals)

#     # Calculate the mean and standard deviation of MSE scores for the inner loop results
#     inner_mse_scores = [mean_squared_error(y_val_inner, predictions) for predictions in rf_model.staged_predict(x_test_outer)]
#     mean_inner_mse = np.mean(inner_mse_scores)
#     std_inner_mse = np.std(inner_mse_scores)

#     # Append the mean and standard deviation of inner MSE to the lists of outer loop results
#     mean_mse_scores.append(mean_inner_mse)
#     std_mse_scores.append(std_inner_mse)

#     # Combine residuals from the inner loop into a single matrix
#     inner_residual_matrix = np.vstack(inner_residual_matrix)
    
#     # Now, you can fit a quantile regression model using the inner_residual_matrix
#     # For example, using the statsmodels library:
#     import statsmodels.api as sm

#     quantile_model = sm.QuantReg(y_train, x_train)
#     quantile_results = quantile_model.fit(q=0.5)  # Fit the model for the median (you can choose other quantiles)
#     print(quantile_results.summary())

# # Calculate the mean and standard deviation of MSE scores from the outer loop
# mean_mse = np.mean(mean_mse_scores)
# std_mse = np.mean(std_mse_scores)

# print(f"Mean MSE: {mean_mse}")
# print(f"Standard Deviation of MSE: {std_mse}")


Model Evaluation

In [8]:

# # Make predictions on the test data
# test_predictions = rf_model.predict(x_test)

# # Calculate R-squared (R²) for test samples
# r2_test = r2_score(y_test, test_predictions)

# # Calculate Root Mean Squared Error (RMSE) for test samples
# rmse_test = sqrt(mean_squared_error(y_test, test_predictions))

# print(f"R² for Test Samples: {r2_test}")
# print(f"RMSE for Test Samples: {rmse_test}")

Bootstrapping

In [9]:
# # Set the number of bootstrap iterations
# num_bootstraps = 10  # Adjust as needed

# # Initialize an empty array to store bootstrap results
# bootstrap_results = np.zeros((num_bootstraps, len(y_test)))

# for i in range(num_bootstraps):
#     # Generate random indices with replacement
#     indices = np.random.choice(len(y_train), len(y_train), replace=True)
    
#     # Select a bootstrap sample
#     bootstrap_x = x_train[indices]
#     bootstrap_y = y_train[indices]
    
#     # Train your model on the bootstrap sample (e.g., rf_model.fit(bootstrap_x, bootstrap_y))
#     rf_model.fit(bootstrap_x, bootstrap_y)
    
#     # Make predictions on the entire dataset
#     bootstrap_results[i] = rf_model.predict(x_test)

# # Calculate statistics from the bootstrap results (e.g., confidence intervals)
# mean_predictions = np.mean(bootstrap_results, axis=0)
# lower_bound_rf = np.percentile(bootstrap_results, 2.5, axis=0)
# upper_bound_rf = np.percentile(bootstrap_results, 97.5, axis=0)

# # Calculate metrics or uncertainty measures using the bootstrap results
# mse_bootstrap = mean_squared_error(y_test, mean_predictions)

In [10]:
# y_lower_rf = np.exp(lower_bound_rf) 
# y_upper_rf = np.exp(upper_bound_rf) 
# y_test_rf = np.exp(y_test) 

In [11]:
# from cqr import helper
# # compute and print average coverage and average length
# coverage_RF, length_RF = helper.compute_coverage(y_test_rf,
#                                                  y_lower_rf,
#                                                  y_upper_rf,
#                                                  alpha,
#                                                  "Random Forests")

In [12]:
# # Create a DataFrame with the desired columns
# df_oc_rf = pd.DataFrame({
#     'lower_oc': y_lower_rf,
#     'upper_oc': y_upper_rf,
#     'predicted_oc': (y_upper_rf + y_lower_rf)/2,
#     'standard_uncertainty': (y_upper_rf - y_lower_rf) / np.mean((y_upper_rf + y_lower_rf)),
#     'test_oc': y_test_rf,
#     'Point_ID': point_id
# })

In [13]:
# df_oc_rf.to_csv('D:\Conformalized_Quantile_Regression\LUCAS_2015_rf.csv', index = False)  # Set index=False to exclude row indices in the CSV

## CQR random forests

Given these two subsets, we now turn to conformalize the initial prediction interval constructed by quantile random forests [2]. Below, we set the hyper-parameters of the CQR random forests method.

[2] Meinshausen Nicolai. "Quantile regression forests." Journal of Machine Learning Research 7, no. Jun (2006): 983-999.

In [14]:
#########################################################
# Quantile random forests parameters
# (See QuantileForestRegressorAdapter class in helper.py)
#########################################################

# the number of trees in the forest
n_estimators = 1000

# the minimum number of samples required to be at a leaf node
# (default skgarden's parameter)
min_samples_leaf = 1

# the number of features to consider when looking for the best split
# (default skgarden's parameter)
max_features = x_train.shape[1]

# target quantile levels
quantiles_forest = [quantiles[0]*100, quantiles[1]*100]

# use cross-validation to tune the quantile levels?
cv_qforest = True

# when tuning the two QRF quantile levels one may
# ask for a prediction band with smaller average coverage
# to avoid too conservative estimation of the prediction band
# This would be equal to coverage_factor*(quantiles[1] - quantiles[0])
coverage_factor = 0.85

# ratio of held-out data, used in cross-validation
cv_test_ratio = 0.05

# seed for splitting the data in cross-validation.
# Also used as the seed in quantile random forests function
cv_random_state = 1

# determines the lowest and highest quantile level parameters.
# This is used when tuning the quanitle levels by cross-validation.
# The smallest value is equal to quantiles[0] - range_vals.
# Similarly, the largest value is equal to quantiles[1] + range_vals.
cv_range_vals = 30

# sweep over a grid of length num_vals when tuning QRF's quantile parameters                   
cv_num_vals = 10

### Symmetric nonconformity score 

In the following cell we run the entire CQR procudure. The class `QuantileForestRegressorAdapter` defines the underlying estimator. The class `RegressorNc` defines the CQR objecct, which uses `QuantileRegErrFunc` as the nonconformity score. The function `run_icp` fits the regression function to the proper training set, corrects (if required) the initial estimate of the prediction interval using the calibration set, and returns the conformal band. Lastly, we compute the average coverage and length on future test data using `compute_coverage`.

In [15]:
from cqr import helper
from nonconformist.nc import RegressorNc
from nonconformist.nc import QuantileRegErrFunc

# define the QRF's parameters 
params_qforest = dict()
params_qforest["n_estimators"] = n_estimators
params_qforest["min_samples_leaf"] = min_samples_leaf
params_qforest["max_features"] = max_features
params_qforest["CV"] = cv_qforest
params_qforest["coverage_factor"] = coverage_factor
params_qforest["test_ratio"] = cv_test_ratio
params_qforest["random_state"] = cv_random_state
params_qforest["range_vals"] = cv_range_vals
params_qforest["num_vals"] = cv_num_vals



In [16]:
from sklearn.ensemble import RandomForestRegressor

# Number of bootstrap samples
n_bootstraps = 20

# n_resample = 1000

# model_rf = RandomForestRegressor(n_estimators = n_estimators, min_samples_leaf = min_samples_leaf, random_state=0)

# Initialize lists to store predictions for each bootstrap sample
bootstrap_predictions = []

# # Perform bootstrapping and calculate predictions as before
# for _ in range(n_bootstraps):
#     X_boot, y_boot = resample(x_train, y_train, n_samples = len(y_train) - n_resample , random_state=np.random.randint(0, 100))
#     model_rf.fit(X_boot, y_boot)
#     y_pred = model_rf.predict(x_test)
#     bootstrap_predictions.append(y_pred)



    
# Standard bootstrap: Omit around 36.8% of samples in each iteration
sample_size = int(0.632 * x_train.shape[0])  # 63.2% of training samples
    
for _ in range(n_bootstraps):
        # Create a new bootstrap sample by randomly selecting samples with replacement
        bootstrap_indices = np.random.choice(x_train.shape[0], size=sample_size, replace=True)
        X_bootstrap = x_train[bootstrap_indices]
        y_bootstrap = y_train[bootstrap_indices]
        
        # Create and train a new model on the bootstrap sample
        bootstrap_model = RandomForestRegressor(n_estimators=n_estimators, min_samples_leaf=min_samples_leaf, random_state=0)
        bootstrap_model.fit(X_bootstrap, y_bootstrap)
        
        # Make predictions on the test data using the bootstrap model
        y_bootstrap_pred = bootstrap_model.predict(x_test)
        
        # Append the predictions to the list
        bootstrap_predictions.append(y_bootstrap_pred)
    

y_upper = np.max(bootstrap_predictions, axis=0)
y_lower = np.min(bootstrap_predictions, axis=0)

coverage_forest, length_forest = helper.compute_coverage(y_test,y_lower,y_upper,alpha,"RF")

RF: Percentage in the range (expecting 90.00): 33.100000
RF: Average length: 0.486211


In [17]:
# Create a DataFrame with the desired columns
df_oc = pd.DataFrame({
    'BootRF_lower_oc': np.exp(y_lower),
    'BootRF_upper_oc': np.exp(y_upper),
    'BootRF_predicted_oc': (np.exp(y_upper) + np.exp(y_lower))/2,
    'BootRF_standard_uncertainty': (np.exp(y_upper) - np.exp(y_lower)) / np.mean(np.exp(y_upper) + np.exp(y_lower)),
    'BootRF_test_oc': np.exp(y_test),
    'BootRF_Point_ID': point_id
})

In [18]:
df_oc.to_csv('C:\\Users\\nkakhani\\_CP_DSM\\Conformal_Prediction_DSM\\.csv', index = False)  # Set index=False to exclude row indices in the CSV

# Quntile Regression Forest

In [20]:
from skgarden import RandomForestQuantileRegressor
from sklearn.model_selection import KFold
kf = KFold(n_splits = 5, random_state = 11)
rfqr = RandomForestQuantileRegressor(n_estimators=n_estimators, min_samples_leaf=min_samples_leaf, random_state=0)

In [23]:
y

1       21.9
2       18.4
3       48.0
4       25.2
5       16.4
        ... 
9995    42.7
9996    40.3
9997    16.5
9998    39.3
9999    27.3
Name: OC, Length: 9999, dtype: float64

In [27]:
y_true_all = []
lower = []
upper = []
X = np.asarray(X)
y = np.asarray(y)

for train_index, test_index in kf.split(X):
    X_train, X_test, y_train, y_test = (
        X[train_index], X[test_index], y[train_index], y[test_index])

    rfqr.set_params(max_features=X_train.shape[1] // 3)
    rfqr.fit(X_train, y_train)
    y_true_all = np.concatenate((y_true_all, y_test))
    upper = np.concatenate((upper, rfqr.predict(X_test, quantile=98.5)))
    lower = np.concatenate((lower, rfqr.predict(X_test, quantile=2.5)))

interval = upper - lower
sort_ind = np.argsort(interval)
y_true_all = y_true_all[sort_ind]
upper = upper[sort_ind]
lower = lower[sort_ind]
mean = (upper + lower) / 2

coverage_QRF, length_QRF = helper.compute_coverage(y_true_all,lower,upper,alpha,"RF")

KeyboardInterrupt: 

In [26]:
len(y_true_all)

9999

In [None]:
# define QRF model
quantile_estimator = helper.QuantileForestRegressorAdapter(model=None,
                                                           fit_params=None,
                                                           quantiles=quantiles_forest,
                                                           params=params_qforest)


In [None]:
# define the CQR object
nc = RegressorNc(quantile_estimator, QuantileRegErrFunc())

In [None]:
# run CQR procedure
y_lower, y_upper = helper.run_icp(nc, x_train, y_train, x_test, idx_train, idx_cal, alpha)

## Rescale your results back to their original scale

In [None]:
y_lower_rescaled = np.exp(y_lower) 
y_upper_rescaled = np.exp(y_upper) 
y_test_rescaled = np.exp(y_test) 

### Evaluation metrics for UQ methods

In [None]:
# Calculate PICP
def calculate_picp(true_values, lower_bounds, upper_bounds):
    num_samples = len(true_values)
    within_interval = np.logical_and(true_values >= lower_bounds, true_values <= upper_bounds)
    picp = np.sum(within_interval) / num_samples
    return picp

picp = calculate_picp(y_test_rescaled, y_lower_rescaled, y_upper_rescaled)
print(f'PICP: {picp:.2f}')

# Calculate PINAW
def calculate_pinaw(lower_bounds, upper_bounds):
    pinaw = np.mean(upper_bounds - lower_bounds)
    return pinaw

pinaw = calculate_pinaw(y_lower_rescaled, y_upper_rescaled)
print(f'PINAW: {pinaw:.2f}')

# Calculate CWC
def calculate_cwc(true_values, lower_bounds, upper_bounds):
    within_interval = np.logical_and(true_values >= lower_bounds, true_values <= upper_bounds)
    cwc = np.mean(upper_bounds[within_interval] - lower_bounds[within_interval])
    return cwc

cwc = calculate_cwc(y_test_rescaled, y_lower_rescaled, y_upper_rescaled)
print(f'CWC: {cwc:.2f}')

In [None]:
helper.plot_func_data(y_test,y_lower,y_upper,"RF")

In [None]:
# Create a DataFrame with the desired columns
df_oc = pd.DataFrame({
    'lower_oc': y_lower_rescaled,
    'upper_oc': y_upper_rescaled,
    'predicted_oc': (y_upper_rescaled + y_lower_rescaled)/2,
    'standard_uncertainty': (y_upper_rescaled - y_lower_rescaled) / np.mean((y_upper_rescaled + y_lower_rescaled)),
    'test_oc': y_test_rescaled,
    'Point_ID': point_id
})

In [None]:
# df_oc.to_csv('D:\Conformalized_Quantile_Regression\LUCAS_2015_cqr.csv', index = False)  # Set index=False to exclude row indices in the CSV

In [None]:
# compute and print average coverage and average length
coverage_cp_qforest, length_cp_qforest = helper.compute_coverage(y_test,
                                                                 y_lower,
                                                                 y_upper,
                                                                 alpha,
                                                                 "CQR Random Forests")

As can be seen, we obtained valid coverage.

### Asymmetric nonconformity score 

The nonconformity score function `QuantileRegErrFunc` treats the left and right tails symmetrically, but if the error distribution is significantly skewed, one may choose to treat them asymmetrically. This can be done by replacing `QuantileRegErrFunc` with `QuantileRegAsymmetricErrFunc`, as implemented in the following cell.

In [None]:
# from nonconformist.nc import QuantileRegAsymmetricErrFunc

# # define QRF model
# quantile_estimator = helper.QuantileForestRegressorAdapter(model=None,
#                                                            fit_params=None,
#                                                            quantiles=quantiles_forest,
#                                                            params=params_qforest)
        
# # define the CQR object
# nc = RegressorNc(quantile_estimator, QuantileRegAsymmetricErrFunc())

# # run CQR procedure
# y_lower, y_upper = helper.run_icp(nc, x_train, y_train, x_test, idx_train, idx_cal, alpha)

# # compute and print average coverage and average length
# coverage_cp_qforest, length_cp_qforest = helper.compute_coverage(y_test,
#                                                                  y_lower,
#                                                                  y_upper,
#                                                                  alpha,
#                                                                  "Asymmetric CQR Random Forests")

Above, we also obtained valid coverage.


## CQR neural net

In what follows we will use neural network as the underlying quantile regression method. Below, we set the hyper-parameters of the CQR neural network method.

In [None]:
#####################################################
# Neural network parameters
# (See AllQNet_RegressorAdapter class in helper.py)
#####################################################

# pytorch's optimizer object
nn_learn_func = torch.optim.Adam

# number of epochs
epochs = 1000

# learning rate
lr = 0.0005

# mini-batch size
batch_size = 64

# hidden dimension of the network
hidden_size = 64

# dropout regularization rate
dropout = 0.1

# weight decay regularization
wd = 1e-6

# Ask for a reduced coverage when tuning the network parameters by 
# cross-validataion to avoid too concervative initial estimation of the 
# prediction interval. This estimation will be conformalized by CQR.
quantiles_net = [0.1, 0.9]

We now turn to invoke the CQR procedure. The class `AllQNet_RegressorAdapter` defines the underlying neural network estimator. Just as before, `RegressorNc` defines the CQR objecct, which uses `QuantileRegErrFunc` as the nonconformity score. The function `run_icp` returns the conformal band, computed on test data. Lastly, we compute the average coverage and length using `compute_coverage`.

In [None]:
# define quantile neural network model
quantile_estimator = helper.AllQNet_RegressorAdapter(model=None,
                                                     fit_params=None,
                                                     in_shape=in_shape,
                                                     hidden_size=hidden_size,
                                                     quantiles=quantiles_net,
                                                     learn_func=nn_learn_func,
                                                     epochs=epochs,
                                                     batch_size=batch_size,
                                                     dropout=dropout,
                                                     lr=lr,
                                                     wd=wd,
                                                     test_ratio=cv_test_ratio,
                                                     random_state=cv_random_state,
                                                     use_rearrangement=False)

# define a CQR object, computes the absolute residual error of points 
# located outside the estimated quantile neural network band 
nc = RegressorNc(quantile_estimator, QuantileRegErrFunc())

# run CQR procedure
y_lower, y_upper = helper.run_icp(nc, x_train, y_train, x_test, idx_train, idx_cal, alpha)

# compute and print average coverage and average length
coverage_cp_qnet, length_cp_qnet = helper.compute_coverage(y_test,
                                                           y_lower,
                                                           y_upper,
                                                           alpha,
                                                           "CQR Neural Net")

Above, we can see that the prediction interval constructed by CQR Neural Net is also valid. Notice the difference in the average length between the two methods (CQR Neural Net and CQR Random Forests). 

## CQR neural net with rearrangement

Crossing quantiles is a longstanding problem in quantile regression. This issue does not affect the validity guarantee of CQR as it holds regardless of the accuracy or choice of the quantile regression method. However, this may affect the effeciency of the resulting conformal band.

Below we use the rearrangement method [3] to bypass the crossing quantile problem. Notice that we pass `use_rearrangement=True` as an argument to `AllQNet_RegressorAdapter`.

[3] Chernozhukov Victor, Iván Fernández‐Val, and Alfred Galichon. “Quantile and probability curves without crossing.” Econometrica 78, no. 3 (2010): 1093-1125.

In [None]:
# # define quantile neural network model, using the rearrangement algorithm
# quantile_estimator = helper.AllQNet_RegressorAdapter(model=None,
#                                                      fit_params=None,
#                                                      in_shape=in_shape,
#                                                      hidden_size=hidden_size,
#                                                      quantiles=quantiles_net,
#                                                      learn_func=nn_learn_func,
#                                                      epochs=epochs,
#                                                      batch_size=batch_size,
#                                                      dropout=dropout,
#                                                      lr=lr,
#                                                      wd=wd,
#                                                      test_ratio=cv_test_ratio,
#                                                      random_state=cv_random_state,
#                                                      use_rearrangement=True)

# # define the CQR object, computing the absolute residual error of points 
# # located outside the estimated quantile neural network band 
# nc = RegressorNc(quantile_estimator, QuantileRegErrFunc())

# # run CQR procedure
# y_lower, y_upper = helper.run_icp(nc, x_train, y_train, x_test, idx_train, idx_cal, alpha)

# # compute and print average coverage and average length
# coverage_cp_re_qnet, length_cp_re_qnet = helper.compute_coverage(y_test,
#                                                                  y_lower,
#                                                                  y_upper,
#                                                                  alpha,
#                                                                  "CQR Rearrangement Neural Net")