# **Linear Models (Part 2)**

Outlier Robust Regression

Noah Rubin

June 2021

# Huber Regressor 

Intro:
* The [Huber Regressor](https://towardsdatascience.com/regression-in-the-face-of-messy-outliers-try-huber-regressor-3a54ddc12516) model is designed to try address the problem of outliers that may exist in the dataset and thus falls under a family of models known as robust regression models
* Sometimes it can be used as a good alternative to OLS as OLS tends to pull the fit towards each datapoint, hence outliers can really distort the fit, catalysing inaccurate predictions
* Algorithm was invented by Peter Jost Huber in 1964 though subtle adjustments have been made over time
* As such, the sklearn documentation mentions that its implementation is based off [this academic paper](https://artowen.su.domains/reports/hhu.pdf), published in 2006.

In the sklearn implementation, the Huber loss function applies a transformation to the error depending on it's value, in which we intend to minimise the quantity

$$J = \sum_{i=1}^n \big(\sigma + H_\epsilon\big(\frac{y_i - \hat{y}_i}{\sigma} \big) \big) + \alpha||\vec{\beta}||^2$$

whereby $y_i - \hat{y}_i$. The huber regressor finds an optimal value for $\sigma \in (0, \infty)$ as well as finding the components of the $\vec{\beta}$ vector based on the minimisation of this loss function. The regularisation term $\alpha||\vec{\beta}||^2$ acts as the $L_2$ shrinkage penalty and the function $H$ is piecewise and takes in scalar input $z$ such that

$$H_\epsilon(z) = \begin{cases}
z{^2} & \text{if } |z| < \epsilon,\\
2\epsilon|z| - \epsilon^2  & \text{if } |z| \geq \epsilon\\
\end{cases}$$

---

According to Scikit-Learn documentation, this minimisation process "makes sure that the loss function is not heavily influenced by the outliers while not completely ignoring their effect." One other tip was to set the threshold parameter $\epsilon$ to 1.35 "to achieve 95% statistical efficiency".

Resources

* https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.HuberRegressor.html
* https://scikit-learn.org/stable/modules/linear_model.html#huber-regression
* https://artowen.su.domains/reports/hhu.pdf
* https://cvxr.rbind.io/cvxr_examples/cvxr_huber-regression/
* https://towardsdatascience.com/understanding-the-3-most-common-loss-functions-for-machine-learning-regression-23e0ef3e14d3
* https://en.wikipedia.org/wiki/Robust_regression



In [2]:
import joblib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import HuberRegressor

# Personal display settings
#===========================

# Suppress scientific notation
np.set_printoptions(suppress=True)

# Get dataset values showing to 5dp
pd.options.display.float_format = '{:.5f}'.format
pd.set_option('display.max_colwidth', None)

# For clear plots with a nice background
plt.style.use('seaborn-whitegrid') 
%matplotlib inline

%load_ext autoreload
%autoreload 2

# python files
import data_prep
import helper_funcs

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [3]:
huber_pipeline = data_prep.create_pipeline(HuberRegressor())
huber_pipeline

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('numeric',
                                                  Pipeline(steps=[('identity',
                                                                   FunctionTransformer())]),
                                                  ['GDP_cap']),
                                                 ('categorical',
                                                  Pipeline(steps=[('ohe',
                                                                   OneHotEncoder(drop='first'))]),
                                                  ['Status'])])),
                ('imputation', KNNImputer()), ('ss', StandardScaler()),
                ('model', HuberRegressor())])

In [None]:
# Create parameter grid used in every model
param_grid = {
    'imputation__n_neighbors': np.arange(3, 16, 2), 
    'imputation__weights': ['uniform', 'distance'],
    'model__alpha': np.linspace(0.01, 3, 10)
}