#### LICENSE
These notes are released under the 
"Creative Commons Attribution-ShareAlike 4.0 International" license. 
See the **human-readable version** [here](https://creativecommons.org/licenses/by-sa/4.0/)
and the **real thing** [here](https://creativecommons.org/licenses/by-sa/4.0/legalcode). 

# Introduction, goals, general setting

Traditionally (i.e. 60 or
70 years ago), statistical analyses (Data Science analyses? 
Learning methods?) started with a
question of interest (e.g. is this fertilizer better than that one?
does this drug reduce cholesterol levels?), 
and then data was collected, typically via 
an experiment (under somewhat controlled circumstances). More recently,
however, the analyses often start from an existing data set. 
When Statistics (Data Science? Learning?) is involved, 
it is generally believed that there exists an unknown 
stochastic phenomenon that generated the data. A model for this
process is postulated, which often (but not always) depends on
a set of parameters. Queries about the model (including
interpreting its components) are 
typically phrased in terms of questions about the parameters,
and these are answered based on estimated values for them
(think of confidence intervals, point estimators, etc.) 
When the interest is in constructing
a predictor, the parameters have no interest in themselves, 
but they still need to be estimated in order to use the
model with future data. 

Estimation methods are generally not unique. Historically
they have been chosen based on their performance 
(e.g. statistical properties of the resulting estimators,
computational cost or feasibility, etc.) Common
criteria include accuracy (e.g. mean squared error: MSE) and prediction
power (e.g. mean squared prediction error: MSPE). 
These criteria usually need themselves to 
be estimated, or computed, which requires making assumptions 
(e.g. that the data follow a certain family
of distributions, that the model is a "good enough"
approximation to the stochastic phenomenon generating
the data, etc.) 

Once sufficient assumptions have been made, "optimal"
estimation methods can be derived. The most famous 
example is the large class of Maximum Likelihood
Estimators. 

**Question**: to what extent do these optimal 
method remain "good enough" if the assumptions
are relaxed? **Answer**: none&mdash;they usually fail badly 
very quickly.

**Question**: can we find estimation methods 
that remain "good enough" under weaker 
assumptions (a wider range of possibilities) for
the stochastic phenomenon generating
the data? **Answer**: Yes. This is what we do 
in Robust Statistics. 

In these notes, "weaker assumptions" will mean
relaxing the assumptions on the distribution 
of the data, but we will still rely on the
model being a "good enough"
approximation to the process that generated
the data. In this sense we talk about 
*Robust Statistics Given A Model*. 

One useful relaxation in practice 
is accepting that the data may not be
*homogeneous*, i.e. that there may be either
errors or observations that follow a different process. 
Other possible departures from the assumptions 
accompanying the model is that the 
distributions may belong to a similar but different family 
(e.g. that the data follow an elliptical
distribution instead of Gaussian). 
Here we will mostly consider the former kind
of departures, which have been called "gross error" model
violations.

In summary, we are interested in the problem of 
performing statistical analysis when the data may contain
atypical observations. These atypical points may be 
"errors" (due to recording mistakes, equipment malfunction, etc.)
or may be due to the presence of observations that follow
a different stochastic phenomenon from the one generating
the majority of
the data. Generally our interest is the latter. Detecting
(accurately identifying) potential "outliers" in the training
set is also a common goal of Robust Statistics. Finally,
we will also consider the problem of obtaining
reliable predictions on future ("good") data when the 
training set may have atypical points. 

## A very personal "agenda" 

Rather than attempting a thorough, deep and exhaustive 
discussion of robust methods for a few 
simple models, in this notes I will try to illustrate 
the main current ideas and approaches as applied to 
models beyond univariate or multivariate location/scale 
and linear regression models.  

For a detailed treatment of the basic concepts and
techniques in Robust Statistics for the last 60 years, 
please refer to the following books:

> Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., Stahel, W. A., & Wiley Online Library. (2011;2005;). Robust statistics: The approach based on influence functions. Hoboken: Wiley. [UBC Library link](http://tinyurl.com/y2bnnzss)

> Huber, P. J., Ronchetti, E., & Wiley Online Library. (2009). Robust statistics (2nd;2. Aufl.; ed.). Hoboken, N.J: Wiley. [UBC Library link](http://tinyurl.com/yxpyrsqq)

> Maronna, R. A., Martin, R.D., Yohai, V.J. and Salibian-Barrera, M. Wiley Online Library. (2019). Robust statistics: Theory and methods (with R) (Second;2; ed.). Hoboken, NJ: WIley. doi:10.1002/9781119214656 [UBC Library link](http://tinyurl.com/yy4heaad)


# Simple examples of robust estimators for linear regression


### Simple linear regression
We will use the `phosphor` data in 
package `RobStatTM` (for `R`). 
Details can be found using `help(phosphor, package='RobStatTM')`. 
The response variable is `plant` and, 
to simplify the example, we will use only one explanatory variable,
`organic`. Furthermore, in order to 
highlight the potential impact of outliers, we will 
change the position of the single outlier in these data (from the 
right end of the plot to the left):

In [None]:
data(phosphor, package='robustbase')
library(RobStatTM)
# artificially change the location of the outlier 
# for illustration purposes
phosphor[17, 'organic'] <- 15
plot(plant ~ organic, data=phosphor, pch=19, col='gray50')

We now fit a robust estimator for linear regression models 
(an MM-estimator) and overlay it in red on the above plot. 
The usual least squares estimator is shown in blue:

In [None]:
MMfit <- lmrobdetMM(plant ~ organic, data=phosphor)
LSfit <- lm(plant ~ organic, data=phosphor)
plot(plant ~ organic, data=phosphor, pch=19, col='gray50')
abline(MMfit, lwd=4, col='tomato3')
abline(LSfit, lwd=4, col='steelblue3', lty=2)
legend('topright', lwd=3, lty=c(1, 2), col=c('tomato3', 'steelblue3'), 
       legend=c('MM', 'LS'))

<!-- We look at the estimated regression coefficients: -->
<!-- ```{r coef} -->
<!-- cbind(MM=coef(MMfit), LS=coef(LSfit)) -->
<!-- ``` -->
Note that if outliers were not present in the data, then the 
robust and the least squares estimator coincide. 
The green line in the next plot corresponds to the OLS estimator
computed without the outlier. Note that the robust fit 
is indistinguishable from this last one:

In [None]:
plot(plant ~ organic, data=phosphor, pch=19, col='gray50')
abline(MMfit, lwd=3, col='tomato3')
abline(LSfit, lwd=3, col='steelblue3')
LSclean <- lm(plant ~ organic, data=phosphor, subset=-17)
abline(LSclean, lwd=3, col='green3')
legend('topright', lwd=3, lty=1, col=c('tomato3', 'steelblue3', 'green3'), 
       legend=c('MM', 'LS', 'LS(clean)'))

We can also see that the estimated regression coefficients are
in fact very close:

In [None]:
cbind(MM=coef(MMfit), LS=coef(LSfit), LSclean=coef(LSclean))

Note that the MM-estimator is indistinguishable from the 
LS estimator computed on the clean data. This is the desired
result of using an efficient and robust estimator. 


#### A synthetic toy example (diagnostics and estimation)

This example will illustrate that:

- outliers can be severerly damaging without being "obviously" deviating;
- quantile regression estimators (L1) offer limited protection against atypical observations; and
- classical diagnostic tools may not work as advertised.

Our example contains $n = 200$ observations with $p = 6$
explanatory variables. The regression model is $Y = 
V1 + 2*V2 + V3 + V4 + V5 + \varepsilon$, where 
$\varepsilon$ follows a $N(0, 1.7)$ distribution. 
Hence, the true vector of regression 
coefficients is `(1, 2, 1, 1, 1, 0)` and the true intercept is 
zero. The explanatory variables are all independent standard
normal random variables. I used the following code 
to generate the data

In [None]:
n <- 200
p <- 6
set.seed(123)
x0 <- as.data.frame(matrix(rnorm(n*p), n, p))
x0$y <- with(x0, V1 + 2*V2 + V3 + V4 + V5 + rnorm(n, sd=1.7))

We now replace the last 20 observations
with outliers (for a total proportion of atypical observations of 20/200 = 10%).

In [None]:
eps <- .1
n1 <- ceiling(n*(1-eps))
x0[n1:n, 1:p] <- matrix(rnorm((n-n1+1)*p, mean=+1.85, sd=.8))
x0$y[n1:n] <- rnorm(n-n1+1, mean=-7, sd=1.7)

These atypical observations cannot be seen easily in 
a pairwise plot, specially if one does not know 
in advance that they are present:

In [None]:
pairs(x0)

Standard diagnostic plots do not flag anything of 
importance either:

In [None]:
m0 <- lm(y~., data=x0)
par(mfrow=c(2,2))
plot(m0, which=c(1, 2, 5))
par(mfrow=c(1,1))

Note that all the Cook distances are below 0.15, 
for example. 
However, the estimated regression coefficients are
very different from the true ones
(1, 2, 1, 1, 1, 0)

In [None]:
cbind(LS=coef(m0), Truth=c(0,1, 2, 1, 1, 1, 0))

We now compare the estimated regression 
coefficients obtained with 2 other methods:
an MM-estimator,
and the L1-estimator (which is a quantile
regression estimator). We will later see in
the course that these estimators 
have different robustness properties.

In [None]:
library(RobStatTM)
myc <- lmrobdet.control(family='bisquare', efficiency=.85)
m1 <- lmrobdetMM(y~., data=x0, control=myc)
m3 <- quantreg::rq(y~., data=x0)
cbind(Truth=c(0, 1, 2, 1, 1, 1, 0),
      LS=coef(m0), L1=coef(m3), MM=coef(m1))

We can also compare the diagnostic plots obtained with
the robust estimator, where the outliers are
clearly visible.

In [None]:
par(mfrow=c(2,2))
plot(m1, which=c(1, 2, 4))
par(mfrow=c(1,1))

Add now the LS estimator computed on the "clean" data:

In [None]:
m0.cl <- lm(y~., data=x0, subset = -(n1:n) )
cbind(Truth=c(0, 1, 2, 1, 1, 1, 0),
      LS=coef(m0), L1=coef(m3),  
      MM=coef(m1), LSclean = coef(m0.cl))

In [None]:
summary(m0)$coef