In [None]:
import holoviews as hv
hv.extension('bokeh')
hv.opts.defaults(hv.opts.Curve(width=500),
                 hv.opts.Scatter(width=500, size=4),
                 hv.opts.Histogram(width=500),
                 hv.opts.Slope(color='k', alpha=0.5, line_dash='dashed'),
                 hv.opts.HLine(color='k', alpha=0.5, line_dash='dashed'))                 

In [None]:
import numpy as np
import pandas as pd
import scipy.stats
import statsmodels.api as sm

# Multivariate linear regression

In the previous lesson we introduce the topic of linear regression and studied the most simple linear model: the line. 

In this lesson we will generalize this model to the multivariate case, i.e. when we want to predict an unidimensional (and continuous) variable $Y$ from a multidimensional (and continuous) variable $X$. You can interpret $X$ as a table where each column represents a particular attribute.

:::{admonition} Example
:class: tip

We want to predict a car's $Y=[\text{fuel consumption}]$ using its $X=[\text{weight}; \text{number of cylinders}; \text{average speed}; \ldots]$

:::


In what follows we will learn the mathematical formalism of the Ordinary Least Squares (OLS) method and how to implement it to fit regression models using Python

## Ordinary Least Squares (OLS)

### Mathematical derivation

Consider a dataset $\{x_i, y_i\}_{i=1,\ldots,N}$ of *i.i.d.* observations with $y_i \in \mathbb{R}$ and $x_i \in \mathbb{R}^D$, with $D>1$. We want to find $\theta$  such that 

$$
y_i \approx \theta_0 + \sum_{j=1}^D \theta_j x_{ij}, \quad \forall i
$$

As before we start by writing the sum of squared errors (residuals) 

$$
\min_\theta L = \sum_{i=1}^N (y_i - \theta_0 - \sum_{j=1}^D \theta_j x_{ij})^2
$$

but in this case we will express it in matrix form 

$$
\min_\theta  L = \| Y - X \theta \|^2 = (Y - X \theta)^T (Y - X \theta)
$$

where

$$
X = \begin{pmatrix} 1 & x_{11} & x_{12} & \ldots & x_{1D} \\ 
1 & x_{21} & x_{22} & \ldots & x_{2D} \\
1 & \vdots & \vdots & \ddots & \vdots \\
1 & x_{N1} & x_{N2} & \ldots & x_{ND} \end{pmatrix},  Y = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{pmatrix}, \theta =  \begin{pmatrix} \theta_0 \\ \theta_1 \\ \vdots \\ \theta_D \end{pmatrix}
$$

From here we can do

$$
\frac{dL}{d\theta} = -(Y - X \theta)^T X =  -X^T (Y - X \theta) = 0
$$

to obtain the **normal equations**

$$
X^T X \theta  = X^T Y
$$

whose solution is

$$
\hat \theta = (X^T X)^{-1} X^T Y
$$

which is known as the **least squares (LS) estimator** of $\theta$

:::{dropdown} Relation with the Moore-Penrose inverse

Matrix $X^{\dagger} = (X^T X)^{-1} X^T $ is known as the left [*Moore-Penrose*](https://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_inverse) pseudo-inverse. There is also the right pseudo inverse $X^T (X X^T)^{-1}$. Together they act  as a generalization of the inverse for non-squared matrices. Further note that if $X$ is squared and invertible then $X^{\dagger} = (X^T X)^{-1} X^T  = X^{-1} (X^T)^{-1} X^T = X^{-1}$

:::

:::{warning}

The OLS solution is only valid if $A=X^T X$ is invertible (non-singular). By construction $A \in \mathbb{R}^{D\times D}$ is a squared symmetric matrix. For $A$ to be invertible we require that its determinant is not zero or equivalently

- The rank of $A$, i.e. the number of linearly independent rows or columns, is equal to $D$ 
- The eigenvalues/singular values of $A$ are positive

:::


:::{note}

The solution we found for the univariate case in the previous lesson is a particular case of the OLS solution

:::

:::{dropdown} Proof

The solution for the univariate case was

$$
\begin{pmatrix} N & \sum_i x_i \\ \sum_i x_i & \sum_i x_i^2\\\end{pmatrix} \begin{pmatrix} \theta_0 \\ \theta_1 \end{pmatrix}  = \begin{pmatrix} \sum_i y_i \\ \sum_i x_i y_i \end{pmatrix} 
$$

which can be rewritten as

$$
\begin{align}
\begin{pmatrix} 1 & 1 & \ldots & 1 \\ x_1 & x_2 & \ldots & x_N \end{pmatrix} 
\begin{pmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_N \end{pmatrix} 
\begin{pmatrix} \theta_0 \\ \theta_1 \end{pmatrix}  &= 
\begin{pmatrix} 1 & 1 & \ldots & 1 \\ x_1 & x_2 & \ldots & x_N \end{pmatrix} 
\begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{pmatrix} \nonumber \\
X^T X \theta &= X^T Y \nonumber
\end{align}
$$

:::

### Fitting an hyperplane using `numpy`

The [`linalg`](https://numpy.org/doc/stable/reference/routines.linalg.html) submodule of the `numpy` library provides

```python
np.linalg.lstsq(X, # a (N, D) shaped ndarray
                Y, # a (N, ) shaped ndarray 
                rcond='warn' # See note below
               )
```

which returns 

- The OLS solution: $\hat \theta = (X^T X)^{-1} X^T Y$
- The sum of squared residuals
- The rank of matrix $X$
- The singular values of matrix $X$

:::{note}

For a near-singular $A=X^T X$ we might not be able to obtain the solution using numerical methods. Conditioning can help stabilize the solution. Singular values smaller than $\epsilon$ can be cut-off by setting `rcond=epsilon` when calling `lstsq`

:::

Let's test `lstsq` on the following database of ice-cream consumption from 

In [None]:
df = pd.read_csv('data/ice_cream.csv', header=0, index_col=0)
df.columns = ['Consumption', 'Income', 'Price', 'Temperature']
display(df.head())

The `corr` attribute of the `pandas` dataframe returns the pairwise correlations between the variables

In [None]:
display(df.corr())

Observations:

- Temperature has a high positive correlation with consumption
- Price has a low negative correlation with consumption
- Income has an almost null correlation with consumption

Let's train a multivariate linear regressor for ice-cream consumption as a function of the other variables

In [None]:
Y = df["Consumption"].values
X = df[["Income", "Price", "Temperature"]].values

- We will standardize the independent variables so that their scale is the same
- We will incorporate a column with ones to model the intercept ($\theta_0$) of the hyperplane

In [None]:
X = (X - np.mean(X, axis=0, keepdims=True))/np.std(X, axis=0, keepdims=True)
X = np.concatenate((np.ones(shape=(X.shape[0], 1)), X), axis=1)

theta, mse, rank, singvals = np.linalg.lstsq(X, Y, rcond=None)
hatY = np.dot(X, theta) # Predicted Y

To assess the quality of the fitted model we can visualize the predicted consumption versus actual (real) consumption or the residuals as a function of the latter and/or the independent variables

In [None]:
p1 = hv.Scatter((Y, hatY), 'Real', 'Predicted').opts(width=330) * hv.Slope(slope=1, y_intercept=0)
p2 = hv.Scatter((Y, Y - hatY), 'Real', 'Residuals').opts(width=330) * hv.HLine(0)
hv.Layout([p1, p2]).cols(2)

In [None]:
p = []
for var_name in ["Income", "Price", "Temperature"]:
    p.append(hv.Scatter((df[var_name].values, Y - hatY), var_name, 'Residuals').opts(width=330) * hv.HLine(0))
hv.Layout(p).cols(3).opts(hv.opts.Scatter(width=280, height=250))

The predicted consumption follows the real consumption closely. There is also no apparent correlation in the residuals.

But some important questions remain

:::{important}

- How significant is the contribution of each of the independent variables to the prediction?
- How to measure in a quantitative way the quality of the fitted model?

:::

For this we need to view OLS from an statistical perspective

## Statistical perspective of OLS

Up to now we have viewed regression from a deterministic (optimization) perspective. To understand its properties and perform inference we seek an statistical interpretation. 

Let's say that we have $\{x_i, y_i\}_{i=1,\ldots,N}$ *i.i.d.* observations from an unidimensional target variable $Y$ and a **D-dimensional** independent variable $X$. We will assume that our measurements of $Y$ consists of the **true model** plus **white Gaussian noise**, *i.e.*

$$
\begin{align}
y_i &= f_\theta(x_i) + \varepsilon_i \nonumber \\
&= \theta_0 + \sum_{j=1}^D \theta_j x_{ij} + \varepsilon_i 
\end{align}
$$

where $\varepsilon_i \sim \mathcal{N}(0, \sigma^2)$. Then the log likelihood of $\theta$ is

$$
\begin{align}
\log L(\theta) &= \log \prod_{i=1}^N \mathcal{N}(y_i | f_\theta(x_i), \sigma^2) \nonumber \\
&= \sum_{i=1}^N \log \mathcal{N}(y_i | f_\theta(x_i), \sigma^2) \nonumber \\
&= -\frac{N}{2} \log(2\pi \sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^N (y_i - f_\theta(x_i))^2\nonumber \\
&= -\frac{N}{2} \log(2\pi \sigma^2) - \frac{1}{2\sigma^2} (Y-X\theta)^T (Y - X\theta), \nonumber 
\end{align}
$$

and the maximum likelihood solution for $\theta$ can by obtained from

$$
\max_\theta \log L(\theta) = - \frac{1}{2\sigma^2} (Y-X\theta)^T (Y - X\theta).
$$

Note that this is equivalent to 

$$
\min_\theta \log L(\theta) =  \frac{1}{2\sigma^2} (Y-X\theta)^T (Y - X\theta),
$$

which yields 

$$
\hat \theta = (X^T X)^{-1} X^T Y
$$

:::{important}

The least squares solution is equivalent to the maximum likelihood solution under iid samples and gaussian noise

:::

### Statistical properties of the OLS solution

Let $\varepsilon = (\varepsilon_1, \varepsilon_2, \ldots, \varepsilon_N)$, where $\varepsilon \sim \mathcal{N}(0, I \sigma^2) \quad \forall i$ 

Is the OLS estimator unbiased?

$$
\begin{align}
\mathbb{E}[\hat \theta] &= \mathbb{E}[(X^T X)^{-1} X^T Y] \nonumber \\
&= \mathbb{E}[(X^T X)^{-1} X^T (X \theta + \varepsilon)] \nonumber \\
&= \mathbb{E}[\theta] + (X^T X)^{-1} X^T \mathbb{E}[\varepsilon] \\
& = \mathbb{E}[\theta]
\end{align}
$$

> YES! 

What is the variance of the estimator? 

$$
\begin{align}
\mathbb{E}[(\hat \theta - \mathbb{E}[\hat\theta])(\hat \theta - \mathbb{E}[\hat\theta])^T] &= \mathbb{E}[((X^T X)^{-1} X^T \varepsilon) ((X^T X)^{-1} X^T \varepsilon)^T] \nonumber \\
&= (X^T X)^{-1} X^T  \mathbb{E}[\varepsilon \varepsilon^T] X ((X^T X)^{-1})^T  \nonumber \\
&= (X^T X)^{-1} X^T  \mathbb{E}[(\varepsilon-0) (\varepsilon-0)^T] X (X^T X)^{-1}  \nonumber \\
& =\sigma^2 (X^T X)^{-1}
\end{align}
$$

and typically we estimate the variance of the noise using the unbiased estimator

$$
\begin{align}
\hat \sigma^2 &= \frac{1}{N-D-1} \sum_{i=1}^N (y_i - \theta_0 - \sum_{j=1}^D \theta_j x_{ij})^2 \nonumber \\
& = \frac{1}{N-D-1} (Y-X\theta)^T (Y-X\theta)
\end{align}
$$

**The Gauss-Markov Theorem:** The least squares estimate of $\theta$ have the smallest variance among all unbiased estimators (Hastie, 3.2.2) 

### Inference and hypothesis tests for OLS

We found the expected value and the variance of $\theta$. From the properties of MLE we know that

$$
\hat \theta \sim \mathcal{N}(\theta, \sigma^2 (X^T X)^{-1})
$$

and the estimator of the variance will be proportional to

$$
\hat \sigma^2 \sim  \frac{1}{(N-M)}\sigma^2 \chi_{N-M}^2
$$

With this we have all the ingredients to find confidence intervals and do hypothesis test on $\hat \theta$

To assess the significance of our model we might try to reject the following *hypotheses*

- One of the parameters (slopes) is zero (t-test)

    $\mathcal{H}_0: \theta_i = 0$
    
    $\mathcal{H}_A: \theta_i \neq 0$
    
    
- All parameters are zero (f-test)

    $\mathcal{H}_0: \theta_1 = \theta_2 = \ldots = \theta_D = 0$

    $\mathcal{H}_A:$ At least one parameter is not zero


- A subset of the parameters are zero (ANOVA)

    $\mathcal{H}_0: \theta_i = \theta_j =0 $

    $\mathcal{H}_A:$ $\theta_i \neq 0 $ or $\theta_j \neq 0 $
    


We can use the [`OLS`](https://www.statsmodels.org/stable/regression.html) function of the `statsmodels` Python library to perform all these tests 

First we create the model by giving the target and independent variables. In `statsmodels` jargon these are called endogenous and exogenous, respectively. Then we call the `fit` attribute

The coefficients obtained are equivalent to those we found with `numpy`

In [None]:
mod = sm.OLS(Y, X, hasconst=True)
res = mod.fit()
display(theta, 
        res.params)

The `summary` attribute gives as 

- the `R-squared` statistic of the model
- the `F-statistic` and its p-value
- A table with the values of `theta` their standard errors, `t-statistics`, p-values and confidence interval

In [None]:
display(res.summary(yname="Consumption", 
                    xname=["Intercept", "Income", "Price", "Temperature"],
                    alpha=0.05))

Observations from the results table:

- The f-test tells that we can reject the hypothesis that all coefficients are null
- The t-test tells us that we cannot reject the null hypothesis that the price coefficient is null

The $r^2$ statistic for the multivariate case is defined as 

$$
\begin{align}
r^2 &= 1 - \frac{\sum_i (y_i - \hat y_i)^2}{\sum_i (y_i - \bar y_i)^2} \nonumber \\
&= 1 - \frac{Y^T(I-X(X^TX)^{-1}X^T)Y}{Y^T (I - \frac{1}{N} \mathbb{1}^T \mathbb{1} ) Y} \nonumber \\
&= 1 - \frac{SS_{res}}{SS_{total}} \nonumber
\end{align}
$$

where $\mathbb{1} = (1, 1, \ldots, 1)$. And it has the same interpretation that was given in the previous lecture


:::{important}

We can trust the test only if our assumptions are true. The assumptions in this case are

- Relation between X and Y is linear
- Errors/noise follows a multivariate normal with covariance $I\sigma^2$

:::


Verify this assumptions by

1. Checking the residuals for normality. Are there outliers that we should remove?
1. Checking for absence of correlation in the residuals
1. Do the errors have different variance?


If the variance of the error is not constant (heteroscedastic) we can use the  **Weighted Least Squares** estimator

## Extra: Weighted Least Squares (WLS)

Before we assumed that the noise was homoscedastic (constant variance). We will generalize to the heteroscedastic case.

We can write the multivariate linear regression model with observations subject to Gaussian noise with changing variance as

$$
y_i = \theta_0 + \sum_{j=1}^D \theta_j x_{ij} + \varepsilon_i, \forall i \quad \text{and} \quad \varepsilon_i \sim \mathcal{N}(0, \sigma_i^2)
$$


With respect to OLS the only difference is that $\sigma_i \neq \sigma$



In this case the maximum likelihood solution is 

$$
\hat \theta = (X^T \Sigma^{-1}X)^{-1} X^T \Sigma^{-1} Y
$$

where

$$
\Sigma = \begin{pmatrix} 
\sigma_1^2 & 0 &\ldots & 0 \\
0 & \sigma_2^2 &\ldots & 0 \\
\vdots & \vdots &\ddots & \vdots \\
0 & 0 &\ldots & \sigma_N^2 \\
\end{pmatrix}
$$

An the distribution of $\theta$ is

$$
\hat \theta \sim \mathcal{N}( \theta,  (X^T X)^{-1} X^T  \Sigma X (X^T X)^{-1} )
$$
