In [None]:
import holoviews as hv
hv.extension('bokeh')
hv.opts.defaults(hv.opts.Curve(width=500), 
                 hv.opts.Histogram(width=500))

In [None]:
import numpy as np
import scipy.stats

# Introduction


Many problems can be posed as "finding a relation" between factors/variables

This can be interpreted as predicting and/or explaining a variable given others

Some examples:

- Predicting sales given money spent in advertising
- Predicting chance to rain in Valdivia given temperature, pressure and humidity
- Predicting gasoline consumption of a car given acceleration, weight and number of cylinders
- Predicting chance to get lung cancer given number of smoked cigarettes per day, age and gender

We could ask

- Are these variable related?
- How strong and/or significant is the relationship?
- What is the nature of the relationship?

Answering these helps us **understand the underlying processes behind the data**

## Defining regression

**Regression** refers to a family of statistical methods to find **relationships** between **variables**

In general the relation is modeled as a function $g(\cdot)$ that maps two types of variables

- The input variable $X$ is called **independent variable** or feature
- The output variable $Y$ is called **dependent variable**, response or target

The mapping or function $g$ is called **predictor** or **regressor**

$$
g: X \to Y
$$

The objective is to learn $g$ such that we can predict $Y$ given $X$, *i.e.* $\mathbb{E}[Y|X]$ 

- **Regression** can be defined from an statistical perspective as a special case of model fitting (parameter estimation)
- In many books **Regression** is defined from a pure-optimization perspective (deterministic)
- **Regression** is considered part of the *supervised learning* paradigm. The difference between **Regression** and *classification* is the nature of the dependent variable (continuous vs categorical)

### Parametric vs non-parametric  regression

Regression methods can be broadly classified as either parametric or non-parametric 

In parametric regression 

- We know the model of the regressor
- The model has a finite number of parameters
- The parameters of the model are all we need to do predictions 
- Simpler but with bigger assumptions (inductive bias)


In nonparametric regression

- There is no functional form for the regressor
- It can have an infinite number of parameters (and a finite number of hyperparameters)
- The regressor is defined from the training data
- More flexible but requires more data to fit it
- Examples: Splines, Support vector regression, Gaussian processes

In this lesson we will focus on parametric regression


### Parametric linear models for regression

Let 

- $X$ be a continuous D-dimensional variable (feature) and $Y$ be a continuous unidimensional variable (target) 
- $\{x_i, y_i\}$ with $i=1,\ldots,N$ be a set of $N$ *iid* observations of $X$ and $Y$
- $g_\theta$ be a model with a M-dimensional parameter $\theta$ 

Then we can define parametric regression as finding a value of $\theta$ such that 

$$
y_i \approx g_\theta(x_i),\quad i=1,\ldots, N
$$

The simplest parametric model is the **linear model**. A linear model gives rise to **linear regression**

:::{important}

The linear model is linear on $\theta$ but not necessarily on $X$

:::

For example a model with unidimensional input

$$
g_\theta \left(x_i \right) = \theta_0 + \theta_1 x_i  + \theta_2 x_i^2,
$$

is a linear model and

$$
g_\theta(x_i) = \theta_0 + \theta_1 \log(x_i),
$$

is also a linear model but

$$
g_\theta(x_i) = \theta_0 + \log(x_i + \theta_1),
$$

is not a linear model

## The simplest linear model: The line

If we consider a one-dimensional variable $x_i \in \mathbb{R}, i=1,\ldots,N$, then the simplest linear model is

$$
g_\theta(x_i) = \theta_0 + \theta_1 x_i
$$

which has $M=2$ parameters. 

This corresponds to a line in $\mathbb{R}^2$ and we recognize

- $\theta_0$ as the intercept
- $\theta_1$ as the slope

If we consider a two-dimensional variable $x_i = (x_{i1}, x_{i2}) \in \mathbb{R}^2, i=1,\ldots,N$ then we obtain

$$
g_\theta(x_i) = \theta_0 + \theta_1 x_{i1} + \theta_2 x_{i2}
$$

which has $M=3$ parameters. This corresponds to a plane in $\mathbb{R}^3$

The most general form assumes a D-dimensional variable $x_i = (x_{i1}, x_{i2}, \ldots, x_{iD}), i=1,\ldots,N$ 

$$
g_\theta(x_i) = \theta_0 + \sum_{j=1}^D \theta_j x_{ij}
$$

which has $M=D+1$ parameters, which corresponds to an hyperplane in $\mathbb{R}^M$

### Fitting the simplest linear model: Mathematics

Assuming that we have $\{x_i, y_i\}_{i=1,\ldots,N}$ *iid* observations from unidimensional variables X and Y

> How do we find $\theta_0$ and $\theta_1$ such that $y_i \approx \theta_0 + \theta_1 x_i, \forall i$?

Let's start by writing the squared residual (error) as 

$$
E_i^2 = (y_i - \theta_0 - \theta_1 x_i)^2,
$$

We can fit (train) the model with

$$
\min_{\theta} L = \sum_{i=1}^N E_i^2 = \sum_{i=1}^N (y_i - \theta_0 - \theta_1 x_i)^2,
$$

where $L$, the sum of squares errors, is a our loss/cost function

:::{note}

Later we will see that this cost function arises when a gaussian likelihood for $Y$ is assumed

:::

Setting the derivative of this expression with respect to the $\theta_0$ and $\theta_1$ we obtain

$$
\hat \theta_1 = \frac{\text{Cov}[x, y]}{\text{Var}[x]}, \hat \theta_0 = \bar y - \hat \theta_1 \bar x
$$

where

$$
\text{Cov}[x, y] = 
$$

and 

$$
\text{Var}[x] =
$$



:::{dropdown} Proof

With

$$
\begin{align}
\frac{dL}{d\theta_0} &= -2 \sum_{i=1}^N (y_i - \theta_0 - \theta_1 x_i) \nonumber \\
&= -2 \sum_{i=1}^N y_i +  2 N\theta_0 + 2 \theta_1 \sum_{i=1}^N x_i = 0 \nonumber
\end{align}
$$

and 

$$
\begin{align}
\frac{dL}{d\theta_1} &= -2 \sum_{i=1}^N (y_i - \theta_0 - \theta_1 x_i) x_i \nonumber \\
&= -2 \sum_{i=1}^N y_i x_i +  2 \theta_0 \sum_{i=1}^N x_i + 2 \theta_1 \sum_{i=1}^N x_i^2 = 0 \nonumber
\end{align}
$$

a system of two equations and two unknowns is obtained

$$
\begin{pmatrix} N & \sum_i x_i \\ \sum_i x_i & \sum_i x_i^2\\\end{pmatrix} \begin{pmatrix} \theta_0 \\ \theta_1 \end{pmatrix}  = \begin{pmatrix} \sum_i y_i \\ \sum_i x_i y_i \end{pmatrix} 
$$

whose solution is

$$
\begin{pmatrix} \hat \theta_0 \\ \hat \theta_1 \end{pmatrix}  = 
\frac{1}{N\sum_i x_i^2 - \left(\sum_i x_i\right)^2}\begin{pmatrix} \sum_i x_i^2 & -\sum_i x_i \\ -\sum_i x_i & N\\\end{pmatrix}  
\begin{pmatrix} \sum_i y_i \\ \sum_i x_i y_i \end{pmatrix} 
$$

where we assume that the determinant of the matrix is not zero

:::

### Fitting the simplest linear model: Python

We can fit a line in Python using the `scipy.stats` library

```python
scipy.stats.linregress(x, # N vector or Mx2 matrix (if y is None)
                       y=None, # N vector
                      )
```

This function returns an object, its main attributes are

- `slope`: Equivalent to $\hat \theta_1$
- `intercept`: Equivalent to $\hat \theta_0$
- `rvalue`: The correlation coefficient (more on this later)
- `pvalue`: A p-value for the null hypothesis that $\theta_1 =0$ (more on this later)

Let's create synthetic data to test this function

In [None]:
np.random.seed(12345)
theta, sigma = [0.5 , -1], 0.5
x = np.random.rand(25)*5

def model(x, theta):
    return theta[0] + theta[1]*x

y = model(x, theta) + sigma*np.random.randn(len(x))

We fit the data using

In [None]:
res = scipy.stats.linregress(x, y)
theta_hat = np.array([res.intercept, res.slope])

print(f"hat theta0: {theta_hat[0]:0.5f}, hat theta1: {theta_hat[1]:0.5f}")

### Predicting with the model and inspecting the results

We can use the fitted model to interpolate/extrapolate on new values $\hat x$

The following plot shows the fitted model on the training samples

In [None]:
hat_x = np.linspace(-1, 6, num=50)
p_fitted = hv.Curve((hat_x, model(hat_x, theta_hat)), label='Fitted model')
p_data = hv.Scatter((x, y), label='Training data').opts(color='k', size=5)

hv.Overlay([p_data, p_fitted])

The fitted model (blue) follows the data closely. 

To visually assess the quality of the fit we can also plot the residuals, i.e. the distance between each sample of the training set and the fitted line. We can also inspect the histogram of the residuals

In [None]:
residuals = y - model(x, theta_hat)
bins, edges = np.histogram(residuals, density=True)

p_residuals = hv.Scatter((y, residuals), 'Target variable', 'Residuals').opts(color='k', size=5, width=350)
p_zero = hv.HLine(0).opts(color='k', line_dash='dashed', alpha=0.5)
p_hist = hv.Histogram((edges, bins), kdims='Residuals', vdims='Density').opts(width=350)

hv.Layout([p_residuals * p_zero, p_hist]).cols(2)

Look for residuals that

- concentrate around zero 
- are not correlated (white noise like)

Correlation in the residuals is a sign that the choice of the model (line) was not adequate 

### Coefficient of determination

We can measure how strong is the linear relation between $y$ and $\hat y = \hat \theta_0 + \hat \theta_1 x$ using the **coefficient of determination** or $r^2$

This is defined as

$$
r^2 = 1 - \frac{\sum_i (y_i - \hat y_i)^2}{\sum_i (y_i - \bar y_i)^2} \in [0, 1]
$$

*i.e.* one minus the sum of residuals divided by the variance of $y$. The $r$ statistic is also known as Pearson's correlation coefficient. 

Interpreting $r^2$:

- If $r^2 = 1$, the data points are fitted perfectly by the model. The regressor accounts for all of the variation in y
- If $r^2 = 0$, the regression line is horizontal. The regressor accounts for none of the variation in y


:::{warning}

If the relation is strong but non-linear it will not be detected by $r^2$

:::

Note that $r$ is available in the object return by `scipy.stats.linregress`

For example in this case:

In [None]:
print(res.rvalue)