In [2]:
# Automatically reload all imported modules
%load_ext autoreload
%autoreload 2
# Allow to pan and zoom graphs & charts
%matplotlib widget

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Linear Regression

## Noise and Consequences

**Let's go back to our asteroid example**

<center><img src="assets/oumuamua.jpg" style="height: 380px;"></center>

* What if our observations are _noisy_..
* I.e. subject to measurement errors?

## Noise and Consequences

**Let us assume, that due to measurement errors:**

* We should have observed $(0.5, 0.1), (-0.55, 0.5)$
* ...But we instead observe $(0.53, 0.1), (-0.52, 0.5)$

In [3]:
import pandas as pd
import numpy as np

data = pd.DataFrame(columns=['x0', 'x1'], data=[[0.53, 0.1], [-0.52, -0.5]])
data

A = data**2 # The table with
b = np.ones((2, 1))

w = np.linalg.solve(A, b)
w

array([[3.55444973],
       [0.15550718]])

* Due the noise, our coefficients will be off
* Even this small perturbation is enough for us to misclassify the curve

## More Observations

**One natural way to deal with this is using _more observations_**

<center><img src="assets/trajectory.png" style="height: 450px;"></center>

* Intuitively, the additional data compensates for measurement errors

## More Observations

**Say we have one additional data point**

* Namely our "revised" observations $(x_{0,0}, x_{0,1}) = (0.53, 0.1)$, and $(x_{1,0}, x_{1,1}) = (-0.52, -0.5)$
* ...Plus the point $(x_{2,0}, x_{2,1}) = (-0.7, -0.52)$

**What happens to our system of linear equations?**

$$\begin{align}
a x_{0,0}^2 + b x_{0,1}^2 &= 1 \\
a x_{1,0}^2 + b x_{1,1}^2 &= 1 \\
a x_{2,0}^2 + b x_{2,1}^2 &= 1 \\
\end{align}$$

* There are 3 equations (constraints) and 2 variables (degrees of freedom)
* The system is _overdetermined_ (and impossible to solve exactly)

## Curve Fitting

**However, we can go for an _approximate solution_**

* Given a function $f(x,; w)$ with input $x$ and parameters $w$
* ...And a dataset of observations $\{x_i, y_i\}_{i=1..m}$

...We can define its approximation error over the observations

$$
\mathit{MSE}(w) = \sum_{i=1}^m \left(f(x_i,; w) - y_i\right)^2
$$

* "MSE" stands for _Mean Squared Error_ and it's a common error metric

**Curve fitting**

Consider the problem:

$$
\text{argmin}_w \, \mathit{MSE}(w)
$$


* I.e. the problem of choosing weights/parameters $w$ to minimize approximation error
* ...This is is sometimes referred to as _curve fitting_

## Curve Fitting ...And Linear Regression

**Linear Regression**

When $f$ is defined as a _linear combination of basis functions_...

$$
f(x; w) = \sum_{i=1}^n w_j \phi_j(x)
$$

...The problem is called _linear regression_

* It's called "regression" because we are estimating a numeric quantity
* It's called "linear" because $f$ is linear w.r.t. its parameter vector $w$
* The dataset used for minimization is usually called a _training set_
* Note: the basis functions do not need to be linear!

## Linear Least Squares Method

**How do we solve it?**

* It can be proved that the linear regression problem has _a unique minimizer_
* ...And we can find it by searching for a $w$ that results in a _null gradient_:
$$
\nabla{\mathit{MSE}(w)} = 0
$$

In the linear case, it can be proven this is true iff:

$$
(A^T A) w = A^T b
$$

where $A$ and $b$ are defined as in the exact interpolation case, i.e.:

$$
A =
\left(\begin{array}{ccc}
\phi_1(x_1) & \cdots & \phi_n(x_n) \\
\vdots & \vdots & \vdots \\
\phi_1(x_m) & \cdots & \phi_n(x_m) \\
\end{array}\right)
\quad
b = 
\left(\begin{array}{c}
y_1 \\
\vdots \\
y_m \\
\end{array}\right)
$$

**This is called the Linear Least Squares method**

## Linear Least Squares Method

**Let's test it on the asteroid data**

In [6]:
import pandas as pd
import numpy as np

data = pd.DataFrame(columns=['x0', 'x1'], data=[[0.53, 0.1], [-0.52, -0.5], [-0.7, -0.52]])

* First we build $A$ and $b$:

In [7]:
A = data**2 # The table with
b = np.ones((3, 1))
display(A)
display(b)

Unnamed: 0,x0,x1
0,0.2809,0.01
1,0.2704,0.25
2,0.49,0.2704


array([[1.],
       [1.],
       [1.]])

## Linear Least Squares Method

* Then we can build $A^\prime = A^TA$ and $b^\prime = A^T b$:

In [10]:
Aprime = np.matmul(A.values.T, A.values)
bprime = np.matmul(A.values.T, b)

* Finally, we solve the modified system of linear equations:

In [12]:
w = np.linalg.solve(Aprime, bprime)
w

array([[ 2.79747883],
       [-0.27426684]])

* The opposite signs tell use this is one again a hyperbola
* The approach works with any number of observations

## Linear Regression in Scikit-Learn

**We can simplify the process by relying on external libraries**

A nice implementation of Linear Regression is available from [scikit-learn](https://scikit-learn.org/)

* We start by building a `LinearRegression` object

In [13]:
from sklearn.linear_model import LinearRegression

m = LinearRegression(fit_intercept=False)

* The we obtain the parameters of the linear combination by calling the `fit` method

In [18]:
m.fit(A, b)

LinearRegression(fit_intercept=False)

* If we don't pass `fit_intercept=False`, the library will also calibrate an offset
* We can then access the parameter values by accessing the `coef_` field:

In [17]:
m.coef_

array([[ 2.79747883, -0.27426684]])

# Linear Regression - An Example

## Linear Regression - An Example

**Our asteroid example was quite unusual**

* Typically, linear regression is employed to estimate an unknown quantity
* ...Based on the values of a number of available observations

**We will see an example on [estimating the price of houses in Taiwan](https://archive.ics.uci.edu/ml/datasets/Real+estate+valuation+data+set)**

<center><img src="assets/taiwan-tea-house.jpg" style="height: 400px;"></center>


## Loading the Data

**The dataset is available (in csv format) from the `data` folder**

In [19]:
!ls data

lr_test.txt  lr_train.txt  real_estate.csv  weather.csv


* We will use pandas to load the csv dataset for the white wine:

In [20]:
data = pd.read_csv('data/real_estate.csv', sep=';')
data.head() # Head returns the first 5 elements

Unnamed: 0,house age,dist to MRT,#stores,latitude,longitude,price per area
0,32.0,84.87882,10,24.98298,121.54024,37.9
1,19.5,306.5947,9,24.98034,121.53951,42.2
2,13.3,561.9845,5,24.98746,121.54391,47.3
3,13.3,561.9845,5,24.98746,121.54391,54.8
4,5.0,390.5684,5,24.97937,121.54245,43.1


* Our goal is to predict the value of the "quality" columns
* As a model, we will consider _a linear combination of the other columns (plus offset)_
* The basis functions are linear, too: this is the most common case

## Separating Input and Ouput

**First, we separate our input and output**

In [25]:
cols = data.columns
X = data[cols[:-1]] # all columns except the last one
display(X.head())

Unnamed: 0,house age,dist to MRT,#stores,latitude,longitude
0,32.0,84.87882,10,24.98298,121.54024
1,19.5,306.5947,9,24.98034,121.53951
2,13.3,561.9845,5,24.98746,121.54391
3,13.3,561.9845,5,24.98746,121.54391
4,5.0,390.5684,5,24.97937,121.54245


**We will focus on predicting the logarithm (order of magnitude) of the price per area**

In [26]:
y = np.log(data[cols[-1]]) # just the last column

* The $y$ values are often referred to as _targets_

## Training and Test Set

**Next, we remove part of the dataset: we will use it for testing**

* This is called the _test set_ (use only for testing)
* The remaining data will form the _training set_ (use for calibration)

In [27]:
from sklearn.model_selection import train_test_split

X_tr, X_ts, y_tr, y_ts = train_test_split(X, y, test_size=0.34, random_state=0)

print(f'Size of the training set: {len(X_tr)}')
print(f'Size of the test set: {len(X_ts)}')

Size of the training set: 273
Size of the test set: 141


The function `train_test_split`

* Randomly shuffles the data (optionally with a fixed seed `random_state`)
* Puts a fraction `test_size` of the data in the test set
* ...And the remaining data in the training set
* Both the input and the output data is processed in this fashion

## Fitting the Model

**We can now fit a linear model**

In [29]:
m = LinearRegression()
m.fit(X_tr, y_tr)

LinearRegression()

We obtain the estimated output via the `predict` method:

In [32]:
y_pred_tr = m.predict(X_tr)
y_pred_ts = m.predict(X_ts)

* The predictions (unlike the targets) are not guaranteed to be integers
* ...But that is still fine, since it's easy to interpret them

## Evaluation

**Finally, we need to evaluate the prediction quality**

A common approach is using metrics. Here are a few examples:

* The _Mean Absolute Error_ is given by:
$$
\mathit{MAE} = \frac{1}{m}\sum_{i=1}^m \left|f(x_i) - y_i\right|
$$
* The _Root Mean Squared Error_ is given by:
$$
\mathit{RMSE} = \sqrt{\frac{1}{m} \sum_{i=1}^m (f(x_i) - y_i)^2}
$$

Both the RMSE and MAE a simple error measures

* They are expresses in the same unit as the original variable

## Evaluation


* The coefficient of determination ($R^2$ coefficient) is given by:
$$
R^2 = 1 - \frac{\sum_{i=1}^m (f(x_i) - y_i)^2}{\sum_{i=1}^m (y_i - \tilde{y})^2}
$$
where $\tilde{y}$ is the average of the $y$ values

**The coefficient of determination is a useful, but more complex metric:**

* Its maximum is 1: an $R^2 = 1$ implies perfect predictions
* Having a known maximum make the metric very readable
* It can be arbitrarily low (including negative)
* It can be subject to a lot of noise if the targets $y$ have low variance

**Using the MSE directly for evaluation is usually a bad idea**

...Since it is a square, and therefore not easy to parse for a human

## Evaluation

**Let's see the values for our example**

In [33]:
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error


print(f'MAE on the training data: {mean_absolute_error(y_tr, y_pred_tr):.3}')
print(f'MAE on the test data: {mean_absolute_error(y_ts, y_pred_ts):.3}')
print(f'RMSE on the training data: {np.sqrt(mean_squared_error(y_tr, y_pred_tr)):.3}')
print(f'RMSE on the test data: {np.sqrt(mean_squared_error(y_ts, y_pred_ts)):.3}')
print(f'R2 on the training data: {r2_score(y_tr, y_pred_tr):.3}')
print(f'R2 on the test data: {r2_score(y_ts, y_pred_ts):.3}')

MAE on the training data: 0.154
MAE on the test data: 0.154
RMSE on the training data: 0.214
RMSE on the test data: 0.24
R2 on the training data: 0.704
R2 on the test data: 0.618


* In general, we have better predictions on the training set than on the test set
* This is symptomatic of some _overfitting_
* I.e. we are learning patterns in the training set that don't translate to unseen data

Later on, we will see some techniques to deal with this situation

## Evaluation

**As an (important!) alternative to metrics, we can use _scatter plots_:**

* On the x-axis, we show the target values
* On the y-axis, we show the predictions

In [34]:
from matplotlib import pyplot as plt
plt.figure(figsize=(9,3))
plt.scatter(y_ts, y_pred_ts, alpha=0.2)
plt.plot(plt.xlim(), plt.ylim(), linestyle=':', color='0.5');

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

This gives us a better idea of which kind of mistakes the model is making