In [4]:
# ======================================================================================
# Notebook setup
# 
# Run this cell before all others to make sure that the Jupyter notebook works properly
# ======================================================================================

# Automatically reload all imported modules
%load_ext autoreload
%autoreload 2
# Allow to pan and zoom graphs & charts
%matplotlib widget

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Linear Regression

## Our Target Problem

**Let's assume we want to [estimate real-estate prices in Taiwan](https://archive.ics.uci.edu/ml/datasets/Real+estate+valuation+data+set)**

<center><img src="assets/taiwan-tea-house.jpg" style="height: 400px;"></center>


## Loading the Data

**Data for this problem is available (in csv format) from the `data` folder**

In [5]:
!ls data

lr_test.txt  lr_train.txt  real_estate.csv  weather.csv


We will load the data via a Python library, called [pandas](https://pandas.pydata.org/)

In [6]:
data = pd.read_csv('data/real_estate.csv', sep=';')
data.head() # Head returns the first 5 elements

Unnamed: 0,house age,dist to MRT,#stores,latitude,longitude,price per area
0,32.0,84.87882,10,24.98298,121.54024,37.9
1,19.5,306.5947,9,24.98034,121.53951,42.2
2,13.3,561.9845,5,24.98746,121.54391,47.3
3,13.3,561.9845,5,24.98746,121.54391,54.8
4,5.0,390.5684,5,24.97937,121.54245,43.1


* The content of the csv files is a made accessible in a table-like object (`DataFrame`)

## A Look at the Data

**Let's have a better look at the data**

In [7]:
data.head()

Unnamed: 0,house age,dist to MRT,#stores,latitude,longitude,price per area
0,32.0,84.87882,10,24.98298,121.54024,37.9
1,19.5,306.5947,9,24.98034,121.53951,42.2
2,13.3,561.9845,5,24.98746,121.54391,47.3
3,13.3,561.9845,5,24.98746,121.54391,54.8
4,5.0,390.5684,5,24.97937,121.54245,43.1


* The first four columns contain quantities that easy to estimate
* ...But that's not true for the last one!

Obtaining prices required actual houses to be sold and bought

* Our goal is to use the data to _learn a model_
* ...That can estimate the price based on the easily available information

## Input, Output, Examples, Targets

**Formally, we say that**

* All columns except the price represent the _input $x$_ of our model
* The price represetns the _output $y$_ of our model
* Each row in the table represents one data point, i.e. an _example $(\hat{x}_i, \hat{y}_i)$_
  - $\hat{x}_i$ is the input value for the $i$-th example
  - $\hat{y}_i$ is the true output value (or _target_) for the $i$-th example

**Our goal is to learn a model $f$ such that**

* When we feed the input $\hat{x}_i$ of each example to it
* ...The output value $y_i = f(\hat{x}_i)$ is as close as possible to $\hat{y}_i$

This kind of setup is known in ML as _supervised learning_
 

## Supervised Learning and Regression

**Supervised Learning is among the most common forms of ML**

Our _model_ is a function $f(x,; w)$ with input $x$ and _parameters $w$_

* If the output is numeric, we speak of _regression_
* ...And we can define the approximation error over the exampple using, e.g.:

$$
\mathit{MSE}(w) = \sum_{i=1}^m \left(f(x_i,; w) - y_i\right)^2
$$

* "MSE" stands for _Mean Squared Error_ and it's a common error metric

**Training in a (MSE) regression problem consists in solving**

$$
\text{argmin}_w \, \mathit{MSE}(w)
$$


* I.e. choosing the parameters $w$ to minimize approximation error

## Supervised Learning...And Linear Regression

**We speak instead of _Linear Regression_**

...When $f$ is defined as a _linear combination of basis functions_

$$
f(x; w) = \sum_{i=1}^n w_j \phi_j(x)
$$

In our case:

* Each basis function will correspond to _a specific input column_
* I.e. "house age", "distance to MRT", "#stores", "latitude", "longitude"

This is a very common setup in Linear Regression

**Linear regression is one of the simplest supervised learning approaches**

* ...But it is still a very good example!
* ...And will allow us to discuss some of the kay challenges in ML

## Separating Input and Ouput

**First, we separate our input and output**

In [8]:
cols = data.columns
X = data[cols[:-1]] # all columns except the last one
display(X.head())

Unnamed: 0,house age,dist to MRT,#stores,latitude,longitude
0,32.0,84.87882,10,24.98298,121.54024
1,19.5,306.5947,9,24.98034,121.53951
2,13.3,561.9845,5,24.98746,121.54391
3,13.3,561.9845,5,24.98746,121.54391
4,5.0,390.5684,5,24.97937,121.54245


**We will focus on predicting the logarithm of the price per area**

In [10]:
y = np.log(data[cols[-1]]) # just the last column

* In practice, it's like predicting the order of magnitude

## Training and Test Set

**Next, we remove part of the dataset: we will use it for testing**

* This is called the _test set_ (use only for testing)
* The remaining data will form the _training set_ (use for calibration)

In [13]:
from sklearn.model_selection import train_test_split

X_tr, X_ts, y_tr, y_ts = train_test_split(X, y, test_size=0.34, random_state=0)

print(f'Size of the training set: {len(X_tr)}')
print(f'Size of the test set: {len(X_ts)}')

Size of the training set: 273
Size of the test set: 141


The function `train_test_split`

* Randomly shuffles the data (optionally with a fixed seed `random_state`)
* Puts a fraction `test_size` of the data in the test set
* ...And the remaining data in the training set
* Both the input and the output data is processed in this fashion

## Training and Test Set

**Using separate test set is _extremely important_**

...Because we want our model to work on _new data_

* We have no use for a model that _learns the input data perfectly_
* ...But that _behaves poorly on unseen data_
* In these cases, we say that the model _does not generalize_

By keeping a separate test set we can simulate this evaluation

**For the best performance...**

...Training and test data should have similar probability distributions

* Informally, they should be relatively similar
* In this case, we say that the training set is _representative_

## Fitting the Model

**We can now train a linear model using the [scikit-learn library](https://scikit-learn.org/stable/)**

In [14]:
from sklearn.linear_model import LinearRegression

m = LinearRegression()
m.fit(X_tr, y_tr)

LinearRegression()

We obtain the estimated output via the `predict` method:

In [15]:
y_pred_tr = m.predict(X_tr)
y_pred_ts = m.predict(X_ts)

* The predictions (unlike the targets) are not guaranteed to be integers
* ...But that is still fine, since it's easy to interpret them

## Evaluation

**Finally, we need to evaluate the prediction quality**

A common approach is using metrics. Here are a few examples:

* The _Mean Absolute Error_ is given by:
$$
\mathit{MAE} = \frac{1}{m}\sum_{i=1}^m \left|f(x_i) - y_i\right|
$$
* The _Root Mean Squared Error_ is given by:
$$
\mathit{RMSE} = \sqrt{\frac{1}{m} \sum_{i=1}^m (f(x_i) - y_i)^2}
$$

Both the RMSE and MAE a simple error measures

* They are expresses in the same unit as the original variable

## Evaluation


* The coefficient of determination ($R^2$ coefficient) is given by:
$$
R^2 = 1 - \frac{\sum_{i=1}^m (f(x_i) - y_i)^2}{\sum_{i=1}^m (y_i - \tilde{y})^2}
$$
where $\tilde{y}$ is the average of the $y$ values

**The coefficient of determination is a useful, but more complex metric:**

* Its maximum is 1: an $R^2 = 1$ implies perfect predictions
* Having a known maximum make the metric very readable
* It can be arbitrarily low (including negative)
* It can be subject to a lot of noise if the targets $y$ have low variance

**Using the MSE directly for evaluation is usually a bad idea**

...Since it is a square, and therefore not easy to parse for a human

## Evaluation

**Let's see the values for our example**

In [18]:
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

print(f'MAE on the training data: {mean_absolute_error(y_tr, y_pred_tr):.3}')
print(f'MAE on the test data: {mean_absolute_error(y_ts, y_pred_ts):.3}')
print(f'RMSE on the training data: {np.sqrt(mean_squared_error(y_tr, y_pred_tr)):.3}')
print(f'RMSE on the test data: {np.sqrt(mean_squared_error(y_ts, y_pred_ts)):.3}')
print(f'R2 on the training data: {r2_score(y_tr, y_pred_tr):.3}')
print(f'R2 on the test data: {r2_score(y_ts, y_pred_ts):.3}')

MAE on the training data: 0.154
MAE on the test data: 0.154
RMSE on the training data: 0.214
RMSE on the test data: 0.24
R2 on the training data: 0.704
R2 on the test data: 0.618


* In general, we have better predictions on the training set than on the test set
* This is symptomatic of some _overfitting_
* I.e. we are learning patterns that don't translate to unseen data

Later on, we will see some techniques to deal with this situation

## Evaluation

**As an (important!) alternative to metrics, we can use _scatter plots_:**

* On the x-axis, we show the target values
* On the y-axis, we show the predictions

In [19]:
from matplotlib import pyplot as plt
plt.figure(figsize=(9,3))
plt.scatter(y_ts, y_pred_ts, alpha=0.2)
plt.plot(plt.xlim(), plt.ylim(), linestyle=':', color='0.5');

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

This gives us a better idea of which kind of mistakes the model is making