**NOTE: This notebook is written for the Google Colab platform. However it can also be run (possibly with minor modifications) as a standard Jupyter notebook.** 



In [None]:
#@title -- Installation of Packages -- { display-mode: "form" }
import sys
!{sys.executable} -m pip install git+https://github.com/michalgregor/class_utils.git

In [None]:
#@title -- Import of Necessary Packages -- { display-mode: "form" }
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, KBinsDiscretizer
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.metrics import (mean_squared_error,
                             mean_absolute_error)
from sklearn.linear_model import LinearRegression
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
import matplotlib.pyplot as plt
from class_utils import error_histogram
import numpy as np
import pandas as pd

In [None]:
#@title -- Downloading Data -- { display-mode: "form" }
# also create a directory for storing any outputs
import os
os.makedirs("output", exist_ok=True)

## Linear Regression

In a previous example we have shown how optimization can be used to carry out simple regression. The optimization was done iteratively, using a gradient-based approach. However, there is one special group of models, for which the optimization does not have to be done iteratively and the optimal parameters can in fact be computed directly. This is the case for linear models with a quadratic loss function and the optimal parameters can be computed using a method known as **ordinary least squares**  (least squares because we are trying to minimize the squares of the errors, ordinary to differentiate from least squares for non-linear systems). As a result, we obtain an approach that allows us to perform **linear regression** .

Linear regression problem occur frequently in practice: we encounter them every time we need to fit a line through a set of points. Since the theory of linear regression is to be discussed separately, we will only provide a short guide on how to use it in practice.

### A Simple Example: A Synthetic Dataset

We will create a synthetic dataset. The points will be from a line, but with added Gaussian noise, i.e.:
\begin{equation}
y_i = a x_i + c + \mathcal{N}(\mu, \sigma^2).
\end{equation}

We can sample from the Guassian (normal) distribution using function `np.random.normal`.



In [None]:
df = pd.DataFrame()
df['x'] = np.arange(0, 1, 0.0025)
df['y'] = df['x'] + np.random.normal(0, 0.2, df['x'].shape)

df.head()

### Data Preprocessing

The splitting and preprocessing the data is completely analogical to what we have done before, the code of the following cell is therefore going to be hidden.



In [None]:
#@title -- Data Loading and Preprocessing; X_train, Y_train, X_test, Y_test -- { display-mode: "form" }
kbins = KBinsDiscretizer(6, encode='ordinal')
y_stratify = kbins.fit_transform(df[['y']])

df_train, df_test = train_test_split(df, stratify=y_stratify,
                                 test_size=0.3, random_state=4)

categorical_inputs = []
numeric_inputs = ['x']
output = ['y']

input_preproc = make_column_transformer(
    (make_pipeline(
        SimpleImputer(strategy="most_frequent"),
        OrdinalEncoder()),
     categorical_inputs),
    
    (make_pipeline(
        SimpleImputer(),
        StandardScaler()),
     numeric_inputs)
)

X_train = input_preproc.fit_transform(df_train[categorical_inputs+numeric_inputs])
Y_train = df_train[output]

X_test = input_preproc.transform(df_test[categorical_inputs+numeric_inputs])
Y_test = df_test[output]

plt.figure()
plt.scatter(X_train, Y_train, label="training data")
plt.scatter(X_test, Y_test, label="testing data")
plt.grid(ls='--')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.savefig("output/linreg_data.pdf", bbox_inches='tight', pad_inches=0)

#### Parameter Fitting

We create a linear regression model and fit its parameters using function `fit`.



In [None]:
model = LinearRegression()
model = model.fit(X_train, Y_train)

#### Testing

In order to get prediction for some data (whether the same data or new data), we will use function `predict`. Functions `fit` and `predict` form part of a standard interface, supported by most models from package `sklearn`. The rest of the testing will look very similar to what we did in the previous example.



In [None]:
y_test = model.predict(X_test)

In [None]:
#@title -- Testing -- { display-mode: "form" }

# we compute and display the MSE and the MAE
mse = mean_squared_error(Y_test, y_test)
print("MSE = {}".format(mse))

mae = mean_absolute_error(Y_test, y_test)
print("MAE = {}".format(mae))

plt.figure(figsize=(8, 6))
error_histogram(Y_test, y_test, Y_fit_scaling=Y_train)
plt.savefig("output/error_output_histogram.pdf", bbox_inches='tight', pad_inches=0)

# we visualize the regression line
x_min = min(np.min(X_train), np.min(X_test))
x_max = max(np.max(X_train), np.max(X_test))

xx = np.linspace(x_min, x_max, 250).reshape((-1, 1))
yy = model.predict(xx.reshape([-1, 1]))

plt.figure()
plt.scatter(X_train, Y_train, label="training data", s=15)
plt.scatter(X_test, Y_test, label="testing data", s=15)
plt.plot(xx, yy, 'k', linewidth='5', label="regressive model")

plt.grid(ls='--')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.savefig("output/linreg_line.pdf", bbox_inches='tight', pad_inches=0)