**NOTE: This notebook is written for the Google Colab platform. However it can also be run (possibly with minor modifications) as a standard Jupyter notebook.** 



In [None]:
#@title -- Installation of Packages -- { display-mode: "form" }
import sys
!{sys.executable} -m pip install git+https://github.com/michalgregor/class_utils.git

In [None]:
#@title -- Import of Necessary Packages -- { display-mode: "form" }
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from scipy.optimize import curve_fit

In [None]:
#@title -- Downloading Data -- { display-mode: "form" }
from class_utils.download import download_file_maybe_extract
download_file_maybe_extract("https://www.dropbox.com/s/p5q7gzupa2ndw55/sigmoid_regression_data.csv?dl=1", directory="data")

# also create a directory for storing any outputs
import os
os.makedirs("output", exist_ok=True)

In [None]:
#@title -- Auxiliary Functions -- { display-mode: "form" }
def fit_poly(X, Y, d):
    x_min = np.min(X)
    x_max = np.max(X)
    
    polynomial_features = PolynomialFeatures(degree=d)
    X_poly = polynomial_features.fit_transform(X)
    linreg = LinearRegression()
    linreg = linreg.fit(X_poly, Y)
    
    xx = np.linspace(x_min, x_max, 250).reshape((-1, 1))
    xx_poly = polynomial_features.transform(xx)
    yy = linreg.predict(xx_poly)
    
    plt.figure()
    plt.scatter(X, Y)
    plt.grid(ls='--')
    plt.xlabel('x')
    plt.ylabel('y')

    plt.plot(xx, yy, 'r')
    plt.title("degree {}".format(d))    
    plt.savefig("output/poly_{}_fit.pdf".format(d), bbox_inches="tight", pad_inches=0)

## Polynomial Regression

### Linear Regression Will Not Work Everywhere

Linear regression will not work well for every imaginable dataset. For an instance:



In [None]:
#@title -- Linear Regression on Polynomial Data -- { display-mode: "form" }
X = np.arange(-5, 10, 0.2)
Y = (X - 2 * (X ** 2) + 0.5 * (X ** 3)
        + np.random.normal(0, 15, len(X)))

X = X.reshape([-1, 1])
Y = Y.reshape([-1, 1])

linreg = LinearRegression()
linreg = linreg.fit(X, Y)
y_lin = linreg.predict(X)

plt.figure()
plt.scatter(X, Y, s=10)
plt.plot(X, y_lin, 'r')
plt.grid(ls='--')
plt.xlabel('x')
plt.ylabel('y')

plt.savefig("output/poly_linfit.pdf", bbox_inches="tight", pad_inches=0)

It is clear that a line is not able to express the character of the data very well.

### Applying Polynomial Regression

Polynomial regression is one among many other types of regression. It fits the data using a polynomial of a certain degree. The regresion model looks as follows:
\begin{equation}
\hat y = a_0 + a_1 x + a_2 x^2 + ... + a_n x^n
\end{equation}

The good news is that polynomial regression can be done in exactly the same way as linear regression – all we need to do is to transform the regression problem so that the input of the linear regression model will not be $x$ directly, but rather a vector of its powers.

The input matrxi of the linear regression model will then take the following form:
\begin{equation}
X = \left(
\begin{matrix}
1 & x_1 & x_1^2 & ... & x_1^n \
1 & x_2 & x_2^2 & ... & x_2^n \
\vdots & \vdots & \vdots & \ddots & \vdots \
1 & x_m & x_m^2 & ... & x_m^n
\end{matrix}
\right),
\end{equation}
where $n$ is the degree of the polynomial.

Package `sklearn` contains an object called `PolynomialFeatures`, which helps us to preprocess the data into just such format.

For a degree 3 polynomial:



In [None]:
model = make_pipeline(PolynomialFeatures(degree=3), LinearRegression())
model.fit(X, Y)

In [None]:
#@title -- Regression Curve vs. Data -- { display-mode: "form" }
x_min = np.min(X)
x_max = np.max(X)
xx = np.linspace(x_min, x_max, 250).reshape((-1, 1))
yy = model.predict(xx)

plt.figure()
plt.scatter(X, Y, s=10)
plt.plot(xx, yy, 'r')
plt.grid(ls='--')
plt.xlabel('x')
plt.ylabel('y')

plt.savefig("output/polyfit.pdf", bbox_inches="tight", pad_inches=0)

As we can see, the results are much better: this is because the original data actually comes precisely from a degree 3 polynomial.

### How to Choose the Degree of the Polynomial?

When selecting the degree of the polynomial, we need to make the same consideration that we always keep track of when applying machine learning. On one hand, the model should be sufficiently complex to express regularities inherent in the data, but not complex enough to simply memorize the data: otherwise it is unlikely to generalize correctly. If the model is not sufficiently complex, we speak about **underfitting** . If it is too complex, **overfitting**  occurs.

The problem can be illustrated using the following simple example:



In [None]:
#@title -- Data Loading and Preprocessing; X, Y -- { display-mode: "form" }
df = pd.read_csv("data/sigmoid_regression_data.csv")

# we specify the inputs and the outputs
categorical_inputs = []
numeric_inputs = ['x']
output = ['y']

# we create the pipeline
input_preproc = make_column_transformer(
    (make_pipeline(
        SimpleImputer(strategy="most_frequent"),
        OrdinalEncoder()),
     categorical_inputs),
    
    (make_pipeline(
        SimpleImputer(),
        StandardScaler()),
     numeric_inputs)
)

# we fit and apply the pipeline
X = input_preproc.fit_transform(df[categorical_inputs+numeric_inputs])
Y = df[output].values

# we plot the data for visual inspection
plt.scatter(X, Y, marker='x', label="training data")
plt.xlabel('x')
plt.ylabel('y')
plt.grid(ls='--')
plt.legend()
plt.savefig("output/regression_data.pdf", bbox_inches='tight', pad_inches=0)

Let us try to fit the data using polynomials of different degrees (function `fit_poly` was defined at the beginning in the auxiliary function section):



In [None]:
fit_poly(X, Y, 2)
fit_poly(X, Y, 5)
fit_poly(X, Y, 7)
fit_poly(X, Y, 11)

As we can see, low-degree polynomials cannot really express the shape of the curve from which the original data was sampled. If we instead opt in for a polynomial of too high a degree, it will pass through the points very precisely, but it will behave unreasonably in between them.

Naturally, the main mistake in this case is to approximate a curve of this kind using polynomial regression in the first place. The original relationship is suspiciously similar to a logistic (sigmoid) curve. It would therefore very likely be a much better idea to fit a sigmoid curve to it:



In [None]:
#@title -- Fitting the Data using the Sigmoid Curve -- { display-mode: "form" }
def sigmoid(x, x0, k, a, c):
    y = a / (1 + np.exp(-k*(x-x0))) + c
    return y

x_min = np.min(X)
x_max = np.max(X)
xx = np.linspace(x_min, x_max, 250).reshape((-1, 1))

popt, pcov = curve_fit(sigmoid, X.reshape(-1), Y.reshape(-1))
yy = sigmoid(xx, *popt)

plt.figure()
plt.scatter(X, Y)
plt.grid(ls='--')
plt.xlabel('x')
plt.ylabel('y')
plt.plot(xx, yy, 'r')
plt.savefig("output/sigmoid_fit.pdf", bbox_inches="tight", pad_inches=0)

As we can see, the results are incomparably better in this case.

