## Introduction to Regression Splines with Python

Date: 23.04.2020

### Introduction
Everything starts with **Linear Regression**, which assumes a linear relationship between the dependent and independent variables. An improvement for nonlinear relationshipts is **Polynomial Regression**, but using it on datasets with high variability is likely to produce an over-fitting. 
A further, flexible way is to use the non-linear approach of **Regression Splines**. 


### Source
https://www.analyticsvidhya.com/blog/2018/03/introduction-regression-splines-python-codes/


### Data
The dataset contains information like the ID, year, age, sex, marital status, race, education, region, job class, health, health insurance, log of wage and wage of various employees. 


In [1]:
# import modules
import pandas as pd
import numpy as np
import statsmodels.api as sm
import plotly
import plotly.express as px
import plotly.graph_objects as go
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)  
from IPython.display import display, Math # for latex output in print to consol

print("Pandas: ", pd.__version__)
print("Numpy: ", np.__version__)
print("Plotly: ", plotly.__version__)

# read the data
data = pd.read_csv("Wage.csv")
data.head()

Pandas:  1.0.1
Numpy:  1.18.1
Plotly:  4.4.1


Unnamed: 0,ID,year,age,sex,maritl,race,education,region,jobclass,health,health_ins,logwage,wage
0,231655,2006,18,1. Male,1. Never Married,1. White,1. < HS Grad,2. Middle Atlantic,1. Industrial,1. <=Good,2. No,4.318063,75.043154
1,86582,2004,24,1. Male,1. Never Married,1. White,4. College Grad,2. Middle Atlantic,2. Information,2. >=Very Good,2. No,4.255273,70.47602
2,161300,2003,45,1. Male,2. Married,1. White,3. Some College,2. Middle Atlantic,1. Industrial,1. <=Good,1. Yes,4.875061,130.982177
3,155159,2003,43,1. Male,2. Married,3. Asian,4. College Grad,2. Middle Atlantic,2. Information,2. >=Very Good,1. Yes,5.041393,154.685293
4,11443,2005,50,1. Male,4. Divorced,1. White,2. HS Grad,2. Middle Atlantic,2. Information,1. <=Good,1. Yes,4.318063,75.043154


In [2]:
data_x = data["age"]
data_y = data["wage"]

# Dividing data into train and validation dataset
from sklearn.model_selection import train_test_split
train_x, valid_x, train_y, valid_y = train_test_split(data_x, data_y, test_size=0.33, random_state=1)

# visualize relationship between age and wage
fig = px.scatter(x=train_x, y=train_y)
fig.show()

### Linear Regression
Supervised learning algorithm for solving regression based task.
Establishes a linear relationship between the dependet and independet variables. It models the data through a linear equation like $y = \beta_0 + \beta_1 * x_1 + ... + \beta_p * x_p$.

It follows from the following computations that **Linear Regression** is not capturing all signals available and is therefor not the best method for solving this problem.

In [19]:
train_x.values.reshape((-1,1))

array([[49],
       [40],
       [55],
       ...,
       [61],
       [34],
       [29]], dtype=int64)

In [3]:
from sklearn.linear_model import LinearRegression

# Fitting linear regression model
x = train_x.values.reshape((-1,1))

model = LinearRegression()
model.fit(x, train_y)
display(Math("\\beta_1 = {}".format(np.round(model.coef_[0],3))))
display(Math("\\beta_0 = {}".format(np.round(model.intercept_, 3))))

# calculate the MSE metric
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(model.predict(valid_x.values.reshape((-1,1))), valid_y)
print("MSE: ", np.round(mse, 3))

<IPython.core.display.Math object>

<IPython.core.display.Math object>

MSE:  1635.126


In [4]:
# plot the fit 
xp = np.linspace(valid_x.min(), valid_x.max(), 100)
yp = model.predict(xp.reshape((-1,1)))

fig = go.Figure()
fig.add_trace(go.Scatter(x=data_x.values, y=data_y.values, mode="markers", name="data"))
fig.add_trace(go.Scatter(x=xp, y=yp, mode="lines", name="fit"))
fig.update_layout(
    title="Linear Regression",
    xaxis_title="Age", yaxis_title="Wage")
fig.show()

### Improvement over Linear Regression: Polynomial Regression

**Polynomial Regression** extends a the linear model by adding extra predictors, obtained by raising each of the original predictors to a power. An example formula would be $y = \beta_0 + \beta_1 * x_i + \beta_2 * x_i^2 + ... + \beta_p * x_i^p$. 
With increasing power value, the curve obtained contains high oscillations which will lead to shapes that are over-flexible. Such curves lead to **over-fitting**. 

Unfortunately, polynomial regression has a fair number of issues as well. As we increase the complexity of the formula, the number of features also increases which is sometimes difficult to handle. Also, polynomial regression has a tendency to drastically over-fit, even on this simple one dimensional data set.It is inherently non-local, i.e., changing the value of Y at one point in the training set can affect the fit of the polynomial for data points that are very far away.

Hence, to avoid the use of high degree polynomial on the whole dataset, we can substitute it with many different small degree polynomial functions.

In [5]:
def poly_fit(degree, train_x, train_y):
    # generating weights for a polynomial funtion with degree=2
    weights = np.polyfit(train_x, train_y, degree)
    # generate model with given weights
    model = np.poly1d(weights)    
    return model

def predict(model, valid_x):
    # prediction on the validation set
    pred = model(valid_x)

model_1 = poly_fit(2, train_x, train_y)
model_2 = poly_fit(4, train_x, train_y)
model_3 = poly_fit(8, train_x, train_y)

# plot the graph for 70 observations only
xp = np.linspace(valid_x.min(), valid_x.max(), 70)
yp_2 = model_1(xp)
yp_14 = model_2(xp)
yp_25 = model_3(xp)


fig = go.Figure()
fig.add_trace(go.Scatter(x=data_x.values, y=data_y.values, mode="markers", name="data"))
fig.add_trace(go.Scatter(x=xp, y=yp_2, mode="lines", name="d = 2"))
fig.add_trace(go.Scatter(x=xp, y=yp_14, mode="lines", name="d = 4"))
fig.add_trace(go.Scatter(x=xp, y=yp_25, mode="lines", name="d = 8"))
fig.update_layout(
    title="Polynomial  Regression",
    xaxis_title="Age", yaxis_title="Wage")
fig.show()

### Spline Regression: Walkthrough and Implementation

Instead of building one model for the whole dataset, divide the data into multiple bins and fit each bin with a separate model == **Regression Splines**

**Spline**: piecewise polynomial function e.g.:

- piecewise step function: constant within each interval 
- 

In [6]:
def str_bins(bins):
    """ gets list of bin values and returns a list of strings of intervals """
    s = list()
    for i in range(len(bins)-1):
        s.append(str(np.round(bins[i], 4))+"-"+str(np.round(bins[i+1])))
    return s

In [7]:
# Piecewise step function

# divide data into n bins
df_cut, bins = pd.cut(train_x, 10, retbins=True, right=True)
df_cut.value_counts(sort=False)

df_steps = pd.concat([train_x, df_cut, train_y], keys=["age", "age_cuts", "wage"], axis=1)

# create dummy variables for the groups
df_steps_dummies = pd.get_dummies(df_cut)
df_steps_dummies.columns = str_bins(bins) 
df_steps_dummies.head()

Unnamed: 0,17.938-24.0,24.2-30.0,30.4-37.0,36.6-43.0,42.8-49.0,49.0-55.0,55.2-61.0,61.4-68.0,67.6-74.0,73.8-80.0
1382,0,0,0,0,1,0,0,0,0,0
23,0,0,0,1,0,0,0,0,0,0
2140,0,0,0,0,0,1,0,0,0,0
1117,0,0,1,0,0,0,0,0,0,0
933,0,0,0,1,0,0,0,0,0,0


In [8]:
# Fitting Generalized Linear Models
fit = sm.GLM(endog=df_steps.wage, exog=df_steps_dummies.values).fit()

# binning validation set into same 4 bins
bin_mapping = np.digitize(valid_x, bins)
X_valid = pd.get_dummies(bin_mapping).drop([5], axis=1)

# prediction
pred = fit.predict(X_valid)
mse = mean_squared_error(valid_y, pred)
print("MSE: ", mse)

MSE:  4263.746355904708


In [9]:
# plot graph for 70 obersvations
xp = np.linspace(valid_x.min(),valid_x.max()-1,70) 
bin_mapping = np.digitize(xp, bins) 
X_valid_2 = pd.get_dummies(bin_mapping) 
pred2 = fit.predict(X_valid_2)

In [10]:
# plot piecewise constant fit
fig = go.Figure()
fig.add_trace(go.Scatter(x=data_x, y=data_y, name="Data", mode="markers"))
fig.add_trace(go.Scatter(x=xp, y=pred2, name="Piecewise Constant Spline Fit", mode="lines+markers"))
fig.update_layout(title="Spline Regression 1", xaxis_title="age", yaxis_title="wage")

#### Chooing the Number and Locations of the Knots

One strategy for knot placement could be to put more knots in the area of high variability in the data. This can work well, but it is common to place knots in a uniform fashion. Here, one defines a degree of freedome and then have the algorithm automatically place the corresponding number of knots at uniform quantiles of the data. Another option is to try the number of knots per hand and use cross-validation.

In [11]:
from scipy.interpolate import BSpline, UnivariateSpline

In [12]:
k = 2
t = np.arange(7)
c =[-1, 2, 0, -1]

data = pd.DataFrame(data={"x": np.arange(len(c)+1)[1:],
                          "y": c})
data.head()
# px.scatter(x=np.arange(len(c)), y=c)


Unnamed: 0,x,y
0,1,-1
1,2,2
2,3,0
3,4,-1


In [15]:
spl= BSpline(t, c, k)
spl.basis_element([1.])
xp = np.linspace(-1,8,100)

In [16]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=np.arange(len(c)), y=c, mode="markers"))
fig.add_trace(go.Scatter(x=xp, y=spl(xp), mode="lines"))

In [44]:
# Univariante Splines from the scipy.interpolate toolbox
from scipy.interpolate import BSpline, UnivariateSpline

x = np.linspace(-1,1,50)
y = np.exp(-x**2)+ 0.1*np.random.randn(50)
fig = px.scatter(x=x,y=y)
# spline with default parameters
spl = UnivariateSpline(x=x, y=y, k=3, ext=1)
xs = np.linspace(-1.2,1.2,1000)
fig.add_scatter(x=xs, y=spl(xs))
# spline with user defined smoothing
spl.set_smoothing_factor(0.4)
fig.add_scatter(x=xs, y=spl(xs), name="Smoothing Spline")

In [None]:
def B(x, k, i, t):
    if k == 0:
       return 1.0 if t[i] <= x < t[i+1] else 0.0
    if t[i+k] == t[i]:
       c1 = 0.0
    else:
       c1 = (x - t[i])/(t[i+k] - t[i]) * B(x, k-1, i, t)
    if t[i+k+1] == t[i+1]:
       c2 = 0.0
    else:
    c2 = (t[i+k+1] - x)/(t[i+k+1] - t[i+1]) * B(x, k-1, i+1, t)
    return c1 + c2

def bspline(x, t, c, k):
    n = len(t) - k - 1
    assert (n >= k+1) and (len(c) >= n)
    return sum(c[i] * B(x, k, i, t) for i in range(n))