## Introduction to Regression Splines with Python

Date: 23.04.2020

### Introduction
Everything starts with **Linear Regression**, which assumes a linear relationship between the dependent and independent variables. An improvement for nonlinear relationshipts is **Polynomial Regression**, but using it on datasets with high variability is likely to produce an over-fitting. 
A further, flexible way is to use the non-linear approach of **Regression Splines**. 

### Data
The dataset contains information like the ID, year, age, sex, marital status, race, education, region, job class, health, health insurance, log of wage and wage of various employees. 


In [10]:
# import modules
import pandas as pd
import numpy as np
import statsmodels.api as sm
import plotly
import plotly.express as px
import plotly.graph_objects as go
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)  
from IPython.display import display, Math # for latex output in print to consol

print("Pandas: ", pd.__version__)
print("Numpy: ", np.__version__)
print("Plotly: ", plotly.__version__)

# read the data
data = pd.read_csv("Wage.csv")
data.head()

Pandas:  1.0.2
Numpy:  1.18.1
Plotly:  4.5.4


Unnamed: 0,ID,year,age,sex,maritl,race,education,region,jobclass,health,health_ins,logwage,wage
0,231655,2006,18,1. Male,1. Never Married,1. White,1. < HS Grad,2. Middle Atlantic,1. Industrial,1. <=Good,2. No,4.318063,75.043154
1,86582,2004,24,1. Male,1. Never Married,1. White,4. College Grad,2. Middle Atlantic,2. Information,2. >=Very Good,2. No,4.255273,70.47602
2,161300,2003,45,1. Male,2. Married,1. White,3. Some College,2. Middle Atlantic,1. Industrial,1. <=Good,1. Yes,4.875061,130.982177
3,155159,2003,43,1. Male,2. Married,3. Asian,4. College Grad,2. Middle Atlantic,2. Information,2. >=Very Good,1. Yes,5.041393,154.685293
4,11443,2005,50,1. Male,4. Divorced,1. White,2. HS Grad,2. Middle Atlantic,2. Information,1. <=Good,1. Yes,4.318063,75.043154


In [12]:
data_x = data["age"]
data_y = data["wage"]

# Dividing data into train and validation dataset
from sklearn.model_selection import train_test_split
train_x, valid_x, train_y, valid_y = train_test_split(data_x, data_y, test_size=0.33, random_state=1)

# visualize relationship between age and wage
fig = px.scatter(x=train_x, y=train_y)
fig.show()

### Linear Regression
Supervised learning algorithm for solving regression based task.
Establishes a linear relationship between the dependet and independet variables. It models the data through a linear equation like $y = \beta_0 + \beta_1 * x_1 + ... + \beta_p * x_p$.

It follows from the following computations that **Linear Regression** is not capturing all signals available and is therefor not the best method for solving this problem.

In [19]:
train_x.values.reshape((-1,1))

array([[49],
       [40],
       [55],
       ...,
       [61],
       [34],
       [29]], dtype=int64)

In [58]:
from sklearn.linear_model import LinearRegression

# Fitting linear regression model
x = train_x.values.reshape((-1,1))

model = LinearRegression()
model.fit(x, train_y)
display(Math("\\beta_1 = {}".format(np.round(model.coef_[0],3))))
display(Math("\\beta_0 = {}".format(np.round(model.intercept_, 3))))

# calculate the MSE metric
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(model.predict(valid_x.values.reshape((-1,1))), valid_y)
print("MSE: ", np.round(mse, 3))

<IPython.core.display.Math object>

<IPython.core.display.Math object>

MSE:  1635.126


In [56]:
# plot the fit 
xp = np.linspace(valid_x.min(), valid_x.max(), 100)
yp = model.predict(xp.reshape((-1,1)))

fig = go.Figure()
fig.add_trace(go.Scatter(x=data_x.values, y=data_y.values, mode="markers", name="data"))
fig.add_trace(go.Scatter(x=xp, y=yp, mode="lines", name="fit"))
fig.update_layout(
    title="Linear Regression",
    xaxis_title="Age", yaxis_title="Wage")
fig.show()

### Improvement over Linear Regression: Polynomial Regression

**Polynomial Regression** extends a the linear model by adding extra predictors, obtained by raising each of the original predictors to a power. An example formula would be $y = \beta_0 + \beta_1 * x_i + \beta_2 * x_i^2 + ... + \beta_p * x_i^p$. 
With increasing power value, the curve obtained contains high oscillations which will lead to shapes that are over-flexible. Such curves lead to **over-fitting**. 


In [67]:
def poly_fit(degree, train_x, train_y):
    # generating weights for a polynomial funtion with degree=2
    weights = np.polyfit(train_x, train_y, degree)
    # generate model with given weights
    model = np.poly1d(weights)    
    return model

def predict(model, valid_x):
    # prediction on the validation set
    pred = model(valid_x)

model_1 = poly_fit(2, train_x, train_y)
model_2 = poly_fit(14, train_x, train_y)
model_3 = poly_fit(25, train_x, train_y)

# plot the graph for 70 observations only
xp = np.linspace(valid_x.min(), valid_x.max(), 70)
yp_2 = model_1(xp)
yp_14 = model_2(xp)
yp_25 = model_3(xp)


fig = go.Figure()
fig.add_trace(go.Scatter(x=data_x.values, y=data_y.values, mode="markers", name="data"))
fig.add_trace(go.Scatter(x=xp, y=yp_2, mode="lines", name="d = 2"))
fig.add_trace(go.Scatter(x=xp, y=yp_14, mode="lines", name="d = 14"))
fig.add_trace(go.Scatter(x=xp, y=yp_25, mode="lines", name="d = 25"))
fig.update_layout(
    title="Polynomial  Regression",
    xaxis_title="Age", yaxis_title="Wage")
fig.show()


Polyfit may be poorly conditioned



NameError: name 'yp_60' is not defined