## Chapter 3 Introducting GAMs

### 3.1 Introduction
A generalized additive model (Hastie and Tibshirani, 1986, 1990) is a generalized linear model with a linear predictor involving a sum of smooth functions of covariates. In general the model has a structure  like 

$$ 
    g(\mu_i) = X_i^T \theta + f_1(x_{1i}) + f_2(x_{2i}) + f_3(x_{3i}, x_{4i}) + ... \quad (3.1) 
$$

where $ \mu_i \equiv \mathbb{E}(Y_i) \ and \ Y_i \sim $ some EXP-family-distribution.

### 3.2 Univariate smooth functions
The representation of smooth functions is best introduced by considering a model containing one smooth function of one covariate,
$$
y_i = f(x_i) + \epsilon_i, \quad (3.2)
$$
where $y_i$ is a response variable, $x_i$ a covariate (or variable or predictor), $f$ a smooth function and the $\epsilon_i$ are i.i.d. $\mathcal N(0, \epsilon^2)$ random variables. 

One possibility to represent smooth functions is using **regression splines**. This can be done by choosing a *basis*, defining the space of functions of which the smooth funtion (or a close approximation to it) is an element of. This *basis* can the be represented by some *basis functions*, which will be treated as completely known: if $b_i(x)$ is the $i^{th}$ such basis function, then $f$ is assumend to have a representation
$$
    f(x) = \sum_{i=1}^q b_i(x) \beta_i \quad (3.3)
$$
for some values of the unknown parameters, $\beta_i$, which needs to be estimated. 

There are many different such *bases* available, e.g.

- polynomial basis
- recursive defined basis by deBoor, 1978
- cubic basis defined by Wahba, 1990

#### 3.2.1 Cubic spline basis by Wahba
A cubic spline is a curve, made up of sections of cubic polynomial, joined together so that they are continuous in value as well as in first and second derivatives. The points at which the sections join are known as the knots of the spline. For a **conventional spline**, the knots occur wherever there is a datum, but for the **regression splines** of
interest here, the locations of the knots must be chosen. Typically the knots would either be evenly spaced through the range of observed $x$ values, or *placed at quantiles of the distribution of unique x* values. Whatever method is used, let the knot locations be denoted by $\{x_i: i = 1, · · · , q − 2\}$.
The **cubic spline basis by Wahba** is defined as
$$
    b_1(x) = 1, \\
    b_2(x) = x, \\
    b_{i+2} = R(x, x_i^*) \ for \ i=1, ..., q-2, \ where \\
    \quad R(x,z) = \frac{\big[(z - \frac{1}{2})^2 + \frac{1}{12}\big]\big[(x - \frac{1}{2})^2 - \frac{1}{12}\big]}{4} -  \frac{\big[(\vert x - z \vert - \frac{1}{2})^4 - \frac{1}{2}(\vert x - z \vert - \frac{1}{2})^2 + \frac{7}{240} \big] }{24} \quad (3.4)
$$

An illustration of the basis functions, as in Fig. 3.4 and Fig. 3.5 in GAM - An Introduction to R, is depicted here ![title](img\bspline-basis-wahba.png).
For **regression spline** models, the fit of the model tends to depend quite strongly on the locations chosen for the knots. 

#### 3.2.2 Controlling the degree of smoothing with penalized regression splines
The degree of smoothness can be constrained by backward selection of the number $q$ of splines used. An alternative to controlling smoothnes by altering the basis dimension is to keep the basis dimension fixed, at a size a little larger than necessary, and  to control the model's smoothness by adding a "wiggliness" penalty to the least squares fitting objective. One would minimize
$$
    \lVert y - X \beta \rVert + \lambda \int_0^1 \vert f''(x)\vert^2 dx. \quad (3.5.a)
$$
Because $f$ is linear in the parameters $\beta_i$, the penalty can always be written as a quadratic form in $\beta$
$$
    \int_0^1 \vert f(x)'' \vert^2dx = \beta^T S \beta \quad (3.5.b)
$$
where $S$ is a matrix of *known coefficients*. Here, the rather complicated form of the cubic spline basis by Wahba proves its worth, because it turns out that $S_{i+1,j+2} = R(x_i, x_j)$ for $i,j = 1, ..., q-2$ while the first two rows and columns of $S$ are $0$ (Gu, 2002, p.34). 

The **penalized least squares estimator of $\beta$** is then given by
$$
    \hat \beta = (X^T X + \lambda S)^{-1}X^T y \quad (3.6)
$$
Similar the **influence matrix** or **hat matrix** $A$ for the model can be written 
$$
    A = X(X^T X + \lambda S)^{-1} X^T. 
$$
Recall that $\hat \mu = Ay$. For practical computations, (3.5.b) can be reformulated as
$$
    \Big \lVert \begin{bmatrix} y \\ 0  \end{bmatrix} - \begin{bmatrix} X \\ \sqrt{\lambda} B \end{bmatrix} \beta \Big \rVert^2 = \lVert y - X \beta \rVert^2 + \lambda \beta^T S \beta  
$$
where $B$ is any square root of the matrix $S$ such that $B^T B = S$, which can be computed by spectral decomposition or pivoted Choleski decomposition. 

#### 3.2.3 Choosing the smoothing parameter $\lambda$ via cross validation
If $\lambda$ is too high then the data will be over smoothed, and if it is too low then the data will be under smoothed: in both cases this will mean that the spline estimate $\hat f$ will not be close to the true function $f$. A suitable criterion might be to choose $\lambda$ to minimize
$$
    M = \frac{1}{n} \sum_{i=1}^n( \hat f(x_i) − f(x_i))^2.
$$
Since $f$ is unknown, $M$ cannot be used directly, but it is possible to derive an estimate of $\mathbb E(M) + \sigma^2$, which is the expected squared error in predicting a new variable. Let $\hat f^{[i]}$ be the model fitted to all data except $y_i$, and define the **ordinary cross validation score**
$$
    V_o = \frac{1}{n} \sum_{i=1}^n (\hat f^{[i]}(x_i) - y_i)^2.
$$
This score results from leaving out each datum once, fitting the model to the remaining data and calculating the squared difference between the missing datum and its predicted value. These squared differences are then averaged over all data. Substituting $y_i = f_i + \epsilon_i$  and separating the squares leads to
$$
    \mathbb E(V_o) = \frac{1}{n} \mathbb E\big(\sum_{i=1}^n (\hat f^{[-i]}(x_i) - f(x_i))^2\big) + \sigma^2
$$
Now, $\hat f^{[i]} \approx \hat f$ withe equality in the large sample limit, so $\mathbb E(V_o) \approx \mathbb E(M) + \sigma^2$ also with equality in the large sample limit. Hence choosing $\lambda$ in order to minimize $V_o$ is a reasonable approach if the ideal would be to minimize $M$. 

It is inefficient to calculate $V_o$ by leaving out one datum at a time and fitting the model to each of the n resulting datasets, but it can be shown (see Sec. 4.5.1) that
$$
    V_o = \frac{1}{n} \sum_{i=1}^n \frac{(y_i - \hat f(x_i))^2 }{1- A_{ii}^2},
$$
where $A$ is the corresponding **influence matrix**. In practice the weights $1 - A_{ii}$, are often replaced by the mean weight, $\frac{tr(I - A)}{n}$, in order to get the **generalized cross validation score**.

An example for generalized cross validation is given below. The code is in block #GCV. ![title](img/gcv-score.png) 


### 3.3 Additive Models
Suppose that two explanatory variables, x and z, are available for a response y, and that a simple additive model structure
$$
    y_i = f_1(x_i) + f_2(z_i) + \epsilon_i, \quad (3.7)
$$
is appropriate. The functions $f_j$ are smooth functions. There are two important points to notice:

- the assumption of an additive effect is a fairly strong one
- there is an identifiability problem: $f_1$ and $f_2$ are each only estimateable to within an additive constant, an identifiability constraints have to be imposed before fitting

Provided the identifiability issue is addressed, the additive model can be represented using penalized regression splines, estimated by penalized least squares and the degree of smoothing estimated by cross validation, in the same way as the simple univariate model. 

#### 3.3.1 Penalized regression spline representation of an additive model
Each smooth function in (3.7) can be represented using a penalized regression spline basis. Using the basis from (3.2.1)
$$
    f_1(x) = \delta_1 + x\delta_2 + \sum_{j=1}^1{q_1-2} R(x, x_j^*)\delta_{j+2}
$$
and
$$
    f_2(z) = \gamma_1 + z\gamma_2 + \sum_{j=1}^1{q_2-2} R(z, z_j^*)\gamma_{j+2}
$$
where $\delta_j$ and $\gamma_j$ are the unknown parameters for $f_1$ and $f_2$ respectively. $q_1$ and $q_2$ are the number of unknown parameters for the functions, while $x_j^*$ and $z_j^*$ are the knot locations for the two functions.

The identifiability problem with the additive model means that $\delta_1$ and $\gamma_1$ are confounded. The simplest way to deal with this is to constrain one of them to zero, say $\gamma_1$ = 0. Having done this, it is easy to see that the additive model can be written in the linear model form, $y = X \beta + \epsilon$, where the $i^{th}$ row of the m model matrix is now
$$
    X_i = [1, x_i, R(x_i, x_1^*), R(x_i, x_2^*), . . . ,R(x_i, x_{q1−2}^*), z_i, R(z_i, z_1^*), . . . , R(z_i, z_{q2−2}^*]
$$
and the parameter vector is 
$$
    \beta = [\delta_ 1, \delta_2, . . . , \delta_{q1}, \gamma_2, \gamma_3, . . . , \gamma_{q2}]^T.
$$
The wiggliness of the function can also be measure exactly as in section 3.2.2. Here, **take care with the matrix dimensions**. The penalty matrix $S \in \mathbb R^{2q-1, 2q-1}$, for the same number of knots for each term, has the structure $\lambda_1 S_1 + \lambda_2 S_2$. 

### 3.4 Generalized Additive Models
GAMs follow from additive models as GLMs follow from linear models. That is, the linear predictor now predicts some known smooth monotonic function of the expected value of the response, and the response may follow any exponential family distribution, or simply have a known mean-variance relationship, permitting the use of a quasi-likelihood approach. The resulting model has a general form like (3.1)

Now it is tempting to suppose that all that is needed, to fit this GAM, is to replace the call to lm with a call to glm (Generalized Linear Model toolbox in R), and perhaps tweak the definition of the GCV score a little. Unfortunately, further reflection reveals that this is not the case. 

Whereas the additive model was estimated by penalized least squares, the GAM will be fitted by **penalized likelihood maximization**: in practice this will be achieved by **penalized iterative least squares**, but there is no simple trick to produce an unpenalized GLM whose likelihood is equivalent to the penalized likelihood of the GAM that we wish to fit.

To fit the model we simply iterate the followin penalized iteratively re-weighted least squares (**P-IRLS**) scheme to convergence:

1. Given the current parameter estimates $\beta^{[k]}$, and corresponding estimated mean response vector $\mu^{[k]}$, calculate: 
$$
    w_i \propto \frac{1}{V(\mu_i^{[k]}) g'(\mu_i^{[k]})} \\
    z_i = g(\mu_i^{[k]})(y_i - \mu_i^{[k]}) + X_i \beta^{[k]}
$$ 
where $var(Y_i) = V(\mu^{[k]}) \phi$ is given by the exponential family distribution
2. Minimize 
$$
    \lVert \sqrt{W} (z - X \beta) \rVert^2 + \lambda_1 \beta^T S_1 \beta + \lambda_2 \beta^T S \beta
$$
w.r.t $\beta$ to obtain $\beta^{[k+1]}$. $W$ is a diagonal matrix such that $W_{ii} = w_i$

Further topics to implement in this chapter
* [ ] Additive models: model matrix
* [ ] GAMs: P-IRLS Algorithm


### 3.5 Summary
Everything in this chapter has been kept straightforward as possible, in order to emphasize the basic simplicity of this sort of modelling. The next chapter simply adds details to the general framework.
It should be clear, for example:

- wide range of alternative bases are possible
- representing smooth functions of more than one variable requires that one choose basis functions of more than one variable, but nothing else changes
- dealing with other link fucntions and distributions involves programming, but nothing conceptually new

In [145]:
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True) 
import numpy as np
from numpy.linalg import lstsq
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import scipy.linalg as slinalg
import pandas as pd

def cubicSplineFromWhaba(x, z): 
    """ compute R(x,z) for a cubic spline on [0,1] accord
    ing to Wahba [1990]"""
    R = ((z - 1/2)**2 - 1/12)*((x - 1/2)**2 - 1/12)/4 - \
        ((np.abs(x - z) - 1/2)**4 - 1/2*(np.abs(x - z) - 1/2)**2 + 7/240) /24
    return R

def modelMatrix(x, xk):
    """ set up model matrix for cubic penalized regression spline """
    q, n = len(xk) + 2, len(x) # q = number of parameters, n = number of data
    X = np.ones(shape=(n, q))
    X[:,1] = x
    for i in range(0,n):
        for j in range(2, q):
            X[i, j] = cubicSplineFromWhaba(x[i], xk[j-2])
    return X

def penaltyMatrix(xk, returnSqrt=False):
    """ 
    compute the penalty matrix for penalized B-Splines according to Wood (p.126) 
    return the square root of the matrix if returnSqrt is True 
    """
    q = len(xk) + 2 # number of parameters
    S = np.zeros(shape=(q,q))
    for i in range(0, q-2):
        for j in range(0, q-2):
            S[i+2, j+2] = cubicSplineFromWhaba(xk[i], xk[j])
    return S if returnSqrt else slinalg.sqrtm(S)

def modelMatrixPenalized(y, x, xk, lam=0):
    """
    Fit the penalized B-Spline with knots at xk to the data x with lam as penalty weight
    """
    q = len(xk) + 2 # number of parameters
    n = len(x) # number of data
    # create augmented model matrix
    Xa = np.concatenate((modelMatrix(x,xk),np.sqrt(lam)*penaltyMatrix(xk, returnSqrt=True)))
    ya = np.concatenate((y, np.zeros((q,))))
    return ya, Xa

def prs_fit(y, x, xk, lam):
    """" function to fit penaliezd regression spline to x,y data with knots at xk 
    and given smoothing parameter lamdba """
    q, n = len(xk) + 2, len(x) # dimension of basis, number of data
    # create augmented model matrix
    y_aug, X_aug = modelMatrixPenalized(y, x, xk, lam)
    fit = lstsq(a=X_aug, b=y_aug)
    return fit, y_aug, X_aug

def influence(X, S=None, lam=0):
    """ computes the incluence matrix for the model matrix X, if an penalty matrix S is given, include it 
        according to: 
            A = X ( X.T * X + lam * S)^{-1} * X.T
    """
    hat = X.dot(np.linalg.inv(X.T.dot(X) + lam*S).dot(X.T))
    return hat
    
def cross_validation_fit(y, x, xk, plot_=True, returnBest=True):
    """ searches for the optimal smoothing parameter using generalized cross validation 
    and penalized regression splines. Returns the best fit if returnBest == True
    """
    V, Lam = list(), list()
    lam = 1e-8
    # compute the augmented output and augmented model matrix and the least squares fit
    for i in range(60):
        fit, y_aug, X_aug = prs_fit(y, x, xk, lam)
        S = penaltyMatrix(xk)
        traceA = np.trace(influence(X_aug, S=S, lam=lam))
        fitValue = np.dot(X_aug, fit[0])
        n = len(y)
        rss = np.sum((y - fitValue[:n])**2)
        V.append(n * rss / (n - traceA)**2)
        Lam.append(lam)
        lam *= 1.5
        
    if returnBest:
        lam = Lam[np.argmin(V)]
        bestFit = prs_fit(y, x, xk, lam=lam)
        print("Best Lambda found is {}".format(lam))
    else:
        bestFit = None
    
    if plot_:
        fig = make_subplots(rows=1, cols=2, subplot_titles=("Data", "Generalized Cross Validation"))
        # Add traces
        fig.add_trace(go.Scatter(x=x, y=y, mode="markers"), row=1, col=1)
        fig.add_trace(go.Scatter(x=Lam, y=V, mode="markers"), row=1, col=2)
        # Update xaxis properties
        fig.update_xaxes(type="log", title_text="log(lambda)", col=2)
        fig.update_xaxes(title_text="Normalized Size", col=1)
        # Update yaxis properties
        fig.update_yaxes(title_text="GCV", col=2)
        fig.update_yaxes(title_text="Wear", col=1)
        fig.show()

    return bestFit

In [147]:
# GCV
# COMPUTE BEST FIT ACCORDING TO GENERALIZED CROSS VALIDATION 
Size = np.array([1.42,1.58,1.78,1.99,1.99,1.99,2.13,2.13,2.13,2.32,2.32,2.32,2.32,2.32,2.43,2.43,2.78,2.98,2.98])
Wear = np.array([4.0,4.2,2.5,2.6,2.8,2.4,3.2,2.4,2.6,4.8,2.9,3.8,3.0,2.7,3.1,3.3,3.0,2.8,1.7])
# normalized data
x = Size - np.min(Size)
x = x/np.max(x)
# get the knots
xk = np.arange(1,15)/15
fit = cross_validation_fit(Wear, x, xk, plot_=True, returnBest=True)


`rcond` parameter will change to the default of machine precision times ``max(M, N)`` where M and N are the input matrix dimensions.



Best Lambda found is 0.07371554880626674


In [148]:
# try to plot the cubic regression splines from Wahba as on p.123 in Wood
xk = np.arange(1,7, 2)/6
x = np.linspace(0,1,20)
X = modelMatrix(x, xk)
def f(x):
    return np.sin(x) + x**3 + np.random.rand(len(x))*0.1
fit = lstsq(a=X, b=f(x))


df_X = pd.DataFrame(data={"x":x, "B1":X[:,0], "B2":X[:,1], "B3":X[:,2], "B4":X[:,3], "B5":X[:,4]})
#df_X.plot(x="x", kind="line", subplots=True, sharex=True)

fig = make_subplots(rows=2, cols=3)


for idx, colN in enumerate(df_X.columns[1:]):
    if idx <= 2:
        fig.add_trace(go.Scatter(x=df_X["x"], y=df_X[colN], name=colN), row=1, col=idx+1)
    else:
        fig.add_trace(go.Scatter(x=df_X["x"], y=df_X[colN], name=colN), row=2, col=idx-2)
fig.add_trace(go.Scatter(x=df_X["x"], y=np.dot(fit[0], df_X[df_X.columns[1:]].T), name="f(x)"), row=2, col=3)
fig.add_trace(go.Scatter(x=df_X["x"], y=f(x), name="true f(x)"), row=2, col=3)
fig.update_layout(title="First 5 basis splines according to Wahba")
fig.show()



`rcond` parameter will change to the default of machine precision times ``max(M, N)`` where M and N are the input matrix dimensions.



In [149]:
# BASIC REGRESSION SPLINES

# get data out of the Book
Size = np.array([1.42,1.58,1.78,1.99,1.99,1.99,2.13,2.13,2.13,
2.32,2.32,2.32,2.32,2.32,2.43,2.43,2.78,2.98,2.98])
Wear = np.array([4.0,4.2,2.5,2.6,2.8,2.4,3.2,2.4,2.6,4.8,2.9,
3.8,3.0,2.7,3.1,3.3,3.0,2.8,1.7])

# standardize data
x = Size - np.min(Size)
x = x/np.max(x)
# fix the spline knots
xk = np.arange(0.2,1,0.2)
xp = np.linspace(0,1,25)
# calculate the model matrix
X = modelMatrix(x, xk)
Xplot = modelMatrix(xp, xk)
# compute least squares fit for regression splines
regSplines_LS = lstsq(a=X, b=Wear)
# plot
fig = px.scatter(x=x, y=Wear)
fig.add_trace(go.Scatter(x=xp, y=np.dot(regSplines_LS[0],Xplot.T), name="Regression Spline Fit"))
fig.update_layout(xaxis_title="Normalized Size", yaxis_title="Wear")


`rcond` parameter will change to the default of machine precision times ``max(M, N)`` where M and N are the input matrix dimensions.



In [150]:
# P-SPLINES

xk_p = np.arange(1,10) / 10
#lam = [10, 0.01, 0.00000001]
#mod = list()
#[mod.append(modelMatrixPenalized(Wear, x, xk_p, lam=l)) for l in lam]
Wear_aug, Xa = modelMatrixPenalized(Wear, x, xk_p, lam=0.1)
#pSplines_LS_fit = list()
#[pSplines_LS_fit.append(lstsq(a=m[1], b=m[0])) for m in mod]
pSplines_LS = lstsq(a=Xa, b=Wear_aug)
xp = np.linspace(0.,1.,100)
Xp = modelMatrix(xp, xk)
Xp_aug = modelMatrix(xp, xk_p)

# plot data, regression spline and P-Splines
fig = px.scatter(x=x, y=Wear)
fig.update_layout(
    xaxis_title="Scaled motor size",
    yaxis_title="Wear index")
#fig.add_scatter(x=xp, y=np.dot(Xp, regSplines_LS[0]), mode="lines", name="BSpline fit")
fig.add_scatter(x=xp, y=np.dot(Xp_aug, pSplines_LS[0]), mode="lines", name="PSpline fit, lam={}".format(0.1))
#for i,fit in enumerate(pSplines_LS_fit):
#    fig.add_scatter(x=xp, y=np.dot(Xp_aug,fit[0]), mode="lines", name="PSpline fit, lam={}".format(lam[i]))

fig.show()


`rcond` parameter will change to the default of machine precision times ``max(M, N)`` where M and N are the input matrix dimensions.

