\section{Linear regression}
Questions 

\begin{enumerate}
    \item Is there a relationship between $x$ and $y$?
    \item How strong is the relationship?
    \item What is the effect of each variable?
    \item How accurately can we estimate the effect?
    \item How well can we predict future events?
    \item Is the relationship linear?
    \item Is there synergy among predictors?
\end{enumerate}

\subsection{Simple linear regression}
\[
Y \approx \beta_0 + \beta_1 X
\]

\subsubsection{Estimating the coefficients}
\textbf{Least squares method, Pienimmän neliösumman menetelmä}

Residual

Residual sum of squares
\[
\textrm{RSS} = e_1^2+e_2^2+ \dots + e_n^2
\]
\[
\hat{\beta_1} = \frac{\sum_{i=1}^n (x_i - \overline{x})(y_i - \overline{y})}{\sum_{i=1}^n(x_i - \overline{x})}
\]

\subsubsection{Assessing the accuracy of the coefficient estimates}

\[
SE(\hat{\beta}_0) = \sigma^2 \Bigg[\frac{1}{n} + \frac{\overline{x}^2}{\sum_{i=1}^n (x_i - \overline{x})^2} \Bigg]
\]

\[
SE(\hat{\beta}_1)^2 = \frac{\sigma^2}{\sum_{i=1}^n (x_i - \overline{x})^2}
\]

where $\sigma^2 = \textrm{Var}(\epsilon)$. In general, $\sigma^2$ is not known, but can be estimated from data as residual standard error
\[
\textrm{RSE} = \sqrt{RSS/(n-2)}
\]

Standard errors can be used to compute confidence intervals

Hypothesis tests.
\begin{enumerate}
    \item $H_0: \beta_1 = 0$: There is no relationship between X and Y
    \item $H_{\alpha}: \beta_1 \neq 0$: There is some relationship between X and Y 
\end{enumerate}

t-statistic and p-value

\subsubsection{Assessing the accuracy of the model}
Quantify the extent to which the model fits the data

\textbf{Residual standard error}: 
\[
\textrm{RSE} = \sqrt{\frac{1}{n-2}\textrm{RSS}} = \sqrt{\frac{1}{n-2} \sum_{i=1}^n (y_i - \hat{y}_i)^2}
\]

Measures the lack of fit of the model to the data

$R^2$ statistic

RSE us measured in the units of Y, so it is not always clear what constitutes a good RSE.

\[
R^2 = \frac{TSS-RSS}{TSS} = 1 - \frac{RSS}{TSS}
\]

where total sum of squares $TSS = \sum (y_i - \overline{y})^2$. $R^2$ measures the proportion of variability in Y that can be explained using X. 
In the simple linear regression setting, $R^2 = r^2$, i.e. the squared correlation and $R^2$ are identical. However, this is not the case in multiple linear regression. 

\subsubsection{Code}
\begin{minted}{R}
attach(Boston)
lm.fit = lm(medv ~ lstat)
summary(lm.fit)
confint(lm.fit)
predict(lm.fit, data.frame(lstat=c(5,10,15)), interval="prediction")
\end{minted}

\subsection{Multiple linear regression}
Instead of fitting a separate simple linear regression model for each predictor, a better approach is to extend the simple linear regression model so that it can directly accommodate multiple predictors.

\[
Y = \beta_0 + \beta_1 X_2 + \dots + \beta_p X_p + \epsilon
\]

\subsubsection{Estimating the regression coefficients}
\[
\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \dots + \hat{\beta}_p x_p
\]

RSS is generalized into multiple dimensions

\subsubsection{Some important questions}
\begin{enumerate}
    \item Is at least one of the predictors $x_1, X_2, \dots X_p$ useful in predicting the response?
    \item Do all the predictors help to explain $Y$, or is only a subset of the predictors useful?
    \item How well does the model fit the data?
    \item Given a set of predictor values, what response value should we predict, and how accurate is our prediction?
\end{enumerate}

$H_0: \beta_1 = \beta_2 = \dots = \beta_p = 0$ versus the alternative

$H_{a}$: at least one $\beta_j$ is non-zero. 

F-statistic

\[
F = \frac{(TSS-RSS)/p}{RSS/(n-p-1)}
\]

Sometimes we want to test that a particular subset of $q$ of the coefficients are zero. 

\[
H_0: \beta_{p-q+1} = \beta_{p-q+2} = \dots = \beta_p = 0
\]

The variables chosen for omission are at the end of the list. 

Second model that uses all the variables except those last $q$, the residual sum of squares for that model is $RSS_0$. Then 

\[
F = \frac{(RSS_0 - RSS)/q}{RSS/(n - p - 1)}
\]

We use the F-statistic rather than individual p-values as small values are seen when the number of predictors $p$ is large.

Deciding important variables

Forward selection: start with a null model - a model with no predictors. Add the variable that results in the lowest RSS. Then we add the variable that results in the lowest RSS for the two-variable model. 

Backward selection: we start with all variables in the model, and remove the variable with the largest p-value

Mixed selection: combination of forward and backward selection

Model fit

$R^2$ will always increase when more variables are added to the model, even if those variables are only weakly associated with the response

Predictions

\subsubsection{Code}
\begin{minted}{R}
attach(Boston)
lm.fit = lm(medv ~ lstat + age)
lm.fit = lm(medv ~.)
summary(lm.fit)
library(car)
vif(lm.fit)
summary(lm(medv~lstat*age))
lm.fit5 = lm(medv~poly(lstat, 5))
\end{minted}

\subsection{Other considerations in the regression model}
\subsubsection{Qualitative predictors}
Predictions with only two levels

Create an indicator or dummy variable that takes two possible numerical values
\[
x_i = 
\begin{cases}
1 & \textrm{if ith person is female}\\
0 & \textrm{if the ith person is male}\\
\end{cases}
\]
this results in the model
\[
y_i = \beta_0 + \beta_1x_i + \epsilon_i = 
\begin{cases}
\beta_0 + \beta_1 + \epsilon_i \\
\beta_0 + \epsilon_i
\end{cases}
\]
Interpretion: $\beta_0$ is the average balance among males, $\beta_0 + \beta_1$ average balance among females, and $\beta_1$ as the average balance between females and males

Qualitative predictors with more than two levels

Create additional dummy variables!

\[
x_{i1} = 
\begin{cases}
1 & \textrm{the ith person is Asian} \\
0 & \textrm{the ith person is not Asian}
\end{cases}
\]

\[
x_{i2} = 
\begin{cases}
1 & \textrm{if the ith person is Caucasian} \\
0 & \textrm{if the ith person is not Caucasian} 
\end{cases}
\]

The regression equation is then

\[
y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \epsilon_i = 
\begin{cases}
\beta_0 + \beta_1 + \epsilon_i & \textrm{if the ith person is Asian} \\
\beta_0 + \beta_2 + \epsilon_i & \textrm{if the ith person is Caucasian} \\
\beta_0 + \epsilon_i & \textrm{if the ith person is African American}
\end{cases}
\]

\begin{minted}{R}
#R generates dummy variables automatically
#contrasts() return the coding that R uses for the dummy variables
contrasts(ShelveLoc)
\end{minted}

\subsubsection{Extensions of the linear model}
Assumptions: additive and linear

Removing the additive assumption

Interaction term

\[
Y = \beta_0 + \beta_1X_1 + \beta_2 X_2 + \epsilon
\]

\[
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_2 + \epsilon
\]

The hierarchical principle states that if we include an interaction in a model, we should also include the main effects, even if the p-values associated with their coefficients are not significant

Non-linear relationships

Polynomial regression

\subsubsection{Potential problems}
\begin{enumerate}
    \item Non-linearity of the response-predictor relationships
    \item Correlation of error terms
    \item Non-constant variance of error terms
    \item Outliers
    \item High-leverage points
    \item Collinearity 
\end{enumerate}

Non-linearity of the data: 

Residual plots

Correlation of error terms: 

Time series data

Non-constant variance of error terms:

Heteroscedasticity, variable transformation

Outliers: 

One solution is to simply remove the observation

High-leverage points:

High leverage observations tend to have a sizeable impact on the estimated regression line. 

Leverage statistic

\[
h_i = \frac{1}{n} + \frac{(x_i - \overline{x})^2}{\sum_{i'=1}^n (x_{i'} - \overline{x})^2}
\]

Collinearity:
It can be difficult to separate out the individual effects of collinear variables on the response

Compute the variance inflation factor (VIF)
\[
VIF(\hat{\beta}_j) = \frac{1}{1-R^2_{X_j | X_{-j}}}
\]
where $R^2_{X_j | X_{-j}}$ comes form a regression of $X_j$ onto all other predictors. 

Drop the predictor!

\subsection{Comparison of linear regression with K-nearest neighbors}
Linear regression is an example of a parametric approach because it assumes a linear functional form for $f(X)$. 

Non-paramtetric alternative: K-nearest neighbors regression (KNN regression)

Given a value for $K$ and a prediction point $x_0$, KNN regression first identifies the $K$ training observations that are closest to $x_0$, represented by $N_0$. It then estimates $f(x_0)$ using the average of all the training responses in $N_0$.
\[
\hat{f}(x_0) = \frac{1}{N}\sum_{x_i \in N_0}y_i
\]

The optimal value for $K$ will depend on the bias-variance tradeoff.

The parametric approach will outperform the non-parametric approach if the parametric form that has been selected is close to the true form of f.

Curse of dimensionality: a given observation has no nearby neighbors

\subsubsection{Code}
\begin{minted}{R}
library(FNN)
knn.reg(train, test = NULL, y, k = 3, algorithm=c("kd_tree", "cover_tree", "brute"))
\end{minted}

\section{Linear model selection and regularization}
In the regression setting, the standard linear model
\[
Y = \beta_0 + \beta_1X_1 + \dots + +\beta_p X_p + \epsilon
\]
is commonly used to desrcibe the relationship between a response $Y$ and a set of variables $X_1, X_2, \dots, X_p$. Here we discuss some ways in which the simple linear model can be improved, by replaicng plain least squares fitting with some alternative fitting procedures.

Altenrative fitting procedures can yield better prediction accuracy and model interpretability. 

Prediction accuracy: If $n$ is not much larger than $p$, then there can be a lot of variability in the least squares fit, resulting in overfitting and consequently poor predictions on future observations not used in model training. By constraining or shrinking the estimated coefficients, we can often substantially reduce the variance at the cost of a negligible increase in bias. 

Model interpretability: we see some approaches for automatically performing feature selection or variable selection. 

There are many alternatives to using least squares to fit. Here we discuss three important classes of methods:
\begin{enumerate}
    \item Subset selection: identifying a subset of the $p$ predictors that we believe to be related to the response
    \item Shrinkage: the estimated coefficients are shrunken towards zero relative to the least squares estimates
    \item Dimension reduction: this approach involves projecting the $p$ predictors into a $m$-dimensional subspace, where $M < p$. This is acieved by computing $M$ different linear combination of the variables. 
\end{enumerate}

\subsection{Subset selection}
\subsubsection{Best subset selection}
The method for best subset selection:
\begin{enumerate}
    \item Let $M_0$ denote the null model, which contains no predictors. This model simply predicts the sample mean for each observation
    \item For $k = 1, 2, \dots, p$:
        \item Fit all models that contain exactly $k$ predictors.
        \item Pick the best among these models, and call it $M_k$. Here best is defined as having smallest RSS, or equivalently largest $R^2$.
    \item Select a single best model from among $M_0, \dots, M_p$ using cross-validated prediction error $C_p$, AIC, BIC, or adjusted $R^2$
\end{enumerate}

Best subset selection becomes computationally infeasible for values of $p$ greater than around 40, even with extremely fast modern computers. 

\subsubsection{Stepwise selection}
\textbf{Forward stepwise selection}
Forward stepwise selection is a computationally efficient alternative to best subset selection. 

\begin{enumerate}
    \item Let $M_0$ denote the null model, which contains no predictors.
    \item For $k=0,\dots, p-1$
        \item Consider all $p-k$ models that augment the predictors in $M_k$ with one additional predictor
        \item Choose the best among these $p -k$ models, and call it $M_{k+1}$. Here best is defined as having smallest RSS or highest $R^2$.
    \item Select a single best model from among $M_0, \dots, M_p$ using cross-validated prediction error $C_p$, $AIC$, $BIC$, or adjusted $R^2$.
\end{enumerate}

Forward stepwise selection can be applied even in the high-dimensional setting where $n < p$; however, in this case, it is possible to construct submodels $M_0, \dots, M_{n-1}$ only, since each submodel is fit using using least squares, which will no yield a unique solution if $p \geq n$.

\textbf{Backward stepwise selection}
\begin{enumerate}
    \item Let $M_p$ denote the full model, which contains all $p$ predictors.
    \item For $k=p, p-1, \dots, 1$:
        \item Consider all $k$ models that contain all but one of the predictors in $M_k$ for a total of $k-1$ predictors. 
        \item Choose the best among these $k$ models, and call it $M_{k-1}$. Here best is defined as having smallest RSS or highest $R^2$
    \item Select a single best model from among $M_0, \dots, M_p$ using cross-validated prediction error, $C_p$, AIC, BIC, or adjusted $R^2$
\end{enumerate}

Hybrid approaches: forward and backward selection

\subsubsection{Choosing the optimal model}
We need a way to determine which of the models is best. The model containing all of the predictors will always have the smallest RSS and the largest $R^2$, since these quantities are related to the training error. In order to select the best model with respect to test error, we need to estimate this test error.

We can indirectly estimate test error by making an adjustment to the training error to account for the bias due to overfitting.

We can directly estimate the test error, using either a validation set approach or a cross-validation approach.

\textbf{Cp, AIC, BIC, and adjusted R2}
For a fitted least squares model containing $d$ predictors, the $C_p$ estimate of test MSE is computed using the equation
\[
C_p = \frac{1}{n}(RSS + 2d\hat{\sigma}^2)
\]
where $\hat{\sigma}^2$ is an estimate of the variance of the error $\epsilon$ associated with each response measurement. 

The AIC criterion is defined for a large class of models fit by maximum likelihood. In the case of the model with Gaussian errors, maximum likelihood and least squares are the same thing. In this case AIC is given by
\[
AIC = \frac{1}{n \hat{\sigma}^2}(RSS + 2 d \hat{\sigma}^2)
\]

BIC is derived from a Bayesian point of view, but ends up looking similar to $C_p$ and AIC. 
\[
BIC = \frac{1}{n \hat{\sigma}^2}(RSS + \log(n)d\hat{\sigma}^2)
\]

The adjusted $R^2$ statistic is another popular approach for selecting among a set of models that contain different numbers of variables. The adjusted $R^2$ statistic is calculated as
\[
adj. R^2 = 1 - \frac{RSS/(n - d -1)}{TSS/(n-1)}
\]
A large value of adjusted $R^2$ indicates a model with a small test error. 

\textbf{Validation and cross-validation}
In the past, performing cross-validation was computationally prohibitive for many problems with large $p$ and/or large $n$, and so AIC, BIC, Cp, and adjusted $R^2$ were more attractive approaches for choosing among a set of models. 

\subsection{Shrinkage methods}
As an alternative, we can fit a model containining all $p$ predictors using a technique that constrains or regularizes the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero. 

Two best-known techniques: ridge regression and the lasso

\subsubsection{Ridge lasso}
Least squares fitting proceudre estimates $\beta_0, \beta_1, \dots, \beta_p$ using the values that minimize
\[
RSS = \sum_{i=1}^n \bigg( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \bigg)^2
\]
Ridge regression minimizes a slightly different quantity
\[
\sum_{i=1}^n \bigg( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \bigg)^2 + \lambda \sum_{j=1}^p \beta_j^2 = RSS + \lambda \sum_{j=1}^p \beta_j^2
\]
where $\lambda \geq 0$ is a tuning parameter. Selecting a good value for $\lambda$ is critical.

Before using ridge regression, the predictors should be standardized:
\[
\tilde{x}_{ij} = \frac{x_{ij}}{\sqrt{\frac{1}{n}}\sum_{i=1}^n (x_{ij} - \overline{x}_j)^2}
\]

As $\lambda$ increases, the flexibility of the ridge regression fit decreases, leading to decreased variance but increased bias. 

\subsubsection{The lasso}
Ridge regression does have one obvious disadvantage: it will include all $p$ predictors in the final model. The lasso coefficients $\hat{\beta}_{\lambda}^L$ minimize the quantity
\[
\sum_{i=1}^n \bigg(y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \bigg)^2 + \lambda \sum_{j=1}^p |\beta_j| = RSS + \lambda \sum_{j=1}^p |\beta_j|
\]
The lass forces some of the coefficient estimates to be exactly equal to zero when the tuning parameter $\lambda$ is sufficiently large. 

Comparing the lasso and ridge: the lasso is expected to perform better in a setting where a relatively small number of predictors have substantial coefficients. Ridge regression is better when all coefficients are of equal size. 

\textbf{Bayesian interpretation for ridge regression and the lasso}
A Bayesian viewpoint for regression assumes that the coefficient vector $\beta$ has some prior distribution, say $p(\beta)$, where $\beta = (\beta_0, \beta_1, \dots, \beta_p)^T$. The likelihood of the data can be written as $f(Y|X, \beta)$, where $X = (X_1, \dots, X_p)$. Then the posterior distribution is
\[
p(\beta|X, Y) \propto f(Y|X, \beta)p(\beta|X) 0 f(Y|X, \beta)p(\beta)
\]

It turns out that ridge regression and the lasso follow naturally from two special casses of $g$:
\begin{enumerate}
    \item If $g$ is a Gaussian distribution with mean zero and standard deviation a function of $\lambda$, then the posterior mode of $\beta$ is given by the ridge regression solution.
    \item If $g$ is a double-exponential (Laplace) distribution with mean zero and scale parameter a function of $\lambda$, then it follows that the posterior mode for $\beta$ is the lasso solution. 
\end{enumerate}

\subsubsection{Selecting the tuning parameter}
We select the tuning parameter value for which the cross-validation error is smallest. 

\subsection{Dimension reduction methods}
All of the previous methods are defined using the original predictors, $X_1, X_2, \dots, X_p$. We explore a class of approaches that transform the predictors and then fit a least squares model using the transformed variables.  

Let $Z_1, Z_2, \dots, Z_M$ represent $M < p$ linear combinations of our original $p$ predictors.
\[
Z_m = \sum_{j=1}^p \phi_{jm}X_j
\]
for some constants $\phi_{1m}, \phi_{2m}, \dots, \phi_{pm}$, $m = 1, \dots, M$. We can then fit the linear regression model
\[
y_i = \theta_0 + \sum_{m=1}^M \theta_m z_{im} + \epsilon_i, i = 1, \dots, n
\]
using least squares. This can lead to better results than fitting the original equation.

Dimension reduction serves to constrain the estimated $\beta_j$ coefficients. Here we will consider two approaches for this task: principal components and partial least squares. 

\subsubsection{Principal components regression}
The first principal component direction of the data is that along which the observation vary the most. 

PCA: the first principal component vector defines the line that is as close as possible to the data. The first principal component has been chosen so that the projected observations are as close as possible to the original observations.

The second principal component $Z_2$ is a linear combination of the variables that is uncorrelated with $Z_1$, and has largest variance subject to this constraint. 

\textbf{The principal components regression approach}
The key idea is that often a small number of principal components suffice to explain most of the variability in the data, as well as the relationship with the response. 
By estimating only $M << p$ coefficients we can mitigate overfitting. 

PCR (principal component regression) will tend to do well in cases when the first few principal components are sufficient to capture most of the variation in the predictors as well as the relationship with the response. 

PCR is not a feauture selection method!

In PCR, the number of principal components $M$ is typically chosen by cross-validation. 

When performing PCR, we generally recommend standardizing each predictor. 

\subsubsection{Partial least squares}
PCR suffers from a drawback: there is no guarantee that the directions that best explain the predicotrs will also be the best directions to use for predicting the response. 

Partial least squares (PLS) is a supervised alternative to PCR.

After standardizing the $p$ predictors, PLS computes the first direction $Z_1$ by setting each $\phi_{j1}$ equal to the coefficient from the simple linear regression of $Y$ onto $X_j$. To identify the second PLS direction we first adjust each of the variables for $Z_1$, by regressing each variable on $Z_1$ and taking residuals. We then compute $Z_2$ using tbis orthogonalized data in exactly the same fashion as $Z_1$ was computed based on the original data. 

\subsection{Considerations in high dimensions}
\subsubsection{High-dimensional data}
Most traditional statistical techniques for regression and classification are intended for the low-dimensional setting in which $n$, the number of observations, is much greater than $p$, the number of features. 

Data sets containing more features than observations are often referred to as high-dimensional.

\subsubsection{What goes wrong in High dimensions?}
When there are only two observations, then regardless of the values of those observations, the regression line will fit the data exactly. This is problematic because this perfect fit will almost certainly lead to overfitting of the data. A simple squares regression line is too flexible!

\subsubsection{Regression in high dimensions}
Forward stepwise selection, ridge regression, the lasso, and principal components regression are particularly useful for perfroming regression in the high-dimensional setting. 

\subsubsection{Interpreting results in high dimensions}
In the high-dimensional setting, the multicollinearity problem is extreme: any variable in the model can be written as a linear combination of all of the other variables in the model. 

We have identified one of many possible models fro predicting blood pressure!

\section{Moving beyond linearity}
Linear models are relatively simple to describe and implement, and have advantages over other approaches in terms of interpretation and inference. 
\begin{enumerate}
    \item Polynomial regression: extends the linear model by adding extra predictors, obtained by raising each of the original predictors to a power. 
    \item Step functions: cut the range of a variable into $K$ disticnt regions in order to produce a qualitative variable. 
    \item Regression splines: more flexible than polynomials and step functions, Involve dividing the range of $X$ into $K$ distinct regions. Within each region, a polynomial function is fit to the data. However, these polynomials are constrained so that they join smoothly at the region boundaries, or knots. 
    \item Smoothing splines: similar to regression splines.
    \item Local regression: the regions are allowed to overlap, and indeed they do so in a very smooth way.
    \item Generalized additive models: extend the methods above to deal with multiple predictors
\end{enumerate}

\subsection{Polynomial regression}
The polynomial function
\[
y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \beta_3 x_i^3 + \dots + \beta_d x_i^d + \epsilon_i
\]
where $\epsilon_i$ is the error term.  

\subsection{Step functions}
Using polynomial functions of the features as predictors in a linear model impose a global structure on the non-linear function of $X$. We can instead use step functions in order to avoid imposing such a global structure. 

We create cutpoints $c_1, c_2, \dots, c_K$ in the range of $X$, and then construct $K + 1$ new variables
\[
C_0(X) = I(X < c_1) \\
C_1(X) = I(c_1 \leq X < c_2) \\
\vdots \\
C_K(X) = I(c_K \leq X)
\]
where $I(\cdot)$ is an indicator function that return a $1$ if the condition is true, and return a $0$ otherwise. We then use least squares to fit a linear model using $C_1(X), C_2(X), \dots, C_k(X)$ as predictors:
\[
y_i = \beta_0 + \beta_1C_1(x_i) + \beta_2C_2(x_i) + \dots + \beta_K C_K(x_i) + \epsilon_i
\]

\subsection{Basis functions}
Polynomial and piecewise-constant regression models are in fact special cases of a basis function approach. The idea is to have at hand a family of functions or transformations that can be applied to a variable $X$: $b_1(X), b_2(X), \dots, b_K(X)$. Instead of fitting a linear model in $X$, we fit the model
\[
y_i = \beta_0 + \beta_1b_1(x_i) + \dots \beta_Kb_K(x_i) + \epsilon_i
\]

\subsection{Regression splines}
\subsubsection{Piecewise polynomials}
Instead of fitting a high-degree polynomial over the entire range of $X$, picewise polynomiaö regression involves fitting separate low-degree polynomials over different regions of $X$.For example a piecewise cubic polynomial works by fitting a cubic regression model of the form
\[
y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \beta_3 x_i^3 + \epsilon_i
\]
where the coefficients $\beta_0, \beta_1, \beta_2$ and $\beta_3$ differ in different parts of the range of $X$. A piecewise cubic polynomial with a single knot at a point $c$ takes the form
\[
y = \begin{cases}
b_{01} + \beta_{11}x_i + \beta_{21}x_i^2 + \beta_{31}x_i^3 + \epsilon_i, x_i < x \\
b_{02} + \beta_{12}x_i + \beta_{22}x_i^2 + \beta_{32}x_i^3 + \epsilon_i
\end{cases}
\]
Using more knots leads to a more flexible piecewise polynomial. 

\subsubsection{Constraints and splines}
We can fit a piecewise polynomial under the constaraint that the fitted curve must be continuous. Smooth: the first and second derivativews of the piecewise polynomials are continuous. 

\subsubsection{The spline basis representation}
It turns out that we can use the basis model to represent a regression spline. A cubic spline with $K$ knots can be modeled as
\[
y_i = \beta_0 + \beta_1b_1(x_i) + \beta_2 b_2(x_i) + \dots + \beta_{K + 3}b_{K + 3}(x_i) + \epsilon_i
\]
The most direct way to represent a cubic spline using the equation above is to start off with a basis for a cubic polynomial and then add one truncated power basis function per knot. A truncated power basis function is defined as
\[
h(x, \chi) = (x - \chi)_+^3 = 
\begin{cases}
(x - \chi)^3 & x > \chi \\
0 
\end{cases}
\]
where $\chi$ is the knot. 

\subsubsection{Choosing the number and locations of the knots}
When we fit a spline, where should we place the knots? One option is to try out different numbers of knots and see which produces the best looking curve. A somewhat more objective approach is to use cross-validation. 

\subsubsection{Comparison to polynomial regression}
Regression splines often give superior results to polynomial regression. 

\subsection{Smoothing splines}
\subsubsection{An overview of smoothing splines}
What we really want is a function $g$ that makes RSS small, but that is also smooth. How might we ensure that $g$ is smooth? A natural approach is to find the function $g$ that minimizes
\[
\sum_{i=1}^n (y_i - g(x_i))^2 + \lambda \int g''(t)^2 dt
\]
where $\lambda$ is a nonnegative tuning parameter. 

\subsubsection{Choosing the smoothing parameter $\lambda$}
Cross-validation!

\subsection{Local regression}
Local regression is a different approach for fitting flexible non-linear functions which involves computing the fit at a target point $x_0$ using only the nearby training observations. 

\begin{enumerate}
    \item Gather the fraction $s = k/n$ of training points whose $x_i$ are closest to $x_0$.
    \item Assign a weight $K_{i0} = K(x_i, x_0)$ to each point in this neighborhood, so that the point furthest from $x_0$ has weigh zero, and the closest has the highest weight. All but these $k$ nearest neighbors get weight zero.
    \item Fit a weighted least squares regression of the $y_i$ on the $x_i$ using the aforementioned weights, by finding $\hat{\beta_0}$ anf $\hat{\beta_1}$ that minimize
    \[
    \sum_{i=1}^n K_{i0}(y_i - \beta_0 - \beta_1 x_i)^2
    \]
    \item The fitted value at $x_0$ is given by $\hat{f} = \hat{\beta}_0 + \hat{\beta}_1 x_0$
\end{enumerate}

In order to perform local regression, there are a number of choices to be made, such as how to define the weighting function $K$, and whether to fit a linear, constant, or quadratic regression. The most important choice is the span $s$. 

In a setting with multiple features $X_1, X_2, \dots, X_p$, one very useful generalization involves fitting a multiple linear regression model that is global in some variables, but local in another. 

\subsection{Generalized additive models}
Generalized additive models (GAMs) provide a general framework for extending a standard linear model by allowing non-linear functions of each of the variables, while maintaining additivity.  

\subsubsection{GAMs for regression problems}
A natural way to extend the multiple linear regression model is to replace each linear component $\beta_i x_{ij}$ with a (smooth) non-linear function $f_j(x_{ij})$, We would then write the model as
\[
y_i = \beta_0 + \sum_{j=1}^p f_j (x_{ij}) + \epsilon_i
\]
\[
= \beta_0 + f_1(x_{i1}) + f_2(x_{i2}) + \dots + f_p(x_{ip}) + \epsilon_i
\]

\textnf{Pros and cons of GAMs}
\begin{enumerate}
    \item +: GAMs allow us to fit a non-linear $f_j$ to each $X_j$, so that we can automatically model non-linear relationships that standard linear regression will miss. We do not need to manually tryn out many different trnasofmrations on each variable individually.
    \item +: The non-linear fits can potentially make more accurate predictions for the response $Y$.
    \item +: GAMs provide a useful representation in inference
    \item +: The smoothness of the function $f_j$ for the variable $X_j$ can be summarized via degrees of freedom
    \item -: The model is restricted to be additive. 
\end{enumerate}

\subsubsection{GAMs for classification problems}
A natural way to extend the logistic regression to allow for non-linear relationships is to use the model
\[
\log \bigg(\frac{p(X)}{1 - p(X)} \bigg) = \beta_0 + f_1(X_1) + f_2(X_2) + \dots + f_p(X_p)
\]
