# .6. Statistical Models, Supervised Learning and Function Approximation

### Review

Our goal is to find a useful approximation $\hat{f}(x)$ to $f(x)$ that underlies the predictive relationship between the inputs and outputs.

In the theoretical setting of $\S$ 2.4, we saw that
1. for a quantitative response, squared error loss lead us to the regression function  

  \begin{equation}
  f(x) = \text{E}(Y|X=x).
  \end{equation}

2. The kNNs can be viewed as direct estimates of this conditional expectation,
3. but kNNs can fail at least two ways:
  * If the dimension of the input space is high, the nearest neighbors need not be close to the target point, and can result in large errors,
  * if special structure is known to exist, this can be used to reduce both the bias and the variance of the estimates.
  
> We anticipate using other classes of models for $f(x)$, in many cases specifically designed to overcome the dimensionality problems, and here we discuss a framework for incorporating them into the prediction problem.

__Remark__: when you have more informtion, you model will work better. But, what
is information:

* __data__
* __model__

If you know both of them, it will be great. Otherwise, have more data and
_guess a model_ might be still helpful. 

## 2.6.1. A Statistical Model for the Joint Distribution $\text{Pr}(X,Y)$

### The additive error model
Suppose in fact that our data arose from a statistical model

\begin{equation}
Y = f(X) + \epsilon,
\end{equation}

where the random error $\epsilon$ has $\text{E}(\epsilon)=0$ and is independent of $X$.

Note that for this model,

\begin{equation}
f(x)=\text{E}(Y|X=x),
\end{equation}

and in fact the conditional distribution $\text{Pr}(Y|X)$ depends on $X$ only through the conditional mean $f(x)$.

The additive error model is a useful approximation to the truth. For most systems 
the input-output pairs $(X,Y)$ will not have a deterministic relationship $Y=f(X)$. 
Generally there will be other unmeasured variables that also contribute to $Y$, including measurement error.

The additive model assumes that we can capture all these departures from a 
deterministic relationship via the error $\epsilon$:

* __do what you can do with your data__
* __leave those unmeasurable things into an assumption of an error term__

#### Where the deterministic rules

For some problems a deterministic relationship does hold. Many of the 
classification problems studied in machine learning are of this form, 
where the response surface can be thought of as a colored map defined in $\mathbb{R}^p$.

The training data consist of colored examples from the map $\{x_i,g_i\}$, and 
the goal is to be able to color any point. Here the function is 
deterministic, and the randomness enters through the $x$ location of the training points.

For the moment we will not pursue such problems, but will see that they can 
be handled by techniques appropriate for the error-based models.

### The i.i.d. assumption

The assumption in the above additive error model that the errors are i.i.d. is not strictly necessary, but seems to be the back of our mind when we average squared errors uniformly in our EPE criterion.

With such a model it becomes natural to use least squares as a data criterion for model estimation as in the linear model

\begin{equation}
\hat{Y} = \hat\beta_0 + \sum_{j=1}^p X_j \hat\beta_j.
\end{equation}

Simple modifications can be made to avoid the independence assumption; e.g., we can have

\begin{equation}
\text{Var}(Y|X=x)=\sigma(x),
\end{equation}

and now both the mean and variance depend on $X$.

In general $\text{Pr}(Y|X)$ can depend on X in complicated ways, but the additive error model precludes these.

### For qualitative outputs

So far we have concentrated on the quantitative response.

Additive error models are typically not used for qualitative outputs $G$; in this case the target function $p(X)$ _is_ the conditional density $\text{Pr}(G|X)$, and this is modeled directly.

For example, for two-class data, it is often reasonable to assume that the data arise from independent binary trials, with the probability of one particular outcome being $p(X)$, and the other $1 − p(X)$. Thus if $Y$ is the 0–1 coded version of $G$, then
\begin{equation}
\text{E}(Y |X = x) = p(x),
\end{equation}

but the variance depends on $x$ as well: $\text{Var}(Y |X = x) = p(x)\left(1 − p(x)\right)$.

## $\S$ 2.6.3. Function Approximation

Here
* the data pairs $\{x_i, y_i\}$ are viewed as points in a $(p+1)$-dimensional Euclidean space.
* The function $f(x)$ has domain eqaul to the $p$-dimensional input subspace (or $\mathbb{R}^p$ for convenience), and
* $f$ is related to the data via a model such as $y_i=f(x_i)+\epsilon$.

The goal is to obtain a useful approximation to $f(x)$ for all $x$ in some region of $\mathbb{R}^p$, given the representations in $\mathcal{T}$.

> Although somewhat less glamorous than the learning paradigm, treating supervised learning as a problem in function approximation encourages the geometrical concepts of Euclidean spaces and mathematical concepts of probabilsitic inference to be applied to the problem. This is the approach taken in this book.

### Associated parameters and basis expansions

Many of the approximations we will encounter have associated a set of parameters $\theta$ that can be modified to suit the data at hand. For example, the linear model

\begin{equation}
f(x) = x^T\beta
\end{equation}

has $\theta=\beta$.

Another class of useful approximators can be expressed as _linear basis expansions_
\begin{equation}
f_\theta(x) = \sum_{k=1}^K h_k(x)\theta_k,
\end{equation}
where the $h_k$ are a suitable set of functions or transformations of the input vector $x$. Traditional examples are polynomial and trigonometric expansions, where for example $h_k$ might be $x_1^2$, $x_1x_2^2$, $\text{cos}(x_1)$ and so on.

We also encounter nonlinear expansions, such as the sigmoid transformation common to neural network models,
\begin{equation}
h_k(x) = \frac{1}{1+\text{exp}(-x^T\beta_k)}.
\end{equation}

### Least squares again
We can use least squares to estimate the parameters $\theta$ in $f_\theta$ as we did for the linear model, by minimizing the residual sum-of-squares
\begin{equation}
\text{RSS}(\theta) = \sum_{i=1}^N\left(y_i-f_\theta(x_i)\right)^2
\end{equation}
as a function of $\theta$. This seems a reasonable criterion for an additive error model.

In terms of function approximation, we imagine our parametrized function as a surface in $p+1$ space, and what we observe are noisy realizations from it. This is easy to visualize when $p=2$ and the vertical coordinate in the output $y$, as in FIGURE 2.10. The noise is in the output coordinate, so we find the set of parameters such that the fitted surface gets as close to the observed points as possible, where close is measured by the sum of squared vertical errors in $\text{RSS}(\theta)$.

For the linear model we get a simple closed form solution to the minimization problem. This is also true for the basis function methods, if the basis functions themselves do not have any hidden parameters. Otherwise the solution requires either iterative methods or numerical optimization.

While least squares is generally very convenient, it is not the only criterion used and in some cases would not make much sense. A more general principle for estimation is _maximum likelihood estimation_.

### Maximum likelihood estimation

Suppose we have a random sample $y_i, i=1,\cdots,N$ from a density $\text{Pr}_\theta(y)$ indexed by some parameters $\theta$. The log-probability of the observed sample is

\begin{equation}
L(\theta) = \sum_{i=1}^N\log\text{Pr}_\theta(y_i).
\end{equation}

> The principle of maximum likelihood assumes that the most reasonable values for $\theta$ are those for which the probability of the observed sample is largest.

#### Least squares = ML with Gaussian errors
Least squares for the additive error model $Y = f_\theta(X) + \epsilon$, with $\epsilon \sim N(0, \sigma^2)$, is equivalent to maximum likelihood using the conditional likelihood

\begin{equation}
\text{Pr}(Y|X,\theta) = N(f_\theta(X), \sigma^2).
\end{equation}

So although the additional assumption of normality seems more restrictive, the results are the same. The log-likelihood of the data is

\begin{equation}
L(\theta) = -\frac{N}{2}\log(2\pi) - N\log\sigma - \frac{1}{2\sigma^2} \sum_{i=1}^N \left( y_i - f_\theta(x_i) \right)^2,
\end{equation}

and the only term involving $\theta$ is the last, which is $\text{RSS}(\theta)$ up to a scalar negative multiplier.

#### Multinomial likelihood for a qualitative output $G$

A more interesting example is the multinomial likelihood for the regression function $\text{Pr}(G|X)$ for a qualitative output $G$.

Suppose we have a model

\begin{equation}
\text{Pr}(G=\mathcal{G}_k|X=x) = p_{k,\theta}(x), k=1,\cdots,K
\end{equation}

for the conditional density of each class given $X$, indexed by the parameter 
vector $\theta$. Then the log-likelihood (also referred to as the cross-entropy) is

\begin{equation}
L(\theta) = \sum_{i=1}^N\log p_{g_i,\theta}(x_i),
\end{equation}

and when maximized it delivers values of $\theta$ that best conform 
with the data in the likelihood sense.

__Once you had the distibution of the error, you can estimate anything with
maximum likelihood estimation__
