# Chapter 2 Statistical Learning


## 2.1 What is statistical learning

Suppose that we observe a quantitative response $Y$ and p different predictors, $X1 , X2 , . . . , Xp$ . We assume that there is some relationship between $Y$ and $X = (X1,X2,...,Xp)$, which can be written in the very general form
$$
Y=f(X)+\epsilon
$$

Here $f$ is some fixed but unknown function of $X1, . . . , Xp$, and $\epsilon$ is a random error term, which is independent of $X$ and has mean zero. In this formula- tion, $f$ represents the systematic information that $X$ provides about $Y$ .

statistical learning refers to a set of approaches for estimating $f$. In this chapter we outline some of the key theoretical concepts that arise in estimating $f$, as well as tools for evaluating the estimates obtained.

### 2.1.1 Why estimate $f$?

two main reasons: prediction and inference

#### Prediction

Having a set of inputs $X$ but the output $Y$ cannot be easily obtained, since the error term averages to zero, we can predict Y using
$$
\hat{Y}=\hat{f}(X)
$$
where $\hat{f}$ represents our estimate for $f$ ans is usually treated as a black box , and $\hat{Y}$ represents the resulting pre- diction for $Y$ 

The accuracy of $\hat{Y}$ as a prediction for $Y$ depends on two quantities, which we will call the reducible error and the irreducible error.

In general, $\hat{f}$ will not be a perfect estimate for $f$, and this inaccuracy will introduce some error. This error is *reducible* because we can potentially improve the accuracy of $\hat{f}$ using the most appropriate statistical learning technique to estimate $f$

Even if it were possible to form a perfect estimate for $f$, so that our estimated response took the form $\hat{Y} = f(X)$, our prediction would still have some error in it! This is because $Y$ is also a function of $\epsilon$, which, by definition, cannot be predicted using $X$

Therefore, variability associated with $\epsilon$ also affects the accuracy of our predictions. This is known as the *irreducible error*

$$
E(Y-\hat{Y})^2 = E[f(X)+\epsilon -\hat{f}(X)]^2=[f(X)-\hat{f}(X)]^2_{reducible} + Var(\epsilon)_{irreducible}
$$

#### Inference
We are often interested in understanding the association between $Y$ and $X1,...,Xp$. In this situation we wish to estimate $f$, but our goal is not necessarily to make predictions for $Y$. Now $\hat{f}$ cannot be treated as a black box because we need to know its exact form


To answer foolowing questions:
- Which predictors are associated with the response? 
- What is the relationship between the response and each predictor?
- Can the relationship between $Y$ and each predictor be adequately summarized using a linear equation, or is the relationship more complicated?

### 2.1.2 How Do We Estimate $f$

Assume that we have observed a set of $n$ different data points aka training data. Then our training data consist of ${(x_1,y_1),...(x_n,y_n)}$ where $x_i=(x_{i1},...,x_{ip})^T$

Our goal is to apply a statistical method to the training data in order to estimate the unknown function $f$. 
Broadly speaking, most statistical learning methods for this task can be characterized as either *parametrix* or *non-parametric*

#### Parametric methods
involve a two-step model-based approach
1. First, we make an assumption about the functional form, or shape of $f$
2. After a model has bene selected, we need a procedure that uses the training data to *fit* or *train* the model.

$\Rightarrow$ The most common approach to fitting the model is referred to as (ordinary) least squares

The model-based approach just described is referred to as *parametric*; it reduces the problem of estimating $f$ down to own of estimating a set of parameters

The potential disadvantage of a parametric approach is that the model we choose will usually not match the true unknown form of $f$.

$\Rightarrow$ if the chosen model is too far from the true $f$, then our estimate will be poor 
$\Rightarrow$ we can try to address this problem by choosing *flexible* models that can fit many different possible functional forms for $f$



However, fitting a more flexible model requires estimating a greater number of parameters aka more complex models and can lead to *overfitting* the data

##### Non-parametric Methods

- Do not make explicit assumptions about the func- tional form of $f$.
- Seek an estimate of $f$ that gets as close to the data pointsas possible without being too rough or wiggly
- Avoid the danger that $\hat{f}$ is very different of $f$

### 2.1.3 The Trade-Off Between Prediction Accuracy and Model Interpretability

One may ask: *why would we ever
choose to use a more restrictive method instead of a very flexible approach?*

Here are some of the reasons that we might prefer a more restrictive model:
- If we are mainly interested in inference, then restrictive models are much more interpreable
- 

### 2.1.4 Supervised Versus Unsupervised Learning
- supervised: what we have talked about so far
- unsupervised: we obser a vector of measurements $x_i$, but no associated respons $y_i$
    - One statistical learning tool that we may use in this setting is cluster analysis or clustering
- Sometimes the question of whether an analysis should be considered supervised or unsupervised is less clear-cut.
    - For m of the observa- tions, where m < n, we have both predictor measurements and a response measurement. For the remaining n − m observations, we have predictor measurements but no response measurement. Such a scenario can arise if the predictors can be measured relatively cheaply but the corresponding responses are much more expensive to collect. We refer to this setting as a semi-supervised learning problem


### 2.1.5 Regression Versus Classification Problems
- Problems with a quantitative response: regression problems
- Problesm with a qualitative response: classification problems

## 2.2 Assessing Model Accuracy
### 2.2.1 Measuring the Quality of Fit
In regression setting, the most commonly-used measure is the *mean squared error* (MSE) given by
$$
MSE = \frac{1}{n}\sum^n_{i=1}(y_i - \hat{f}(x_i))^2
$$

The MSE is computed using the training data that was used to fit the model, and so should more accurately be referred to as the training MSE.

But in general, *we do not really care how well the method works on the training data. Rather, we are interested in the accuracy of the pre- dictions that we obtain when we apply our method to previously unseen test data*.

$\Rightarrow$ We want to choose the method that gives the lowest test MSE
$$
testMSE =Ave(y_0-\hat{f}(x_0))^2
$$

##### flexibility ~ degrees of freedom

### 2.2.2 The Bias-Variance Trade-Off
We have:

$$
E(y_0-\hat{f}(x_0))^2 = Var(\hat{f}(x_0))+[Bias(\hat{f}(x_0))]^2 + Var(\epsilon)
$$
We have $E(y_0-\hat{f}(x_0))^2$ defines the *expected test MSE* at $x_0$

- Variance refers to the amount by which $\hat{f}$ would change if we estimated it using a different training data set
    - Ideally the estimate for $f$ should not vary too much between training sets.
    - If a method has high variance then small changes in the training data can result in large changes in $\hat{f}$
    - In general, more flexible statistical methods have higher variance

- Bias refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model
    - Generally, more flexible methods result in less bias


- As a general rule, as we use more flexible methods, the variance will increase and the bias will decrease.
    - The relative rate of change of these two quantities determines whhether the test MSE increases or decreases
$\Rightarrow$ The relationship between bias, variance and test set MSE given above is referred to as the *bias-variance* trade-off
$\Rightarrow$ The challende lies in finding a method for which both the variance and the squared bias are low


### 2.2.3 The Classification Setting
The moset common approach for quantifying the accuracy of our estimate $\hat{f}$ is the training *error rate*
$$
\textit{error rate} = \frac{1}{n}\sum_{i=1}^nI(y_i\neq \hat{y_i})
$$
$$
\Rightarrow \textit{test error rate} = Ave(I(y_i\neq \hat{y_i}))
$$

#### The Bayes Classifier
It is shown that the test error rate given is minimized, on average, by a very simple classifier that assigns each observation to the most likely class given its predictor values
aka Assign a test observation with predictor vector $x_0$ to the class $j$ for which 
$$
Pr(Y=j|X=x_0)
$$
is largest

The Bayes classifier produces the lowest possible test error rate, called the *Bayes error rate*

## 2.3 Lab: Introduction to Python

In [4]:
import numpy as np
x = np.array([[1, 2], [3, 4]])
x.ndim
x.shape


(2, 2)

## 2.4 Exercise