# Chapter 2 - Statistical Learning

## What is Statistical Learning?

This chapter begins with an example dataset, which contains information for the sales of a product in 200 different markets, along with advertising budgets in each of those markets for three different media: TV, radio, and newspaper. 

The client has direct control over the advertisement levels, however, they can obviously not directly increase the sale of the product. If we can determine there is an association between advertising and sales, then they can instruct their team to adjust advertising budgets, thereby increasing sales. In other words, our goal is to devlelop an accurate model that can be used to predict sales on the basis of the three media budgets. 

In this problem, the advertising budgets are the *input variables* while sales is an *output variable*. We can estimate the relationship between x and y with the formula:

$$ Y = f(X) + \epsilon. $$ 

Here f is some fixed but unknown function of X1, ... Xp and $\epsilon$ is a random error term, which is independent of X and has mean zero. In this formula, f represents the systematic information that X provides about Y. We must estimate f based on the observed datapoints. 

In essence, statistical learning refers to a set of approaches for estimating f. 

#### ***Why Estimate f?***

There are two main reasons we may wish to estimate f:
1) Prediction
2) Inference

***Prediction***

We can predict y using: 

$$ \hat{Y} = \hat{f}(X) $$

Where $\hat{f}$ represents our estimate for f, and $\hat{y}$ represents the resulting prediction for Y. In this setting, $\hat{f}$ is typically treated as a black box, in the sense that one is not typically concerned with the exact form of $\hat{f}$, provided that it yields accurate predictions.

The accuracy of $\hat{y}$ depends on two quantities, *reducible error* and *irreducible error*. In general, $\hat{f}$ will not be a perfect estimator for f, and this inaccuracy will introduce some error. This erorr is reducible, because we can potentially improve the accuracy of $\hat{f}$ by choosing or strengthening our model. 

However, even if we choose the best model we still may have some error in it. This is because Y is also a function of $\epsilon$, which, by definition, cannot be predicted by using x. Therefore, variability associated with $\epsilon$ also affects the accuracy of our predictions. 

The focus of this book is on techniques for estimate f with the aim of minimizing the reducible error. It is important to know that the irreducible error will always provide an upper bound on the accuracy of our prediction for Y. This bound is almost always unknown in practice.

***Inference***

We are often interested in understanding the association between Y and X1,...Xp. In this situation, we wish to estimate f, but our goal is not necessarily to make predictions for Y. Now $\hat{f}$ can not be treated as a black box, because we need to know its exact form. 

In this setting, one may be interested in the following questions:
* *Which predictors are associated with the response?* It is often the case that only a small fraction of the predictors are substancially associated with Y. Identifying the few important predictors among a large set can be useful.
* *What is the relationship between the response and each predictor?*
* *Can the relationship between Y and each predictor be adequately summarized using a linear equation, or is the relationship more complicated?*

In this book, we will see a number of examples that fall into the prediction setting, the inference setting, or a combination of two. 

#### ***How do We Estimate f?***

Our goal is to apply a statistical learning method to the training data in order to estimate the function f. In other words, we want to find a function $\hat{f}$ such that  

$$ {Y} = \hat{f}(X) $$

Broadly speaking, most statistical learning methods for this task can be characterized as either *parametric* or *non-parametric*.

***Parametric Methods***

Parametric methods involve a two-step model-based approach. 

First, we make an assumption about the functional form, or shape, of f. For example, one very simple assumption is that f is linear in X:

$$ f(X) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p $$

Once we have assumed that f is linear, the problem of estimating f is greatly simplified. Instead of having to estimate an entirely arbitrary p-dimensionl function f(x), we only need to estimate the p+1 coefficients $\beta_0, \beta_1, \beta_p$.

After a model has been selected, we need a procedure that uses the training data to fit/train the model. In the case of the linear model, we will use the data to estimate the parameters $\beta_0, \beta_1, \beta_p$. The most common approach is referred to as ordinary least squares, however, there are many possible ways to fit the linear model.

The model-based approach talked about above is referred to as *parameteric*. It reduces the problem of estimating f down to one of estimating a set of parameters. 

The potential disadvantege of a parametric apporach is that the model we choose will usually not match the true unknown form of f. If the chosen model is too far from the true f, then our estimate will be poor. We can try to address this problem by chooisng *flexible* models - which generally requires estimating a greater number of parameters. This can lead to ***overfitting***, which essentially means the model follows the errors/noise too closely. 


***Non-Parametric Methods***

These methods do not make explicit assumptions about the function form of f. Instead, they seek an estimate of f that gets as close to the data points as possible without being too rough or wiggly. 

This can have a major advantage over parameteric apporaches - it avoids the asusmption and can fit a wider range of possible shapes for f. Any parameteric approach brings with it the possibility that the function form used to estimate f is very different from the true f, in which case the resulting model will not fit the data well. In contrast, non-parametric approaches completely avoid this danger. 

But these approaches do suffer from a major disadvantage - since they do not reduce problem of estimating f to a small number of parameters, a very large number of observations is required in order to obtain an accurate estimate for f. 

Non-parameteric methods are also more prone to overfitting.


#### ***The Trade-Off Between Prediction Accuracy and Model Interpretability***

A reasonable question to ask is: *"Why would we ever choose to use a more restrictive method instead of a very flexible approach"*? If we are mainly interested in inference, then retrictive models are much more interpretable. 

So it's important to keep in mind what our goal is. If we are only interested in generating the best predictions, then we will choose a more flexible model. If we need intuition into the relationship between variables and we need interpretability, then we will use a less flexible model.

![test picture](../../img/Introduction%20to%20Statistical%20Learning/02-Statistical%20Learning/Flexibility_Interpretability_Tradeoff.png)

## Assesing Model Accuracy

This book aims to introudce many methods because *there is no free lunch in statistics* - meaning no one method dominates all others over all possible datasets. It's important (and challenging) to decide for any given set of data which method produces the best results.

#### ***Measuring the Quality of Fit***

The most commonly-used measure in the regression setting is MSE:

$$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} \left(y_i - \hat{f}(x_i)\right)^2$$

The MSE will be small if the predicted responses are very close to the true responses. 

***As model flexibility increases, the training MSE will decrease, but the test MSE may not. When a given method yields a small training MSE and  larger test MSE, we are said to be overfitting. This happens because our system is working too hard to find patterns in the training data, and is not generalizing well. Note that we do expect to see a smaller training MSE than testing MSE, since most methods either directly or indirectly seek to minimize the training MSE.***

#### ***The Bias Variance Trade-Off***

Through mathematical proof, it can be shown that the expected test MSE for a given value can always be broken down into the sum of three fundamental quantitities:
1) The variance of $\hat{f}(x_0)$
2) The squared bias of $\hat{f}(x_0)$
3) And the variance of the error terms $\epsilon$

Here is that formula:

$$E \left( y_0 - \hat{f}(x_0) \right)^2 = \text{Var}(\hat{f}(x_0)) + \left[\text{Bias}(\hat{f}(x_0))\right]^2 + \text{Var}(\epsilon)$$

Where $E \left( y_0 - \hat{f}(x_0) \right)^2$ is the expected test MSE.

This equations tells us that in order to minimize the expected test error, we need to select a statistical learning method that simaltaneously achieves low variance and low bias. 

***Variance*** refers to the amount by which $\hat{f}$ would change if we estimated it using a different training set. Since the training data are used to fit the method, different sets will result in a different $\hat{f}$. But ideally the estimate for f should not vary too much between training sets. However, if a method has high variance then small changes in the training data can result in large changes in $\hat{f}$. In general, more flexible statistical methods have high variance. 

See the figure below. The flexible green curve is following the observations very closely. It has high variance because changing any one of these data points may cause the estimate $\hat{f}$ to change considerably. In contrast, the orange least square lines is relatively inflexible and has low variance. 

![test picture](../../img/Introduction%20to%20Statistical%20Learning/02-Statistical%20Learning/Variance.png)

<br>

On the other hand, ***bias*** refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model. 

For example, linear regression assumes that there is a linear relationship between Y and the X's. It is unlikel ythat a ny real-life problem truly has such a simple linear relationship, and so performing linear regression will undoubtedly result in some bias in the estimate of f. In this example, when the data is approximately linear, we know that linear regression will result in low bias. However, as the data begins to get less linear, the bias will increase. Generally, more flexible methods result in less bias.

***As a general rule, as we use more flexible methods, the variance will increase and the bias will decrease. The relative rate of change of these two quantities determines whetehr the test MSE increases or decreases. As we increase the flexibility of a class of methods, the bias tends to initially decrease faster than the variance increases. Consequently, the expected test MSE declines. However, at some point increasing flexibility has little impact on the bias but starts to signficantly increase the variance. When this happens, the test MSE increases.***

The figure below shows 3 bias-variance tradeoffs. The dataset for the figure in the middle is linear, so we see a small decrease in bias as flexibility increases. However, data is less linear in the first figure and even less linear in the third figure. This explains the dramatic drop in variance as flexibility increases initially. 

![test picture](../../img/Introduction%20to%20Statistical%20Learning/02-Statistical%20Learning/Bias_Variance_Tradeoff.png)

<br>

The relationship displayed above is referred to as the ***bias-variance tradeoff***. Good test set performance of a statistical learning method requires low variance as well as low bias. This is referred to as a trade-off because it is easy to obtain a method with extremely low bias but high variance, or a mthod with very low variance but high bias. The challenge lies in finding a method for which botht he variance and the bias are low. 

In a real-life situation which f is unobserved, it is generally not possible to explicity compute the test MSE, bias, or variance for a statistical method. Nevertheless, one should always keep this trade-off in mind. 


#### ***The Classification Setting***

The most common approach for quantifying the accuracy of our estimate $\hat{f}$ is the *error rate*. This is the proportion of mistakes that are made if we apply our estimate. 

***The Bayes Classifier***

$$\Pr(Y = j \mid X = x_0)$$

This classifier will calculate the probability that Y = j, given the observed predictor values. It will predict the class with the largest probability. 

For example, say the response variable is binary. If the probability the Y = 1 is 0.55, and the probability that Y = 0 is 0.45, then it will choose to predict Y as 1. THis is called the Bayes decision boundary. The Bayes classifier produces the lowest possible test error rate, called the Bayes error rate. Since the Bayes classifier will always choose the class which is largets, the error rate will be e 1−maxj Pr(Y = j|X = x0) at X = Xo. The Bayes error rate is analagous to the irreducible error, discussed earlier.

The figure below shows the Bayes decision boundary for two groups. Because we notice some orange circles in the blue range and blue circles in the orange range, we know that the Bayes error rate will be greater than 0.

![test picture](../../img/Introduction%20to%20Statistical%20Learning/02-Statistical%20Learning/Bayes.png)
<br>

***K-Nearest Neighbors***

In theory, we would always like to predict qualitative responses using the Bayes classifier. But for real data, we do not know the conditional distribution of Y given X. Therefore, the Bayes classifier serves as an unattainable gold standard against which to compare other methods. 

Many approaches attempt to estimate the conditional distribution of Y given X, and then calssify a given observation to the class with highest estimated probability. One such method is *K-Nearest Neighbors (KNN)*. 

The figure below plots a small training set on the two plots. On the left side, the goal is to make a prediction for the point labeled by the black cross. Suppose we choose K = 3. Then KNN will first identify the three observations closest to that cross. This neighborhood is shown in the circle. It consists of two blue points, and one orange point and hence KNN will predict that the black cross belongs to the blue cross.

In the right-hand panel, they have applied the KNN approach with K = 3 at all of the possible values, and have drawn in the corresponding KNN decision boundary. 

![test picture](../../img/Introduction%20to%20Statistical%20Learning/02-Statistical%20Learning/KNN.png)

<br>

Despite the fact that it is a very simple approach, KNN can often produce classifiers that are suprisingly close to the optimal Bayes classifier. 

The choice of K has a drastic effiect on the KNN classifier obtained. The picture below shows that when K = 1, the decision boundary is overly flexible and overfitting to the data. This corresponds to a classifier that has low bias but high variance. As K grows, the method becomes less flexible and produces a decision boundary that is close to linear. This corresponds to low variance and high bias.

![test picture](../../img/Introduction%20to%20Statistical%20Learning/02-Statistical%20Learning/KNN2.png)

## Exercises

![test picture](../../img/Introduction%20to%20Statistical%20Learning/02-Statistical%20Learning/Exercises1.png)

![test picture](../../img/Introduction%20to%20Statistical%20Learning/02-Statistical%20Learning/Exercises2.png)

![test picture](../../img/Introduction%20to%20Statistical%20Learning/02-Statistical%20Learning/Exercises3.png)nario i. 