<a href="https://colab.research.google.com/github/lustraka/Data_Analysis_Workouts/blob/main/Introduction_to_Statistical_Learning/ISL02_Statistical_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Statistical Learning
## What is Statistical Learning?
![Key Concepts](http://www.plantuml.com/plantuml/png/RP5FImCn4CNl_HHpR5bAxxAKWWZU51My56IQp6u7ayawcUqVnEzk9Ar5x6M7zzvlzeLabGtL8ekFE4mQkCl64OsW3ULxMAwtQ9_TLxkeSj8qyBezbj6ymQ3asHadgPgb8oLnL2JSf_s9GiL8fkoWZ6tokVgIP7urWnT5J_E7hgjWJ9u2i1XfQJJSS63xTmH0vqP5TuHf5-Z0bPeL39x9ZAMl6taSI7USoKCWLFaDHhaUmQEcJQ2OA_P4lLBEFutJZn75sD1uHr3S8KccMULk0nQgOuTsPiCPnNU4ubEVjEJXihgMQSlB3UrHwGP2wZaREz1B9sT0S7__NQ-kNV1oDbcH-DDhVWC0)

Suppose that we observe a quantitative response $Y$ and $p$ different predictors, $X_1, X_2,...,X_p$. We assume that there is some relationship between $Y$ and $X = (X_1, X_2,...,X_p)$, which can be written in the very general form

$$Y = f(X) + \epsilon. \qquad (1)$$

Here $f$ is some fixed but unknown funtion of $X_1, X_2,...,X_p$, and $\epsilon$ is a random *error term*, which is independent of $X$ and has mean zero. In this formulation, $f$ represents the *systematic* information that $X$ provides about $Y$.

### Why Estimate $f$ ?
There are two main reasons that we may wish to estimate $f$ : *prediction* and *inference*.

**Predicion**. In many situations a set of inputs $X$ are readily available, but the output $Y$ cannot be easily obtained. In this setting, since the error term averages to zero, we can predict $Y$ using

$$\hat{Y} = \hat{f}(X), \qquad (2)$$

where $\hat{f}$ represents our estimate of $f$, and $\hat{Y}$ represents the resulting prediction for $Y$. In this setting, $\hat{f}$ is often treated as a *black box*, in the sense that one is not typically concerned with the exact form of $\hat{f}$, provided it yields accurate predictions for $Y$.

Consider a given estimate $\hat{f}$ and a set of predictors $X$, which yields the prediction $\hat{Y} = \hat{f}(X)$. Assume for a moment that both $\hat{f}$ and $X$ are fixed, so that only variability comes from $\epsilon$. Then, it is easy to show that

$$E(Y - \hat{Y})^2 = E[f(X) + \epsilon - \hat{f}(X)]^2 = \underbrace{[f(X) - \hat{f}(X)]^2}_\text{Reducible} + \underbrace{\text{Var}(\epsilon)}_\text{Irreducible}, \quad (3)$$

where $E(Y - \hat{Y})^2$ represents the average, or *expected value*, of the squared difference between the predicted and actual value of $Y$, and $\text{Var}(\epsilon)$ represents the *variance* associated with the error term $\epsilon$.

> The focus of this repository is on techniques for estimating $f$ with the aim of minimizing the reducible error. It is important to keep in mind that the irreducible error will always provide the upper bound on the accuracz of our prediction on $Y$. This bound is almost always unknown in practice.

**Inference**. We are often interested in understanding the association between $Y$ and $X_1, X_2,...,X_p$. In this situation we wish to estimate $f$, but our goal is not necessarily to make predictions for $Y$. Now $\hat{f}$ cannot be treated as a black box, because we need to know its exact form. In this setting, one may be interested in answering the following questions:
- Which predictors are associated with the response?
- What is the relationship between the response and each predictor?
- Can the relationship between $Y$ and each predictor be adequately summarized using a linear equation, or is the relationship more complicated?

> In this repository, we will see a number of examples that fall into the prediction setting, the inference setting, or a combination of the two.
>
> Depending on whether our ultimate goal is prediction, inference, od a combination of the two, different methods for estimating $f$ may be appropriate.

### How Do We Estimate $f$ ?
**Parametric Methods** involve two-step model-based approach.
1. First, we make an assumption about the functional form, or shape, of $f$. For example, one very simple assumption is that $f$ is linear in $X$:
$$f(X) = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p. \qquad (4)$$
This is a *linear model*. Once we have assumed tha $f$ is linear, the problem of estimating $f$ is greatly simplified. Instead of having to estimate an entirely arbitrary $p$-dimensinal function $f(X)$, one only needs to estimate $p+1$ coefficients $\beta_0, \beta_1, ..., \beta_p$.

2. After a model has been selected, we need a procedure that uses the training data to *fit* or *train* the model. In case of the linear model $(4)$, we need to estimate the parameters $\beta_0, \beta_1, ..., \beta_p$. That is, we want to find values of these parameters such that
$$Y \approx \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p.$$
The most common approach to fitting the model $(4)$ is referred to as *(ordinary) least squares*, but there are also other approaches.

**Non-Parametric Methods** do not make explicit assumptions about the functional form of $f$. Instead they seek an estimate of $f$ that gets as close to the data points as possible without being too rough or wiggly. An example is a *thin-plate spline*. In order ro fit a thin-plate spline, the data analyst must select a level of smoothness.

### The Trade-Off Between Prediction Accuracy and Model Interpretability
In general, as the flexibility of a method increases, its interpretability decreases.

### Supervised Versus Unsupervised Learning

**Supervised Learning**. For each observation of the predictor measurement(s) $x_i, i = 1, ..., n$ there is an associated response measurement $y_i$. We wish to fit a model that relates the response to the predictors, with the aim of accurately predicting the response for future observation (prediction) or better understand the relationship between the response and the predictors (inference).

**Unsupervised Learning** describes somewhat more challenging situation in which for every observation $i = 1, ..., n$, we observe a vector of measurements $x_i$ but no associated response $y_i$. One statistical tool that we may use in this setting is *cluster analysis*, or clustering. The goal of cluster analysis is to ascertain, on the basis of $x_1, ..., x_n$, whether the observation fall into relatively distict groups.

### Regression Versus Classification Problems

## Assessing Model Accuracy
It is important task to decide for any given set of data which method produces the best results. Selecting the best approach can be one of the most challenging parts of performing statistical learning in practice.

### Measuring the Quality of Fit
**Training MSE**. In the regression setting, the most commonly-used measure is the *mean squared error* (MSE), given by

$$MSE = \frac{1}{n}\sum_{i-1}^n(y_i - \hat{f}(x_i))^2, \qquad (5)$$

where $\hat{f}(x_i)$ is the prediction that $\hat{f}$ gives for $ith$ observation.

**Test MSE**. We want to know whether $\hat{f}(x_0) \approx y_0$, where $(x_0,y_0)$ is a previously unseen test observation not used to train the statistical learning method. If we had a large number of test observation, we could compute

$$\text{Ave}(y_0-\hat{f}(x_0))^2, \qquad (6)$$

the average squared prediction error for these test observations $(x_0,y_0)$.

**Overfitting**. When a given method yields a small training MSE but a large test MSE, we are said to be *overfitting* the data.

**Cross-validation**. There is a variety of approaches that can be used to find the flexibility level corresponding to the model with the minimal test MSE. One important method is *cross-validation*, which uses the training data to estimate the test MSE.

### Tha Bias-Variance Trade-Off
**Expected test MSE**. The *expected test MSE at $x_0$* equals:

$$E(y_0 - \hat{f}(x_0))^2 = \text{Var}(\hat{f}(x_0)) + [\text{Bias}(\hat{f}(x_0))]^2 + \text{Var}(\epsilon), \qquad (7)$$

where $\text{Var}(\hat{f}(x_0))$ is the *variance* of $\hat{f}(x_0)$, $\text{Bias}(\hat{f}(x_0))$ is the *bias* of $\hat{f}(x_0)$, and $\text{Var}(\epsilon)$ is the variance of the error terms $\epsilon$. The overall expected test MSE can be computed by averaging $E(y_0 - \hat{f}(x_0))^2$ over all possible values of $x_0$ in the test set.
- *Variance* $\text{Var}(\hat{f}(x_0))$ refers to the amount by which $\hat{f}$ would change if we estimated it using a different training data set.
- *Bias* $\text{Bias}(\hat{f}(x_0))$ refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model.

**Bias-variance trade-off**. As a general rule, as we use more flexible methods, the variance will increase and the bias will decrease. The relative rate of change of these two quantities determines whether the test MSE increases or decreases. The challenge lies in finding a method for which both the variance and the squared bias are low.


### The Classification Setting


---
![Expected Value](http://www.plantuml.com/plantuml/png/SoWkIImgoKqioU1orOXKq5M8oKWigOwirOmpKh1LS8rEquZGLD1MY4ajACxCoS-3AKYh1Oh7WjN4bEQbf1Ob5IKcfrQ3bQEhgOsFAKcjAAaEIaqfJSvCoacjLT16qGMH3aiigjM0sQC9q-HPL0JNfgCGKrYQcAAWOQp9vP2Qbm9oDG00)