# Introduction

## A Brief History of Statisical Learning
- Beginning of 19th Century Legendre and Gauss published papers on *method of least squares* earliest form of *linear regression*
- 1936 Fischer proposed *Linear Disciminant Analysis*. In the 1940s various authors put forward an alternative approach *Logistic Regression*.
- In the early 1970s Nelder and Wedderburn coined the term *generalised linear models* for an entire class of statistical learning methods that include both linear and logistic regression as special cases.
- By the end of the 1970s many more techniques were developed but they were almost all exclusively linear, non-linear methods were computationally expensive.
- By the 1980s the necessary processing power was more readily available and Breimann, Friedman, Olshen and Stone introduced *classification and regression trees* and also used *cross-validation* techniques for model selection.
- Hastie and Tibshirani coined the term *generalised additive models* in 1986 for a class of non-linear extensions to generalised linear models. 
- Since this time statistical learning (aka machine learning) has blossomed as an active and productive subject area

## What is Statistical Learning?

### Definition
We take a set of inputs (also known as *predictors*, *independent variables*, *features* and sometimes just *variables*) and use various methods to produce outputs (also known as *response* or *dependent variable*).

### Estimating $f$
More generally suppose there are $p$ different predictors $X_1, X_2, ..., X_p$ and we observe a quantative response $Y$. Our job is to find (based on the assumption, which can be challenged, that it exists) a relationship between $Y$ and $X = (X_1, X_2, ..., X_p)$ which can be written in a general form as

$$ Y = f(X) + \epsilon$$

Here we have an unknown (to be discovered) function $f$ of the input variables $X$ and a random error term $\epsilon$. $f$ represents the systematic information that $X$ provides about $Y$. The errors have a mean of approximately zero.

### Why Estimate $f$?
Two key reasons: *prediction* and *inference*.

#### Prediction
We attempt to predict $Y$ using a function which we *learn* from the data

$$ \hat{Y} = \hat{f}(X)$$

In this situtation $f$ is often treated as a black box.

In general $\hat{f}$ will not be a perfect prediction of $Y$. The accuracy of our prediction $\hat{f}$ depends on two quantities:
- Reducible Error - we can modify $\hat{f}$ by using better statistical learning methods for example
- Irreducible Error - this comes from $\epsilon$. This irreducible error comes from many places: certain variables not measured or available in the data and also unmeasurable variation (e.g. inherent randomness based on say time of day or other uncollected input data, or just inherent randomness).

The expected value of the squared difference 
$$\begin{align*}
E(Y-\hat(Y))^2 &= E( f(X) + \epsilon - \hat{f}(X) )^2\\
&= \underbrace{\left[ f(X) - \hat{f}(X)\right]}_{\text{Reducible}} + \underbrace{\text{Var}(\epsilon)}_{\text{Irreducible}}
\end{align*}$$

Irreducible error will always place an upper bound on the accuracy of our estimates.

#### Inference
Rather than a predictive black box, this is about understanding the relationship between the input variables and the response. 
- Which predictors are associated with the response, which are independent?
- What is the relationship between each predictor and the response (inverse relationship?)?. Are some more influential than others?
- Is the relationship linear or non-linear?

A real world inferential question is something like: "What effect will changing the price of an item have on its sales?". "How much extra will the house be worth with a view of the river?". Highly non-linear models may be great for prediction but provide difficulties for interpretation (see the later section on *The Trade-Off between Prediction Accuracy and Model Interpretability*).

### How we Estimate $f$
We have a set of training data so called because the observations contained in this data will be used to "train" our model to enable accurate predictions. Our goal is to find a statistical learning method and apply it to the training data in order to estimate an unknown function $f$. This estimate is such that $Y \approx \hat{f}(X)$. Broadly most methods for discovering such a function can be categorised as either *parametric* or *non-parametric*.

#### Parametric Methods
Parametric methods invovle a two-step approach:
1. Make an assumption about the *functional form*, or shape, of $f$. For example we could assume it is linear
$$f(X) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_p X_p$$
2. After a model has been selected we need a procedure that uses the training data to "fit" or "train" the model. We have **reduced our problem to estimating the parameters that come with our model**.

It is generally much easier to estimate parameters of a particular model rather than attempt to fit an entirely arbitary $f$. 

We have to be careful we do not overfit our data which means our estimate of $f$ too closely follow noise in the training data set.
#### Non-parametric Methods
No explicit assumptions about the functional form of $f$ are made. By avoiding making assumptions about the shape of $f$ non-parametric methods have the potential to more accurately fit the data. The problem is however complicated as we have not reduced it to parameter estimation and often large amounts of training data is required. 

An example would be a *thin-plate spline*.

### The Trade-Off between Prediction Accuracy and Model Interpretability
Linear regression is an inflexible approach meaning only a small range of shapes are available to fit $f$. Other methods are flexible in that they can fit a large range of shapes. 

Inference is helped by restrictive inflexible models as the relationship is more readily understood. Flexible models can be more helpful in "black-box" prediction scenarios (with the awareness that we can actually obtain less accurate predictions with this approach because of overfitting).

### Supervised vs. Unsupervised Learning
Supervised learning is when we have a set of inputs and associated outputs with which we can train our model.

Unsupervised learning is where we have only inputs and no supervising outputs. Our aim here is to discover relationships within the data by techniques like *clustering*. For example google news aggregating similar news articles is an example of unsupervised learnings (clustering).

Sometimes the distinction between supervised and unsupervised is clear-cut, at other times it is a little less clear. For example let's suppose that the cost of obtaining responses is expensive so we have $n$ observations but only $m$ of which have an associated response (with $m<n$). This would be a semi-supervised learning task. Such a task will not be discussed in these set of notes.

### Regression vs. Classification Problems
Variables can be categorised as *quantitative* or *qualitative* (aka *categorical*). Quantitative variables take on numerical values e.g. height, weight stock price etc. qualitative virables take on values in one of $k$ different *classes*, or categories. Examples of a categorical variable would be gender, product brand, cancer type, or many other variables.

- Regression problems involve the prediction of a quantitative reponse. Techniques include linear regression.
- Classification problems involve predicting what category we should place an observation in. Techniques include logistic regression where we estimate class probabilities. 

Some techniques ($K$-nearest neighbours and boosting) can be used for either regression or classification problems. The responses are considering important in selecting a technique, the predictor types are considered less important. With proper encoding of the data most statistical learning methods can be applied regardless of whether the predictor variable type is qualitative or quantitative.

## Assessing Model Accuracy
How do we select the best model? What metrics can we use to determine the "best fit"?

### Measuring Quality of Fit
Mean Squared Error (MSE) - $$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} {(y_i - \hat{f}(x_i))^2}$$ used in regression problems to assess quality of fit. We are most interested in the MSE for our model on new data (that we haven't trained our model on). We want a model where the *average squared prediction error* is a small as possible. In most cases this is the *test MSE* (rather than lowest *training MSE*). 

#### The Bias-Variance Trade-Off (Test MSE vs. Training MSE)
What we find is that as we minimise the training MSE we get an associated U-shaped response in the test MSE.
<img src="files/biasvariance.png">
This comes from the fact that the expected test MSE can always be decomposed into the sum of three fundamental quantities
- *Variance* of $\hat{f}(x_0)$
- Squared *bias* of $\hat{f}(x_0)$
- The variance of the error terms $\epsilon$
$$E \left(y_0 - \hat{f}(x_0) \right)^2 = \text{Var}\left(\hat{f}(x_0)\right) + \left[\text{Bias}\left(\hat{f}(x_0)\right)\right]^2 + \text{Var}(\epsilon)$$

This equation tells us that to minimise the expected test MSE we need a statistical learning tool that simultaneously achieves a *low variance* and *low bias*. Variance is inherently nonnegative as is the squared bias. This means we can never get beyond the error that $\text{Var}(\epsilon)$ provides. It provides an upper bound.

What do we mean by *variance* and *bias*:
- Variance - refers to the amount by which $\hat{f}$ would change if we estimated it using a different training set. Different training data sets will give rise to different $\hat{f}$s. This variance should ideally be low. A high variance method means that even small changes in the training data set can results in large changes in $\hat{f}$.
- Bias - refers to the error introduced by estimating a real-world problem, which can often be extremely complicated, by a necessarily simpler model (say linear).  

We can perceive some general rules here:
- Flexible models introduce high variance (follow variance in data closely) but reduce bias.
- Inflexible models introduce high bias (are less flexible to data) but reduce variance (small changes do not affect $\hat{f}$ too much.

<img src="files/biasvariancedecomp.png">
We can see here three datasets that plot the bias, variance and associated test MSE as we increase the flexibility of our model. The dashed dotted line represents $\text{Var}(\epsilon)$. As we increase flexibility variance increases and bias decreases. The U-shape comes from the fact that initially the bias decreases at a greater rate than the variance increases, but eventually the model is so sensitive to variations in the data (high variance) that this overwhelms the bias reduction and produces the U-shape in our test MSE.

This describes the bias-variance trade-off. It is easy to discover a learning method with low bias and high variance, or low variance and high bias, the trick is balancing these two competing tendencies to fit the optimal model with the lowest test MSE. This trade-off is a recurring theme in statistical learning techniques so it is important to understand it.

#### The Classification Setting

##### The Bayes Classifier

##### K-Nearest Neighbours

## Python Basics

### Loading Data

### Matrices

### Graphics

### Statistical Summaries