In [2]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

%matplotlib inline

Chapter 1 is mostly introductory, not very technical.

*Supervised* vs *Un-supervised* learning:
* Supervised learning involves creating a statistical for predicting an *output* given one or more *inputs*.
* With un-supervised learning, we still have *inputs* but no supervising output. See *clustering*.

*Regression* vs *Classification*
* Regression involves predicting a *continuous* or *quantative* value
* Classification involves predicting a categorical or *qualitative* value.*

Notation:
* $n$ = number of observations
* $p$ = number of variables
* $xij$ = the value of the $jth$ variable for the $ith$ observation. Where $i = 1,2,...n$ and $j = 1,2,...,p$.
* $X = n x p$ matrix
* $y$ = target variable vector





#### Chapter 2

* input variables synonyms: predictors, indepedent ariables, features
* output variable synonyms: respones, dependent variable, target variable

Given a quantative response $Y$ and $p$ different predictors $X1,X2,...Xp$ we write the relationship between these predictors and $Y$ as

$Y = f(X)+e$

Where f represents the *systemic* information that $X$ provides us about $Y$, and $e$ is the *error term* which is independent of $X$ and has a mean of zero.
The TRUE function $f$ is generally not known in practice, and so statistical learning is ultimately a set of approaches for estimating $f$

Goal is to find a function $\hat{f}$ so that $Y \approx f(X)$ for all observations $(X,Y)$.

**Prediction vs Inference**

* When we have a lot of inputs X but no easy way of obtaining the true output Y, we can predict the outcome: $\hat{Y}$ = $\hat{f}$(X) where $\hat{f}$ is our estimate for f. In this case we are generally mostly interested in as accurate an output of $\hat{Y}$ as possible and $\hat{f}$ is treated as a black box.

*Example*
* Given patient features $X1,...Xp$ and variable $Y$ representing said patient's risk of adverse reaction to a particular drug, we would want to predict $Y$ using $X$, avoiding the risk of an adverse reaction.

**Reducive vs irreducible errors**

* Even if we could predict the actual function $f$ perfectly, such that $\hat{Y}$ = $f(X)$, we would still be left with an error, as the actual $Y$ is a function both of $f(X)$ as well as the error term, $e$. This is the irreducible error, independent from X and so we cannot do anything about it.
* The reducible error can be minimized through utilizing the appropriate statistical learning techniques.

Given an estimate $\hat{f}$ and a set of predictors $X$, yielding the prediction $\hat{Y} = \hat{f}(X)$. Assuming for now that $\hat{f}, X$ are both fixed, then

$E(Y-\hat{Y})^2 = E[f(X)+e - \hat{f}(X)]^2$  
*< expected value of squared difference between $\hat{Y}$ and $Y$, and var is the $var(e)$ is the variance of the error term $e$*

$= [f(X)-\hat{f}(X)]^2 + Var(e)$

    Reducible | irreducible
    
The irreducible error provides an upperbound on how accurate any model we make can be.
       
**Inference**
* Understanding how $X$ influences $Y$. 
* $f$ cannot be treated as a black box here, interpretability is important. 
* Understanding which variables in $X$ have the most impact on $Y$, and whether their relationship is positive or negative.
* Can we model it as a linear relationship? Much easier for interpretation but often not realistic.

**Parametric vs non-parametric**

*Parametric methods*
Parametric methods can be broken into 2 steps:
1. We make an assumption about the functional form of $f$, for example that it $f$ is linear in X:
$f(X) = \beta_0+\beta_1X_1+\beta_2X_2+...+\beta_pX_p$

2. We *fit* (or *train*) our model using the training data. Using the linear model from above as an example we want to find the values of the paramters $\beta_0,\beta_1,...\beta_p$ such that
$Y \approx \beta_0+\beta_1X_1+\beta_2X2+...+\beta_pX_p$
We measure the goodness of our fit using, most commonly, *ordinary least squares*.

Pros:
* Reduces the possibility space, greatly simplifying the problem.
* More interpretable.

Cons:
* Unlikely to match the true function $f$. If the true function $f$ is highly non-linear, we might want to use a more flexible model.

*Non-parametric methods*
* Makes no assumption about the functional form of $f$.
* Tries to find the smoothest fit possible - i.e find $\hat{f}$ such that every data point is as close as possible without roughness or wigglyness.
* As such, can fit a wider range of shapes for $f$.

Pros:
* Flexible, can fit a wide range of shapes for $f$.

Cons:
* As we don't reduce the size of the problem at all, we require many more observations to accurately estimate $f$ than a parametric method would.
* Possible to overfit - matching the training data perfectly but not generalizing well to test or new data.
* Less interpretable.

**Supervised vs Unsupervised**

*Supervised*
* For every observation of our predictor measurements $x_i, i= 1,...,n$ we have an associated response measurement $yi$.

*Unsupervised*
* No response variable, simply a set of predictor measurements $x_i, i = 1,...,n$.
* Example: clustering, such as with $K$*-means*.

**Classification vs Regression**
Variables can either be *quantitative* or *qualitiative*. A quantiative variable is a numerical variable (discrete or continuous), while a qualiatitve variable is a categorical.
* Ex age is a quantitative variable while gender would be a qualitative variable.

Generally speaking a problem with a quantitiative response variable is a regression problem, while one with a qualitative one is a classification problem. There are however many exceptions (logistic regression is often used for classification but as the name implies, is a regression method).

The nature of the response variable is a key factor in picking which method to use, but the nature of the features (predictor variables) matters less, generally.

**Model Accuracy**

*Mean Squared Error*
The most common metric for measuring goodness of fit in the regression setting is the MSE, defined as
$MSE = 1/n \sum_{i=1}^{n}(y_i-\hat{f}(x_i))^2$

* Goal is not to minimize training set MSE (doing this would just lead to overfitting).
Rather we want to minimize test-set MSE, measuring our model's predictive powers on data not seen during training.

$Ave(y_o-\hat{f}(x_o))$ where $(x_o,y_o)$ are here to unseen by our model.

As flexibility of a method increases, training set MSE will decrease, but this does not guarantee that test set MSE will. A big discrepancy in the form of low training set MSE with a high test set MSE is a symptom of overfitting.
Note that we are only said to be overfit if a less flexible model would have yielded a better result than the more flexible one.

**The Bias-Variance trade-off**

Expected test MSE, for a given value $x_o$ can be decomposed as three terms:

* The variance of $\hat{f}(x_o)$
* The squared bias of $\hat{f}(x_o)$
* The variance of the error term $e$

or
$E(y_o-\hat{f}(x_o)^2 = Var(\hat{f}(x_o)) + [Bias(\hat{f}(x_o))]^2 + Var(e)$

Overall expected test MSE can be computed by averaging $E(y_o-\hat{f}(x_o))^2$ over all possible values $x_o$ in the test set.

Thus we want to select the method giving us the lowest variance and the lowest bias.
Neither value can ever be negative (as we are using squared bias), so we are bounded below in terms of a minimum MSE, on the value of $e$, our *irreducible error* term.

Variance here is a measure of how much $\hat{f}$ would change with a changes to our training set. A very flexible method generally has higher variance than a less flexible one, and would vary more between training sets.

Bias on the other hand is how much of an additional error we introduce in our assumptions (for example the assumption of a linear relationship in linear regression). 

Generally, more flexible = lower bias, higher variance, less flexible = higher bias, lower variance.