\section{Introduction}

\subsection{Concepts}

Statistical learning: tools foor understanding data

Supervised learning: predicting or estimating an output based on input

Unsupervised learning: input but no output

Continuous or quantitative output value: regression problem

Categorical or qualitative output: classification problem

\subsection{History}
1800s: Legendre and Gauss on the method of least squares

1940s: logistic regression

1970s: generalized linear models

1986: generalized additive models, non-linear extensions to generalized linear models

\subsection{Notation}
Matrix
\[
\textbf{X} = \begin{bmatrix}
x_{11} & x_{12} & \dots & x_{1p} \\
x_{21} & x_{22} & \dots & x_{2p} \\
\vdots & \vdots & \ddots & \vdots \\
x_{n1} & x_{n2} & \dots & x_{np} \\
\end{bmatrix}
\]

\section{Statistical learning}
Input variables: predictors, independent variables, features, variables

Output variables: response, dependent variable

Statistical model:
\[
Y = f(X) + \epsilon
\]
Systematic term and random term

Statistical learning: set of approaches for estimating f

\subsection{Why estimate f?}
Prediction and inference

\subsubsection{Prediction}
\[
\hat{Y} = \hat{f}(X)
\]
$\hat{f}$ is the estimate for $f$, and $\hat{Y}$ is the prediction for $Y$. 

Reducible error: systematic part

Reducible error: error term

\[
E(Y - \hat{Y})^2 = E[f(X)+\epsilon - \hat{f}(X)]^2
\]
\[
[f(X) - \hat{f}(X)]^2 + \textrm{Var}(\epsilon)
\]

\subsubsection{Inference}
Which predictors are associated with the response?

What is the relationship between the response and each predictor?

Can the relationship between Y and each predictor be adequately using a linear equation, or is the relationship more complicated?

\subsection{How do we estimate f?}
Parametric or non-parametric

\subsubsection{Parametric methods}
Parametric methods involve a two-step model-based approach

1. Functional form or shape of f.
Linear model
\[
f(X) = \beta_0 + \beta_1X_1 + \beta_2X_2 + \dots + \beta_pX_p
\]

2. Fit training data to model
Ordinary least squares

Fitting more flexible models may lead to overfitting the data

\subsubsection{Non-parametric methods}
No assumption about the form of f is made

Thin-plate spline

\subsection{Trade-off between prediction accuracy and model interpretability}

\begin{figure}
    \centering
    \includegraphics[width=1\textwidth]{Pictures/MLflexibility.PNG}
\end{figure}

Restrictive models are much more interpretable

\subsection{Measuring the quality of fit}
Mean squared error (MSE)
\[
MSE = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{f}(x_i))^2
\]

We are interested in the accuracy of the predictions that we obtain when we apply our method to previously unseen test data

Test data: calculate test MSE and evaluate the models

No test observations available? 

Test MSE: U-shape property

Cross-validation: method to evaluate the test MSE

\subsection{The bias-variance trade-off}
The test MSE can always be decomposed into the sum of three fundamental qualities: varianc of $\hat{f}(x_o)$, the squared bias of $\hat{f}(x_0)$, and the variance of the error terms $\epsilon$. 
\[
E\bigg(y_0 - \hat{f}(x_0) \bigg)^2 = \textrm{Var}(\hat{f}(x_0)) + [Bias(\hat{f}(x_0))]^2 + \textrm{Var}(\epsilon)
\]

Averaging over all possible values of $x_0$ in the test set. 

Simultaneously achieving low variance and low bias. 

Variance: the amount by which $\hat{f}$ would change if we estimated it using a different training data set

Bias: error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model. More flexible methods result in less bias. 

Bias-variance trade-off: both low variance and low bias

\subsection{The classification setting}
Estimation, the training error rate: the proportion of mistakes that are made 
if we apply our estimate $\hat{f}$ to the training observations:
\[
\frac{1}{n}\sum_{i=1}^n I(y_i \neq \hat{y}_i)
\]
$\hat{y}_i$ is the predicted class label for the $i$:th using $\hat{f}$.
Indicator variable, $I(y_i \neq \hat{y}_i) = 1$ if true and $0$ otherwise. 

Test error rare is 
\[
Ave(I(y_0 \neq \hat{y}_0))
\] 

\subsubsection{The Bayes Classifier}
Simple classifier, assigns each observation to the most likely class, given 
its predictor values. We should simply assign a test observation with predictor 
vector $x_0$ to the class $j$ for which
\[
Pr(Y = j| X = x_0)
\]
is the largest. 

The Bayes classifier produces the lowest possible test error rate, the Bayes 
error rate. The overall Bayes error rate is
\[
1 - E\Bigg( \textrm{max}_j \textrm{Pr}(Y = j|X) \Bigg)
\]
where the expectation averages the probability over all possible values of X.

\subsubsection{K-nearest neighbors}
For real data, we do not know the conditional distribution of Y given X, and 
so computing the Bayes classifier is impossible.

Given a positive integer $K$ and a test observation$x_0$, the KNN classifier identifies $K$ points in the training data that are closest to $x_0$, represented by $N_0$. Estimates the conditional probability for class $j$ as the fraction of points in $N_0$ whose response values equal $j$:
\[
Pr(Y = j|X = x_0) = \frac{1}{K}\sum_{i \in N_0} I(y_i = j)
\]
KNN applies Bayes rule and classifies the test observation $x_0$ to the class with the largest probability.

\begin{figure}
    \centering
    \includegraphics[width=1\textwidth]{Pictures/KNN.PNG}
\end{figure}

KNN can often produce classifiers that are surprisingly close to the optimal Bayes classifier.

The choice of $K$ has a drastic effect on the KNN classifier. As $K$ grows, the method becomes less flexible.

\section{Evaluation metrics: classification}
True positive (TP): hits

True enegative (TN): correct rejection

False positive (FP): Type I error

False negative (FN): Type II error

True positive rate (TPR), sensitivity, recall:
\[
TPR = \frac{TP}{P} = \frac{TP}{TP + FN} = 1 -FNR
\]
True negative rate (TNR), specificity, selectivity:
\[
TNR = \frac{TN}{N} = 1 - FPR
\]
Precision:
\[
\frac{TP}{TP+FP}
\]

Some metrics are essentially defined for binary classification tasks ($f1_score, roc_auc_score$). In extending a binary metric to multiclass or multilabel problems, the data is treated as a collection of binary problems, one for each class. There are then a sumber of ways to average binary metric calculations across the set of classes:
\begin{enumerate}
    \item "macro": simply calculates the mean of the binary metrics, giving equal weight to each class.
    \item "weighted": accounts for class imbalance by computing the average of binary metrics in which each class' score is weighted by its presence in the true data. 
    \item "micro": gives each sample-class pair an equal contribution to the overall metric. 
    \item "samples": applies only to multilabel problems. 
\end{enumerate}

\subsection{Confusion matrix}
\begin{minted}[breaklines]{python}
from sklearn.metrics import confusion_matrix
y_pred = logreg.predict(X_test)
confusion_matrix(y_true, y_pred)
\end{minted}

\subsection{Accuracy score}
Accuracy score: the number of correct predictions/total predictions

If $\hat{y}_i$ is the predicted value of the $i$-th sample and $y_i$ is the corresponding true value, then the fraction of correct prediction sover $n$ samples is defined as
\[
accuracy(y, \hat{y}) = \frac{1}{n}\sum_{i=1}^n 1(\hat{y}_i = y_i)
\]
where $1(x)$ is the indicator function. 

\begin{minted}[breaklines]{python}
from sklearn.metrics import accuracy_score
y_pred = [0, 2, 1, 3]
y_true = [0, 1, 2, 3]
accuracy_score(y_true, y_pred)
\end{minted}

\subsection{Balanced accuracy score}
Balanced accuracy avoids inflated performance estimates on imbalanced datasets. In the binary case, balanced accuracy is equal to the arithmetic mean of sensitivity (true positive rate) and specificity (true negative rate), or the area under the ROC cruve with binary predictions rather than scores.

If $y_i$ is the true value of the $i$-th sample, and $w_i$ is the corresponding sample weight, then we adjust the sample weight to:
\[
\hat{w}_i = \frac{w_i}{\sum_j 1(y_j = y_j)w_j}
\]
where $1(x)$ is the indicator function. Given predicted $\hat{y}_i$ for sample $i$, balanced accuracy is defined as:
\[
\frac{1}{\sum_{\hat{w}_i}}\sum_i1(\hat{y}_i = y_i)\hat{w}_i
\]

\subsection{Hamming loss}
If $\hat{y}_j$ is the predicted value for the $j$-th label of a given sample, $y_j$ is the corresponding true value, and $n_{labels}$ is the number of classes or labels, then the Hamming loss between two samples is 
\[
\frac{1}{n_{labels}} \sum_{j=1}^{n_{labels}}1(\hat{y}_j \neq y_j)
\]
where $1(x)$ is the indicator function.

\begin{minted}[breaklines]{python}
from sklearn.metrics import hamming_loss
y_pred = [1, 2, 3, 4]
y_true = [2, 2, 3, 4]
hamming_loss(y_true, y_pred)
\end{minted}


\subsection{ROC curve and AUC}
The Receiver Operating Characteristic (ROC) metric is a method to evaluate classifier output quality. ROC curves typically feature true positive rate on the Y axis, and false positive rate on the X axis. This means that a larger area under the curve (AUC) is usually better. The steepness of ROC curves is also important, since it is ideal to maximize the true positive rate while minimizing the false positive rate. 

ROC curves are typically used in binary classification to study the output of a classifier. In order ti extend the ROC curve and ROC area to multilabel classification, it is necessary to binarize the output. 

Binary outcome:
\begin{minted}[breaklines]{python}
from sklearn.metric import roc_curve
y_pred-prob = logreg.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
\end{minted}

Multiclass-outcome
\begin{minted}[breaklines]{python}
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])


#Plot of ROC curve for a specific class    
plt.figure()
lw = 2
plt.plot(fpr[2], tpr[2], color='darkorange',
         lw=lw, label='ROC curve (area = %0.2f)' % roc_auc[2])
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()  
\end{minted}

\subsection{Binary cross-entropy/log-loss}
For binary classification, the typical loss function is the binary cross-entropy/log loss. \[
H_p(q) = -\frac{1}{N}\sum_{i=1}^N y_i \cdot \log(p(y_i)) + (1-y_i)\cdot\log(1 - p(y_i))
\]
where $y$ is the label and $p(y)$ is the predicted probability of the point being 1 or Yes for all $N$ points.

Since we're trying to compute a loss, we need to penalize bad predictions. If the probability associated with the true class is 1.0, we need its loss to be zero. Conversely, if that probability is low, 0.01, we need its loss to be huge. It turn out that taking the (negative) log of the probability suits us well enough for this purpose. 

We can compute the entropy of a distribution, like our $q(y)$, using the formula below, where $C$ is the number of classes:
\[
H(q) = - \sum_{c=1}^Cq(y_c)\cdot\log(q(y_c))
\]
We can try to approximate the true distribution with some other distribution, say $p(y)$. If we compute entropy like this, we are actually computing the cross-entropy between both distributions:
\[
H_p(q) = -\sum_{c=1}^Cq(y_c)\cdot\log(p(y_c))
\]
Cross-entropy will have a bigger value than the entropy computed on the true distribution. The difference between cross-entropy and entropy is the Kullback-Leibler divergence:
\[
D_{KL}(q\|p) = H_p(q) - H(q) = \sum_{c=1}^C q(y_c) \cdot [\log(q(y_c)) - \log(p(y_c))]
\]
The classifier 

\begin{minted}[breaklines]{python}
from sklearn.metrics import log_loss
y_pred_prob = logreg.predict_proba(X_test)[:,1]
log_loss(y_true, y_pred_prob)
\end{minted}

\subsection{Brier score}
The smaller the Brier score, the better. Across all items in a set $N$ predictions, the Brier score measures the mean squared difference between the predicted probability assigned to the possible outcomes for item $i$, and the actual outcome (0 or 1). 

\begin{minted}[breaklines]{python}
from sklearn.metrics import brier_score_loss
y_true = np.array([0,1,1,0])
y_prob = np.array([0.1, 0.9, 0.8, 0.3])
brier_score_loss(y_true, y_prob)
\end{minted}

\subsection{Dscounted cumulative gain}
Compute discounted cumulative gain

Sum the true scores ranked in order induced by the predicted scores, after applying a logarithmic discount.

\begin{minted}[breaklines]{python}
from sklearn.metrics import dcg_score
y_true = np.array([0,1,1,0])
y_prob = np.array([0.1, 0.9, 0.8, 0.3])
dcg_score(y_true, y_prob)
\end{minted}

\subsection{F1 score}
The F1 score can be interpreted as a weighted average of the precision and recall, where F1 score reaches its best value at 1 anf worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is:
\[
F1 = \frac{2(P + R)}{P+R}
\]

\begin{minted}[breaklines]{python}
from sklearn.metrics import f1_score
y_true = [0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 1, 0, 0, 1]
f1_score(y_true, y_pred, average='macro')
\end{minted}

\subsection{Precision-Recall}
The precision-recall metric is a method to evaluate the classifier output quality. Prec ision-recall is a useful measure of success of prediction when the classes are very imbalanced. A high area under the curve represents both high recall and high precision, where high precision relates to low false positive rate, and high recall relates to low false negative rate. A system with high recall but low precision return many results, but most of its predicted labels are incorrect when compared to the training labels. A system with high precision but low recall returns very few results, but most of its predicted labels are correct when compared to the training labels. 

Precision $P$ is defined as the number of true positives $T_p$ over the number of true positives plus the number of false positives $F_p$. 
\[
P = \frac{T_p}{T_p + F_p}
\]
Recall $R$ is defined as the number of true positives $T_p$ over the number of true positives plus the number of false negatives $F_n$:
\[
R = \frac{T_p}{T_p + F_n}
\]

Average precision (AP) summarizes the precision-recall plot as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight:
\[
AP = \sum_n (R_n - R_{n-1})P_n
\]
where $P_n$ and $R_n$ are the precision and recall at the nth threshold. 

\begin{minted}[breaklines]{python}
from sklearn.metric import precision_recall_curve
from sklearn.metrics import plot_precision_recall_curve

plot_precision-recall_curve(classifier, X_test, y_test)

y_score = logreg.pred_proba(X_test)[:,1]

average_precision = average_precision_score(y_test, y_score)
\end{minted}


