In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

import sys
sys.path.append('../../pyutils')
import metrics
import utils

# When $p$ is much bigger than $N$

High variance and overfitting are a major concern in this setting.  
Simple, highly regularized models are often used.  

Let's suppose we are trying to predict a linear model.  
With $p<<N$, we can identify as many coefficients as we want without shrinkage.  
With $p=N$, we can identify some non-zero coefficients with moderate shrinkage.
With $p>>N$, even though they are many non-zero coefficients, we don't have a hope to find them, we need to shrink a lot.

# Diagonal LDA and Nearest Shrunken Centroids

The simplest form of regularization assumes that the features are independant within each class (the within class covariance matris is diagonal).  
It greatly reduces the number of parameters and often result in an effective and interpretable classifier.  

The discriminant score for class $k$ is:
$$\theta_k(x) = - \sum_{j=1}^p \frac{(x_j - \bar{x}_{kj})^2}{s_j^2} + 2 \log \pi_k$$
with $s_j$ the within-class standard deriavtion for feature $j$, and:
$$\bar{x}_{kj} = \frac{1}{N_k} \sum_{i \in C_k} x_{ij}$$  

We call $\bar{x}_k$ the centroid of class $k$. Diagonal LDA can be seen as a nearest centroid classifier with appropriate standardization.  

To regularize in order to drop out features, we can shrink the classwise mean toward the overall mean for each feature separately. This method is called Nearest Shrunken Procedure (NSC).  

Let 
$$d_{jk} = \frac{\bar{x}_{kj} - \bar{x}_j}{m_k(s_j + s_0)}$$

with $m_k^2 = 1/N_k - 1/N$ and $s_0$ a small positive constant.  

We can shrink $d_{kj}$ toward zero using soft thresholding:
$$d'_{kj} = \text{sign}(d_{kj})(|d_{kj}| - \Delta)_+$$
with $\Delta$ a parameter to be determined.  

The shruken centroids are obtained by:
$$\bar{x}'_{kj} = \bar{x}_j + m_k(s_j + s_0)d'_{kj}$$
We use the shrunken centroids $\bar{x}'_{kj}$ instead of the original $\bar{x}_{kj}$ in the discriminant score.

# Linear Classifiers with Quadratic Regularization

## Regularized Discriminant Analysis

LDA involves the inversion of a $p*p$ within-covariance matrix $\Sigma$. When $p > n$, the matrix is singular.  
RDA solves the issue by shrinking $\Sigma$ towards its diagonal:
$$\hat{\Sigma}(\gamma) = \gamma \hat{\Sigma} + (1 - \gamma) \text{diag}(\hat{\Sigma})$$

## Logistic Regression with Quadratic Regularization

The multiclass logistic model is expressed as:
$$P(G=k|X=x) \frac{\exp (\beta_{k0} + x^T \beta_k)}{\sum_{l=1}^K \exp (\beta_{l0} + x^T \beta_l)}$$

This has $K$ coefficients vecors $\beta_k$. We regalurize the fitting by maximizing the penalized log-likelihhood:
$$\max_{ \{ \beta_{0k}, \beta_k \}_1^K} \left[ \sum_{i=1}^N \log P(g_i|x_i) - \frac{\lambda}{2} \sum_{k=1}^K ||\beta_k||_2^2 \right]$$

## The Support Vector Classifier

When $p > N$, the classes are perfectly separable, unless there are identical feature vectors in different classes.  
Surprisingly, the unregularized SVC often works about as well as the best regularized version.

## Feature Selection

RDA, regularized logistic regression and SVC shrinks weights toward zero, but they keep all features.  
Recursive feature elimination remove feature with small weights, and retrain the classifier.  

All 3 approches can be modified using kernels, to increase model complixity. With $p > N$ overfitting is always a danger, and yet using kernel may sometimes give better results.

## Computational shorcuts when $p \gg N$

Instead of working with $X \in \mathbb{R}^{N*p}$ matrix, we can work with a matrix of size $N*N$, using the SVD:
$$
\begin{equation}
\begin{split}
X & = UDV^T \\
& = RV^T
\end{split}
\end{equation}
$$

with $R \in \mathbb{R}^{N*N}$:
$$R = UD$$

We can usually work with $R$ instead of $X$.  
For example, let's consider the estimates from a ridge regression:
$$\hat{\beta} = (X^TX + \delta I)^{-1}X^ty$$

We can instead get $\hat{\theta}$ the ridge regression estimate using $(r_i, y_i)$. And then we get $\hat{\beta} = V \hat{\theta}$.  
This idea can be generalized to any linear models with a quadratic penalty on the weights.

# Linear classifiers with $L_1$ Regularization

The lasso for linear regression is:
$$\min_{\beta} \frac{1}{2} \sum_{i=1}^N \left( y_i - \beta_0 - \sum_{j=1}^p x_{ij}\beta_j \right) ^2 + \lambda \sum_{j=1}^p |\beta_j|$$

$L_1$ penalty causes a subset of the $\hat{\beta}_j$ to be exactly zero for a sufficiently large value of $\lambda$, and hence performs feature selection.  

When $p > N$, as $\lambda \to 0$ the model fits perfectly the dataset.  
When $p > N$ the number of non-zero coefficients is at most $N$ for any values of $\lambda$.  

Linear regression can be applied for two-class clasification using $\pm 1$ as labels, and using sign for the predictions.  

A more natural approach is to use the lasso penalty on logistic regression. We can use a symmetric multinomial logistric regression model, and maximixe the penalized log-likelihood:
$$\max_{ \{ \beta_{0k}, \beta_k \}_1^K} \left[ \sum_{i=1}^N log P(g_i|x_i) - \lambda \sum_{k=1}^K \sum_{j=1}^p |\beta_{kj}|  \right]$$

The lasso tends to encourage a sparse solution, and ridge tends to shrink the oefficients of correlated variables toward each other.  
The elastic net penalty is a compromise:
$$\sum_{j=1}^p (\alpha |\beta|_j + (1 - \alpha) \beta_j^2)$$
with $\alpha \in [0,1]$ parameter that determines the mix of the penalties.

The logistic regression problem above with the elastic net penalty becomes:
$$\max_{ \{ \beta_{0k}, \beta_k \}_1^K} \left[ \sum_{i=1}^N log P(g_i|x_i) - \lambda \sum_{k=1}^K \sum_{j=1}^p (\alpha|\beta_{kj}| + (1 - \alpha) \beta_{kj}^2)  \right]$$

## The Fused Lasso

The Fused Lasso is a method that tend to smooth the coefficients uniformly. We add a penalty to take into account the ordering of the features:

$$\min_{\beta} \sum_{i=1}^N \left( y_i - \beta_0 - \sum_{j=1}^p x_{ij}\beta_j \right) ^2 + \lambda_1 \sum_{j=1}^p |\beta_j| + \lambda_2 \sum_{j=1}^{p-1} |\beta_{j-1} - \beta_j|$$

# Classification When Features are Unavailable

Instead of working with features, we can instead work with an $N*N$ proximity matrix, and we can interpret the proximities as inner-products.

For example, it can be considerer as the matrix kernel $K$, and can be used with kernel methods, SVM.

## Classition and other methods using Kernels

They are a number of other classifier, besides SVM, that can be implemented using only inner-product matrices. This also implies they can be kernelized like the SVM.  

For nearest-neigbor, we can transform inner-products to distances:
$$||x_i - x_{i'}||^2 = \langle x_i, x_i \rangle + \langle x_{i'}, x_{i'} \rangle - \langle x_i, x_{i'} \rangle$$

For nearest-centroid classification, with training pairs $(x_i, g_i)$, and class  centroids $\bar{x}_k$, we can compute the distance of the test point to each centroid:
$$||x_0 - \bar{x}_k||^2 = \langle x_0, x_0 \rangle - \frac{2}{N_k} \sum_{g_i=k} \langle x_0, x_i \rangle + \frac{1}{N_k^2} \sum_{g_i=k} \sum_{g_{i'}=k} \langle x_i, x_{i'} \rangle$$

We can also perform kernel PCA. Let $X$ centered data matrix, with SVD decomposition:
$$X=UDV^T$$
We get the matrix of principal components $Z$:
$$Z = UD$$
When $K=XX^T$, it follow that $K=UD^2U^T$, and hence we can compute $Z$ from the eigeindecomposition of $K$.  
If $X$ is not centered, we need to use the double-centered kernel instead:
$$\tilde{K} = (I-M)K(I-M)$$
with $M = \frac{1}{N} 1 1^T$.

But they are some things that we cannot do with kernels:
- We cannot standardize the variables.
- We cannot assess directly the contribution of individual variables (i.e. we cannot use the lasso penalty)
- We cannot separate the good variables from the noise: they all get an equal say.

# High-Dimensional Regression: Supervised Principal Components

PCA is an effective method to find linear combinations of features that exhibit large variation in the data.  
Supervised PCA find linear linear combination with both high variance and significant correlation with the outcome.

Supervised PCA can be related Latent-Variable modeling.  
Suppose we hase a response variable $Y$ related to an underlying latent variable U by a linear model:
$$Y = \beta_0 + \beta_1 U + \sigma$$

We have measurements on a set of features $X_j$, $j \in \mathcal{P}$:
$$X_j = \alpha_{0j} + \alpha_{1j}U + \sigma_j, \space j \in \mathcal{P}$$
We also have many additional features $X_k$, $k \notin \mathcal{P}$, which are independent of $U$.

This is a 3 steps, proccess, similar to supervised PCA:
- Estimates the set $\mathcal{P}$
- Given $\hat{\mathcal{P}}$, estimates $U$
- Perform a regression fit to estimate $\beta$, $\beta_0$.

# Feature Assessment and the Multiple-Testing Problem

Feature Assessment asses the significance of each features, it's related to multiple hypothesis testing.  
Let's suppose we have a dataset ok $N$ observations, each with $M$ features, separated into $K=2$ groups.  

To identify which features are informative to guess the group, we construct a two-sample t-statistic for each feature:
$$t_j = \frac{\bar{x}_{2j} - \bar{x}_{1j}}{\text{se}_j}$$
where:
$$\bar{x}_{kj} = \frac{1}{N_k} \sum_{i \in C_k} x_{ij}$$

$\text{se}_j$ is the pooled within-group standard error for feature $j$:
$$\text{se}_j = \hat{\sigma}_j \sqrt{\frac{1}{N_1} + \frac{1}{N_2}}$$
$$\hat{\sigma}_j^2 = \frac{1}{N_1 + N_2 - 2} \left( \sum_{i \in C_1} (x_{ij} - \bar{x}_{1j})^2 + \sum_{i \in C_2} (x_{ij} - \bar{x}_{2j})^2 \right)$$

We could consider any value $> 2$ in absoluve value to be significantly large. However, with $M$ large, we would expect many largve values to occur by chance.  
We can assess the result for all $M$ using the multiple testing problem.   
We can compute the p-value for each feature $p_j$:
$$p_j = \frac{1}{K} \sum_{k=1}^K I(|t_j^k| > |t_j|)$$

where whe take $K$ random sample labels permutations $t_j^k$.

Using p-values, we can test the hypotheses:
$H_{0j} = $ label has no effect on feature $j$.  
$H_{1j} = $ label has an effect on feature $j$.  
We reject $H_{0j}$ at level $\alpha$ if $p_j < \alpha$.  

Let $A_j$ the event that $H_{0j}$ if falsely rejected: $P(A_j) = \alpha$  
The familiy-wise error rate (FWER) is the probability of at least one false rejection.

## The False Discovery Rate

Possible outcomes from $M$ hypotesis tests:

|&nbsp;|Called not significant|Called signification|Total|
|---|---|---|---|
|$H_0$ True| $U$ | $V$ | $M_0$|
|$H_0$ False|$T$|$S$|$M_1$|
|Total|$M-R$|$R$|$M$|

The false discovery rate is:
$$\text{FDR} = E(V/R)$$

The expectation is taken over the sampled data.

## Asymmetric Cutpoints and the SAM Procedure

In previous approaches, we used the absolute value of $t_j$, hence applying the same cutpoint to both positive and negative values.  
Significance analysis of microrrays (SAM) derive separate cut-point for the two classes.