In [1]:
import sys
sys.path.append('..')

import numpy as np

import metrics

# Introduction

The predictor $G(X)$ takes values in a discrete set $\mathbb{G}$. The input space is divided into a collection regions labeled according to their clasification.  
The boundaries of these regions are called decisions boundaries, and they are linear for linear methods

## Example 1

Example of $K$ classes with discriminant function:

$$\delta_k(x) = \beta_{k0} + \beta_k^Tx$$
The assigned class is the one with the biggest value for $\delta_k$.  

The decision boundary between class $k$ and $l$ is the set of points for wich $\delta_k(x) = \delta_l(x)$.  
That is the hyperplane defined by $\{x: (\beta_{k0} - \beta_{l0}) + (\beta_k - \beta_l)^Tx = 0 \}$.

Methods that model the posterior probabilities $P(G = k | X = x)$ are also in this class.  
If $\delta_k(x)$ or $P(G = k | X = x)$ is linear in $x$, then the decision boundaries will be linear.

## Example 2

Example of 2 classes with posterior probabilities:
    
$$P(G = 1 | X = x) = \frac{\exp(\beta_0 + \beta^Tx)}{1 + \exp(\beta_0 + \beta^Tx)}$$
$$P(G = 2 | X = x) = \frac{1}{1 + \exp(\beta_0 + \beta^Tx)}$$  

The decision boundary is the set of points for which the log-odds are zero:

$$\log(\frac{p}{1-p}) = 0$$
$$\log \frac{P(G = 1 | X = x)}{P(G = 2 | X = x)} = \beta_0 + \beta^Tx$$

Thus the decision boundary is an hyperplane defined by $\{x | \beta_0 + \beta^Tx = 0\}$.  

Linear logistic regression and linear discriminant analysis have linear log-odds.

Another solution is to explicitely model linear boundaries between the classes.

# Linear Regression of an Indicator Matrix

The output is a vector $y \in \mathbb{R}^{K}$, with $K$ the number of classes.
If the example belong to classs $k$, $y_j = 1_{j=k}$.  
For a training set of size $N$, the output matrix is $Y \in \mathbb{R}^{N*K}$.  

The parameters are fitted using any multiple outputs linear classification methods for $X$ and $Y$, eg normal equations:
$$\hat{B} = (X^TX)^{-1} X^TY$$.  

Classify an example:
$$\hat{y} = x^T \hat{B}$$
$$\hat{G}(x) = \arg \max_{k \in \mathbb{G}} \hat{y}_k$$

Regression gives an expectation of conditional expectation.
$$y_k = \mathbb{E}(y_k | X = x) = P(G = k | X = x)$$

In [2]:
def gen_toy_class(N, noise=0.001):
    X = 2.8 * np.random.randn(N, 4)**1 + 4.67
    v1 = 1.5*X[:, 0] + 2.3*X[:, 1] - 0.3*X[:, 2] + 4.5 + noise*np.random.randn(len(X)) 
    v2 = 1.7*X[:, 0] + 0.4*X[:, 1] + 2.3*X[:, 2] - 3.7 + noise*np.random.randn(len(X))
    v3 = -0.6*X[:, 0] + 5.8*X[:, 1] - 1.3*X[:, 2] + 0.1 + noise*np.random.randn(len(X))
    V = np.vstack((v1, v2, v3)).T
    g = np.argmax(V, axis=1)
    return X, g

def label2onehot(g, nclasses):
    Y = np.zeros((len(g), nclasses))
    Y[np.arange(len(g)), g] = 1
    return Y

def onehot2labels(Y):
    return np.argmax(Y, axis=1)

def add_col1(X):
    return np.append(np.ones((len(X),1)), X, axis=1)

In [3]:
#Example with 3 classes

X, g = gen_toy_class(117, noise=1e-3)
Y = label2onehot(g, 3)

X2 = add_col1(X)
B = np.linalg.inv(X2.T @ X2) @ X2.T @ Y

Y_preds = X2 @ B
preds = onehot2labels(Y_preds)
print('error:', np.mean((Y - Y_preds)**2))
print('acc:', np.mean(g == preds))

error: 0.11236653073061927
acc: 0.8888888888888888


In [4]:
def gen_toy_bin(N, noise=0.001):
    X = 2.8 * np.random.randn(100000, 4)**2 + 4.67
    v = 1.5*X[:, 0] + 2.3*X[:, 1] - 4.7*X[:, 2] + 4.5 + noise*np.random.randn(len(X))
    m = v.mean()
    X = X[:N]
    g = (v[:N] > m).astype(np.int)
    return X, g

In [5]:
#Binary example

X, g = gen_toy_bin(117000, noise=0)
Y = label2onehot(g, 2)

X2 = add_col1(X)
B = np.linalg.inv(X2.T @ X2) @ X2.T @ Y

Y_preds = X2 @ B
preds = onehot2labels(Y_preds)
print('error:', np.mean((Y - Y_preds)**2))
print('acc:', np.mean(g == preds))

error: 0.12269558980684189
acc: 0.85423


$y_k$ doesn't belong like a probability. Even though $\sum_{k \in \mathbb{G}} y_k = 1$, $y_k$ might be negative or greater than $1$.

Another approch is to construct a target $t_k$ for each class, with $t_k$ $k$-th columns of $I_K$.
Obervations are $y_i = t_k$ if $g_i = k$.
We fit the least-squares criterion:
$$\hat{B} = \arg \min_{B} \sum_{i=1}^N ||y_i - x_i^TB||^2$$

Classify an example:
$$\hat{G}(x) = \arg \min_{k} ||x_i^T\hat{B} - t_k||^2$$  

Actually, this model yields exactly the same results than the previous ones.

This model doesn't work well when $K \geq 3$. Because of the rigid nature of regression, classes can be masked by others.  
A general rule is that with $K \geq 3$ classes, polynomials terms up to degree $K - 1$ might be needed to solve them.  
Masking usually occurs for large $K$ and small $p$.  
Other methods like logistic regression and linear distriminant analysis doesn't suffer from masking

# Linear Discriminant Analysis

According to Bayes theorem:
    
$$P(G = k | X = x) = \frac{P(X = x | G = k) P(G = k)}{P(X)}$$
$$P(G = k | X = x) = \frac{P(X = x | G = k) P(G = k)}{\sum_{l=1}^K P(X = x | G = l) P(G = l)}$$

Let $\pi_k$ the prior probability of class $k$: $\pi_k = P(G = k)$.  
Let $f_k(x)$ the class-condisional density of $X$ in class $G = k$: $P(X \in T | G = k) = \int_T f_k(x)dx$.

Thus, the posterior probability is:

$$P(G = k | X = x) = \frac{f_k(x) \pi_k}{\sum_{l=1}^K f_l(x) \pi_l}$$

Each density class is represented as a multivariate Gaussian:

$$f_k(x) = \frac{\exp(-\frac{1}{2} (x-\mu_k)^T \Sigma^{-1} (x - \mu_k) )}{\sqrt{(2\pi)^p |\Sigma|}}$$

with:
- $\Sigma \in \mathbb{R}^{p*p}$ covariance matrix shared by all class densities.
- $\mu_k \in \mathbb{R}^p$ the mean vector for class density $k$.
- $|\Sigma| = \det(\Sigma)$  

$$\log \frac{P(G = k | X = x)}{P(G = k | X = x)} = \log \frac{\pi_k}{\pi_l} - \frac{1}{2}(\mu_k + \mu_l)^T\Sigma^{-1}(\mu_k - \mu_l) + x^T \Sigma^{-1}(\mu_k - \mu_l)$$.

The log-odds function is linear in $x$, so the decision boundaries are linear.

The decision rule can be described with linear descriminant functions:

$$\delta_k(x) = x^T \Sigma^{-1} \mu_k - \frac{1}{2}\mu_kt^T\Sigma^{-1}\mu_k + \log \pi_k$$
$$G(x) = \arg \max_k \delta_k(x)$$  

The parameters are estimated from the training data:
$$\hat{\pi}_k = \frac{N_k}{N}$$
$$\hat{\mu}_k = \frac{\sum_{g_i = k} x_i}{N_k}$$
$$\hat{\Sigma} = \frac{\sum_{k=1}^K \sum_{g_i = k} (x_i - \mu_k)(x_i - \mu_k)^T}{N - K}$$

In [6]:
#Example with 3 classes

def lda(X, g, K):
    N = X.shape[0]
    p = X.shape[1]
    
    pis = []
    mus = []
    cov = np.zeros((p, p))
    
    for k in range(K):
        nk = np.sum(g == k)        
        pi = nk / N
        mu = np.zeros((p,))
        for i in range(N):
            if g[i] == k:
                mu += X[i]
        mu /= nk
        
        pis.append(pi)
        mus.append(mu)
        
    
    for i in range(N):
        cov += np.outer(X[i] - mus[g[i]], X[i] - mus[g[i]])
    cov /= (N - K)
    
    
    icov = np.linalg.inv(cov)
    B = np.empty((p, K))
    intercept = np.empty((K,))
    for k in range(K):
        B[:, k] = icov @ mus[k]
        intercept[k] = - 1/2 * (mus[k] @ icov @ mus[k]) + np.log(pis[k])
    
    return B, intercept


X, g = gen_toy_class(11700, noise=1e-5)
B, intercept = lda(X, g, 3)

Y_preds = X @ B + intercept
preds = np.argmax(Y_preds, axis=1)

print('acc:', np.mean(g == preds))

acc: 0.9889743589743589


## Quadratic Discriminant Analysis

If each $f_k(x)$ as it's own covariance marix $\Sigma_k$, the logs-odd function and the distriminant functions became quadratic:

$$\delta_l(x) = - \frac{1}{2} \log | \Sigma_k| - \frac{1}{2} (x - \mu_k)^T \Sigma_k^{-1} (x - \mu_k) + \log \pi_k$$  

When $p$ large, QDA causes a dramatic increase in the number of parameters.  
There is little difference between LDA applied to dataset augmented with polynomials of degree $2$, and QDA.  

For QDA, the estimates of $\pi_k$ and $u_k$ stays the same, and the estimate of $\Sigma_k$ is:

$$\hat{\Sigma}_k = \frac{\sum_{g_i = k} (x_i - \mu_k)(x_i - \mu_k)^T}{N_k - 1}$$

In [7]:
#Example with 3 classes

def qda(X, g, K):
    N = X.shape[0]
    p = X.shape[1]
    
    pis = []
    mus = []
    dcovs = []
    icovs = []
    
    for k in range(K):
        nk = np.sum(g == k)        
        pi = nk / N
        
        mu = np.zeros((p,))
        for i in range(N):
            if g[i] == k:
                mu += X[i]
        mu /= nk
        
        cov = np.zeros((p, p))
        for i in range(N):
            if g[i] == k:
                cov += np.outer(X[i] - mu, X[i] - mu)
        cov /= (nk - 1)
        
        pis.append(pi)
        mus.append(mu)
        dcovs.append(-1/2 * np.log(np.linalg.det(cov)))
        icovs.append(np.linalg.inv(cov))

    
    return pis, mus, dcovs, icovs

def qda_pred(x, pis, mus, dcovs, icovs):
    K = len(pis)
    y = np.empty((K,))
    
    for k in range(K):
        qt = -1/2 * (x - mus[k]) @ icovs[k] @ (x - mus[k])
        y[k] = dcovs[k] + qt + np.log(pis[k])
    
    return np.argmax(y)


X, g = gen_toy_class(11700, noise=1e-5)
pis, mus, dcovs, icovs = qda(X, g, 3)
preds = np.empty((len(X),))
for i in range(len(X)):
    preds[i] = qda_pred(X[i], pis, mus, dcovs, icovs)

print('acc:', np.mean(g == preds))

acc: 0.9684615384615385


## Regularized Discriminant Analysis

RDA is a oompromise between LDA and QDA, allowing to shrink the separate covariances of QDA toward a common covariance as in LDA.

$$\hat{\Sigma}_k(\alpha) = \alpha \hat{\Sigma}_k + (1 - \alpha) \hat{\Sigma}$$

with $\alpha$ hyperparameter that allows a continuum of models between LDA and QDA.  

Another modificatio allows $\hat{\Sigma}$ to be shunk toward a scalar covariance:

$$\hat{\Sigma}(\gamma) = \gamma \hat{\Sigma} + (1 - \gamma) \hat{\sigma}^2I$$

## Computations for LDA

Computations can be simplified by diagonalization of $\Sigma$.  
For QDA:
$$\hat{\Sigma}_k = U_k D_k U^T_k$$
$$(x - \hat{\mu}_k)^T\Sigma^{-1}_k (x - \hat{\mu}_k) = [U_k^T(x - \hat{\mu}_k)]^T D^{-1}_k [U_k^T(x - \hat{\mu}_k)]$$
$$log |\Sigma_k| = \sum_l log d_{kl}$$  

For LDA, we can project the data into a space where the common covariance estimate is $I$:
$$\hat{\Sigma} = UDU^T$$
$$X^* \leftarrow X D^{-\frac{1}{2}}U^T$$

## Reduced-Rank LDA

The $K$ centroids in $p$-dimensional input space lie in an affine subspace of dimension $\leq K - 1$. We might just as well project $X*$ onto this centroid-spanning subpace $H_{K-1}$.  

We can also project $X*$ into a subspace $H_L$ for $L \leq K$, where the projected centroids were spread out as much as possible in terms of variance.

### Algorithm

- Compute matrix of class centroids $M \in \mathbb{R}^{K*p}$
$$M_k = \frac{1}{N_k} \sum_{g_i = k} x_i$$

- Compute within class covariance matrix $W \in \mathbb{R}^{p*p}$

$$W = \sum_{k=1}^K \sum_{g_i = k} (x_i - M_k) (x_i - M_k)^T$$

- Compute $M^* = MW^{-\frac{1}{2}}$, $M^* \in \mathbb{R}^{K*p}$

$$P_W^T W P_W = D_W$$
$$W^{-\frac{1}{2}} = P_W D^{-\frac{1}{2}}P_W^T$$

- Compute between class covariance matrix $B^*$ of $M^*$, $B^* \in \mathbb{R}^{p*p}$
$$\mu = \frac{1}{K} \sum_{k=1}^K M_k^*$$

$$B^* = \sum_{k=1}^K N_k (M^*_k - \mu) (M^*_k - \mu)^T$$

- Diagionalize $B^*$: $B^* = V^* D_B V^{*T}$

- Project the data:

$$v_l = W^{-\frac{1}{2}}v_l^*, \space v_l^* \in \mathbb{R}^p$$
$$z_l = Xv_l, \space z_l \in \mathbb{R}^n$$

### Fisher Method

This is a different method, that gives the same results.

Fisher LDA looks for a projection $Z = a^TX$ such that the between-class variance is maximized relative to the within-class variance.  

Let $B$ and $W$ respectively the between-class and the within-class variance of $X$.  Note than $T = B + W$ with $T$ the covariance matrix of $X$, ingoring class information.  
The between-class and within-class variance of $Z$ are respectively $a^TBa$ and $a^TWa$.  
The objective is:
$$\max_a \frac{a^TBa}{a^tWa}$$

$a$ is the eigeinvector corresponding to the largest eigeinvalue of $W^{-1}B$

### Algorithm

- Compute matrix of class centroids $M \in \mathbb{R}^{K*p}$
$$M_k = \frac{1}{N_k} \sum_{g_i = k} x_i$$

- Compute within class covariance matrix $W \in \mathbb{R}^{p*p}$

$$W = \sum_{k=1}^K \sum_{g_i = k} (x_i - M_k) (x_i - M_k)^T$$

- Compute between class covariance matrix $B \in \mathbb{R}^{p*p}$
$$\mu = \frac{1}{K} \sum_{k=1}^K M_k$$

$$B = \sum_{k=1}^K N_k (M_k - \mu) (M_k - \mu)^T$$

- Diagionalize $W^{-1}B$: 
$$W^{-1}B = V D V^T$$

- Project the data:
$$z_l = Xv_l, \space z_l \in \mathbb{R}^N$$

$$Z = XV_L, \space Z \in \mathbb{R}^{N*L}$$

with $V_L$ columns the $L$ eigenvectors corresponding to the largest eigeinvalues of $W^{-1}B$

In [12]:
N = 11700
K = 3
X, g = gen_toy_class(N, noise=1e-5)
p = X.shape[1]

#1) Compute class centroids M
M = np.zeros((K, p))
for k in range(K):
    nk = np.sum(g == k)        
    for i in range(N):
        if g[i] == k:
            M[k] += X[i]
    M[k] /= nk
    
#2) Compute within-class covariance W
W = np.zeros((p, p))
for i in range(N):
    W += np.outer(X[i] - M[g[i]], X[i] - M[g[i]])

#3) Compute between class covariance B
mu = np.mean(M, axis=0)
B = np.zeros((p, p))
for k in range(K):
    nk = np.sum(g == k) 
    B += nk * np.outer(M[k] - mu, M[k] - mu)
    
#4) Diagonalize W^-1B
d, V = np.linalg.eig(np.linalg.inv(W) @ B)

#5) Project the data
Vr = V[:, :2]
Z = X @ Vr

#6) Make predictions
MZ = M @ Vr
preds = np.empty((N,)).astype(np.int)
for i in range(N):
    
    min_k = None
    min_dist = float('inf')
    for k in range(K):
        d = (Z[i] - MZ[k]) @ (Z[i] - MZ[k])
        if d < min_dist:
            min_k = k
            min_dist = d
    
    preds[i] = min_k
    
print('acc:', np.mean(g == preds))

acc: 0.8701709401709402


# Logistic Regression

The model is defined by the log-odds of the posterior probabilities.

$$\log \frac{P(G = k | X = x)}{P(G = K | X = x)} = \beta_{k0} + \beta_{k}^T x, \space k=1\text{...}K-1$$

It can be deduced that:

$$P(G = k | X = x) = \frac{\beta_{k0} + \beta_{k}^T x}{1 + \sum_{l=1}^{K-1} \beta_{l0} + \beta_{l}^T x}, \space k=1\text{...}K-1$$
$$P(G = K | X = x) = \frac{1}{1 + \sum_{l=1}^{K-1} \beta_{l0} + \beta_{l}^T x}$$

The log-likelihood for $N$ observations is:

$$l(\theta) = \sum_{i=1}^N \log P(G=g_i | X = x_i; \theta)$$

Let's focus on the cases with $K = 2$, with a response $y_i$ with $y_i = 1$ when $g_i = 1$ and $y_1 = 0$ when $g_i = 2$

$$l(\beta) = \sum_{i=1}^N y_i \log p(x_i) + (1 - y_i) \log(1 - p(x_i))$$

$$\text{with } p(x_i) = P(G=1|X=x) = \frac{\exp(\beta^Tx)}{1 + \exp(\beta^Tx)}$$

$$l(\beta) = \sum_{i=1}^N y_i \beta^Tx_i - \log(1 + \exp(\beta^Tx_i))$$

In order to maximize the log-likelihood, we solve:

$$\frac{\partial l(\beta)}{\partial \beta} = 0$$
$$\frac{\partial l(\beta)}{\partial \beta} = \sum_{i=1}^N x_i(y_i - p(x_i))$$

This can be solved using the Newton-Raphson algorithm, with second-derivates:

$$\frac{\partial^2 l(\beta)}{\partial \beta \partial \beta^T} = - \sum_{i=1}^N x_ix_i^T p(x_i)(1-p(x_i))$$

The update is:

$$\beta \leftarrow \beta - (\frac{\partial^2 l(\beta)}{\partial \beta \partial \beta^T})^{-1} \frac{l(\beta)}{\partial \beta}$$  

It can be rewritten in matrix form as:

$$\frac{\partial l(\beta)}{\partial \beta} = X^T(y-p)$$
$$\frac{\partial^2 l(\beta)}{\partial \beta \partial \beta^T} = - X^TWX$$

with:
- $X \in \mathbb{R}^{N * p}$ the matrix of features
- $p \in \mathbb{R}^N$ the vector of predictions
- $y \in \mathbb{R}^N$ the vector of labels
- $W \in \mathbb{R}^{N*N}$ diagonal matrix: $W_{ii} = p_i(1-p_i)$ 

The update became

$$\beta \leftarrow \beta + (X^TWX)^{-1}X^T (y - p)$$
$$\beta \leftarrow (X^TWX)^{-1}X^T Wz$$
$$\text{with } z = X \beta + W^{-1} (y-p)$$  

So the update is equivalent to solving a weigthed least square problem with output $z$:

$$\beta \leftarrow \arg \min_\beta (z - X\beta)^TW(z - X\beta)$$

In [204]:
X, y = gen_toy_bin(117, noise=1e-5)


def logreg(X, y):
    n = X.shape[0]
    p = X.shape[1]
    
    #works a lot better when init at 0
    beta = np.zeros((p,))
    #beta = np.random.randn(p)
    
    for i in range(5):
        p = np.exp(X @ beta) / (1 + np.exp(X @ beta))
        l = np.sum(y * np.log(p) + (1 - y) * np.log(1 - p))
        W = np.diag(p * (1-p))
        
        #IRLS update
        z = X @ beta + np.linalg.inv(W) @ (y - p)
        beta = np.linalg.inv(X.T @ W @ X) @ X.T @ W @ z

        '''
        #newton update
        g = X.T @ (y - p)
        H = - X.T @ W @ X
        beta = beta - np.linalg.inv(H) @ g
        '''
        
        print('loss:', l)
    
    
    return beta

Xc = np.mean(X, axis=0, keepdims=True)
Xs = np.std(X, axis=0, keepdims=True)
X2 = (X - Xc) / Xs
beta = logreg(X2, y)

y_hat = np.exp(X2 @ beta) / (1 + np.exp(X2 @ beta))
preds = np.round(y_hat).astype(np.int)
print('beta:', beta)
print('acc:', np.mean(y == preds))

loss: -81.0982201255136
loss: -45.425338172882384
loss: -30.34245722892736
loss: -20.693186675835356
loss: -15.169604917644264
beta: [  3.06500292   5.19078785 -10.35015818  -1.07264767]
acc: 0.9658119658119658
