# LDA & QDA From Scratch

# LDA

LDA (Linear Discriminent Analysis) is a **supervised** learning technique that can be used for classification and/or dimensionality reduction. 

There seem to be two main ways to introduce LDA, which correspond to Fisher's (1936) original approach and Welch's (1939) later derivation. I'll start by considering Fisher's approach, which seems to avoid (or at least hides?) many of the assumptions. 

A lot of peeople introduce LDA by likening it to PCA, an unsupervised dimensionality reduction algorithm for finding the directions of greatest variance in a dataset. The main difference is that LDA uses the class labels (thus making it supervised learning) to find the directions (or combinations of attributes, if you prefer) that give the greatest separation *between classes*. 

Let's consider the simple case of two classes, and an $n$-dimensional input vector $\mathbf{x}$. We hope to project that vector down to one dimension using $y=\mathbf{w}^T \mathbf{x}$ with the appropriate weights $\mathbf{w}$, which we hope to find. But how do we find this projection/direction that maximizes separability?

One way to achieve this is by maximizing the following quanitity:

$$
\frac{\text{Variance Between Classes}}{\text{Variance Within Classes}}
$$

So let's find expressions for the numerator and the denominator. To write these variances, we will make use of the covariance matrix.

To set things up, let's say I have a class $C_i$. To find the covariance matrix, I will require the sample mean over all features for the samples in $C_i$, which is an $n\times 1$ matrix:

$$
\mathbf{\mu}_i = \frac{1}{N_i}\sum_{\mathbf{x} \in C_i} \mathbf{x}
$$

The sample covariance matrix for the class $C_i$ can then be defined succinctly as an outer product:

$$
S_i = \frac{1}{N_i}\sum_{\mathbf{x} \in C_i} (\mathbf{x} - \mathbf{\mu_i})(\mathbf{x} - \mathbf{\mu_i})^T
$$

which is a symmetric, positive definite, $n\times n$ matrix. The $j,j$ entry of $S_i$ is the variance of the $j^{th}$ feature of the samples in class $i$. The $j,k$ entry meanwhile is the covariance between the $j^{th}$ and $k^{th}$ features of the class $i$ samples.

But remember, what we really care about are the variance between and within classes *once the data has been projected*. And we seek the projection that makes the ratio of the two as great as possible. Once projected, the mean of the projected features is in class $C_i$ is $y=\mathbf{w}^T \mathbf{\mu_i}$. The variance of the projected samples, meanwhile, can be written as $y=\mathbf{w}^T S_i \mathbf{w}$. Let's take a look at why this is.

$$
y=\mathbf{w}^T S_i \mathbf{w} = \frac{1}{N_i}\sum_{\mathbf{x} \in C_i} \mathbf{w}^T(\mathbf{x} - \mathbf{\mu_i})(\mathbf{x} - \mathbf{\mu_i})^T \mathbf{w} = S_i \mathbf{w} = \frac{1}{N_i}\sum_{\mathbf{x} \in C_i} \left[\mathbf{w}^T(\mathbf{x} - \mathbf{\mu_i})\right]^2 \geq 0
$$

where the last step can be confirmed by noting that $\mathbf{x} \cdot \mathbf{y} = \mathbf{x}^T \mathbf{y} = (\mathbf{x}^T \mathbf{y})^T = \mathbf{y}^T \mathbf{x}$ From this, we confirm that $S_i$ is positive definite, which implies the matrix is non-singular (a useful fact later). But why does this expression equal the variance of the projected samples? Let's do a little more rearranging...

$$
\frac{1}{N_i}\sum_{\mathbf{x} \in C_i} \left[\mathbf{w}^T(\mathbf{x} - \mathbf{\mu_i})\right]^2 = \frac{1}{N_i}\sum_{\mathbf{x} \in C_i} \left[\mathbf{w}^T\mathbf{x} - \mathbf{w}^T\mathbf{\mu_i}\right]^2
$$

where $\mathbf{w}^T\mathbf{x}$ is our projected sample and $\mathbf{w}^T\mathbf{\mu_i}$ is our projected mean. Thus, our expression is the sum of the squared deviations from the mean, which is just the variance!

Now let's get an expression for the Variance Within Classes or "within-class scatter" of the projected samples. This value is simply the sum of the within-class variances for all classes $i$: $\sum_{i} \mathbf{w}^T S_i \mathbf{w} = \mathbf{w}^T (\sum_{i}S_i) \mathbf{w} = \mathbf{w}^T S_W \mathbf{w}$, where we pulled through the sum using the distributive property of matrix multiplication. For our example of simply two classes, we write this as $\mathbf{w}^T S_0 \mathbf{w} + \mathbf{w}^T S_1 \mathbf{w}$ or $\mathbf{w}^T S_W \mathbf{w}$ where $S_W = S_1 + S_2$.

For the Variance between classes or "between-class scatter", we take a slightly different approach. We set $w$ such that the difference between the prjected means is large. We can write the distance between the means of the two classes as follows:

$$
\left\lVert \mathbf{w}^T\mathbf{\mu_0} - \mathbf{w}^T\mathbf{\mu_1}\right\rVert^2 = \left\lVert \mathbf{w}^T(\mathbf{\mu_0} - \mathbf{\mu_1})\right\rVert^2 = \left\lVert (\mathbf{\mu_0} - \mathbf{\mu_1})^T \mathbf{w}\right\rVert^2 = ((\mathbf{\mu_0} - \mathbf{\mu_1})^T \mathbf{w})^T((\mathbf{\mu_0} - \mathbf{\mu_1})^T \mathbf{w}) = \mathbf{w}^T (\mathbf{\mu_0} - \mathbf{\mu_1}) (\mathbf{\mu_0} - \mathbf{\mu_1})^T \mathbf{w}
$$

where we can view $(\mathbf{\mu_0} - \mathbf{\mu_1})(\mathbf{\mu_0} - \mathbf{\mu_1})^T$ as a between-class covariance matrix $S_B$. Therfore, the within-class scatter may be written more simply as $ \mathbf{w}^T S_B \mathbf{w}$

Thus, we finally arrive at an expression for the separation $s(\mathbf{w})$:

$$
s(\mathbf{w}) = \frac{\mathbf{w}^T S_B \mathbf{w}}{\mathbf{w}^T S_W \mathbf{w}} = \frac{\mathbf{w}^T (\mathbf{\mu_0} - \mathbf{\mu_1}) (\mathbf{\mu_0} - \mathbf{\mu_1})^T \mathbf{w}}{\mathbf{w}^T S_0 \mathbf{w} + \mathbf{w}^T S_1 \mathbf{w}}
$$

where the last equality is for our simple 2 class example. To find the optimum $\mathbf{w}$ we need to take gradient $\nabla s(\mathbf{w})$ and see where it vanishes.


