# LDA & QDA From Scratch

# LDA

LDA (Linear Discriminent Analysis) is a **supervised** learning technique that can be used for classification and/or dimensionality reduction. 

There seem to be two main ways to introduce LDA, which correspond to Fisher's (1936) original approach and Welch's (1939) later derivation. I'll start by considering Fisher's approach, which seems to avoid (or at least hides?) many of the assumptions. 

A lot of peeople introduce LDA by likening it to PCA, an unsupervised dimensionality reduction algorithm for finding the directions of greatest variance in a dataset. The main difference is that LDA uses the class labels (thus making it supervised learning) to find the directions (or combinations of attributes, if you prefer) that give the greatest separation *between classes*. 

Let's consider the simple case of two classes, and an $n$-dimensional input vector $\mathbf{x}$. We hope to project that vector down to one dimension using $y=\mathbf{w}^T \mathbf{x}$ with the appropriate weights $\mathbf{w}$, which we hope to find. But how do we find this projection/direction that maximizes separability?

One way to achieve this is by maximizing the following quanitity:

$$
\frac{\text{Variance Between Classes}}{\text{Variance Within Classes}}
$$

So let's find expressions for the numerator and the denominator. To write these variances, we will make use of the covariance matrix.

To set things up, let's say I have a class $C_i$. To find the covariance matrix, I will require the sample mean over all features for the samples in $C_i$, which is an $n\times 1$ matrix:

$$
\mathbf{\mu}_i = \frac{1}{N_i}\sum_{\mathbf{x} \in C_i} \mathbf{x}
$$

The sample covariance matrix for the class $C_i$ can then be defined succinctly as an outer product:

$$
S_i = \frac{1}{N_i}\sum_{\mathbf{x} \in C_i} (\mathbf{x} - \mathbf{\mu_i})(\mathbf{x} - \mathbf{\mu_i})^T
$$

which is a symmetric, positive definite, $n\times n$ matrix. The $j,j$ entry of $S_i$ is the variance of the $j^{th}$ feature of the samples in class $i$. The $j,k$ entry meanwhile is the covariance between the $j^{th}$ and $k^{th}$ features of the class $i$ samples.

But remember, what we really care about are the variance between and within classes *once the data has been projected*. And we seek the projection that makes the ratio of the two as great as possible. Once projected, the mean of the projected features is in class $C_i$ is $y=\mathbf{w}^T \mathbf{\mu_i}$. The variance of the projected samples, meanwhile, can be written as $y=\mathbf{w}^T S_i \mathbf{w}$. Let's take a look at why this is.

$$
y=\mathbf{w}^T S_i \mathbf{w} = \frac{1}{N_i}\sum_{\mathbf{x} \in C_i} \mathbf{w}^T(\mathbf{x} - \mathbf{\mu_i})(\mathbf{x} - \mathbf{\mu_i})^T \mathbf{w} = S_i \mathbf{w} = \frac{1}{N_i}\sum_{\mathbf{x} \in C_i} \left[\mathbf{w}^T(\mathbf{x} - \mathbf{\mu_i})\right]^2 \geq 0
$$

where the last step can be confirmed by noting that $\mathbf{x} \cdot \mathbf{y} = \mathbf{x}^T \mathbf{y} = (\mathbf{x}^T \mathbf{y})^T = \mathbf{y}^T \mathbf{x}$ From this, we confirm that $S_i$ is positive definite, which implies the matrix is non-singular (a useful fact later). But why does this expression equal the variance of the projected samples? Let's do a little more rearranging...

$$
\frac{1}{N_i}\sum_{\mathbf{x} \in C_i} \left[\mathbf{w}^T(\mathbf{x} - \mathbf{\mu_i})\right]^2 = \frac{1}{N_i}\sum_{\mathbf{x} \in C_i} \left[\mathbf{w}^T\mathbf{x} - \mathbf{w}^T\mathbf{\mu_i}\right]^2
$$

where $\mathbf{w}^T\mathbf{x}$ is our projected sample and $\mathbf{w}^T\mathbf{\mu_i}$ is our projected mean. Thus, our expression is the sum of the squared deviations from the mean, which is just the variance!

Now let's get an expression for the Variance Within Classes or "within-class scatter" of the projected samples. This value is simply the sum of the within-class variances for all classes $i$: $\sum_{i} \mathbf{w}^T S_i \mathbf{w} = \mathbf{w}^T (\sum_{i}S_i) \mathbf{w} = \mathbf{w}^T S_W \mathbf{w}$, where we pulled through the sum using the distributive property of matrix multiplication. For our example of simply two classes, we write this as $\mathbf{w}^T S_0 \mathbf{w} + \mathbf{w}^T S_1 \mathbf{w}$ or $\mathbf{w}^T S_W \mathbf{w}$ where $S_W = S_1 + S_2$.

For the Variance between classes or "between-class scatter", we take a slightly different approach. We set $w$ such that the difference between the prjected means is large. We can write the distance between the means of the two classes as follows:

$$
\left\lVert \mathbf{w}^T\mathbf{\mu_0} - \mathbf{w}^T\mathbf{\mu_1}\right\rVert^2 = \left\lVert \mathbf{w}^T(\mathbf{\mu_0} - \mathbf{\mu_1})\right\rVert^2 = \left\lVert (\mathbf{\mu_0} - \mathbf{\mu_1})^T \mathbf{w}\right\rVert^2 = ((\mathbf{\mu_0} - \mathbf{\mu_1})^T \mathbf{w})^T((\mathbf{\mu_0} - \mathbf{\mu_1})^T \mathbf{w}) = \mathbf{w}^T (\mathbf{\mu_0} - \mathbf{\mu_1}) (\mathbf{\mu_0} - \mathbf{\mu_1})^T \mathbf{w}
$$

where we can view $(\mathbf{\mu_0} - \mathbf{\mu_1})(\mathbf{\mu_0} - \mathbf{\mu_1})^T$ as a between-class covariance matrix $S_B$. Therfore, the within-class scatter may be written more simply as $ \mathbf{w}^T S_B \mathbf{w}$

Thus, we finally arrive at an expression for the separation $s(\mathbf{w})$:

$$
s(\mathbf{w}) = \frac{\mathbf{w}^T S_B \mathbf{w}}{\mathbf{w}^T S_W \mathbf{w}} = \frac{\mathbf{w}^T (\mathbf{\mu_0} - \mathbf{\mu_1}) (\mathbf{\mu_0} - \mathbf{\mu_1})^T \mathbf{w}}{\mathbf{w}^T S_0 \mathbf{w} + \mathbf{w}^T S_1 \mathbf{w}}
$$

where the last equality is for our simple 2 class example. To find the optimum $\mathbf{w}$ we need to take gradient $\nabla s(\mathbf{w})$ and see where it vanishes. This matrix calculus can get pretty bad, so feel free to skip it, but I will try to provide a short intro in an aside.

#### Aside: Taking the gradient of $s(\mathbf{w})$ ####

To start, let's look at the gradient of the general expression $\mathbf{x}^T \mathbf{A} \mathbf{x}$, where $\mathbf{x}$ is an $n \times 1$ vector, and $\mathbf{A}$ is an $n \times n$ matrix.

We know that the quadratic form $\mathbf{x}^T \mathbf{A} \mathbf{x}$ will equal some constant, let's call it $\alpha$. Breaking apart the expression, we can get the following relation:

$$
\alpha = \mathbf{x}^T \mathbf{A} \mathbf{x} = \mathbf{x}^T (\mathbf{A} \mathbf{x}) = \sum_{i=1}^n x_i (\mathbf{A} \mathbf{x})_i = \sum_{i=1}^n x_i \sum_{j=1}^n a_{ij}x_j = \sum_{i=1}^n \sum_{j=1}^n a_{ij}x_ix_j
$$

We now differentiate $\alpha$ with respect to the vector $\mathbf{x}$, which will give us a $1 \times n$ row vector. If this size confuses you, I highly suggest [this reference](https://atmos.washington.edu/~dennis/MatrixCalculus.pdf), which is an excellent resource on matrix calculus. The main point is that $\frac{d\mathbf{x}}{d\alpha}$ is a column vector, while $\frac{d\alpha}{d\mathbf{x}}$ is a row vector by convention. Let's carry out the differentiation below by taking the derivative with respect to one element of $\mathbf{x}$, $x_k$:

$$
\frac{d\alpha}{d x_k} = \frac{d}{dx_k}\sum_{i=1}^n \sum_{j=1}^n a_{ij}x_ix_j = \sum_{i=1}^n a_{ik}x_i + \sum_{j=1}^n a_{kj}x_j
$$

The summation term on the left  comes about by considering when $j=k$ and the summation term on the right comes about when $i=k$ is considered. Let's first focus on the left term $\sum_{i=1}^n a_{ik}x_i$. This term is the dot product of the $k^{th}$ column of $\mathbf{A}$ with $x$. And since we want our results to be row vectors, we can write it as the $k^{th}$ element of the row vector $(\mathbf{x}^T \mathbf{A})_k$. The second term $\sum_{j=1}^n a_{kj}x_j$ is just the dot product of the $k^{th}$ row of $\mathbf{A}$ with $\mathbf{x}$. This is simply $(\mathbf{A} \mathbf{x})_k$, but since we want it as a row vector, we take the transpose to get $(\mathbf{x}^T \mathbf{A}^T)_k$.

Thus we arrive at the following expression:

$$
\frac{d\alpha}{d x_k} = \sum_{i=1}^n a_{ik}x_i + \sum_{j=1}^n a_{kj}x_j = (\mathbf{x}^T \mathbf{A})_k + (\mathbf{x}^T \mathbf{A}^T)_k = (\mathbf{x}^T \mathbf{A} + \mathbf{x}^T \mathbf{A}^T)_k = (\mathbf{x}^T (\mathbf{A} + \mathbf{A}^T))_k
$$

Or writing it for not just one element:

$$
\frac{d \mathbf{x}^T \mathbf{A} \mathbf{x}}{d \mathbf{x}} = \mathbf{x}^T (\mathbf{A} + \mathbf{A}^T)
$$

which just simplifies to $2\mathbf{x}^T\mathbf{A}$, if the matrix is symmetric (as our covariance matrices are).

So now we are left to differentiate our expression as follows:

$$
\frac{d s(\mathbf{w})}{d\mathbf{w}} = \frac{d}{d\mathbf{w}}\frac{\mathbf{w}^T S_B \mathbf{w}}{\mathbf{w}^T S_W \mathbf{w}}
$$

Using the quotient rule:

$$
\frac{d}{d\mathbf{w}}\frac{\mathbf{w}^T S_B \mathbf{w}}{\mathbf{w}^T S_W \mathbf{w}} = \frac{(\mathbf{w}^T S_B \mathbf{w})(2\mathbf{w}^T S_W) - (\mathbf{w}^T S_W \mathbf{w})(2\mathbf{w}^T S_B)}{(\mathbf{w}^T S_B \mathbf{w})^2} \stackrel{\text{set}}{=}0 \\
\implies \mathbf{w}^T S_W = \left(\frac{\mathbf{w}^T S_W \mathbf{w}}{\mathbf{w}^T S_B \mathbf{w}}\right)\mathbf{w}^T S_B
$$

To clean this expression up, I know that the ratio $\frac{\mathbf{w}^T S_W \mathbf{w}}{\mathbf{w}^T S_B \mathbf{w}}$ is just some scalar I will call $K$. We do this because we really only care about the direction of $\mathbf{w}$ and not it's magnitude. I will also take the transpose of the expression to get an equation for $\mathbf{w}$ and not $\mathbf{w}^T$. 

$$
\mathbf{w}^T S_W = K \mathbf{w}^T S_B \\
\implies  S_W \mathbf{w} = K  S_B \mathbf{w}\\
\implies  \mathbf{w} = K  S^{-1}_W S_B \mathbf{w}\\
\text{For two classes:}
\implies \mathbf{w} = K  (S_1 + S_2)^{-1} (\mathbf{\mu_0} - \mathbf{\mu_1})(\mathbf{\mu_0} - \mathbf{\mu_1})^T \mathbf{w}\\
$$

We recognize that $(\mathbf{\mu_0} - \mathbf{\mu_1})^T \mathbf{w}$ is also a scalar that can be absorbed into the constant, and finally discover the direction of $w$:

#### End Aside ####

$$
w \propto (S_1 + S_2)^{-1} (\mathbf{\mu_0} - \mathbf{\mu_1})
$$

Using the $\mathbf{w}$ above, we can indeed get a direction of maximum separation. If we further assume that $S_0 = S_1 = S$, that is the covariance matrices are the same for all classes, then we get "Linear Discriminat Analysis." In this case, we set $\mathbf{w} = (S)^{-1} (\mathbf{\mu_0} - \mathbf{\mu_1})$ (since we don't care about scaling). A threshold for discrimination is then usually chosen based on the distribution of the projected samples. A popular choice for the bias $b$ is the midpoint between the means, $b = \mathbf{w}^T (\frac{\mathbf{\mu_0} + \mathbf{\mu_1}}{2}) =\frac{1}{2} (\mathbf{\mu_0} - \mathbf{\mu_1})^T (S)^{-1} (\mathbf{\mu_0} - \mathbf{\mu_1})$. 

However, often maximum likelihood methods will be used to model the samples from each class as coming from a Gaussian distribution. From these arguments, one can then find optimum thresholds using Bayesian arguments.

#### One other way of viewing the equation ####
(From Pattern Classification, Duda)

When we get to the equation below: 

$$
S_W \mathbf{w} = K  S_B \mathbf{w}\\ \\
$$

We can replace the constant value $\frac{1}{K}$ with $\lambda$ and recognize that this is a generalized eigenvalue problem.

$$
S_B \mathbf{w} = \lambda  S_W \mathbf{w}\\ \\
$$

In the event that $S_W$ is invertible, we arrive at the more typical eigenvalue problem:

$$
S_W^{-1}S_B \mathbf{w} = \lambda  \mathbf{w}\\ \\
$$

Recognizing that $S_B\mathbf{w}$ is in the direction of $(\mathbf{\mu_0} - \mathbf{\mu_1})$, we get to our desired answer from before. 
### Multiple Classes ###

Often we care about classifying more than just two classes. If we say there are $M$ classes $C_1,...,C_M$, there are a couple of ways to handle the problem. The first is more simple and involves using methods that work with just two classes, as before.

These methods are mainly *Ove vs Rest* (OvR) and *One vs One* (OvO). In OvR, $M$ different binary classifiers are constructed. Each classifier works by attempting to discriminate between the the samples in $C_i$ and all of the other samples in $C_{\neq i}$. The scores from each classifier are then aggregated in an appropriate way so as to predict the actual class of the sample. However, this method can be problematic because properly aggregating the scores can be difficult (calibration between different classifiers is needed) and the number of samples in $C_i$ is often quite small compared to the rest of the samples in $C_{\neq i}$.

In (OvO), ${M}\choose{2}$ classifiers are trained between each pair of classes. Each classifier then votes on the identity of the sample. The sample is then labeled with whichever class gets the most votes (must handle case when gets equal number of votes.

#### Generalizing the linear discriminant multiple classes and higher dimensions ####

Alternatively, we can generalize our earlier analysis to multiple classes. Instead of seeking to project the two classes down to a one dimensional space, we now have $M$ classes and will use $M-1$ discriminant functions to project our $n$-dimensional input space down to a $M-1$ dimensional space. Throughout this analysis we will assume that $n \geq M$.

The projection of the $n$ dimensional data to the $M-1$ dimensional space will requires $M-1$ discriminants:

$$
y_i = \mathbf{w}_i^T \mathbf{x},  \hspace{1cm} i=1,...,M-1
$$

Conveniently, we can then put this in terms of matrices

$$
\mathbf{y} = \mathbf{W}^T \mathbf{x}
$$

where $\mathbf{W}$ is an $n \times (M-1)$ matrix, with the $i^{th}$ column of $\mathbf{W}$ being equal to $\mathbf{w}_i$.

We must now generalize the within and between class scatter matrices to multiple classes. We start with the within-class scatter matrix because it is very similar in form to our earlier $S_W$. Again we define $S_W$ as follows:

$$
S_W = \sum_{i}^M S_i = \sum_{i}^M \frac{1}{N_i}\sum_{\mathbf{x} \in C_i} (\mathbf{x} - \mathbf{\mu_i})(\mathbf{x} - \mathbf{\mu_i})^T\\
$$

where $\mathbf{\mu_i} = \frac{1}{N_i}\sum_{\mathbf{x} \in C_i} \mathbf{x}$ is again the mean of the samples in class $i$.

A bit more work must be done to extend the "between-class scatter" matrix to multiple classes. Following Bishop (2006) who follows Duda and Hart's convention (1973), we will start by defining the total covariance matrix $S_T$:

$$
S_T = \sum_{\mathbf{x}} (\mathbf{x}-\mathbf{\mu})(\mathbf{x}-\mathbf{\mu})^T
$$

where $\mathbf{\mu}$ is now the mean of all $N$ data points:

$$
\mathbf{\mu} = \frac{1}{N} \sum_{\mathbf{x}} \mathbf{x} = \frac{1}{N} \sum_{C_i} N_i\mathbf{\mu}_i 
$$

Note that $N_i$ is just the number of samples in class $C_i$. Now being clever, we can write out $S_T$ in terms of quantities that we will recognize. One is simply $S_W$ and we will call the other quantity $S_B$:

$$
\left.\begin{aligned} { S } _ { T } & = \sum _ { i = 1 } ^ { M } \sum _ { \mathbf { x } \in \mathcal { C } _ { i } } \left( \mathbf { x } - \mathbf { \mu } _ { i } + \mathbf { \mu } _ { i } - \mathbf { \mu } \right) \left( \mathbf { x } - \mathbf { \mu } _ { i } + \mathbf { \mu } _ { i } - \mathbf { \mu } \right) ^ { T } \\ 
& = \sum _ { i = 1 } ^ { M } \sum _ { \mathbf { x } \in \mathcal { C } _ { i } } \left( \mathbf { x } - \mathbf { \mu } _ { i } \right) \left( \mathbf { x } - \mathbf { \mu } _ { i } \right) ^ { T } + \sum _ { i = 1 } ^ { M } \sum _ { \mathbf { x } \in \mathcal { C } _ { i } } \left( \mathbf { \mu } _ { i } - \mathbf { \mu } \right) \left( \mathbf { \mu } _ { i } - \mathbf { \mu } \right) ^ { T } \\ 
& =  { S } _ { W } + \sum _ { i = 1 } ^ { M } N _ { i } \left( \mathbf { \mu } _ { i } - \mathbf { \mu } \right) \left( \mathbf { \mu } _ { i } - \mathbf { \mu } \right) ^ { T } = S_W + S_B \end{aligned} \right.
$$

Now that we again have expressions for $S_W$ and $S_B$, we once again hope to maximize the variance between classes in our projected space over the variance within classes in the projected space. Using $\tilde{S}_W$ and $\tilde{S}_B$ to denote the scatter matrices *of the projected samples*, we can see the following:

$$
\tilde{S}_W = \mathbf{W}^T S_W \mathbf{W} \\ 
\tilde{S}_B = \mathbf{W}^T S_B \mathbf{W}
$$

where $\tilde{S}_W$ and $\tilde{S}_B$ are $(M-1) \times (M-1)$ matrices. (See Duda and Hart for more information on these transformed scatter matrices).

It turns out we commonly attempt to maximize the quotient of the determinants as shown below:

$$
J(\mathbf{W}) = \frac{|\mathbf{W}^T S_B \mathbf{W}|}{|\mathbf{W}^T S_W \mathbf{W}|}
$$

This is a sesible thing to do since the determinant is equal to the product of the eigenvalues, which give a measure of the variances. Therefore, a large determinant indicates large variances in the directions of focus. One can also use the trace, since it is equal to the sum of the eigenvalues.

Finally, it can be shown that weights are determined by the eigenvectors of $S_W^{-1}S_B$ that correspond to the $M-1$ largest eigenvalues!

One last note: The reason we can cannot find more than $M-1$ weights or linear discriminants is that $S_B$ is the sum or outer product matrices, which are rank 1. Due to the constraints on $S_B$ at most $M-1$ of the summed matrices are indipendant, so $S_B$ has a maximum rank of $M-1$. This also means it has a maximum of $M-1$ nonzero eigenvalues.





