# $\S8.$ Linear Discriminant Analysis, Quadratic Discriminant Analysis, and Naive Bayes Classifier

**Author**: [Gilyoung Cheong](https://www.linkedin.com/in/gycheong/)

**References**
* ["The Elements of Statistical Learning" by Hastie, Tibshirani, and Friedman](https://hastie.su.domains/ElemStatLearn/) $\S4.3$
* The [Erd&#337;s Institute](https://www.erdosinstitute.org/) Data Science Boot Camp Spring 2024 Week 9 lecture notes (only available to members)
* [Wikipedia page on conditional probability distribution](https://en.wikipedia.org/wiki/Conditional_probability_distribution)
* [User guide page at scikit-learn](https://scikit-learn.org/stable/modules/lda_qda.html#lda-qda)

### General principle from Bayes rule

Given any nonempty events $A, B$, we have
$$\mathbb{P}(A | B) = \frac{\mathbb{P}(A \cap B)}{\mathbb{P}(B)} = \frac{\mathbb{P}(B | A)\mathbb{P}(A)}{\mathbb{P}(B)}.$$

Given any disjoint events $A_1, \dots, A_k$, we have

$$\mathbb{P}(A_i | B) = \frac{\mathbb{P}(A_i)\mathbb{P}(B | A_i)}{\mathbb{P}(B)} = \frac{\mathbb{P}(A_i)\mathbb{P}(B | A_i)}{\sum_{j=1}^k\mathbb{P}(A_j \cap B)} = \frac{\mathbb{P}(A_i)\mathbb{P}(B | A_i)}{\sum_{j=1}^k \mathbb{P}(A_j) \mathbb{P}(B | A_j)}.$$

### Probability density functions

Given a probability space $(\Omega, \Sigma, \mu)$, recall that a random variable $X : \Omega \rightarrow \mathbb{R}^m$ is a measurable map with respect to the Lebesgue measure on $\mathbb{R}^m$. Given any measurable subset $A \subset \mathbb{R}^m$, by definition, we have

$$\mathbb{P}(X \in A) = \mu(\{\omega \in \Omega : X(\omega) \in A\}) = \mu(X^{-1}(A)).$$

Consider the measure $\nu$ on a sub-probability space $\Omega$ defined by

$$\nu(X^{-1}(A)) := \mathbb{P}(X \in A).$$


If $\mu(E) = 0$ on any measurable subset $E \subset \Omega$ that $\nu$ can evaluate, then $\nu(E) = 0$, so by [Radon-Nikodym theorem](https://en.wikipedia.org/wiki/Radon%E2%80%93Nikodym_theorem), there exists a (almost) unique measurable function $f_X : \mathbb{R}^m \rightarrow \mathbb{R}_{\geq 0}$ such that

$$\mathbb{P}(X \in A) = \int_A f_X(\boldsymbol{x}) d \boldsymbol{x}.$$

We call this $f_X$ the **probability density** of $X$.

### Conditional probability density functions

Fix $r \in \mathbb{Z}_{\geq 1}$. Given a probability space $(\Omega, \Sigma, \mu)$, let 
* $X : \Omega \rightarrow \mathbb{R}^m$ and
* $Y : \Omega \rightarrow \{1, 2, \dots, r\}$

be random variables, where we use the Lebesgue measure on $\mathbb{R}^m$ and the some probability measure on $\{1, 2, \dots, r\}$ with the discrete $\sigma$-algebra. Given any measurable subsets $A \subset \mathbb{R}^m$ and $B \subset \{1, 2, \dots, r\}$, we have

* $\mathbb{P}(X \in A) = \mu(X^{-1}(A)) = \mu(\{\omega \in \Omega : X(\omega) \in A\})$ and
* $\mathbb{P}(Y \in B) = \mu(Y^{-1}(B)) = \mu(\{\omega \in \Omega : Y(\omega) \in B\})$.

Moreover, we have

$$\mathbb{P}(X \in A, Y \in B) = \mu(X^{-1}(A) \cap Y^{-1}(B)) = \mu(\{\omega \in \Omega : X(\omega) \in A \text{ and } Y(\omega) \in B\}),$$

so by an application of the Radon-Nikodym theorem on $(X, Y) : \Omega \rightarrow \mathbb{R}^m \times \{1, 2, \dots, r\}$, we may find a (almost) unique measurable function $f_{X,Y} : \Omega \rightarrow \mathbb{R}_{\geq 0}$ such that

$$\mathbb{P}(X \in A, Y \in B) = \mu(X^{-1}(A) \cap Y^{-1}(B)) = \int_{A \times B} f_{X,Y}(\boldsymbol{x}, y) d(\boldsymbol{x} \times y).$$

Recalling that $\{1,2, \dots, r\}$ has the discrete $\sigma$-algebra, the above can be written as

$$\mathbb{P}(X \in A, Y \in B) = \sum_{y \in B} \int_A f_{X,Y}(\boldsymbol{x},y) d\boldsymbol{x}.$$

We call this $f_{X,Y}$ the **joint probability density function** of $X$ and $Y$.

**Remark**. It follows that

$$\sum_{y=1}^k f_{X,Y}(\boldsymbol{x}, y) = f_{X}(\boldsymbol{x})$$

for almost all $\boldsymbol{x} \in \mathbb{R}^m$. From now on, we may assume that we made choices so that the above identity is true for all $\boldsymbol{x} \in \mathbb{R}^m$.

**Definition**. The **conditional probability density function** of $Y$ given the occurence of $X$ is defined as
$$f_{Y|X}(\boldsymbol{x}, y) := \frac{f_{X,Y}(\boldsymbol{x},y)}{f_X(\boldsymbol{x})},$$

which is only defined for almost all $\boldsymbol{x} \in \mathbb{R}^m$ such that $f_X(x) > 0$. We may similarly define

$$f_{X|Y}(\boldsymbol{x}, y) := \frac{f_{X,Y}(\boldsymbol{x},y)}{\mathbb{P}(Y = y)},$$

for any $y \in \{1, 2, \dots, r\}$ such that $\mathbb{P}(Y = y) > 0$.

**Remark**. Note that $X$ and $Y$ are independent if and only
$$f_{X,Y}(\boldsymbol{x},y) = f_X(\boldsymbol{x})f_Y(y)$$

for almost all $\boldsymbol{x} \in \mathbb{R}^m$ and all $y \in \{1, 2, \dots, r\}$. This is equivalent to saying that
$$f_{Y|X}(\boldsymbol{x}, y) = f_{Y}(y)$$

for almost all $\boldsymbol{x} \in \mathbb{R}^m$ such that $f_X(x) > 0$ and all $y \in \{1, 2, \dots, r\}$, which can be taken as the motivation for the defintion.

**Notation**. We write 
* $\mathbb{P}(Y = y | X = \boldsymbol{x}) := f_{Y|X}(\boldsymbol{x}, y)$ and
* $\mathbb{P}(X = \boldsymbol{x} | Y = y) := f_{X|Y}(\boldsymbol{x}, y)$

whenever defined. Technically, these expressions do not necessarily mean probabilities.

The following theorem is the analogoue of the principle about the Bayes rule introduced above:

**Theorem**. Using the preceding notation, for any $\boldsymbol{x} \in \mathbb{R}^m$ and $y \in \{1,2, \dots, N\}$, we have

$$\mathbb{P}(Y = y | X = \boldsymbol{x}) = \frac{\mathbb{P}(Y = y) \mathbb{P}(X = \boldsymbol{x} | Y = y)}{\sum_{j=1}^k \mathbb{P}(y = j) \mathbb{P}(X = \boldsymbol{x} | Y = j)}.$$

given that all the espressions are defined.

*Proof*. By definition, the right-hand side is

$$\mathbb{P}(Y = y | X = \boldsymbol{x}) = \frac{f_{X,Y}(\boldsymbol{x}, y)}{\sum_{j=1}^k f_{X,Y}(\boldsymbol{x}, j)} = \frac{f_{X,Y}(\boldsymbol{x}, y)}{f_{X}(\boldsymbol{x})},$$

which is identical to the left-hand side. $\Box$

### Application in classification problems

Now, the idea is to use the above theorem when we are given distinct $j, k \in \{1, 2, \dots, r\}$. Then assuming that all the expressions with the symbol $\mathbb{P}$ is positive, having

$$\mathbb{P}(Y = j | X = \boldsymbol{x}) = \mathbb{P}(Y = k | X = \boldsymbol{x})$$

is equivalent to

$$\mathbb{P}(Y = j) \mathbb{P}(X = \boldsymbol{x} | Y = j) = \mathbb{P}(Y = k) \mathbb{P}(X = \boldsymbol{x} | Y = k).$$
1
When we give further assumptions on the conditional density function

$$\mathbb{P}(X = \boldsymbol{x} | Y = y) = f_{X|Y}(\boldsymbol{x}, y),$$

the last equation can be used to cut out a hypersurface to classify data.

### Quadratic Discriminant Analysis

The following is the hypothesis we make for the Quadratic Discriminant Analysis (QDA):

**Hypothesis**. Given 
* the input dataset $\boldsymbol{x}_1, \dots, \boldsymbol{x}_n \in \mathbb{R}^m$ and
* the output dataset $y_1, \dots, y_n \in \{1,2, \dots, r\}$

for training, we assume that the value for each conditional density function

$$\mathbb{P}(X = \boldsymbol{x} | Y = k) = f_{X|Y}(\boldsymbol{x}, k)$$

is approximated as follows:

$$\mathbb{P}(X = \boldsymbol{x} | Y = k) \approx \frac{1}{(2\pi)^{m/2} (\det(\Sigma_k))^{1/2}}\exp\left(-\frac{1}{2}(\boldsymbol{x} - \boldsymbol{\mu_k})^T \Sigma_k^{-1} (\boldsymbol{x} - \boldsymbol{\mu_k})\right),$$

where

* $\boldsymbol{\mu_k} := \frac{1}{N_k}\sum_{\substack{1\leq i \leq n \\ y_i = k}} \boldsymbol{x}_i$,
* $N_k := \#\{i \in \{1, 2, \dots, n\} : y_i = k\}$, and
* $\Sigma_k = \frac{1}{N_k - 1}\sum_{\substack{1\leq i \leq n \\ y_i = k}}(\boldsymbol{x}_i - \boldsymbol{\mu}_i)(\boldsymbol{x}_i - \boldsymbol{\mu}_i)^T$, the $m \times m$ matrix each of whose entry is the corresponding sample covariance conditional on $Y = k$.

We also have the approximation

$$\mathbb{P}(Y = j) \approx N_k / n,$$

the uniform probability that a random $y_i$ among our training data is equal to $k$.

**Remark**. The above approximation is an approximate version of a [multivariate normal distribution](https://en.wikipedia.org/wiki/Multivariate_normal_distribution). When $\Sigma_k$ is not invertible, we may use its [pseudoinverse](https://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_inverse), which can be constructed by taking the transpose of the singular decomposition and then replacing the nonzero singular values with their reciprocals. The determinant can be replaced by the product of nonzero eigenvalues.

Hence, the classification hypersurfaces may be approximately given by

$$N_j \exp\left(-\frac{1}{2}(\boldsymbol{x} - \boldsymbol{\mu_j})^T \Sigma_j^{-1} (\boldsymbol{x} - \boldsymbol{\mu_j})\right) = N_k \exp\left(-\frac{1}{2}(\boldsymbol{x} - \boldsymbol{\mu_k})^T \Sigma_k^{-1} (\boldsymbol{x} - \boldsymbol{\mu_k})\right)$$

for distinct $j, k \in \{1, 2, \dots, r\}$. Assuming each $N_j \neq 0$, we may take the logarithm to have

$$\log(N_j) - \frac{1}{2}(\boldsymbol{x} - \boldsymbol{\mu_j})^T \Sigma_j^{-1} (\boldsymbol{x} - \boldsymbol{\mu_j}) = \log(N_k) - \frac{1}{2}(\boldsymbol{x} - \boldsymbol{\mu_k})^T \Sigma_k^{-1} (\boldsymbol{x} - \boldsymbol{\mu_k}),$$

each of which is a quadratic equation in $\boldsymbol{x}$.

### Linear Discriminant Analysis

The following is the hypothesis we make for the Linear Discriminant Analysis (LDA):

**Hypothesis**. Given 
* the input dataset $\boldsymbol{x}_1, \dots, \boldsymbol{x}_n \in \mathbb{R}^m$ and
* the output dataset $y_1, \dots, y_n \in \{1,2, \dots, r\}$

for training, we assume that the value for each conditional density function

$$\mathbb{P}(X = \boldsymbol{x} | Y = k) = f_{X|Y}(\boldsymbol{x}, k)$$

is approximated as follows:

$$\mathbb{P}(X = \boldsymbol{x} | Y = k) \approx \frac{1}{(2\pi)^{m/2} (\det(\Sigma))^{1/2}}\exp\left(-\frac{1}{2}(\boldsymbol{x} - \boldsymbol{\mu_k})^T \Sigma^{-1} (\boldsymbol{x} - \boldsymbol{\mu_k})\right),$$

where

* $\boldsymbol{\mu_k} := \frac{1}{N_k}\sum_{\substack{1\leq i \leq n \\ y_i = k}} \boldsymbol{x}_i$,
* $N_k := \#\{i \in \{1, 2, \dots, n\} : y_i = k\}$, and
* $\Sigma = \frac{1}{n- 1}\sum_{k=1}^r\sum_{\substack{1\leq i \leq n \\ y_i = k}}(\boldsymbol{x}_i - \boldsymbol{\mu}_i)(\boldsymbol{x}_i - \boldsymbol{\mu}_i)^T$,  the $m \times m$ matrix each of whose entry is the corresponding sample covariance.

For non-invertible $\Sigma$, we may apply the same strategy as in QDA.

Assuming each $N_j \neq 0$, for distinct $j, k \in \{1, 2, \dots, r\}$, the classification hypersurface is given by

$$\log(N_j) - \frac{1}{2}(\boldsymbol{x} - \boldsymbol{\mu_j})^T \Sigma^{-1} (\boldsymbol{x} - \boldsymbol{\mu_j}) = \log(N_k) - \frac{1}{2}(\boldsymbol{x} - \boldsymbol{\mu_k})^T \Sigma^{-1} (\boldsymbol{x} - \boldsymbol{\mu_k}).$$

Since $\Sigma$ is symmetric, so is $\Sigma^{-1}$, and we may cancel out the quadratic terms in the above equation to rewrite it as

$$\log(N_j) + \boldsymbol{x}\Sigma^{-1}\boldsymbol{\mu}_j - \frac{1}{2}\boldsymbol{\mu_j}^T \Sigma^{-1} \boldsymbol{\mu_j} = \log(N_k) + \boldsymbol{x}\Sigma^{-1}\boldsymbol{\mu}_k - \frac{1}{2}\boldsymbol{\mu_k}^T \Sigma^{-1} \boldsymbol{\mu_k},$$

which is linear in $\boldsymbol{x}$. Hence in this case, we get classifying hyperplanes.

### Naive Bayes Classifier

We start with the theorem above:

$$\mathbb{P}(Y = y | X = \boldsymbol{x}) = \frac{\mathbb{P}(Y = y) \mathbb{P}(X = \boldsymbol{x} | Y = y)}{\sum_{j=1}^k \mathbb{P}(y = j) \mathbb{P}(X = \boldsymbol{x} | Y = j)}.$$

Then we also assume that the random variable $X = (X_1, X_2, \dots, X_m) \in \mathbb{R}^m$ is given independently by random variables $X_1, \dots, X_m \in \mathbb{R}$. That is, writing $\boldsymbol{x} = (u_1, \dots, u_m)$, for any $k \in \{1, 2, \dots, r\}$, we assume that

$$\mathbb{P}(X = \boldsymbol{x} | Y = y) = \mathbb{P}(X_1 = u_1 | Y = k) \cdots \mathbb{P}(X_m = u_m | Y = k).$$

Then we assume that

$$\mathbb{P}(X_j = u_j | Y = k) \approx \frac{1}{(2\pi \sigma_{k}^2)^{1/2}}\exp\left(-\frac{1}{2\sigma_{k}^2}(u_j - \mu_{k})^2\right),$$

where writing $\boldsymbol{x}_i = (x_{1i}, \dots, x_{ni})$, we have

* $\mu_{k} := \frac{1}{N_k}\sum_{\substack{1\leq i \leq n \\ y_i = k}} x_{ki}$,
* $N_k := \#\{i \in \{1, 2, \dots, n\} : y_i = k\}$, and
* $\sigma_k^2 = \frac{1}{N_k - 1}\sum_{\substack{1\leq i \leq n \\ y_i = k}}(x_{ki} - \mu_i)^2$, the sample covariance under $Y = k$.

Then we get classifying quadratic hypersurfaces. (If we replace $\sigma_k$ with some constant quauntity, then we get classifying hyperplanes). Since we approximiate each $\mathbb{P}(X_j = u_j | Y = k)$ by Gaussian distribution (a.k.a. normal distribution), this classification model is called the **Gaussian Naive Bayes** Classifier. Using different dirstributions yield different types of classification models, which can be found in this [scikit-learn user guide page](https://scikit-learn.org/stable/modules/naive_bayes.html).