**LDA V.S. Logistic Regression**:
    
1. When the classes are well-separated, the parameter estimates for the
logistic regression model are surprisingly unstable. Linear discriminant
analysis does not suffer from this problem.

2. If n is small and the distribution of the predictors X is approximately
normal in each of the classes, the linear discriminant model is again
more stable than the logistic regression model.

3. Linear discriminant analysis is popular
when we have more than two response classes.

# Using Bayes’ Theorem for Classification
Suppose that we wish to classify an observation into one of K classes, where
K ≥ 2.

**Prior**:Let $\pi_k=Pr(Y=k)$ represent the overall or ***prior***
probability that a randomly chosen observation comes from the kth class. This is the probability that a given observation is associated with the kth
category of the response variable Y . 

Let $f_k(X) ≡ Pr(X = x|Y = k)$ denote
the ***density function*** of X for an observation that comes from the kth class. In other words, fk(x) is relatively large if there is a high probability that an observation in the kth class has X ≈ x.

**Bayes’
theorem** states that

\begin{align}
Pr(Y=k|X=x)=\frac{\pi_k f_k(x)}{\sum_{l=1}^K\pi_lf_l(x)} 
\end{align}

**Posterior**:$p_k(X)
= Pr(Y = k|X)$ an observation X = x belongs to the kth class, given the predictor value for that
observation

**Estimating $π_k$:** simply compute the fraction of the training
observations that belong to the kth class.

**Estimating $f_k(X)$:** more challenging

# Linear Discriminant Analysis for p = 1

Assume p = 1—that is, we have only one predictor. We
would like to obtain an estimate for $f_k(x)$ that we can estimate $p_k(x)$. We will then classify an observation to the class
for which $p_k(x)$ is greatest. 

## Assumptions
In order to estimate $f_k(x)$, we will first make
some assumptions about its form:

1. Assume that $f_k(x)$ is normal or Gaussian.
\begin{align}
f_k(x)=\frac{1}{\sqrt{2\pi}\sigma_k}\exp{\left( -\frac{1}{2\sigma_k^2}(x-\mu_k)^2 \right)}
\end{align}

where $μ_k$ and $σ_k^2$ are the mean and variance parameters for the kth class.

2. Assume that $\sigma_1^2=...=\sigma_k^2$
: that is, there is a shared
variance term across all K classes, which for simplicity we can denote by
$\sigma^2$.

So
\begin{align}
p_k(x)=\frac{\pi_k \frac{1}{\sqrt{2\pi}\sigma}\exp{\left( -\frac{1}{2\sigma^2}(x-\mu_k)^2 \right)}}{\sum_{l=1}^K\pi_l\frac{1}{\sqrt{2\pi}\sigma}\exp{\left( -\frac{1}{2\sigma^2}(x-\mu_l)^2 \right)}}
\end{align}

The Bayes classifier involves assigning an observation X = x to the class for which $p_k(x)$ is largest. Taking the log of $p_k(x)$
and rearranging the terms, it is not hard to show that this is equivalent to
assigning the observation to the class for which

\begin{align}
\delta_k(x)=x\frac{\mu_k}{\sigma^2}-\frac{\mu_k^2}{2\sigma^2}+\log(\pi_k) \quad\quad (4.13)
\end{align}

is largest.

For instance, if K = 2 and π1 = π2, then the Bayes classifier
assigns an observation to class 1 if $2x (μ_1 − μ_2) > μ^2_1
− μ^2_2$, and to class
2 otherwise. In this case, the Bayes decision boundary corresponds to the
point where

\begin{align}
x=\frac{\mu_1^2-\mu_2^2}{2(\mu_1-\mu_2)}=\frac{\mu_1+\mu_2}{2}
\end{align}




## Parameters Estimation

In practice, even if we are quite certain of our assumption that X is drawn
from a Gaussian distribution within each class, we still have to estimate
the parameters $μ_1, . . . , μ_K, π_1, . . . , π_K$, and $σ^2$.


**Linear discriminant analysis (LDA)** method approximates the Bayes classifier by plugging estimates for $μ_1, . . . , μ_K, π_1, . . . , π_K$, and $σ^2$ into (4.13)

\begin{align}
\hat{\mu}_k=\frac{1}{n_k}\sum_{i:y_i=k}x_i  \quad (4.15) \\
\hat{\sigma}^2=\frac{1}{n-K}\sum_{k=1}^K\sum_{i:y_i=k}(x_i-\hat{\mu_k})^2 \quad (4.16)\\
\hat{\pi_k}=\frac{n_k}{n}
\end{align} 


where n is the total number of training observations, and $n_k$ is the number
of training observations in the kth class. 

$\hat{\mu}_k$: average of all the training observations from the kth class;

$\hat{\sigma}^2$: a weighted average of the sample variances for each of the K classes.

$\hat{\pi_k}$: the proportion of the training observations
that belong to the kth class


##  LDA classifier
The LDA classifier assigns an observation X = x to the class for which

\begin{align}
\hat{\delta}_k(x)=x\frac{\hat{\mu}_k}{\hat{\sigma}^2}-\frac{\hat{\mu}_k^2}{2\hat{\sigma}^2}+\log(\hat{\pi}_k)
\end{align} 

is largest.

The word ***linear*** in the classifier’s name stems from the fact
that the ***discriminant functions*** $\hat{\delta}_k(x)$ are linear functions of x.

<img src="./images/6.png" width=600>

The right-hand panel of Figure 4.4 displays a histogram of a random
sample of 20 observations from each class. 

To implement LDA,

1. Estimating πk, μk, and σ2 using (4.15) and (4.16).
2. Compute the decision boundary, shown as a black solid line, that results from assigning an observation to the class for which $\hat{\delta}_k(x)$ is largest.

In this case, since n1 = n2 = 20,
we have $\hat{\pi_1}$ = $\hat{\pi_2}$. As a result, the decision boundary corresponds to the
midpoint between the sample means for the two classes,$\frac{\mu_1+\mu_2}{2}$

# Linear Discriminant Analysis for p >1 

Assume that X = (X1,X2, . . .,Xp) is drawn from a **multivariate Gaussian** (or multivariate normal) distribution, with a class-specific mean vector and a common covariance matrix.


## Multivariate Gaussian Distribution

Assumes that each individual predictor
follows a one-dimensional normal distribution with some
correlation between each pair of predictors.

<img src="./images/7.png" width=600>

To indicate that a p-dimensional random variable X has a multivariate
Gaussian distribution, we write X ∼ N(μ,Σ). Here E(X) = μ is
the mean of X (a vector with p components), and Cov(X) = Σ is the
p × p **covariance matrix** of X. Formally, the **multivariate Gaussian density**
is defined as

\begin{align}
f(x)=\frac{1}{\sqrt{(2\pi)^{p}|Σ|}}\exp{\left( \frac{1}{2}(x-\mu)^TΣ^{-1}(x-\mu) \right)}
\end{align} 

In the case of p > 1 predictors, the **LDA classifier** assumes that the
observations in the kth class are drawn from a multivariate Gaussian distribution
$N(μ_k,Σ)$, where $μ_k$ is a class-specific mean vector, and Σ is a
covariance matrix that is common to all K classes.

Plugging the density function for the kth class, $f_k(X = x)$, into $Pr(Y = k|X = x)$, the Bayes classifier assigns an observation X = x
to the class for which

\begin{align}
\delta_k(x)=x^TΣ^{-1}\mu_k-\frac{1}{2}\mu_k^TΣ^{-1}\mu_k+\log{\pi_k}
\end{align} 

is largest.