# Chap 4: Generative models: Bayes Classifier

## General theory 

Logistic regression involves directly modeling $P(Y = k|X = x)$ using the logistic function. We now consider an alternative and less direct approach to estimating these probabilities. In this new approach, we model the distribution of the predictors $X$ separately in each of the response classes (i.e. for each value of $Y $). We then use Bayes’ theorem to flip these around into estimates for $P(Y = k|X = x)$. When the distribution of $X$ within each class is assumed to be normal, it turns out that the model is very similar in form to logistic regression.

Recall that the generic form of Bayes' theorem is:

- $\text{Posterior} = \text{Likelihood}\times \text{Prior}: \,\,P(Y|X) = \dfrac{P(X|Y) \times P(Y)}{P(X)}$



Suppose that we wish to classify an observation into one of $K$ classes, where $K \geq 2$. In other words, the qualitative response variable $Y$ can take on $K$ possible distinct and unordered values. 
- Let $\pi_k$ represent the overall or prior probability that a randomly chosen observation comes from the $k$th class. 

- Let $f_k(X) \equiv Pr(X|Y = k)$ denote the density function of $X$ for an observation that comes from the $k$th class. In other words, $f_k(x)$ is relatively large if there is a high probability that an observation in the $k$th class has $X \approx x$, and $f_k(x)$ is small if it is very unlikely that an observation in the $k$th class has $X ≈ x$. 

- Then Bayes’ theorem states that
    - $p_k(x) \equiv P(Y = k|X=x) = \dfrac{\pi_k f_k(x)}{\sum_{l=1}^K \pi_l f_l(x)} = \dfrac{\pi_k P(X=x|Y = k)}{\sum_{l=1}^K \pi_l P(X=x|Y = l)}$
    
    
The above equation means that instead of directly computing the posterior probability $p_k(x)$, we can simply plug in estimates of the prior $\pi_k$ and likelihood $f_k(x)$ to get $p_k(x)$. In general, estimating $\pi_k$ is easy if we have a random sample from the population: we simply compute the fraction of the training observations that belong to the $k$th class. However, estimating the density function $f_k(x)$ is much more challenging. As we will see, to estimate $f_k(x)$, we will typically have to make some simplifying assumptions. In the following sections, we discuss three classifiers that use different estimates of $f_k(x)$ to approximate the Bayes classifier: *linear discriminant analysis, quadratic discriminant analysis*, and *naive Bayes*.

In [1]:
library(ISLR2)

In [2]:
#for MacOS 10.14 Mojave, install ISLR because the version of R is lower
#library(ISLR)

In [3]:
attach(Smarket)

In [4]:
head(Smarket)

Unnamed: 0_level_0,Year,Lag1,Lag2,Lag3,Lag4,Lag5,Volume,Today,Direction
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
1,2001,0.381,-0.192,-2.624,-1.055,5.01,1.1913,0.959,Up
2,2001,0.959,0.381,-0.192,-2.624,-1.055,1.2965,1.032,Up
3,2001,1.032,0.959,0.381,-0.192,-2.624,1.4112,-0.623,Down
4,2001,-0.623,1.032,0.959,0.381,-0.192,1.276,0.614,Up
5,2001,0.614,-0.623,1.032,0.959,0.381,1.2057,0.213,Up
6,2001,0.213,0.614,-0.623,1.032,0.959,1.3491,1.392,Up


# Linear Discriminant Analysis (LDA)

## LDA Theory

**One predictor**

In LDA, we assume that $f_k(x)$ is normal or Gaussian. Let us focus on the case $p=1$ (there is only 1 predictor), such that the Gaussian distribution is

$f_k(x) = \dfrac{1}{\sqrt{2\pi}\sigma_k}\exp\left(-\dfrac{1}{2\sigma_k^2}(x-\mu_k)^2\right)$

where $\mu_k$ and $\sigma_k^2$ are the mean and variance parameters for the $k$th class. For now, let us further assume that $\sigma_1^2 = \ldots = \sigma_K^2$ : that is, there is a shared variance term across all $K$ classes, which for simplicity we can denote by $\sigma^2$. Substituting this form of $f_k(x)$ into the Bayes' equation gives

$p_k(x) = \dfrac{ \dfrac{\pi_k}{\sqrt{2\pi}\sigma}\exp\left(-\dfrac{1}{2\sigma^2}(x-\mu_k)^2\right)}{\sum_{l=1}^K  \dfrac{\pi_l}{\sqrt{2\pi}\sigma}\exp\left(-\dfrac{1}{2\sigma^2}(x-\mu_l)^2\right)}$

The Bayes classifier involves assigning an observation $X = x$ to the class for which $p_k(x)$ is largest. This is  equivalent to assigning the observation to the class for which the following quantity is the largest:

$\delta_k(x) = x\,.\,\dfrac{\mu_k}{\sigma^2} - \dfrac{\mu_k^2}{2\sigma^2} + \log(\pi_k)$

The linear discriminant analysis (LDA) method approximates the Bayes classifier by plugging estimates for $\pi_k$, $\mu_k$, and $\sigma^2$ into the above equation. 

**Multiple predictors**

We now extend the LDA classifier to the case of $p$ multiple predictors. To do this, we will assume that $X = (X_1,X_2, \ldots, X_p)$ is drawn from a multivariate Gaussian (or multivariate normal) $N(\mu_k, \Sigma)$ distribution, with a class-specific $p$-dimensional mean vector $\mu_k$ and a common covariance $p\times p$ matrix $\Sigma$. 

$f_k(x) = \dfrac{1}{(2\pi)^{p/2} |\Sigma|^{1/2}}\exp\left(-\dfrac{1}{2}(x-\mu_k)^T \Sigma^{-1}(x-\mu_k)\right)$

For this case, the Bayes classifier assigns an observation $X = x$ to the class for which

$\delta_k(x) = x^T \Sigma^{-1} \mu_k - \dfrac{1}{2}\mu^T_k \Sigma^{-1}\mu_k + \log\pi_k$

is largest.

## LDA applied to the dataset 'Smarket'

In R, we fit an LDA model using the lda() function, which is part of the MASS library. 

In [5]:
library(MASS)


Attaching package: ‘MASS’


The following object is masked from ‘package:ISLR2’:

    Boston




In [6]:
#create a train set
train <- (Year < 2005)
Smarket_2005 <- Smarket[!train, ]
Direction_2005 <- Direction[!train]

In [7]:
lda.fit <- lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
lda.fit

Call:
lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)

Prior probabilities of groups:
    Down       Up 
0.491984 0.508016 

Group means:
            Lag1        Lag2
Down  0.04279022  0.03389409
Up   -0.03954635 -0.03132544

Coefficients of linear discriminants:
            LD1
Lag1 -0.6420190
Lag2 -0.5135293

The LDA output indicates that the priors are $\hat{\pi}_1 = 0.492$ and $\hat{\pi}_2 = 0.508$; in other words, 49.2% of the training observations correspond to days during which the market went down. It also provides the group means; these are the average of each predictor within each class, and are used by LDA as estimates of $\mu_k$.

The coefficients of linear discriminants output provides the linear combination of *Lag1* and *Lag2* that are used to form the LDA decision rule. In other words, these are the multipliers of the elements of $X = x$. If $(-0.642 \times Lag1 − 0.514 \times Lag2)$ is large, then the LDA classifier will predict a market increase, and if it is small, then the LDA classifier will predict a market decline.

In [8]:
#prediction
lda_pred <- predict(lda.fit, Smarket_2005)
names(lda_pred)

In [9]:
lda_class <- lda_pred$class
#confusion matrix
table(lda_class, Direction_2005)

         Direction_2005
lda_class Down  Up
     Down   35  35
     Up     76 106

In [10]:
#posterior
lda_pos <- lda_pred$posterior
head(lda_pos)

Unnamed: 0,Down,Up
999,0.4901792,0.5098208
1000,0.4792185,0.5207815
1001,0.4668185,0.5331815
1002,0.4740011,0.5259989
1003,0.4927877,0.5072123
1004,0.4938562,0.5061438


Applying a 50 % threshold to the posterior probabilities allows us to recreate the predictions contained in *lda_pred$class*.

In [11]:
sum(lda_pos[, 1] >= .5)
sum(lda_pos[, 1] <= .5)


In [12]:
#accuracy level
mean(lda_pred$class == Direction_2005)

The accuracy of LDA classification is 56%.

# Quadratic Discriminant Analysis (QDA)

## QDA theory

Like LDA, the QDA classifier results from assuming that the observations from each class are drawn from a Gaussian distribution, and plugging estimates for the parameters into Bayes’ theorem in order to perform prediction. However, unlike LDA, QDA assumes that each class has its own covariance matrix. That is, it assumes that an observation from the $k$th class is of the form $X ~ N(\mu_k, \Sigma_k)$, where $\Sigma_k$ is a covariance matrix for the $k$th class. Under this assumption, the Bayes classifier assigns an observation $X = x$ to the class for which

$\delta_k(x) = -\dfrac{1}{2}(x−\mu_k)^T \Sigma^{−1}_k (x−\mu_k)− \dfrac{1}{2}\log|\Sigma_k|+ \log\pi_k$

is largest. So the QDA classifier involves plugging estimates for $\Sigma_k, \mu_k$, and $\pi_k$ into the above equation, and then assigning an observation $X = x$ to the class for which this quantity is largest. 

## QDA applied to 'Smarket'

In [13]:
qda_fit <- qda(Direction ~ Lag1 + Lag2, subset = train)
qda_fit

Call:
qda(Direction ~ Lag1 + Lag2, subset = train)

Prior probabilities of groups:
    Down       Up 
0.491984 0.508016 

Group means:
            Lag1        Lag2
Down  0.04279022  0.03389409
Up   -0.03954635 -0.03132544

The output contains the group means. But it does not contain the coefficients of the linear discriminants, because the QDA classifier involves a quadratic, rather than a linear, function of the predictors. 

In [14]:
qda_class <- predict(qda_fit, Smarket_2005)$class
table(qda_class, Direction_2005)

         Direction_2005
qda_class Down  Up
     Down   30  20
     Up     81 121

In [15]:
#accuracy level
mean(qda_class == Direction_2005)

The accuracy is 60%, better than the LDA accuracy of 56% above. 

# Naive Bayes (NB)

## NB theory

The naive Bayes classifier takes a different tack for estimating $f_1(x),\ldots, f_K(x)$. Instead of assuming that these functions belong to a particular family of distributions (e.g. multivariate normal), instead the following  single assumption is made:

$f_k(x) = \prod_{j=1}^p f_{k\,j}(x_j), \qquad k = 1, \ldots, K$

where $f_{kj}$ is the density function of the $j$th predictor among observations in the $k$th class.

This leads to:

$p_k(x) = \dfrac{\pi_k  \prod_{j=1}^p f_{kj}(x_j)}{\sum_{l=1}^K \pi_l \prod_{j=1}^p f_{lj}(x_j)}$

## NB applied to 'Smarket'

Next, we fit a naive Bayes model to the Smarket data. Naive Bayes is implemented in R using the naiveBayes() function, which is part of the *e1071* library.

In [16]:
# Type 'R' in the Terminal, then 'install.packages('e1071')'.
#Choose 16 (China/Beijing2) as the CRAN mirror
library(e1071)

In [17]:
nb_fit <- naiveBayes(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
nb_fit


Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
    Down       Up 
0.491984 0.508016 

Conditional probabilities:
      Lag1
Y             [,1]     [,2]
  Down  0.04279022 1.227446
  Up   -0.03954635 1.231668

      Lag2
Y             [,1]     [,2]
  Down  0.03389409 1.239191
  Up   -0.03132544 1.220765


In [18]:
#prediction
nb_class <- predict(nb_fit, Smarket_2005)
table(nb_class, Direction_2005)

        Direction_2005
nb_class Down  Up
    Down   28  20
    Up     83 121

In [19]:
#accuracy
mean(nb_class == Direction_2005)

Naive Bayes performs very well on this data, with accurate predictions over 59% of the time. This is slightly worse than QDA, but much better than LDA.