# Chapter.05 Classification
---
### 5.3. Probabilistic Approaches for Classification
5.3.1. Statistics vs Bayesian Classification

- Statistical classification
    - Based on the Neyman-Pearson criterion
    - Typically used in sonar and rader systems (with unknown priors)
    - ML estimator
- Bayesian Classification
    - Based on minimization of the Bayes risk(by cost function)
    - Typically used in communications and pattern recognition systems
    - MAP estimator (suppose we know prior e.g., gaussian)

<br>
Both are based on the Likelihood Ratio Test (LRT), just comparing the ratio of likelihoods, but to different thresholds

$$
L(\mathbf{x}) = \frac{p(\mathbf{x} | C_2)}{p(\mathbf{x} | C_1)} \overset{C_2}{\underset{C_1}{\gtrless}} \xi
$$

$ L(\mathbf{x}) $ is a likelihood ratio and $ \xi $ is a decision threshold.

5.3.2. Probabilities in classification<br>

<img src="./res/ch05/fig_3_1.png" width="800" height="600"><br>
<div align="center">
  Figure.5.3.1
</div>

$$
\begin{align*}
p(\text{mistake}) &= p(\mathbf{x} \in \mathcal{R}_1, \mathcal{C}_2) + p(\mathbf{x} \in \mathcal{R}_2, \mathcal{C}_1) \\
                  &= \int_{\mathcal{R}_1} p(\mathbf{x}, C_1) d\mathbf{x} + \int_{\mathcal{R}_2} p(\mathbf{x}, C_2) d\mathbf{x} \\
\end{align*}
$$

$$
\begin{align*}
p(correct) &= \int_{\mathcal{R}_1} p(\mathbf{x}, C_1) d\mathbf{x} + \int_{\mathcal{R}_2} p(\mathbf{x}, C_2) d\mathbf{x} \\
\end{align*}
$$

5.3.3. A simple binary classification

<img src="./res/ch05/fig_3_2.png" width="500" height="400"><br>
<div align="center">
  Figure.5.3.2
</div>

$$
\text{Type-1 error : } \int_{\mathcal{R}_2} p(x | C_1) dx = \int_{\frac{1}{2}}^{\infty} = Q(0.5)
$$

$$
\text{Type-2 error : } \int_{\mathcal{R}_1} p(x | C_2) dx = \int_{-\infty}^{\frac{1}{2}} = Q(0.5)
$$

What if the threshold changes?

<img src="./res/ch05/fig_3_3.png" width="550" height="430"><br>
<div align="center">
  Figure.5.3.3
</div>

It isn't possible to reduce both error probabilities at the same time. So there is a criterion suggested by Neyman-Pearson.

5.3.4. Statistical classification : Neyman-Pearson criterion<br>
To degign the optimal(binary) classifier, one possible choice is to minimize the __Type-Ⅱ error__(false negative), or equivalently, to maximize the __Power of test__ by constraining __Type-Ⅰ error__(false positive) below a threshold $ \alpha $:

$$
\max_{\mathcal{R}_2} \int_{\mathcal{R}_2} p(\mathbf{x} | C_2) d\mathbf{x} \quad s.t. \quad \int_{\mathcal{R}_2} p(\mathbf{x} | C_1) d\mathbf{x} \le \alpha
$$

<strong>Theorem.5.3.4.1. Neyman-Pearson theorem</strong><br>
The solution to 

$$
\max_{\mathcal{R}_2} \int_{\mathcal{R}_2} p(\mathbf{x} | C_2) d\mathbf{x} \quad s.t. \quad \int_{\mathcal{R}_2} p(\mathbf{x} | C_1) d\mathbf{x} \le \alpha
$$

is given by 

$$
R_2^* = \left\{ \mathbf{x} : L(\mathbf{x}) = \frac{p(\mathbf{x} | C_2)}{p(\mathbf{x} | C_1)} > \gamma \right\}
$$

$$
\text{where the threshold} \,\ \gamma \,\ \text{is found such that} \,\ \int_{ \mathcal{R}_2 } p(\mathbf{x} | C_1) d\mathbf{x} = \alpha 
$$

Neyman-Pearson(Binary) Classification Rule : 
- Decide a decision threshold $ \gamma $ from 

$$
\int_{\mathcal{R}_2} p(\mathbf{x} | C_1) d\mathbf{x} = \alpha
$$

- Perform classification according to the likelihood ratio test : 

$$
L(\mathbf{x}) = \frac{p(\mathbf{x} | C_2)}{p(\mathbf{x} | C_1)} \overset{C_2}{\underset{C_1}{\gtrless}} \gamma
$$

<strong>Proof.</strong><br>
It can be proved by constraint optimization problem. $\blacksquare$

Let's solve the problem in 5.3.3.

$$
Pr\{\text{False positive(Type 1 error)}\} = 0.5
$$

$$
\begin{align*}
\frac{p(x|C_2)}{p(x|C_1)} &= \frac{\frac{1}{\sqrt{2 \pi}} \exp(-\frac{1}{2}(x - 1)^2)}{\frac{1}{\sqrt{2 \pi}} \exp(- \frac{1}{2} x^2)} \overset{C_2}{\underset{C_1}{\gtrless}} \gamma \\
                          &= \exp\left(- \frac{1}{2} (x^2 - 2x + 1 - x^2) \right) = \exp \left( x - \frac{1}{2} \right) \overset{C_2}{\underset{C_1}{\gtrless}} \gamma \\
\end{align*}
$$

$$
\begin{align*}
Pr\{\text{False positive(Type 1 error)}\} &= Pr\left\{ \exp \left( x - \frac{1}{2} \right) > \gamma | C_1 \right\} \\
                                          &= Pr\left\{ x - \frac{1}{2} > \ln \gamma | C_1 \right\} \\
                                          &= \int_{\ln \gamma + \frac{1}{2}}^{\infty} \frac{1}{\sqrt{2 \pi}} \exp(-\frac{1}{2} x^2) dx = 0.5 \quad \Rightarrow \quad \ln \gamma + \frac{1}{2} = 0 \\
\end{align*}
$$

$$
\begin{align*}
Pr\{\text{True positive}\} &= Pr\left\{ x > 0 \right\} \\
                           &= \int_{0}^{\infty} \frac{1}{\sqrt{2 \pi}} \exp[- \frac{1}{2}(x - 1)^2] dx = 0.84 \\
\end{align*}
$$

So are there some limitations? Although an NP criterion can be formulated for multiple classes, it seems to seldom be used in practice. More commonly the minimum $P_e$ criterion or its generalization, the Bayes risk criterion, is employed.

5.3.4. Receiver Operating Characteristics (ROC)<br> 
ROC curve : $Pr\{\text{True positive}\} \,\ \text{vs} \,\ Pr\{\text{False positive}\} $

<img src="./res/ch05/fig_3_4.png" width="550" height="430"><br>
<div align="center">
  Figure.5.3.4
</div>

This is alternative way of summarizing the performance of a classifier and very useful to compare different classifiers and to decide which one performs best.

<img src="./res/ch05/fig_3_5.png" width="650" height="530"><br>
<div align="center">
  Figure.5.3.5
</div>


5.3.5. Bayesian classification : Minimum Bayes Risk Classifier for two classes<br> 
Bayes risk $ \mathcal{R} $ is given like following
$$
\begin{align*}
\mathcal{R} &= c_{11}Pr(\text{True Positive}) + c_{22}Pr(\text{True Negative}) + c_{21}Pr(\text{False Negative}) + c_{12}Pr(\text{False Positive}) \\
            &= c_{11} \pi_1 \int_{R_1} p(\mathbf{x} | C_1) d\mathbf{x} + c_{22} \pi_2 \int_{R_2} p(\mathbf{x} | C_2) d\mathbf{x} + c_{21} \pi_1 \int_{R_2} p(\mathbf{x} | C_1) d\mathbf{x} + c_{12} \pi_2 \int_{R_1} p(\mathbf{x} | C_2) d\mathbf{x} \\
            &\quad \text{where} \,\ \pi_i = P(C_i), \,\ i = 1, 2, \,\ R_i \,\ : \,\ \text{decision region in which} \,\ \mathbf{x} \in C_i , \,\ c_{ij} \,\ : \,\ \text{cost if we choose} \,\ C_i \,\ \text{but} \,\ C_j \,\ \text{is true} \\
\end{align*}
$$

$$
\text{Since} \,\ \int_{R_1} p(\mathbf{x} | C_i) d\mathbf{x} + \int_{R_2} p(\mathbf{x} | C_i) d\mathbf{x} = 1, \,\ i = 1, 2, 
$$

$$
\begin{align*}
\text{We have} \,\ \mathcal{R} = &c_{11} \pi_1 \int_{R_1} p(\mathbf{x} | C_1) d\mathbf{x} + c_{22} \pi_2 \left( 1 - \int_{R_1} p(\mathbf{x} | C_2) d\mathbf{x} \right) \\
                                 &+ c_{21} \pi_1 \left( 1 - \int_{R_1} p(\mathbf{x} | C_1) d\mathbf{x} \right) + c_{12} \pi_2 \int_{R_1} p(\mathbf{x} | C_2) d\mathbf{x} \\                             
\end{align*}
$$

$$
\begin{align*}
\mathcal{R} = &c_{21} \pi_1 + c_{22} \pi_2 \\
              &+ \int_{R_1} \left[ \pi_2 (c_{12} - c_{22}) p(\mathbf{x} | C_2) - \pi_1 (c_{21} - c_{11}) p(\mathbf{x} | C_1) \right] d\mathbf{x} \quad \text{where} \,\ c_{12} - c_{22} > 0 \,\ \text{and} \,\ c_{21} - c_{11} > 0 \\
\end{align*}
$$

$ c_{11} $ and $ c_{22} $ are negative to maximize TP and TN.<br>
$ c_{12} $ and $ c_{21} $ are positive to minimize FP and FN.<br>

$$
\begin{align*}
R_1 &= \arg\min \mathcal{R} \\
    &= \arg\min \int_{R_1} \left[ \pi_2(c_{12} - c_{22})p(\mathbf{x}|C_2) - \pi_1 (c_{21} - c_{11}p(\mathbf{x}|C_1))\right] d\mathbf{x} \\
\end{align*}
$$

$$
\begin{align*}
\text{Choose} &\,\ R_1 \,\ s.t. \\
              & \pi_2 (c_{12} - c_{22}) p(\mathbf{x} | C_2) < \pi_1(c_{21} - c_{11})p(\mathbf{x} | C_1) \\
              & \therefore \,\ \pi_2(c_{12} - c_{22}) p(\mathbf{x} | C_2) \overset{C_2}{\underset{C_1}{\gtrless}} \pi_1 (c_{21} - c_{11}) p(\mathbf{x} | C_1)
\end{align*}
$$

The likelihood ratio is compared to a threshold:
$$
\Lambda(\mathbf{x}) = \frac{p(\mathbf{x} | C_2)}{p(\mathbf{x} | C_1)} \overset{C_2}{\underset{C_1}{\gtrless}} \frac{\pi_1(c_{21} - c_{11})}{\pi_2(c_{12} - c_{22})} = \xi 
$$

<img src="./res/ch05/fig_3_6.png" width="600" height="300"><br>
<div align="center">
  Figure.5.3.6
</div>

Equivalent form: Log-Likelihood ratio test
$$
\log \Lambda(\mathbf{x}) \overset{C_2}{\underset{C_1}{\gtrless}} \log \xi
$$

<img src="./res/ch05/fig_3_7.png" width="600" height="300"><br>
<div align="center">
  Figure.5.3.7
</div>

In the minimum Bayes risk classifier, the threshold is determined by the prior densities.<br>
In the Neyman-Pearson classifier, it is determined by the type-1 error(i.e., false alarm)

5.3.6. Minimum Error Probability Classifier for two classes<br>
Probability of error $ P_e $(or misclassification probability)
$$
\begin{align*}
P_e &= Pr\{\text{decide} \,\ C_1, \,\ C_2 \,\ \text{true}\} + Pr\{\text{decide} \,\ C_2, \,\ C_1 \,\ \text{true}\} \\
    &= \int_{R_1} p(\mathbf{x} | C_2) p(C_2) d \mathbf{x} + \int_{R_2} p(\mathbf{x} | C_1) p(C_1) d \mathbf{x} \\
\end{align*}
$$

In this context, minimum $ P_e $ classifier:
$$
L(\mathbf{x}) = \frac{p(\mathbf{x} | C_2)}{p(\mathbf{x} | C_1)} \overset{C_2}{\underset{C_1}{\gtrless}} \frac{p(C_1)}{p(C_2)} = \xi \qquad : \,\ \text{Maximum A Posteriori classifier}
$$

- A special case of the more general minimum Bayes risk classifier
- If $ c_{11} = c_{22} = 0 $ and $ c_{12} = c_{21} = 1 $, then $ R = P_e $

> The MAP classifier minimizes the error probability.

If the class-prior probabilities are equal, $p(C_1) = \cdots = p(C_k)$, 
$$
L(\mathbf{x}) = \frac{p(\mathbf{x} | C_2)}{p(\mathbf{x} | C_1)} \overset{C_2}{\underset{C_1}{\gtrless}} 1 \qquad : \,\ \text{Maximum Likelihood classifier}
$$

> The ML classifier minimizes the error probability for equally likely classes

5.3.7. Bayesian classification : Minimum Bayes Risk Classifier for multiple classes<br>
Bayes risk $ \mathcal{R} $(or expected cost) for multiple classes $ \{C_1, \cdots, C_k\} $ 

$$
\begin{align*}
\mathcal{R} &= \sum_{i = 1}^{K} \sum_{j = 1}^{K} c_{ij} \int_{R_i} p(\mathbf{x} | C_j) p(C_j) d\mathbf{x} \\
            &= \sum_{i = 1}^{K} \int_{R_i} \sum_{j = 1}^{K} c_{ij} p(\mathbf{x} | C_j) p(C_j) d\mathbf{x} \\
            &= \sum_{i = 1}^{K} \int_{R_i} J_i(\mathbf{x}) p(C_j) d\mathbf{x} \quad \text{where} \,\ J_i(\mathbf{x}) = \sum_{j = 1}^{K} c_{ij} p(C_j | \mathbf{x}) 
\end{align*}
$$

To minimize $ \mathcal{R} $, we should choose the class that minimizes 
$$
J_i(\mathbf{x}) = \sum_{j = 1}^{K} c_{ij} p(C_j | \mathbf{x})
$$

Therefore, Minimum Bayes risk classifier for multiple class $\{C_1, \cdots, C_K\}$ 
$$
\mathbf{x} \in C_k \,\ \text{if} \,\ k = \underset{i \in \{i, \cdots K\}}{\arg\min} \left\{ J_i(\mathbf{x}) = \sum_{j = 1}^{K} c_{ij} p(C_j | \mathbf{x}) \right\}
$$

5.3.8. Minimum Error Probability Classifier for multiple classes<br>
Probability of error $ P_e $ for multiple classes $ \{C_1, \cdots, C_K\} $
$$
P_e = \sum_{i = 1}^{K} \sum_{j = 1, j \neq i}^{K} \int_{R_i} p(\mathbf{x} | C_j) p(C_j) d\mathbf{x}
$$

- $ P_e $ is a special case of $ \mathcal{R}\text{(i.e., } P_e = \mathcal{R}\text{)} $ for a paricular assignment:

$$
c_{ij} = 
\begin{cases}
0 & i = j \\
1 & i \neq j \\
\end{cases}
$$

To determin the classifier that minimizes $ P_e $, we use $ J_i(\mathbf{x}) $
$$
\begin{align*}
J_i(\mathbf{x}) &= \sum_{j = 1, j \neq i}^{K} p(C_j | \mathbf{x}) \\
                &= \sum_{j = 1}^{K} p(C_j | \mathbf{x}) - p(C_i | \mathbf{x}) \\
\end{align*}
$$

Therefore, Minimum $ P_e $ classifier for multiple classes $\{C_1, \cdots C_k\}$ 
$$
\begin{align*}
C_k &= \underset{C \in \{C_1, \cdots, C_K\}}{\arg\max} p(C | \mathbf{x}) \\
    &= \underset{C \in \{C_1, \cdots, C_K\}}{\arg\max} p(\mathbf{x} | C) p(C) \qquad : \,\ \text{Multiple-class Maximum A Posteriori classifier} \\
\end{align*}
$$

> The MAP classifier minimizes the chance of misclassification.

If the class-prior probabilities are equal, $ p(C_1) = \cdots = p(C_K) $, 
$$
C_k = \underset{C \in \{C_1, \cdots, C_K\}}{\arg\max} p(\mathbf{x} | C) \qquad : \,\ \text{Multiple-class Maximum Likelihood classifier} \\
$$

> The ML classifier minimizes the chance of misclassification for equally likely classes.

In summarize, __MAP classifier__
- Requires the knowledge of __class-prior__  distributions and __class-conditional__  distributions.
- Optimal in the sense of minimizing the chance of misclassification 
<br>

In summarize, __ML classifier__
- Requires the knowledge of only the __class-conditional__  densities
- Optimal in terms of error probability only for equally likely classes

<br>

For example, following is classification of gaussian features with different means.

<img src="./res/ch05/fig_3_8.png" width="500" height="300"><br>
<div align="center">
  Figure.5.3.8
</div>

$$
\underset{k = 1, 2, 3}{\max} p(\mathbf{x} | C_k) = \underset{k = 1, 2, 3}{\min} (x - A_k)^2 \quad \text{where} \,\ A_1 = -A, \,\ A_2 = 0, \,\ A_1 = A
$$

$$
\begin{cases}
C_1 & \text{if} \,\ x < -\frac{A}{2} \\
C_2 & \text{if} \,\ -\frac{A}{2} < x < \frac{A}{2} \\
C_3 & \text{if} \,\ x > \frac{A}{2} \\
\end{cases}
$$

- ML is optimal in the sense of the minimum error probability
- In the ML, thresholds are just intersections between any two PDFs

5.3.9. Naive Bayes classifier<br>
Naive Bayes classification assuming independent features
- Posterior Probability

$$
\begin{align*}
p(C_k | \mathbf{x}) &\propto p(\mathbf{x} | C_k) p(C_k) \\
                    &= p(x_1, \cdots, x_m, C_k) \\
                    &= p(x_1 | x_2, \cdots, x_m, C_k)p(x_2 | x_3, \cdots, x_m, C_k) \\
                    & \,\ \cdots p(x_{n-1} | x_m, C_k) p(x_m | C_k) p(C_k) \\
\end{align*}
$$

$$
p(C_k | \mathbf{x}) = \frac{1}{Z} p(C_k) \prod_{i} p(x_i | C_k) \quad (\because \,\ p(x_i | x_{i+1}, \cdots, x_m, C_k) = p(x_i | C_k) \,\ \text{Conditional independence assumption} )
$$

- Naive MAP classifier
$$
C = \underset{k \in \{1, \cdots, K\}}{\arg\max} p(C_k) \prod_{i} p(x_i | C_k)
$$

- Naive ML classifier
$$
C = \underset{k \in \{1, \cdots, K\}}{\arg\max} \prod_{i} p(x_i | C_k)
$$

For example, Naive Bayes classifier for binary classification(Spam e-mail classifier)

1. Posteriori Probability
$$
p(C_k | \mathbf{x}) = \frac{p(C_k)p(\mathbf{x} | C_k)}{p(\mathbf{x})} = \frac{p(C_k)}{p(\mathbf{x})} \prod_{i} p(x_i | C_k), \,\ k = 1, 2 
$$

2. MAP classification rule
$$
\frac{p(C_1|\mathbf{x})}{p(C_2 | \mathbf{x})} = \frac{p(C_1)}{p(C_2)} \prod_{i} \frac{p(x_i | C_1)}{p(x_i | C_2)} \overset{C_2}{\underset{C_1}{\gtrless}} 1
$$
    - A special kind of likelihood ratio test(independent)

3. ML classification rule
$$
\frac{p(\mathbf{x}|C_1)}{p(\mathbf{x} | C_2)} = \prod_i \frac{p(x_i | C_1)}{p(x_i | C_2)} \overset{C_2}{\underset{C_1}{\gtrless}} 1
$$
    - A special kind of likelihood ratio test(independent and same prior density)

Let $ C_1 $ be spam, $ C_2 $ be legitimate with known class-prior $ p(C_k), \,\ k = 1, 2 $<br>
, $ \{w_1, \cdots, w_n\} $ be collection of words in an e-mail, based on which we decide whether the e-mail is spam or not.

$$
x_i = 
\begin{cases}
1, & \text{if } w_i \,\ \text{appears in the e-mail} \\
0, & \text{if it does not} \\
\end{cases} \Rightarrow \,\ \text{Bernoulli R.V.}
$$

$$
\Rightarrow \,\ P(C_k | x_1, \cdots, x_n) = \frac{\prod_{i = 1}^{n} p(x_i | C_k) p(C_k)}{\sum_{j = 1}^{2} \sum_{i = 1}^{n} p(x_i | C_j) p(C_j)}, \,\ k = 1, 2
$$

$$
p(C_1 | x_1, \cdots, x_n) \overset{\text{spam}}{\underset{\text{legitimate}}{\gtrless}} P(C_2 | x_1, \cdots, x_n)
$$

Therefore, we can do likelihood ratio test with independent class-conditional densities
$$
\frac{\prod_{i = 1}^{m} p(x_i | C_1)}{\prod_{i = 1}^{m} p(x_i | C_2)}  \overset{\text{spam}}{\underset{\text{legitimate}}{\gtrless}} \frac{p(C_2)}{p(C_1)} = \xi
$$

5.3.10. Assumptions of Naive Bayes classifier<br>
1. Assumptions on class-prior distribution<br>
- Equiprobable classes
$$
p(C_k) = \frac{1}{\text{Number of classes}}
$$

- Estimation of class probabilities from training samples
$$
p(C_k) = \frac{\text{Number of samples in} \,\ C_k}{\text{Total number of samples}}
$$

2. Assumptions on class-conditional distribution

- Gaussian Naive Bayes
$$
p(x | C_k) = \frac{1}{\sqrt{2 \pi \sigma_k^2}} \exp \left(- \frac{(x - \mu_k)^2}{2 \sigma_k^2} \right)
$$

- Multinomial Naive Bayes(e.g., document classifications)
$$
p(\mathbf{x} | C_k) = \frac{\left(\sum_i x_i\right)!}{\prod_i x_i!} \prod_i p_{ki}^{x_i}
$$

$$
\quad \text{where} \,\ p_{ki} : \text{probability that event } i \,\ \text{occurs given } C_k , \,\ x_i : \text{the number of times event } i \,\ \text{was observed} \\
$$

$$
\begin{align*}
\text{In Log-Likelihood form, } \,\ \log p(C_k | \mathbf{x}) &\propto \log \left(p(C_k) \prod_i p_{ki}^{x_i} \right) \\
                         &= \log p(C_k) + \sum_i x_i \log p_{ki} = b_k + \mathbf{w}_k^T \mathbf{x} \\
\end{align*}
$$

- Binomial(or Bernoulli) Naive Bayes(e.g., document classifications with binay features)

$$
    p(\mathbf{x} | C_k) = \prod_i p_{ki}^{x_i} (1 - p_{ki})^{(1 - x_i)}
$$

$$
\quad \text{where} \,\ p_{ki} : \text{probability of the term } i \,\ \text{given } C_k ,
$$

$$
x_i : \text{boolean expressing the occurrence or absence of the } i\text{th term from the vocabulary}
$$

5.3.11. Bayes Gaussian Classifier<br>
- Assumptions 

$$
\text{Class} \,\ C_1 \,\ : \,\ \mathbb{E}[\mathbf{x}] = \mathbf{\mu}_1, \,\ \mathbb{E}[(\mathbf{x} - \mathbf{\mu}_1)(\mathbf{x} - \mathbf{\mu}_1)^T] = \Sigma 
$$

$$
\text{Class} \,\ C_2 \,\ : \,\ \mathbb{E}[\mathbf{x}] = \mathbf{\mu}_2, \,\ \mathbb{E}[(\mathbf{x} - \mathbf{\mu}_2)(\mathbf{x} - \mathbf{\mu}_2)^T] = \Sigma 
$$

$$
p(\mathbf{x} | C_i) = \frac{1}{\sqrt{|2 \pi \Sigma|}} \exp \left( - \frac{1}{2} (\mathbf{x} - \mathbf{\mu}_i)^T \Sigma^{-1} (\mathbf{x} - \mathbf{\mu}_i)\right), \,\ i = 1,2 
$$

- Also suppose that

$$
p(C_1) = p(C_2) = \frac{1}{2}, \,\ c_{21} = c_{12}, \,\ c_{11} = c_{22} = 0
$$

<img src="./res/ch05/fig_3_9.png" width="500" height="300"><br>
<div align="center">
  Figure.5.3.9
</div>

Bayes Gaussian classifier is just a linear classifier

$$
\begin{align*}
\log \xi &= 0 \\
\log L(\mathbf{x}) &= \frac{p_\mathbf{x} (\mathbf{x} | C_1)}{p_\mathbf{x} (\mathbf{x} | C_2)} \\
                   &= - \frac{1}{2} (\mathbf{x} - \mathbf{\mu}_1)^T \Sigma^{-1} (\mathbf{x} - \mathbf{\mu}_1) + \frac{1}{2} (\mathbf{x} - \mathbf{\mu}_2)^T \Sigma^{-1} (\mathbf{x} - \mathbf{\mu}_2) \\
                   &= (\mathbf{\mu}_1 - \mathbf{\mu}_2)^T \Sigma^{-1} \mathbf{x} + \frac{1}{2} (\mathbf{\mu}_2^T \Sigma^{-1} \mathbf{\mu}_2 - \mathbf{\mu}_1^T \Sigma^{-1} \mathbf{\mu}_1) \\
                   &= \mathbf{w}^T \mathbf{x} + b \\
\end{align*}
$$

$$
\log L(\mathbf{x}) = \mathbf{w}^T \mathbf{x} + b \overset{C_1}{\underset{C_2}{\gtrless}} 0
$$

5.3.12. Generative and discriminative approach<br>
Generative approach use Bayes's theorem like likelihood and posterior probability, so that we can make __decision boundary__ because we have to concentrate on distribution of classes. However, Discriminative approach just focus on classification of classes so that we have to concentrate on differences of classes.

- Generative approach 

Generative model $ p(t, x) = p(x | t) p(t) $<br>
$ p(t|x) = \frac{p(x | t)p(t)}{p(x)} \quad \text{where} \,\ p(x) = \int p(x | t) p(t) dt  \quad (\because \,\ Bayes's theorem)$ <br>

<img src="./res/ch05/fig_3_10.png" width="400" height="400"><br>
<div align="center">
  Figure.5.3.10
</div>
<br><br>

Discriminative model $ p(t | x) directly $ <br>

<img src="./res/ch05/fig_3_11.png" width="400" height="400"><br>
<div align="center">
  Figure.5.3.11
</div>
<br><br>

5.3.13. Probabilistic Generative Models for two classes classification<br>
- Posterior probability in binary classification (Bayes rule)

$$
p(C_1 | \mathbf{x}) = \frac{p(\mathbf{x} | C_1) p(C_1)}{\sum_{i = 1}^{2} p(\mathbf{x} | C_i) p(C_i)} = \frac{1}{1 + \exp(-a)} = \sigma(a)
$$

$$
\text{where} \,\ a = \ln \frac{p(\mathbf{x}|C_1)p(C_1)}{\mathbf{x}|C_2)p(C_2)} = \ln \frac{p(\mathbf{x}, C_1)}{p(\mathbf{x}, C_2)} \quad (a \,\ \text{takes simply a linear form of } \mathbf{x}, \,\ a = \mathbf{w}^T \mathbf{x} + b)
$$

<strong>Proof.</strong><br>
Let $ \alpha = p(\mathbf{x} | C_1) p(C_1) $ and $ \beta = p(\mathbf{x} | C_2) p(C_2) $, <br>

$$
\frac{\alpha}{\alpha + \beta} = \frac{1}{\frac{\alpha + \beta}{\alpha}} = \frac{1}{1 + \frac{\beta}{\alpha}} = \frac{1}{1 + \exp(- \ln \frac{\alpha}{\beta})} \qquad \blacksquare
$$

5.3.8. Bayes Gaussian Classifier<br>
5.3.9. Generative and discriminative approach<br>
5.3.10. Probabilistic generative models in two classes<br>
5.3.11. Probabilistic generative models in multiple classes<br>
5.3.12. Probabilistic discriminative models<br>