# Chapter.05 Classification
---
### 5.3. Probabilistic Approaches for Classification
5.3.1. Statistics vs Bayesian Classification

- Statistical classification
    - Based on the Neyman-Pearson criterion
    - Typically used in sonar and rader systems (with unknown priors)
    - ML estimator
- Bayesian Classification
    - Based on minimization of the Bayes risk(by cost function)
    - Typically used in communications and pattern recognition systems
    - MAP estimator (suppose we know prior e.g., gaussian)

<br>
Both are based on the Likelihood Ratio Test (LRT), just comparing the ratio of likelihoods, but to different thresholds

$$
L(\mathbf{x}) = \frac{p(\mathbf{x} | C_2)}{p(\mathbf{x} | C_1)} \overset{C_2}{\underset{C_1}{\gtrless}} \xi
$$

$ L(\mathbf{x}) $ is a likelihood ratio and $ \xi $ is a decision threshold.

5.3.2. Probabilities in classification<br>

<img src="./res/ch05/fig_3_1.png" width="800" height="600"><br>
<div align="center">
  Figure.5.3.1
</div>

$$
\begin{align*}
p(\text{mistake}) &= p(\mathbf{x} \in \mathcal{R}_1, \mathcal{C}_2) + p(\mathbf{x} \in \mathcal{R}_2, \mathcal{C}_1) \\
                  &= \int_{\mathcal{R}_1} p(\mathbf{x}, C_1) d\mathbf{x} + \int_{\mathcal{R}_2} p(\mathbf{x}, C_2) d\mathbf{x} \\
\end{align*}
$$

$$
\begin{align*}
p(correct) &= \int_{\mathcal{R}_1} p(\mathbf{x}, C_1) d\mathbf{x} + \int_{\mathcal{R}_2} p(\mathbf{x}, C_2) d\mathbf{x} \\
\end{align*}
$$

5.3.3. A simple binary classification

<img src="./res/ch05/fig_3_2.png" width="500" height="400"><br>
<div align="center">
  Figure.5.3.2
</div>

$$
\text{Type-1 error : } \int_{\mathcal{R}_2} p(x | C_1) dx = \int_{\frac{1}{2}}^{\infty} = Q(0.5)
$$

$$
\text{Type-2 error : } \int_{\mathcal{R}_1} p(x | C_2) dx = \int_{-\infty}^{\frac{1}{2}} = Q(0.5)
$$

What if the threshold changes?

<img src="./res/ch05/fig_3_3.png" width="550" height="430"><br>
<div align="center">
  Figure.5.3.3
</div>

It isn't possible to reduce both error probabilities at the same time. So there is a criterion suggested by Neyman-Pearson.

5.3.4. Statistical classification : Neyman-Pearson criterion<br>
To degign the optimal(binary) classifier, one possible choice is to minimize the __Type-Ⅱ error__(false negative), or equivalently, to maximize the __Power of test__ by constraining __Type-Ⅰ error__(false positive) below a threshold $ \alpha $:

$$
\max_{\mathcal{R}_2} \int_{\mathcal{R}_2} p(\mathbf{x} | C_2) d\mathbf{x} \quad s.t. \quad \int_{\mathcal{R}_2} p(\mathbf{x} | C_1) d\mathbf{x} \le \alpha
$$

<strong>Theorem.5.3.4.1. Neyman-Pearson theorem</strong><br>
The solution to 

$$
\max_{\mathcal{R}_2} \int_{\mathcal{R}_2} p(\mathbf{x} | C_2) d\mathbf{x} \quad s.t. \quad \int_{\mathcal{R}_2} p(\mathbf{x} | C_1) d\mathbf{x} \le \alpha
$$

is given by 

$$
R_2^* = \left\{ \mathbf{x} : L(\mathbf{x}) = \frac{p(\mathbf{x} | C_2)}{p(\mathbf{x} | C_1)} > \gamma \right\} \quad \text{where the threshold} \,\ \gamma \,\ \text{is found such that} \,\ \int_{\mathcal{R_2}} p(\mathbf{x} | C_1) d\mathbf{x} = \alpha 
$$

Neyman-Pearson(Binary) Classification Rule : 
- Decide a decision threshold $ \gamma $ from 

$$
\int_{\mathcal{R}_2} p(\mathbf{x} | C_1) d\mathbf{x} = \alpha
$$

- Perform classification according to the likelihood ratio test : 

$$
L(\mathbf{x}) = \frac{p(\mathbf{x} | C_2)}{p(\mathbf{x} | C_1)} \overset{C_2}{\underset{C_1}{\gtrless}} \gamma
$$

<strong>Proof.</strong><br>
It can be proved by constraint optimization problem. $\blacksquare$

Let's solve the problem in 5.3.3.

$$
Pr\{\text{False positive(Type 1 error)}\} = 0.5
$$

$$
\begin{align*}
\frac{p(x|C_2)}{p(x|C_1)} &= \frac{\frac{1}{\sqrt{2 \pi}} \exp(-\frac{1}{2}(x - 1)^2)}{\frac{1}{\sqrt{2 \pi}} \exp(- \frac{1}{2} x^2)} \overset{C_2}{\underset{C_1}{\gtrless}} \gamma \\
                          &= \exp\left(- \frac{1}{2} (x^2 - 2x + 1 - x^2) \right) = \exp \left( x - \frac{1}{2} \right) \overset{C_2}{\underset{C_1}{\gtrless}} \gamma \\
\end{align*}
$$

$$
\begin{align*}
Pr\{\text{False positive(Type 1 error)}\} &= Pr\left\{ \exp \left( x - \frac{1}{2} \right) > \gamma | C_1 \right\} \\
                                          &= Pr\left\{ x - \frac{1}{2} > \ln \gamma | C_1 \right\} \\
                                          &= \int_{\ln \gamma + \frac{1}{2}}^{\infty} \frac{1}{\sqrt{2 \pi}} \exp(-\frac{1}{2} x^2) dx = 0.5 \quad \Rightarrow \quad \ln \gamma + \frac{1}{2} = 0 \\
\end{align*}
$$

$$
\begin{align*}
Pr\{\text{True positive}\} &= Pr\left\{ x > 0 \right\} \\
                           &= \int_{0}^{\infty} \frac{1}{\sqrt{2 \pi}} \exp[- \frac{1}{2}(x - 1)^2] dx = 0.84 \\
\end{align*}
$$

So are there some limitations? Although an NP criterion can be formulated for multiple classes, it seems to seldom be used in practice. More commonly the minimum $P_e$ criterion or its generalization, the Bayes risk criterion, is employed.

5.3.4. Receiver Operating Characteristics (ROC)<br> 
ROC curve : $Pr\{\text{True positive}\} \,\ \text{vs} \,\ Pr\{\text{False positive}\} $

<img src="./res/ch05/fig_3_4.png" width="550" height="430"><br>
<div align="center">
  Figure.5.3.4
</div>

This is alternative way of summarizing the performance of a classifier and very useful to compare different classifiers and to decide which one performs best.

<img src="./res/ch05/fig_3_5.png" width="650" height="530"><br>
<div align="center">
  Figure.5.3.5
</div>


5.3.5. Bayesian classification : Minimum Bayes Risk Classifier in two classes<br> 
Bayes risk $ \mathcal{R} $ is given like following
$$
\begin{align*}
\mathcal{R} &= c_{11}Pr(\text{True Positive}) + c_{22}Pr(\text{True Negative}) + c_{21}Pr(\text{False Negative}) + c_{12}Pr(\text{False Positive}) \\
            &= c_{11} \pi_1 \int_{R_1} p(\mathbf{x} | C_1) d\mathbf{x} + c_{22} \pi_2 \int_{R_2} p(\mathbf{x} | C_2) d\mathbf{x} + c_{21} \pi_1 \int_{R_2} p(\mathbf{x} | C_1) d\mathbf{x} + c_{12} \pi_2 \int_{R_1} p(\mathbf{x} | C_2) d\mathbf{x} \\
            &\quad \text{where} \,\ \pi_i = P(C_i), \,\ i = 1, 2, \,\ R_i \,\ : \,\ \text{decision region in which} \,\ \mathbf{x} \in C_i , \,\ c_{ij} \,\ : \,\ \text{cost if we choose} \,\ C_i \,\ \text{but} \,\ C_j \,\ \text{is true} \\
\end{align*}
$$

$$
\text{Since} \,\ \int_{R_1} p(\mathbf{x} | C_i) d\mathbf{x} + \int_{R_2} p(\mathbf{x} | C_i) d\mathbf{x} = 1, \,\ i = 1, 2, 
$$

$$
\begin{align*}
\text{We have} \,\ \mathcal{R} = &c_{11} \pi_1 \int_{R_1} p(\mathbf{x} | C_1) d\mathbf{x} + c_{22} \pi_2 \left( 1 - \int_{R_1} p(\mathbf{x} | C_2) d\mathbf{x} \right) \\
                                 &+ c_{21} \pi_1 \left( 1 - \int_{R_1} p(\mathbf{x} | C_1) d\mathbf{x} \right) + c_{12} \pi_2 \int_{R_1} p(\mathbf{x} | C_2) d\mathbf{x} \\                             
\end{align*}
$$

$$
\begin{align*}
\mathcal{R} = &c_{21} \pi_1 + c_{22} \pi_2 \\
              &+ \int_{R_1} \left[ \pi_2 (c_{12} - c_{22}) p(\mathbf{x} | C_2) - \pi_1 (c_{21} - c_{11}) p(\mathbf{x} | C_1) \right] d\mathbf{x} \quad \text{where} \,\ c_{12} - c_{22} > 0 \,\ \text{and} \,\ c_{21} - c_{11} > 0 \\
\end{align*}
$$

$ c_{11} $ and $ c_{22} $ are negative to maximize TP and TN.<br>
$ c_{12} $ and $ c_{21} $ are positive to minimize FP and FN.<br>

5.3.6. Bayesian classification : Minimum Bayes Risk Classifier in multiple classes<br>
5.3.7. Naive Bayes Classifier<br>
5.3.8. Bayes Gaussian Classifier<br>
5.3.9. Generative and discriminative approach<br>
5.3.10. Probabilistic generative models in two classes<br>
5.3.11. Probabilistic generative models in multiple classes<br>
5.3.12. Probabilistic discriminative models<br>

<strong>Reference.</strong><br>
http://sas.uwaterloo.ca/~aghodsib/courses/f07stat841/notes/lecture6.pdf<br>
https://web.stanford.edu/~boyd/papers/pdf/robust_FDA.pdf<br>