# **Minimizing the classification error probability**

Let $R_1$ be the region of the feature space in which we decide in favor of
$\omega_1$ and $R_2$ be the corresponding region for $\omega_2$.

Then an **error** is made if $x \in R_1$, although it belongs to $\omega_2$ or if $x \in R_2$, although it belongs to $\omega_1$. That is,
$$P_e = P(x\in R_2, \omega_1) + P(x\in R_1, \omega_2)$$
Where $P(\cdot,\cdot)$ is the **joint probability** of two events.


By definition, the **conditional probability** of $B$ assuming $A$ is
$$P(B|A) = \frac{P(B,A)}{P(A)}$$

Therefore, the total error becomes
\begin{eqnarray}
P_e &=& P(x\in R_2 | \omega_1) P(\omega_1)+P(x\in R_1 | \omega_2) P(\omega_2)\\
&=& P(\omega_1) \int_{R_2} p(x|\omega_1)dx + P(\omega_2) \int_{R_1} p(x|\omega_2)dx\tag{eq1}
\end{eqnarray}
Where $ p(x|\omega_1)$ is the **conditional probability density function** of $x$ given $\omega_1$, which describes how the values of $x$ are distributed within the class $\omega_1$.


**Bayes' rule** states that
$$P(\omega_1 | x) = \frac{p(x|\omega_1)P(\omega_1)}{p(x)}$$

Factoring we obtain
$$\frac{P(\omega_1|x)p(x)}{P(\omega_1)}= p(x|\omega_1)$$

Substituting in **(eq1)**, we have that
$$P_e = \int_{R_2} P(\omega_1|x)p(x)dx+\int_{R_1} P(\omega_2|x)p(x)dx \tag{eq2}$$

By the **law of total probability**
$$P(\omega_1) = \int_{\Omega} P(\omega_1 | x) p(x) dx$$

Since the uniont of the regions $R_1$, $R_2$ partition the space, that is
$\Omega = R_1 \cup R_2$ and $R_1 \cap R_2=\emptyset$, we have that
\begin{eqnarray} P(\omega_1) &=& \int_{R_1 \cup R_2} P(\omega_1|x) p(x)dx \\
 &=& \int_{R_1}  P(\omega_1|x) p(x)dx + \int_{R_2} P(\omega_1|x) p(x)dx\end{eqnarray}
Then,
$$\int_{R_2} P(\omega_1|x)p(x)dx= P(\omega_1)-\int_{R_1}P(\omega_1|x)p(x)dx$$

Substituting in **(eq2)**, we obtain
\begin{eqnarray} P_e &=& \left( P(\omega_1)-\int_{R_1}P(\omega_1|x)p(x)dx \right)+\int_{R_1} P(\omega_2|x)p(x)dx\\
&=& P(\omega_1)+\int_{R_1} \left(-P(\omega_1|x) +P(\omega_2 | x) \right)p(x) dx\\
&=&P(\omega_1)-\int_{R_1} \left(P(\omega_1|x) -P(\omega_2 | x) \right)p(x) dx
\end{eqnarray}

Because the first element is fixed, $P(\omega_1)$, the **error is minimized** when the value of the integral is larger. This occurs when $R_1$ is the region of space in which
$$P(\omega_1 |x) > P(\omega_2 |x)$$

Then, $R_2$ becomes the region where the reverse is true. In other words, when
$$P(\omega_2 |x ) > P(\omega_1|x)$$

Therefore, the **error is minimized** if the partitioning regions $R_1$ and $R_2$ of the feature space are chosen so that
$$R_1: \qquad P(\omega_1|x) > P(\omega_2|x)$$
$$R_2: \qquad P(\omega_2 |x) > P(\omega_1|x)$$

## **$M$-classes**

So far, we have dealt with the simple case of two classes. Generalizations to
the multiclass case are straightforward. In a classification task with $M$ classes, $\omega1, \omega_2,\ldots, \omega_M$, an unknown pattern, represented by the feature vector $x$, is assigned to class $\omega_i$ if
$$P(\omega_1 | x) > P(\omega_j |x), \qquad \forall j\neq i$$

It can be shown that such a choice also **minimizes the probability of classification error**.

# **Discriminant functions**

It is by now clear that minimizing either the error probability  is equivalent to **partitioning** the feature space into $M$ regions, for a task with $M$ classes.

If regions $R_i$, $R_j$ happen to be **contiguous**, then they are separated by a **decision surface** in the multidimensional feature space. For the minimum error probability case, this is described by the equation
$$P(\omega_i | x) - P(\omega_j|x) = 0$$

From the one side of the surface this difference is positive, and from the other
it is negative.

Sometimes, instead of working directly with probabilities, it may be more convenient, from a mathematical point of view, to work with an equivalent function of them, for example,
$$g_i(x) \equiv f (P(\omega_i |x))$$
where $f(\cdot)$ is a **monotonically increasing function**. $g_i(x)$ is known as a **discriminant function**.

The **decision test**
$$P(\omega_i|x) > P(\omega_j|x) \qquad \forall j\neq i$$
is now stated as
$$\text{classify} \ x \ \text{in} \ \omega_i \quad \text{if} \quad g_i(x) > g_j(x) \quad j\neq i$$

The **decision surfaces**, separating contiguous regions, are described by
$$g_{ij}(x) \equiv g_i(x) - g_j(x) = 0, \qquad i,j=1,2,\ldots,M, \ i\neq j$$

So far, we have approached the classification problem via **Bayesian probabilistic approache** and the goal was to **minimize the classification error probability** or the **risk**.

However, as we will soon see, not all problems are well suited to such approaches.
> For example, in many cases the involved pdfs are complicated and their estimation is not an easy task.

>In such cases, it may be preferable to compute decision surfaces directly by means of alternative costs.

Such approaches give rise to discriminant functions and decision surfaces, which
are entities with no (necessary) relation to Bayesian classification, and they are, in general, **suboptimal** with respect to **Bayesian classifiers**.