# **BAYESIAN CLASSIFICATION FOR NORMAL DISTRIBUTIONS**

## **The Gaussian Probability Density Function**

One of the most commonly encountered probability density functions in practice
is the **Gaussian** or **normal probability density function**. The major reasons for its popularity are its computational tractability and the fact that it models adequately a large number of cases.

The one-dimensional or the univariate Gaussian, as it is sometimes called, is
defined by
$$p(x) = \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2} \right)$$
With parameters $\mu$ (mean) and $\sigma^2$ (variance).

The multivariate generalization of a Gaussian pdf in the l-dimensional space is
given by
$$p(x) = \frac{1}{(2\pi)^{\frac{l}{2}}|\Sigma|^{\frac{1}{2}}}\exp\left(-\frac{1}{2}(x-\mu)^T \Sigma^{-1}(x-\mu) \right)$$
where $\mu = E[x]$ is the mean value and $\Sigma$ is the $l\times l$ **covariance matrix** defined as
$$\Sigma = E\left[ (x-\mu)(x-\mu)^T\right]$$
where $|\Sigma|$ denotes the determinant of $\Sigma$.

It is readily seen that for $l=1$ the multivariate Gaussian coincides with the univariate one.

The symbol $N(\mu,\Sigma)$ is used to denote a Gaussian pdf with mean value $\mu$ and covariance 	$\Sigma$.

$\sigma_{XX}$ lo hace ancho  
$\sigma_{YY}$ lo hace largo  
$\sigma_{XY}$ lo incilna, para arriba cuando es positivo y para abajo cuando es negativo

## **The Bayesian Classifier for Normally Distributed Classes**

The optimal **Bayesian classifier** when the involved pdfs, $p(x|\omega_i), \ i=1,2,\ldots,M$ (**likelihood functions** of $\omega_i$ with respect to $x$),
describing the data distribution in each one of the classes, are **multivariate normal distributions**, that is,
$$N(\mu_i, \Sigma_i), \ i=1,2,\ldots,M$$


Because of the exponential form of the involved densities, it is preferable to work with the following **discriminant functions**, which involve the (monotonic) logarithmic function $\ln(\cdot)$:

\begin{align}
 g_i(x) &=& \ln\left(p(x|\omega_i)P(\omega_i) \right)\\
  &=&\ln p(x|\omega_i) + \ln P(\omega_i)\\
  &=& \ln\left(\frac{1}{(2\pi)^{\frac{l}{2}}|\Sigma_i|^{\frac{1}{2}}}\exp\left(-\frac{1}{2}(x-\mu_i)^T \Sigma_i^{-1}(x-\mu_i) \right) \right)+ \ln P(\omega_i)\\
  &=& \ln\left(\frac{1}{(2\pi)^{\frac{l}{2}}|\Sigma_i|^{\frac{1}{2}}}\right) + \ln\left(\exp\left(-\frac{1}{2}(x-\mu_i)^T \Sigma_i^{-1}(x-\mu_i) \right) \right)+ \ln P(\omega_i)\\
  &=& c_i-\frac{1}{2}(x-\mu_i)^T \Sigma_i^{-1}(x-\mu_i) + \ln P(\omega_i) \\
\end{align}
where $c_i$ is a constant
$c_i = -\frac{l}{2}\ln(2\pi)-\frac{1}{2}\ln |\Sigma_i|$.

Expanding, we obtain
$$g_i=-\frac{1}{2}x^T \Sigma_i^{-1} x + \frac{1}{2} x^T \Sigma_i^{-1}\mu_i +\frac{1}{2}\mu_i^T \Sigma_i^{-1}x-\frac{1}{2} \mu_i^T \Sigma_i^{-1}\mu_i +\ln P(\omega_i) + c_i \tag{eq1}$$

In general, this is a **nonlinear quadratic form**.

Take, for example, the case of $l=1$ and assume that
$$\Sigma_i = \left[\begin{matrix} \sigma_i^2 & 0 \\ 0 & \sigma_i^2\end{matrix}\right]$$
Then, **eq1** becomes
$$g_i(x) = -\frac{1}{2\sigma_i^2}(x_1^2+x_2^2) + \frac{1}{\sigma_i^2}(\mu_{i1}x_1+\mu_{i2}x_2)-\frac{1}{2\sigma_i^2}(\mu_{i1}^2+\mu_{i2}^2) + \ln P(\omega_i) +c_i$$

and obviously the associated decision curves
$$g_i(x)-g_j(x)=0$$
 are **quadrics**: ellipsoids, parabolas, hyperbolas, pairs of lines.


That is, in such cases, the Bayesian classifier is a quadratic classifier, in the sense that the partition of the feature space is performed via quadric decision surfaces. For $l>2$ the decision surfaces are **hyperquadrics**.

### **Decision Hyperplanes**


The only quadratic contribution in **eq1** comes from the term $x^T \Sigma_i^{-1} x$. If we now assume that the covariance matrix is the **same in all classes**, that is, $\Sigma_i = \Sigma$, the quadratic term will be the same in all discriminant functions.

Hence, it does not enter into the comparison for computing the maximum, and it cancels out in the **decision surface equations**.

The same is true for the constant $c_i$. Thus, they can be omitted, and we may redefine $g_i(x)$ as
$$g_i(x) = w_i^T x + w_{i0}$$
where
\begin{align}
w_i &=& \Sigma^{-1}\mu_i\\
w_{i0} &=& \ln P(\omega_i) -\frac{1}{2} \mu_i^T \Sigma^{-1}\mu_i
\end{align}
Hence $g_i(x)$ is a **linear function** of $x$ and the respective decision surface are **hyperplanes**

Cuando tienen la misma $\sigma_{XX}$ y $\sigma_{YY}$ se hace una recta 