# Assigning Probabilities by Feature

<br>

Suppose we have $\mathbf{x}\in\mathbb{R}^m,\ \boldsymbol{\mu}\in\mathbb{R}^m,\text{ and }\boldsymbol{\Sigma}\in\mathbb{R}^{m,m}$ such that $\boldsymbol{\Sigma}^\text{T}=\boldsymbol{\Sigma}$ and the eigen-values of $\boldsymbol{\Sigma}$ are all positive.


Recall that the log pdf of a multivariate normal to be

$$\begin{align*}
    \log \text{pdf}(\mathbf{x}\ |\ \boldsymbol{\mu},\ \boldsymbol{\Sigma}) 
    &= -\frac{m}{2} \log 2 \pi - \frac{1}{2}\log |\boldsymbol{\Sigma}| - \frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^\text{T}\boldsymbol{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu}),
    & \big(\text{use the eigenvalue decomposition } \mathbf{U}\mathbf{S}\mathbf{U}^\text{T} = \boldsymbol{\Sigma}\big)\\
    
    &= -\frac{m}{2} \log 2 \pi - \frac{1}{2} \log |\mathbf{U}\mathbf{S}\mathbf{U}^\text{T}| - \frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^\text{T}\mathbf{U}\mathbf{S}^{-1}\mathbf{U}^\text{T}(\mathbf{x} - \boldsymbol{\mu}),
    & \big(|\mathbf{AB}| = |\mathbf{A}| \cdot |\mathbf{B}| \text{ and } |\mathbf{A}| = 1 \text{ if } \mathbf{A} \text{ is orthonormal}\big)\\

    &= -\frac{m}{2} \log 2 \pi - \frac{1}{2} \log |\mathbf{S}| - \frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^\text{T}\mathbf{U}\mathbf{S}^{-1}\mathbf{U}^\text{T}(\mathbf{x} - \boldsymbol{\mu}),
    & \big(\log |\mathbf{S}| = \sum_j^m \log s_{jj} \text{ as } \mathbf{S} \text{ is a diagonal matrix}\big)\\

    &= -\frac{1}{2}\sum_j^m \bigg(\log 2 \pi s_{jj} + s_{jj}^{-1}\big\langle\mathbf{x} - \boldsymbol{\mu},\ \mathbf{u}_{\cdot,j}\big\rangle^2\bigg).
\end{align*}
$$

where $\langle\cdot,\ \cdot\rangle$ is the inner product and $\mathbf{u}_{\cdot,j}$ is the $j$ -th column of $\mathbf{U}$.<br>

$\therefore -\frac{1}{2} \bigg(\log 2 \pi s_{jj} + s_{jj}^{-1}\big\langle\mathbf{x} - \boldsymbol{\mu},\ \mathbf{u}_{\cdot,j}\big\rangle^2\bigg)$ is the marginalised $\log$ probability contribution from the $j$-th feature value with respect to all the feature values in the sample.

<br>

In the case that $\boldsymbol{\Sigma}$ is **not symmetric positive definite**, we can use the Singular Value Decomposition to force computations through leading to the alternative more general result:

$\quad -\frac{1}{2} \bigg(\log 2 \pi s_{jj} + s_{jj}^{-1}\big\langle\mathbf{x} - \boldsymbol{\mu},\ \mathbf{u}_{\cdot,j}\big\rangle\langle\mathbf{x} - \boldsymbol{\mu},\ \mathbf{v}_{\cdot,j}\big\rangle\bigg)$

where $\mathbf{USV}^\text{T} = \boldsymbol{\Sigma}$. 

**Note that $\mathbf{U} = \mathbf{V}$ iff $\boldsymbol{\Sigma}$ is a symmetric positive definite matrix**.

In [1]:
from   scipy.stats import multivariate_normal
import numpy as np

# set seed for reproducibility
np.random.seed(0)

# no. of features
m              = 3

# mean and cov
mean           = np.random.normal(size = m)
temp           = np.random.normal(size = (m ,m))
cov            = temp.T @ temp                                     # ensures SPD

# define most likely Gaussian distribution that explains the data
dist           = multivariate_normal(mean, cov)

# random new sample
x              = np.random.normal(size = m)

# eigenvalues and eigenvectors (orthonormal)
U, s, Vt       = np.linalg.svd(cov)
V              = Vt.T

# negative log likelihood
nll            = np.log(2 * np.pi * s).sum() + ((((x - mean) @ U)) * ((x - mean) @ V) / s).sum()
nll           /= 2

# similar value to scipy's computation
print(nll, -dist.logpdf(x))

4.210372454378047 4.210372454378047


In [2]:
U - V

array([[-2.22044605e-16,  1.11022302e-16,  3.33066907e-16],
       [ 2.22044605e-16, -1.66533454e-16,  1.11022302e-16],
       [-5.55111512e-17, -2.22044605e-16,  1.38777878e-16]])

In [3]:
# computation for final line in above equation
nll_by_feature  = np.log(2 * np.pi * s) + np.square((x - mean) @ U) / s
nll_by_feature /= 2

print(nll_by_feature)

nll_by_feature.sum()

[2.09292433 1.50395575 0.61349238]


4.210372454378046

The 3rd value is the least likely so lets inspect the difference between the sample and the mean.

In [4]:
x - mean

array([-1.00301462, -0.27848219, -0.53487475])

We can see that $\mathbf{x}$ is less than $\boldsymbol{\mu}$ and that the $x_1$ value is actually the furthest away from the associated $\mu_1$ value. Lets inspect $\boldsymbol{\Sigma}$.

In [5]:
cov

array([[ 6.09286146,  4.10033934, -1.69091987],
       [ 4.10033934,  3.5314304 , -1.60002145],
       [-1.69091987, -1.60002145,  3.08063762]])

Upon examining the values, we can see from $\boldsymbol{\Sigma}$ that $(x_1 - \mu_1)$ and $(x_2 - \mu_2)$ should have the same sign, whereas $(x_3 - \mu_3)$ should have the opposite sign.