## Normalizing tables

First let's create a test table

In [2]:
import numpy as np

table = np.arange(16).reshape((4, 4))
table

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

To be proper distribution it should sum to 1. Numpy idiom to this is

In [3]:
norm_table = table/table.sum()
norm_table

array([[ 0.        ,  0.00833333,  0.01666667,  0.025     ],
       [ 0.03333333,  0.04166667,  0.05      ,  0.05833333],
       [ 0.06666667,  0.075     ,  0.08333333,  0.09166667],
       [ 0.1       ,  0.10833333,  0.11666667,  0.125     ]])

But if this a conditional probability table, say $p_{A|B}(a|b)$ and b is columns of the table, it should sum to 1 for every column.

In [4]:
p_a_given_b = table/table.sum(axis=0)
p_a_given_b

array([[ 0.        ,  0.03571429,  0.0625    ,  0.08333333],
       [ 0.16666667,  0.17857143,  0.1875    ,  0.19444444],
       [ 0.33333333,  0.32142857,  0.3125    ,  0.30555556],
       [ 0.5       ,  0.46428571,  0.4375    ,  0.41666667]])

where axis = 1 means by rows and axis = 0 - by columns. Let's normalize table by rows:

In [5]:
table/table.sum(axis=1, keepdims=True)

array([[ 0.        ,  0.16666667,  0.33333333,  0.5       ],
       [ 0.18181818,  0.22727273,  0.27272727,  0.31818182],
       [ 0.21052632,  0.23684211,  0.26315789,  0.28947368],
       [ 0.22222222,  0.24074074,  0.25925926,  0.27777778]])

What if we need to get product of table elements by columns? It could be done this way:

In [7]:
prod_by_col = table.prod(axis=0)
prod_by_col

array([   0,  585, 1680, 3465])

But there is a problem. If table is big and first elements are small, the intermediate result of computation will be lower than the minimum double number (about $10^{−323}$ in Python). The intermidiate result would be assigned zero and total result will be zero. That is underflow.

If we compute this product in log domain, underflow could be prevented. The idea is $\prod_n a_n = \exp⁡(\sum_n \log⁡ a_n)
$

In [8]:
prod_table = np.exp(np.log(table).sum(axis=0))
prod_table

  if __name__ == '__main__':


array([    0.,   585.,  1680.,  3465.])

Compare, it's the same as prod_by_col. And it's normalization is

In [9]:
norm_prod_table = prod_table/prod_table.sum()
norm_prod_table

array([ 0.        ,  0.10209424,  0.29319372,  0.60471204])

What to do if we have access to log_prod_table only?

In [10]:
log_prod_table = np.log(table).sum(axis=0)
log_prod_table

  if __name__ == '__main__':


array([       -inf,  6.37161185,  7.42654907,  8.15046791])