# Big Entropy and the Generalized Linear Model

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np

import collections

import matplotlib.pyplot as plt
import pymc3 as pm

import scipy.stats as stats

Why choose the distributions with the biggest entropy? 

1. The widest distribution and least informative. In the context of likelihoods = selecting the distribution we'd get by counting up all the ways outcomes could arise, consistent with constraints on the outcome variable. In both cases, the resulting distribution embodies the least information while remaining true to the information we've provided.
2. Nature tends to produce empirical distributions that have high entropy. 
3. Regardless of why it works, it works. There is no guarantee that logic in the small world will work in the big world, even if the mathematics works.

Information entropy: 
$$ H(p) = - \sum_{i} p_i log p_i $$

The principle of maximum entropy applies this measure of uncertainty to the problem of choosing among probability distributions.

``` The distribution that can happen the most ways is also the distribution with the biggest information entropy. The distribution with the biggest entropy is the most conservative distribution that obeys its constraints. ```

In [11]:
p = np.array(
    [[0, 0, 10, 0, 0], # pA
    [0, 1, 8, 1, 0], # pB
    [0, 2, 6, 2, 0], # pC
    [1, 2, 4, 2, 1], # pD
    [2, 2, 2, 2, 2]] # pE
)

In [17]:
p_norm = np.apply_along_axis(lambda x: x / sum(x), arr=p, axis=1)

Since these are probability distributions, we can compute the information entropy of each.

In [28]:
def entropy(arr):
    return - sum([ val * np.log(val) if val > 0 else 0 for val in arr])

In [27]:
np.apply_along_axis(lambda x: entropy(x), arr=p_norm, axis=1)

array([-0.        ,  0.63903186,  0.95027054,  1.47080848,  1.60943791])

So distribution E, which can be realized by far the greatest number of ways, also has the biggest entropy, which is not a coincidence.

** Information entropy is a way of counting how many unique arrangements correspond to a distribution. **

This is useful because the distribution that can happen the greatest number of ways is the most plausible distribution. Call this the maximum entropy distribution.