# Information Theory Measures using the ITE Toolbox

* Author: J. Emmanuel Johnson
* Email: jemanjohnson34@gmail.com
* Date: $4^{\text{th}}$ September, $2019$

This notebook will walk-through how one can calculate a few key Information theory (IT) measures using the ITE toolbox. We have done previous experiments with the MATLAB package but there is a python version that can be useful for Python users. It's a lot cleaner but some of the functionality may be difficult to follow. 

## Literature Review (what we previous did)


### Entropy

In our experiments, we were only looking at Shannon entropy. It is the general case of Renyi's entropy as $\alpha \rightarrow 1$. We chose not to look at Renyi's entropy because we did not want to go down a rabbit hole of measures that we cannont understand nor justify. So we stuck to the basics. It's also important to keep in mind that we were looking at measures that could calculate the joint entropy; i.e. for multivariate, multi-dimensional datasets.


#### Algorithms

##### KnnK

This uses the KNN method to estimate the entropy. From what I understand, it's the simplest method that may have some issues at higher dimensions and large number of samples (normal with KNN estimators). 


* A new class of random vector entropy estimators and its applications in testing statistical hypotheses - Goria et. al. (2005) - [Paper](https://www.tandfonline.com/doi/full/10.1080/104852504200026815)
* Nearest neighbor estimates of entropy - Singh et. al. (2003) - [paper]()
* A statistical estimate for the entropy of a random vector - Kozachenko et. al. (1987) - [paper]()

##### KDP

This is the logical progression from KnnK. It uses KD partitioning trees (KDTree) algorithm to speed up the calculations I presume.

* Fast multidimensional entropy estimation by k-d partitioning - Stowell & Plumbley (2009) - [Paper]()

##### expF 

This is the close-form expression for the Sharma-Mittal entropy calculation for expontial families. This estimates Y using the maximum likelihood estimation and then uses the analytical formula for the exponential family.

* A closed-form expression for the Sharma-Mittal entropy of exponential families - Nielsen & Nock (2012) - [Paper]()

##### vME

This estimates the Shannon differential entropy (H) using the von Mises expansion. 

* Nonparametric von Mises estimators for entropies, divergences and mutual informations - Kandasamy et. al. (2015) - [Paper]()

##### Ensemble

Estimates the entropy from the average entropy estimations on groups of samples


This is a simple implementation with the freedom to choose the estimator `estimate_H`.

```python
# split into groups
for igroup in batches:
    H += estimate_H(igroup)
    
H /= len(batches)
```

* High-dimensional mutual information estimation for image registration - Kybic (2004) - [Paper]()


#### Potential New Experiments

#### Voronoi

Estimates Shannon entropy using Voronoi regions. Apparently it is good for multi-dimensional densities.

* A new class of entropy estimators for multi-dimensional densities - Miller (2003) - [Paper]()

### Mutual Information

### Total Correlation

## Code

In [1]:
import sys, os
cwd = os.getcwd()
sys.path.insert(0, f'{cwd}/../../src')
sys.path.insert(0, f'{cwd}/../../src/itetoolbox')

import numpy as np
import ite
from data.load_TishbyData import load_TishbyData

%matplotlib inline
%load_ext autoreload
%autoreload 2

### Data

We will simulate some data X that is normally distributed and Y which is X that has been rotated by some random matrix A. 

In [2]:
10**(-2)

0.01

In [4]:
np.random.seed(123)    # reproducibility
n_samples    = 1000
d_dimensions = 3

# create dataset X
X = np.random.randn(n_samples, d_dimensions)

# do some random rotation
A = np.random.rand(d_dimensions, d_dimensions)

# create dataset Y
Y = X @ A

### Entropy

#### Shannon Entropy (KNN/KDP)

In [5]:
# parameters (default)
mult        = True
knn_method  = 'cKDTree'      # fast version (slower version KNN)
k_neighbors = 20             # free parameter
eps         = 0.1            # free parameter

# initialize it estimator
clf_knnK = ite.cost.BHShannon_KnnK(
    mult=mult, 
    knn_method=knn_method,
    k=k_neighbors,
    eps=eps
)

# estimate entropy
H_x = clf_knnK.estimation(X)
H_y = clf_knnK.estimation(Y)

print(f"H(X): {H_x:.3f} bits")
print(f"H(Y): {H_y:.3f} bits")

H(X): 4.132 bits
H(Y): 2.208 bits


### Mutual Information

The estimation was carried out using the following relationship. Let $XY = [X, Y] \in \mathcal{R}^{N \times D}$, where $D=D_1+D_2$.

$$I(XY) = \sum_{d=1}^D H(XY) - H(XY)$$

The pseudo-code is fairly simple (in the MATLAB version).


1. Organize the components

```python
XY = [X, Y]
```

2. Estimate the joint entropy, $H(XY)$

```python
H_xy = - estimate_H(
    np.hstack(XY)     # stack the vectors dimension-wise
)
```

3. Estimate the marginals of XY; i.e. estimate X and Y individually, then sum them.
```python
H_x_y = np.sum(
    # estimate the entropy for each marginal
    [estimate_H(imarginal) for imarginal in XY]
)
```

4. Summation of the two quantities

```python
MI_XY = H_x_y + H_xy
```

In [6]:
# parameters (default)
mult       = True          # ??
kl_co_name = 'BDKL_KnnK'   # KLD calculator
kl_co_pars = None          # parameters for the KLD calculator

# initialize it estimator
clf_mi = ite.cost.MIShannon_DKL(
    mult=mult,
    kl_co_name=kl_co_name,
    kl_co_pars=kl_co_pars,
)

# concat data
XY = np.concatenate((X, Y), axis=1)

# individual dimensions per
sub_dimensions = np.array([X.shape[1], Y.shape[1]])

# estimate mutual information
mi_XY = clf_mi.estimation(XY, sub_dimensions)

print(f"MI(X,Y): {mi_XY:.3f} bits")

MI(X,Y): 3.584 bits


I expect there to be some MI between X and Y since it is a rotation of the original distribution.

---
### Total Correlation (Multi-Information, Co-Information)

The estimation was carried out using the following relationship:

$$I(x_1, x_2, \ldots, x_D) = \sum_{d=1}^D H(x_d) - H(X)$$

In [7]:
# parameters (default)
mult       = True
kl_co_name = 'BDKL_KnnK'
kl_co_pars = None

# initialize it estimator
clf_mi = ite.cost.MIShannon_DKL(
    mult=mult,
    kl_co_name=kl_co_name,
    kl_co_pars=kl_co_pars,
)

# concat data
sub_dimensions = np.array(range(X.shape[1]))

# estimate mutual information
tc_X = clf_mi.estimation(X, sub_dimensions)
tc_Y = clf_mi.estimation(Y, sub_dimensions)

print(f"Shannon Total Correlation, TC(X): {tc_X:.3f} bits")
print(f"Shannon Total Correlation, TC(Y): {tc_Y:.3f} bits")

Shannon Total Correlation, TC(X): -0.002 bits
Shannon Total Correlation, TC(Y): 2.365 bits


This makes since given that the original distribution $X$ should have no correlations between dimensions because it is Gaussian. The rotation of $X$ by some random matrix $A$, $Y=AX^{\top}$, means that we have added some correlations between dimensions. We see that as the TC is higher.