## On Information theory, Relative entropy , mutual information and KL divergence

## What is information theory

Information theory is the branch of CS which deals  with quantification, storage and communication of digital information.


A key measure to quantify information in information theory is entropy

This builds off a basic premise of information theory, that the value/information content of a communicated message (event in statistics language) is higher, the less probable the event

If the event is very probably, there is hardly any new information, whereas if the event is less probable, there is a lot of information

In concrete terms, the Information of an event E 

I(E) is inversely proportional to the probability of the event p(E)

In reality, the log(1/p) equation is used for this relation

ie $I(E) = log\frac{1}{p(E)}$  which is equivalent to -log(p(E))

Why is this functional form used ? [See reasons here](https://stats.stackexchange.com/questions/87182/what-is-the-role-of-the-logarithm-in-shannons-entropy#:~:text=We%20can%20call%20log(1,events%20we%20can%20tell%20apart).)


( In short , two reasons -
1) a mathematical reason : logarithm is to make it growing linearly with system size and "behave like information". It is also to satisfy nice mathematical properties. For example, we would expect the entropy of tossing a coin n times to be n*entropy of tossing a coin once.

Having a logarithm in the definition makes this mathematically possibe (check link above)

2) An intuitive reason : if all events happen with the same probability p (such as a fair coin or a fair die - uniform distribution), the number of events is 1/p. For capturing 1/p events in binary notation, we have to use $log_{2}(1/p)$ bits
This extends even in a non-uniform distribution because of theoretical reasons (see link above)


## Code

Given a fair coin, with probabilty of each event = 1/2,  what is information of the event ?

In [9]:
from math import log2
information = log2(1/(1/2))
print("information of event E in a fair coin toss is {} bits".format(information))

information of event E in a fair coin toss is 1.0 bits


Given a fair die, with probabilty of each event = 1/6,  what is information of the event ?

In [7]:
from math import log2
information = log2(1/(1/6))
print("information of event E in a fair die roll is {} bits".format(information))

information of event E in a fair die toss is 2.584962500721156 bits


Already, we see that since a p(E) for a die is 1/6 (lower than p(E) for a coin),
information of event in a die roll is higher than coin

## What is entropy

Entropy in information theory is analogous to the concept of entropy in thermodynamic systems

Specifically, the entropy of a random variable X is the [average level of "information", "surprise", or "uncertainty" inherent in the variable's outcomes](https://en.wikipedia.org/wiki/Entropy_(information_theory)

Note that entropy is defined for a random variable, whereas information is defined for an event

Entropy is nothing but the expectation of information

ie H(X) = E(I(X)) =  $\sum_{i}p(Ei)log(\frac{1}{p(E(i))})$

This is the equation of Shannon Entropy, over events/states of the RV . This is analogous to the equation of Gibbs entropy in thermodynamics over states of the thermodynamical system


## Code

Can you compare entropy of tossing a fair coin vs a biased coin where p(H) = 0.7 ?

In [19]:
import numpy as np
def entropy(prob_vec):
    prob_vec = np.array(prob_vec)
    return np.sum(-prob_vec*np.log2(prob_vec))
    

In [21]:
print("Entropy of a fair coin is {} bits".format(entropy([0.5, 0.5])))

Entropy of a fair coin is 1.0


In [22]:
print("Entropy of a biased coin with p(H) = 0.7  is {} bits".format(entropy([0.7, 0.3])))

Entropy of a biased coin with p(H) = 0.7  is 0.8812908992306927


Note that entropy of fair coin is larger, in fact , a uniform distribution has the highest entropy among all distributions, because there is more uncertainty

## Conditional entropy

Conditional entropy of RV Y|X  (represented as H(Y|X)) is the amount of information (think of it as bits) needed to describe Y given another RV X

We know that H(Y) = E(I(Y)) =  $\sum_{i}p(Ei)log(\frac{1}{p(E(i))})$ over all Events Ei in Y

which can be equivalently written as $\sum_{y}p(Y=y)log(\frac{1}{p(Y=y)})$
over all outcomes y of RV Y
in a discete case


Assume that RV X takes a specific value x

Then, we can say as an extension that 
H(Y|X=x) = $\sum_{y}p(Y=y|X=x)log(\frac{1}{p(Y=y|X=x)})$

Averaging over all possible values of X,
we can define

H(Y|X) = $\sum_{x}p(X=x)H(Y|X=x)$ 
 = $\sum_{x}p(X=x) (\sum_{y}p(Y=y|X=x)log(\frac{1}{p(Y=y|X=x)}))$ 
  = $\sum_{x}\sum_{y}p(X=x,Y=y)log(\frac{1}{p(Y=y|X=x)}))$ 
  
  = $\sum_{x,y}p(X=x,Y=y) log(\frac{p(X=x)}{p(X=x, Y=y)})) $


Properties of conditional entropy

Property 1 - If X and Y are independent 

If X and Y are independent RV, intuitively, H(Y|X) = H(Y)

This can also be seen from the equation above. If X and Y are independent,
p(X=x, Y=y) = p(X)p(Y)

H(Y|X) = $\sum_{x,y}p(X=x)*p(Y=y) log(\frac{p(X=x)}{p(X=x)*p( Y=y)})) $

= $\sum_{x,y}p(X=x)*p(Y=y)log(\frac{1}{p(Y=y)})$

= $\sum_{y}p(Y=y)log(\frac{1}{p(Y=y)})$ = H(Y)


This also implies that H(X,Y) = H(X) + H(Y)

Property 2 - when is conditional entropy 0

If H(Y|X)=0, it means Y is completely dependent on X / completely determined by X (ie P(X,Y) = P(X)), which can also be seen from the equation above

Property 3 (Chain Rule)

H(Y|X) = H(Y,X) - H(X)

Trivial from equation above 


Property 4 - generalization of chain rule

In general, H(X1,X2...Xn) = H(X1) + H(X2|X1) + H(X3|X1,X2).... (generalizing equation above)

Kind of analogous to chain rule of probability of random variables, except that is multiplicative instead of additive


Property 5 - Bayes rule of entropy

H(Y|X) = H(X|Y) + H(Y) - H(X)

Proof : 
H(Y|X) = H(Y,X) - H(X) = H(X,Y) - H(Y)
H(Y|X) + H(X) = H(X|Y) + H(Y) = H(X,Y)
=> H(Y|X) = H(X|Y) + H(Y) - H(X)

Again, kind of analogous to Bayes theorem of probability , except that is multiplicative

Property 6 - H(Y|X) <= H(Y)

Proof : 

Define relative entropy I(Y|X) = H(Y)-H(Y|X)

Prove that I(Y|X) >= 0


H(Y) =  -$\sum_{y}p(Y=y)log({p( Y=y)})) $   
H(Y|X) = $\sum_{x,y}p(X=x,Y=y) log(\frac{p(X=x)}{p(X=x, Y=y)})) $



I =  -$\sum_{y}p(Y=y)log({p( Y=y)})) $    - $\sum_{x,y}p(X=x,Y=y) log(\frac{p(X=x)}{p(X=x, Y=y)})) $



$\sum_{y}p(Y=y)log({p( Y=y)})) $ =  $\sum_{x,y}p(X=x, Y=y)log({p( Y=y)})) $


Therefore, I  = -  $\sum_{x,y}p(X=x, Y=y)log({p( Y=y)})) $ - $\sum_{x,y}p(X=x,Y=y) log(\frac{p(X=x)}{p(X=x, Y=y)})) $


= - ($\sum_{x,y} (p(X=x, Y=y)log({p( Y=y)}) + p(X=x,Y=y) log(\frac{p(X=x)}{p(X=x, Y=y)})))$)

= $\sum_{x,y} (p(X=x, Y=y)*log(\frac{p(X=x,Y=y)}{p(X=x)*p(Y=y)})))$


Multiplying and dividing numerator and denominator by p(X=x)*p(Y=y), we get


I = $\sum_{x,y} p(X=x)*p(Y=y)  * \frac{(p(X=x, Y=y)}{p(X)*p(Y)}*log(\frac{p(X=x,Y=y)}{p(X=x)*p(Y=y)})))$



By Jensen's inequality, for convex functions, E(f(X)) >= f(E(X))

xlogx is a convex function for x > 0

Therefore, I =  $\sum_{x,y} p(X=x)*p(Y=y)  *  f(\frac{(p(X=x, Y=y)}{p(X)*p(Y)})$

where f(z) = zlogz


>=  $  f (\sum_{x,y} p(X=x)*p(Y=y)  *  \frac{(p(X=x, Y=y)}{p(X)*p(Y)})) $ 

>= f(1) >=0

Hence Proved

## References

1) https://machinelearningmastery.com/what-is-information-entropy/  
2) https://en.wikipedia.org/wiki/Entropy_(information_theory)
3) https://stats.stackexchange.com/questions/87182/what-is-the-role-of-the-logarithm-in-shannons-entropy#:~:text=We%20can%20call%20log(1,events%20we%20can%20tell%20apart).
4) https://machinelearningmastery.com/information-gain-and-mutual-information/
    