## On Information theory, Relative entropy , mutual information and KL divergence

## What is information theory

Information theory is the branch of CS which deals  with quantification, storage and communication of digital information.


A key measure to quantify information in information theory is entropy

This builds off a basic premise of information theory, that the value/information content of a communicated message (event in statistics language) is higher, the less probable the event

If the event is very probably, there is hardly any new information, whereas if the event is less probable, there is a lot of information



## Hartley's measure of information

$I(E) = log_bL$

where L is the number of possible unique values of RV X from which E is sampled

Disadvantage : Does not take probability distributions of RV into account, kind of assumes all events are equally probable (uniform distribution)

## Shannon's measure of information

In concrete terms, the Information of an event E 

I(E) is inversely proportional to the probability of the event p(E)

In reality, the log(1/p) equation is used for this relation

ie $I(E) = log\frac{1}{p(E)}$  which is equivalent to -log(p(E))

Why is this functional form used ? [See reasons here](https://stats.stackexchange.com/questions/87182/what-is-the-role-of-the-logarithm-in-shannons-entropy#:~:text=We%20can%20call%20log(1,events%20we%20can%20tell%20apart).)


( In short , two reasons -
1) a mathematical reason : logarithm is to make it growing linearly with system size and "behave like information". It is also to satisfy nice mathematical properties. For example, we would expect the entropy of tossing a coin n times to be n*entropy of tossing a coin once.

Having a logarithm in the definition makes this mathematically possibe (check link above)

2) An intuitive reason : if all events happen with the same probability p (such as a fair coin or a fair die - uniform distribution), the number of events is 1/p. For capturing 1/p events in binary notation, we have to use $log_{2}(1/p)$ bits
This extends even in a non-uniform distribution because of theoretical reasons (see link above)


What base of the logarithm should we take in the definition above ?

Depends - base 2 gives an output as bits, base e as nats


## Code

Given a fair coin, with probabilty of each event = 1/2,  what is information of the event ?

In [9]:
from math import log2
information = log2(1/(1/2))
print("information of event E in a fair coin toss is {} bits".format(information))

information of event E in a fair coin toss is 1.0 bits


Given a fair die, with probabilty of each event = 1/6,  what is information of the event ?

In [11]:
from math import log2
information = log2(1/(1/6))
print("information of event E in a fair die roll is {} bits".format(information))

information of event E in a fair die roll is 2.584962500721156 bits


Already, we see that since a p(E) for a die is 1/6 (lower than p(E) for a coin),
information of event in a die roll is higher than coin

## What is entropy

Entropy in information theory is analogous to the concept of entropy in thermodynamic systems

Specifically, the entropy of a random variable X is the [average level of "information", "surprise", or "uncertainty" inherent in the variable's outcomes](https://en.wikipedia.org/wiki/Entropy_(information_theory)

Note that entropy is defined for a random variable, whereas information is defined for an event

Entropy is nothing but the expectation of information

ie H(X) = E(I(X)) =  $\sum_{i}p(Ei)log(\frac{1}{p(E(i))})$

This is the equation of Shannon Entropy, over events/states of the RV . This is analogous to the equation of Gibbs entropy in thermodynamics over states of the thermodynamical system


Note that by the definition above, entropy is always positive. The lowest value it can take is 0, for a certain event. For example, a RV X for a coin toss where p(X=H) = 1, p(X=T) = 0

## Code

Can you compare entropy of tossing a fair coin vs a biased coin where p(H) = 0.7 ?

In [8]:
import numpy as np
def entropy(prob_vec):
    prob_vec = np.array(prob_vec)
    prob_vec = prob_vec[prob_vec>0]
    return np.sum(-prob_vec*np.log2(prob_vec))
    

In [21]:
print("Entropy of a fair coin is {} bits".format(entropy([0.5, 0.5])))

Entropy of a fair coin is 1.0


In [22]:
print("Entropy of a biased coin with p(H) = 0.7  is {} bits".format(entropy([0.7, 0.3])))

Entropy of a biased coin with p(H) = 0.7  is 0.8812908992306927


Note that entropy of fair coin is larger, in fact , a uniform distribution has the highest entropy among all distributions, because there is more uncertainty

In fact, assume a bernoulli distribution X ~ Bern(p)
where p(X=1) = p, and p(X=0) = 1-p

Therefore, H(X) = -plogp - (1-p)log(1-p)

Parametrized by p, we can call this H(p) = -plogp - (1-p)log(1-p)

This  function H(p) is called binary entropy function

If we plot, this is a concave function, whose maximum occurs when p = 0.5



![equation2](bernoulli_plot.png "Figure 2.1 Elements of Information theory") 

Image credit - Figure 2.1 Elements of Information theory

## Proof that H(p) is  a concave function of p

On the same domain D, assume there are two RV X with PMF p, and Y with pmf q

Define another RV B which can take two values, 0 and 1 with the following PMF

P(B=0) = $\lambda$, P(B=1) = 1 - $\lambda$

Define a new RV Z such that Z=X if B=0, Z=Y if B=1

Therefore, The distribution of Z is $\lambda*p + (1-\lambda)*q $

We know that conditioning always reduces entropy

Therefore , H(Z) >= H(Z|B)

H(Z|B) = $P(B=0)* H(Z|B=0) + P(B=1)*H(Z|B=1)$ by definition
       = $\lambda*H(X) + (1-\lambda)*H(Y)$

H(Z) = $H(\lambda*X + (1-\lambda)*Y)$

Since H(Z) >= H(Z|B),

$H(\lambda*X + (1-\lambda)*Y)$ >= $\lambda*H(X) + (1-\lambda)*H(Y)$

Proves that H(X) is concave by Jensen's inequality

## Properties of entropy

0 <= H(X) <= log(L) where L is the number of possible values X can take

H(X) = 0 only if P(X=x) = 1 for some X (completely predictable)

H(X) = log(L) only if P(X=x) = (1/L) for all X (uniform distribution)

Proof of  

H(X) >= 0


H(X) = $-P(X=x)log(P(X=x))$

when P(X=x) =1 for some x,  -P(X=x)log(P(X=x) = 0

when P(X=x) < 1 for some x, log(P)<0 =>    -P(X=x)log(P(X=x)  >0

Therefore, H(X) = 0 iff P(X=x)=1 for some x in X 

Proof of H(X) <= Log(L)

Use IT inequality log(r) <= (r-1)log(e) if r > 0

Take H(X)-log(L)

= -($\sum_{X}P(X=x)log(P(X=x))$) - log(L)

= -($\sum_{X}P(X=x)log(P(X=x))$) - $\sum_{X}P(X=x)log(L)$


= $\sum_{X}P(X=x)(log(\frac{1}{L*P(X=x)}))$

<= $\sum_{X}P(X=x)((\frac{1}{L*P(X=x)} - 1)*log(e))$

<= log(e)*($\sum_{X=x}\frac{1}{L}$ - $\sum_{X=x}P(X=x)$)

<= log(e)*(1-1) <= 0


Therefore, H(X) <= log(L)

## Extending to continuous RV

The classical definition of entropy according to Shannon as defined above was postulated for only discrete RV

Shannon tried to extend the concept of entropy to continuous RV is called a differential entropy, which has some noticeably different properties from classical entropy (thus , it is not theoretically the best estimate of uncertainty for continuous RV . In fact , it can even be negative, and is not invariant to transformations). [See here for more details](https://en.wikipedia.org/wiki/Differential_entropy)

This has been found to be 

First the definition

Given a continuous RV X

the differential entropy h(X) = E(-ln(f(x))) = -$\int_{X}f(x)ln(f(x))dx$

An example where differential entropy can be negative

Assume X ~ U(0,0.5) =>

f(x) = 2

Therefore, h(X) =  -$\int_{0,2}*2*ln(2)dx$

= -0.5*ln(2)*2 which is negative

## Conditional entropy

Conditional entropy of RV Y|X  (represented as H(Y|X)) is the amount of information (think of it as bits) needed to describe Y given another RV X

We know that H(Y) = E(I(Y)) =  $\sum_{i}p(Ei)log(\frac{1}{p(E(i))})$ over all Events Ei in Y

which can be equivalently written as $\sum_{y}p(Y=y)log(\frac{1}{p(Y=y)})$
over all outcomes y of RV Y
in a discete case


Assume that RV X takes a specific value x

Then, we can say as an extension that 
H(Y|X=x) = $\sum_{y}p(Y=y|X=x)log(\frac{1}{p(Y=y|X=x)})$

By definition, conditional entropy H(Y|X) is defined as weightedsum of H(Y|X=x) over all values of X. Thus, taking a weighted averaging over all possible values of X,
we can define

H(Y|X) = $\sum_{x}p(X=x)H(Y|X=x)$ 
 = $\sum_{x}p(X=x) (\sum_{y}p(Y=y|X=x)log(\frac{1}{p(Y=y|X=x)}))$ 
  = $\sum_{x}\sum_{y}p(X=x,Y=y)log(\frac{1}{p(Y=y|X=x)}))$ 
  
  = $\sum_{x,y}p(X=x,Y=y) log(\frac{p(X=x)}{p(X=x, Y=y)})) $


## Properties of conditional entropy

### Property 1 - If X and Y are independent 

If X and Y are independent RV, intuitively, H(Y|X) = H(Y)

This can also be seen from the equation above. If X and Y are independent,
p(X=x, Y=y) = p(X)p(Y)

H(Y|X) = $\sum_{x,y}p(X=x)*p(Y=y) log(\frac{p(X=x)}{p(X=x)*p( Y=y)})) $

= $\sum_{x,y}p(X=x)*p(Y=y)log(\frac{1}{p(Y=y)})$

= $\sum_{y}p(Y=y)log(\frac{1}{p(Y=y)})$ = H(Y)


This also implies that H(X,Y) = H(X) + H(Y)

### Property 2 - when is conditional entropy 0

If H(Y|X)=0, it means Y is completely dependent on X / completely determined by X (ie P(X,Y) = P(X)), which can also be seen from the equation above

Note, just like entropy for discrete variables, conditional entropy is also always positive, with a minimum value of 0 if rv are completely related

### Property 3 (Chain Rule)

H(Y|X) = H(Y,X) - H(X)

Trivial from equation above 


### Property 4 - generalization of chain rule

In general, H(X1,X2...Xn) = H(X1) + H(X2|X1) + H(X3|X1,X2).... (generalizing equation above)

Kind of analogous to chain rule of probability of random variables, except that is multiplicative instead of additive


## Property 5 - Bayes rule of entropy

H(Y|X) = H(X|Y) + H(Y) - H(X)

Proof : 
H(Y|X) = H(Y,X) - H(X) = H(X,Y) - H(Y)
H(Y|X) + H(X) = H(X|Y) + H(Y) = H(X,Y)
=> H(Y|X) = H(X|Y) + H(Y) - H(X)

Again, kind of analogous to Bayes theorem of probability , except that is additive unlike probability which is multiplicative

### Property 6 - H(Y|X) <= H(Y)

Proof : 

Define relative entropy I(Y|X) = H(Y)-H(Y|X)

Prove that I(Y|X) >= 0


H(Y) =  -$\sum_{y}p(Y=y)log({p( Y=y)})) $   
H(Y|X) = $\sum_{x,y}p(X=x,Y=y) log(\frac{p(X=x)}{p(X=x, Y=y)})) $



I =  -$\sum_{y}p(Y=y)log({p( Y=y)})) $    - $\sum_{x,y}p(X=x,Y=y) log(\frac{p(X=x)}{p(X=x, Y=y)})) $



$\sum_{y}p(Y=y)log({p( Y=y)})) $ =  $\sum_{x,y}p(X=x, Y=y)log({p( Y=y)})) $


Therefore, I  = -  $\sum_{x,y}p(X=x, Y=y)log({p( Y=y)})) $ - $\sum_{x,y}p(X=x,Y=y) log(\frac{p(X=x)}{p(X=x, Y=y)})) $


= - ($\sum_{x,y} (p(X=x, Y=y)log({p( Y=y)}) + p(X=x,Y=y) log(\frac{p(X=x)}{p(X=x, Y=y)})))$)

= $\sum_{x,y} (p(X=x, Y=y)*log(\frac{p(X=x,Y=y)}{p(X=x)*p(Y=y)})))$


Multiplying and dividing numerator and denominator by p(X=x)*p(Y=y), we get


I = $\sum_{x,y} p(X=x)*p(Y=y)  * \frac{(p(X=x, Y=y)}{p(X)*p(Y)}*log(\frac{p(X=x,Y=y)}{p(X=x)*p(Y=y)})))$



By Jensen's inequality, for convex functions, E(f(X)) >= f(E(X))

xlogx is a convex function for x > 0

Therefore, I =  $\sum_{x,y} p(X=x)*p(Y=y)  *  f(\frac{(p(X=x, Y=y)}{p(X)*p(Y)})$

where f(z) = zlogz


>=  $  f (\sum_{x,y} p(X=x)*p(Y=y)  *  \frac{(p(X=x, Y=y)}{p(X)*p(Y)})) $ 

>= f(1) >=0

Hence Proved

## Extending to continuous RV

h(Y|X) is called differential conditional entropy in the continuous case, defined as


h(Y|X) =  = -$\int_{X,Y}f(x,y)ln(f(y|x))dxdy$

Just as in the discrete case, h(Y|X) = h(X,Y) - h(Y)

However, if any of the entropies don't exist, this equation does not hold

Also, h(Y|X) <= h(Y), with equality if X and Y are independent

# Joint entropy

Measure of entropy asscoiated with a set of RV jointly

For example, H(X,Y) is used to represent the joint entropy of X and Y

Defined as H(X,Y) = -$\sum_{x,y}p(X=x, Y=y)log(p(X=x, Y=y))$ represented 
in convenient notation as H(X,Y) = -$\sum_{x E X, y E Y}p(x,y)log(p(x,y))$


Can be extended to N Variables X1..XN

as H(X1...Xn) =  -$\sum_{x1 E X1, x2 E X2.. x3 E Xn}p(x1..xN)log(p(x1...xN))$


We've already seen this above in the definiton of conditional entropy,

where H(Y|X) = H(X,Y) - H(X)

## Properties of joint entropy

### Property 1 

Just like other entropy measurements, joint entropy is non negative. 


### Property 2 -

The joint entropy of a set of RV is greater than or equal to the maximum of the entropies of the individual RV

H(X1,X2..Xn) >= max(H(X1), H(X2)...)

Start with two variables

H(X,Y) = H(X) + H(Y|X) - from definitions earlier

We know H(Y) >= 0

=> H(X,Y) >= H(X)

Similarly, H(X,Y) >= H(Y)

This implies that H(X,Y) >= max(H(X), H(Y))

Can extend to entropy of more than two RV

### Property 3 - 

joint entropy is less than or equal to sum of individual entropies. In fact, equality occurs when All
RV are independent

H(X1,X2..XN) <= $\sum_{i}H(Xi)$

Starting with two RV again

(Proven in section below on Information gain using Jenson's inequality)



## CODE

![equation3](simple_problem.png "Example 2.2.1 Elements of Information theory") 

Image credit - Example 2.2.1 Elements of Information theory

Given Joint distribution above, compute joint and conditional entropy

In [7]:
import numpy as np
a = np.zeros(shape = (4,4))
a[0,0] = 1/8
a[0,1] = 1/16
a[0,2] = 1/32
a[0,3] = 1/32

a[1,0] = 1/16
a[1,1] = 1/8
a[1,2] = 1/32
a[1,3] = 1/32

a[2,0] = 1/16
a[2,1] = 1/16
a[2,2] = 1/16
a[2,3] = 1/16

a[3,0] = 1/4
a[3,1] = 0
a[3,2] = 0
a[3,3] = 0
print(a)

[[0.125   0.0625  0.03125 0.03125]
 [0.0625  0.125   0.03125 0.03125]
 [0.0625  0.0625  0.0625  0.0625 ]
 [0.25    0.      0.      0.     ]]


In [40]:
hxy = (entropy(a.flatten().tolist()))
print("joint entropy H(X,Y) is {}".format(hxy))

joint entropy H(X,Y) is 3.375


In [41]:
px = a.sum(axis=0)
hx = entropy(px)

print("entropy H(X) is {}".format(hx))

entropy H(X) is 1.75


In [42]:
py = a.sum(axis=1)
hy = entropy(py)

print("entropy H(Y) is {}".format(hy))

entropy H(Y) is 2.0


H(X|Y) = $\sum_{Y=i}P(Y=i)H(X|Y=i)$

In [43]:
hxgiveny = 0
for i in range(len(py)):
    pxgivenyequalsi = a[i,]/py[i]
    hxgiveny = hxgiveny + py[i]*entropy(pxgivenyequalsi.tolist())
print("conditional entropy H(X|Y) is {}".format(hxgiveny))

conditional entropy H(X|Y) is 1.375


In [44]:
hygivenx = 0
for i in range(len(px)):
    pygivenxequalsi = a[:,i]/px[i]
    hygivenx = hygivenx + px[i]*entropy(pygivenxequalsi.tolist())
print("conditional entropy H(Y|X) is {}".format(pygivenx))

conditional entropy H(Y|X) is 1.625


In [45]:
print("As derived earlier, we note that H(X) + H(Y|X) {0} + {1} = {2} is same as H(X,Y) {3}".format(hx, hygivenx, hx + hygivenx,  hxy))

As derived earlier, we note that H(X) + H(Y|X) 1.75 + 1.625 = 3.375 is same as H(X,Y) 3.375


In [46]:
print("Similarly, we note that H(Y) + H(X|Y) {0} + {1} = {2} is same as H(X,Y) {3}".format(hy, hxgiveny, hy + hxgiveny, entropy(a.flatten().tolist()) ))

Similarly, we note that H(Y) + H(X|Y) 2.0 + 1.375 = 3.375 is same as H(X,Y) 3.375


In [48]:
print("Mutual Information I = H(X) + H(Y) - H(X,Y) = {0} + {1} - {2} = {3} bits".format(hx, hy, hxy, hx+hy-hxy ))

Mutual Information I = H(X) + H(Y) - H(X,Y) = 1.75 + 2.0 - 3.375 = 0.375 bits


### RELATIVE ENTROPY OR KL DIVERGENCE

Relative entropy or KL divergence is a way to quantify distance between two dstributions

If X and Xhat are two RV, having pmf p and q respectively,

D(p||q) is the distance between distributions with pmf p and q 


The formula - 

D(p||q) = $\sum_{xEX}p(X=x)log(\frac{p(X=x)}{q(X=x)})$ = -$\sum_{xEX}p(X=x)log(\frac{q(X=x)}{p(X=x)})$


In words, it is the expectation of the log difference between pmf p and q, expectation taken over p,
where p and q have to be defined over the same probability space X



Problem : assume Xhat is a uniform distribution , with q(Xhat=x) = 1/L

and X is the real distribution

Then, D(p||q) = -H(X) + log(L) = log(L) - H(X)

Note that relative entropy is defined only if p(X)=0 for any X=x implies that q(X)=0 for same x, as otherwise we get log(0) which is not defined

## Extension to continuous RV

D(p||q) = $\int_{-inf, +inf}p(x)log(\frac{p(x)}{q(x)})$ =  -$\int_{-inf, +inf}p(x)log(\frac{q(x)}{p(x)})$

## Code

Assume P(X)is a binomial distribution with N=2, p=0.4. therefore
P(X=0) = 9/25, P(X=1) = 12/25, P(X=2) = 4/25

Let Q(Y) be a uniform distribution on the same domain
Q(Y=0) = 1/3, Q(Y=1) = 1/3, Q(Y=2) = 1/3

We want to evaluate D(P||Q)

In [23]:
import numpy as np
def relative_entropy(p, q):
    rel_env = None
    if len(p) != len(q):
        print("p and q have different lengths, returning")
    else:
        try:
            rel_env = np.sum([p[i]*np.log(p[i]/q[i]) for i in range(len(p))])
        except Exception as e:
            print(str(e))
    return rel_env

In [15]:
p = [9/25, 12/25, 4/25]
q = [1/3, 1/3, 1/3]

In [24]:
print(relative_entropy(p,q))

0.0852996013183706


In [25]:
print(relative_entropy(q,p))

0.09745500678538754


Note from above that D(p||q) != D(q|p)
Also, relative entropy does not follow the triangle inequality, so its not a true [metric](https://en.wikipedia.org/wiki/Metric_(mathematics)

## Interpretations of relative entropy

1) In ML, such as in decision trees, D(p||q) is called [information gain](https://en.wikipedia.org/wiki/Information_gain_in_decision_trees) achieved if p were used instead of q   
2) In a Bayesian interpretation, it is the amount of information lost when Q (usually prediction) is used to approximate P (the true distribution

## Conditional relative entropy

Given two joint mass functions p(x,y) and q(x,y), conditional relative entropy D(p(y|x) || q(y|x)) is defined as the relative entropy between p(y|x) and q(y|x) averaged over p(x)


In equation , D(p(y|x) || q(y|x)) = 

$\sum_{x}p(x)(\sum_{y}p(y|x)log(\frac{p(y|x)}{q(y|x)}))$

## Properties of relative entropy/information gain

1) D(p||q) is always >= 0

Proof 1 : 
    
D(p||q) =   -$\sum_{xEX}p(X=x)log(\frac{q(X=x)}{p(X=x)})$ =>

-D(p||q) = $\sum_{xEX}p(X=x)log(\frac{q(X=x)}{p(X=x)})$ 

By Jensen's inequality, log is a concave function => E(log X) <= log(E(X))

Therefore , $\sum_{xEX}p(X=x)log(\frac{q(X=x)}{p(X=x)})$  <= log($\sum_{xEX}q(X=x)$) <=0

-D(p||q) <= 0 => D(p||q) >=0

Proof 2 :  Using IT inequality

-D(p||q) = $\sum_{xEX}p(X=x)log(\frac{q(X=x)}{p(X=x)})$ 

By IT inequality, log(r) <= (r-1)*log(e)

Therefore, 

$\sum_{xEX}p(X=x)log(\frac{q(X=x)}{p(X=x)})$ 

<= $\sum_{xEX}p(X=x)(\frac{q(X=x)}{p(X=x)} - 1)*log(e))$

<= 0 

Hence proved

2) Relative entropy is additive for independent distributions   

ie If P(x,y) = P(x)P(y)  

If Q(x,y) = Q(x)Q(y)  


D(P||Q) = D(P1||Q1) + D(P2||Q2)

3) D(p||q) is convex in the pair p,q

ie If p1,q1 and p2,q2 are two pairs of probability mass functions,

$D(\frac{\lambda*p1 + (1-\lambda)*p2}{\lambda*q1 + (1-\lambda)*q2})$ <=

$\lambda*D(\frac{p1}{q1}) + (1-\lambda)*D(\frac{p2}{q2}))$

for all $0 <= \lambda <= 1$

Proof :

LHS :

$D(\frac{\lambda*p1 + (1-\lambda)*p2}{\lambda*q1 + (1-\lambda)*q2})$


= $\sum_{X}(\lambda*p1 + (1-\lambda)*p2)log(\frac{\lambda*p1 + (1-\lambda)*p2}{\lambda*q1 + (1-\lambda)*q2})$

by definition of relative entropy

Log sum inequality says that

For non negative numbers a1.. an and b1..bn, 

$\sum_{i=1:n}(ai*log (\frac{ai}{bi}))$ >= $(\sum_{i=1:n}(ai))*log(\frac{\sum_{i=1:n}(ai)}{\sum_{i=1:n}(bi)})$



Therefore, by log sum inequality,


$\sum_{X}(\lambda*p1 + (1-\lambda)*p2)log(\frac{\lambda*p1 + (1-\lambda)*p2}{\lambda*q1 + (1-\lambda)*q2})$
 can be taken as the RHS of the log sum inequality
 
 where $a1 = \lambda*p1$, $a2 = (1-\lambda)*p2$,
 $b1 =  \lambda*q1$, $b2 = (1-\lambda)*q2$

Therefore, RHS <=
 
 $\sum_{X} (\lambda*p1*log(\frac{\lambda*p1}{\lambda*q1})  +  ((1-\lambda)*p2*log(\frac{(1-\lambda)*p2}{(1-\lambda)*q2})  )$
 
 = $\lambda*D(\frac{p1}{q1}) + (1-\lambda)*D(\frac{p2}{q2})$
 
 
 Hence proved
 


### Mutual Information

Mutual Information of two RV is a measure of mutual dependence between two RV . It quantifies amount of information (in units such as bits or nats) obtained about one RV by observing the other RV

It is defined as I(X;Y) =  D(P(X,Y) || P(X)P(Y))

(the relative entropy between the joint probability and product of marginal probabilities)

I(X;Y) =  D(P(X,Y) || P(X)P(Y)) = $\sum_{X,Y}P(X,Y)log(\frac{P(X,Y)}{P(X)P(Y)})$ in a disrete case

In a continuous case,

I(X;Y) = $\int_{X,Y}p(X,Y)log(\frac{p(X,Y)}{p(X)p(Y)})dxdy$


Intuitively, it talks about Information X and Y share. 

If I(X;Y) = 0, it means X and Y are completely independent

On the other hand, If X and Y are completely dependent,

I(X;Y) = H(X) = H(Y)

## Relation between mutual information and entropy

Taking the discrete case, I(X;Y) =  D(P(X,Y) || P(X)P(Y)) = $\sum_{X,Y}P(X,Y)log(\frac{P(X,Y)}{P(X)P(Y)})$ 


= $\sum_{X,Y}P(X,Y)log(\frac{P(X,Y)}{P(X)}) - \sum_{X,Y}P(X,Y)log(P(Y))$  =  H(Y) - H(Y|X)


= $\sum_{X,Y}P(X,Y)log(\frac{P(X,Y)}{P(Y)}) - \sum_{X,Y}P(X,Y)log(P(Y))$  =  H(X) - H(X|Y)

We also know that H(Y|X) = H(X,Y) - H(X)

Therefore , I(X;Y) = H(X) + H(Y) - H(X,Y)


Note that this also can be written as 


I(X;Y) = H(X) - H(X|Y)

or equivalently

I(X;Y) = H(Y) - H(Y|X)


In case X and Y are completely independent , I(X,Y) = 0

See diagram below for more clarity

![equation1](mi_entropy_relation.png "Figure 2.2 Elements of Information theory") 

Image credit - Figure 2.2 Elements of Information theory

## Conditional Mutual Information


I(X;Y|Z) = I(X|Z) - I(X|Y,Z) = I(Y|Z) - I(Y|X,Z)

This is defined as $E_{p(x,y,z)}log\frac{p(X,Y|Z)}{p(X|Z)*p(Y|Z)})$

## Properties of Mutual Information

I(X;Y) >= 0   => Since I(X;Y) is a kind of relative entropy, and we have already proven that relative entropy >=0 always, I(X;Y) >= 0


This also implies that H(X) + H(Y) >= H(X,Y)


2. I(X;X) = H(X) + H(X) - H(X,X) = H(X)

So information gain of a RV with itself is just the entropy of the RV itself

3. Chain rule for Mutual Information

I(X1,X2..Xn;Y) = $\sum_{i=1}^{n}I(Xi;Y|X1..Xi-1)$

Proof :

I(X1,X2..Xn;Y) = H(X1,X2..Xn) - H(X1,...Xn|Y) = 

$\sum_{i=1}^{n}H(Xi|X1..Xi-1)$ - $\sum_{i=1}^{n}H(Xi|X1..Xi-1,Y)$


=$\sum_{i=1}^{n}(I(Xi;Y|X1...Xi-1))$

2.Mutual information I(X;Y) is a concave function of p(x) for fixed p(y|x), and a convex function of p(y|x) for a fixed p(x)

proof (using explicit notation to make it less confusing)

a) $p_{Y}(Y=y) = \sum_{X=x}(p_{X}(X=x)p_{X,Y}(\frac{Y=y}{X=x}))$

If $p_{X,Y}(\frac{Y=y}{X=x})$ is fixed for every X, it means $p_{Y}(Y=y)$ is a linear function of $p_{X}(X=x)$


We know that I(X;Y) = H(Y) - H(Y|X) = H(Y) - $\sum_{x}P(X=x)H(\frac{Y}{X=x})$


H(Y) is a concave function of p(Y=y), but since p(Y=y) is a linear function of p(X=x), it is a concave function of p(X)

The second term is a linear function in p(X), which is again concave

Therefore I(X;Y) is concave

Problem on mutual information

Assume you have a fair die and a fair coin
You toss the die first, if die has outcome in (1,2,3,4), you toss the coin once

If die has outcome in (5,6), you toss the coin twice

Find mutual information between face of die, and no of heads obtained

Let X be RV describing die, Y be RV counting no of heads

P(X in (1,2,3,4)) = 2/3
P(X in (5,6)) = 1/3

P(Y=0) = P(Y=0|X in (1,2,3,4))*P(X in (1,2,3,4)) + 
P(Y=0|X in (5,6))*P(X in (5,6))

= (1/2)*(2/3)  + (1/4)*(1/3) = (5/12)


P(Y=1) = P(Y=1|X in (1,2,3,4))*P(X in (1,2,3,4)) + 
P(Y=1|X in (5,6))*P(X in (5,6))

= (1/2)*(2/3)  + (1/2)*(1/3) = (1/2)


P(Y=2) = P(Y=2|X in (1,2,3,4))*P(X in (1,2,3,4)) + 
P(Y=2|X in (5,6))*P(X in (5,6))

= 0*(2/3)  + (1/4)*(1/3) = (1/12)

H(Y) = H(5/12, 1/2, 1/12) = 1.325

H(Y|X) = P(X in (1,2,3,4))*H(Y|X in (1,2,3,4)

+ P(X in (5,6))*H(Y|X in (5,6)


= (2/3)*H(1/2, 1/2,0 ) + (1/3)*H(1/4, 1/2, 1/4) = 1.167

I(X;Y) = H(Y)-H(Y|X) = 1.325-1.167 = 0.158

## References

1) https://machinelearningmastery.com/what-is-information-entropy/  
2) https://en.wikipedia.org/wiki/Entropy_(information_theory)
3) https://stats.stackexchange.com/questions/87182/what-is-the-role-of-the-logarithm-in-shannons-entropy#:~:text=We%20can%20call%20log(1,events%20we%20can%20tell%20apart).    
4) https://machinelearningmastery.com/information-gain-and-mutual-information/
    
5) Elements of Information Theory (Cover and Thomas) Chapter 2
6) https://www.youtube.com/watch?v=JxNirQxjvs4