### Entropy (Physics)
Entropy is often interpreted as the degree of disorder or randomness in the system.

How many ways can I get 4 heads ("order") if I toss a fair coin 4 times?
 HHHH --> only one way. Probabilty = 1/16

How many ways can I get 2 heads and 2 tails ("disorder", "messy") if I toss to a fair coin 4 times?
 HHTT, HTHT, TTHH, THTH, HTTH, THHT --> 6 ways. Probability = 6/16

 There are more ways to have "disorder" than "order". The increase of entropy in the universe is statistical,i.e., disorder is more probable than order. It explains the arrow of time (past, present, future). Why do we grow old but not get younger? Why do we remember the past but not the future?

 Life struggles against entropy to maintain existence but increases the overall entropy of the universe by consuming and wasting energy.

 I was featured in [The Big Question](https://terp.umd.edu/bigq13/#.YhLhHu7MLMC) for the Terp Mag. My response to the prompt, “How Could Climate Change Affect Your Field?” was, “The energy cost of information needs more public awareness. The carbon footprint of machine learning training for an AI entity can be multiple times greater than the lifetime footprint of a car. The use of information should be managed as a resource with environmental costs, similar to the use of fossil fuels.”
 


<img src='https://image.slidesharecdn.com/time-pop-11-120918154634-phpapp02/95/the-origin-of-the-universe-and-the-arrow-of-time-20-728.jpg'>

Is the meaning of life to speed up the entropy ("disorder") of the universe?

<img src='https://image.slidesharecdn.com/time-pop-11-120918154634-phpapp02/95/the-origin-of-the-universe-and-the-arrow-of-time-16-728.jpg?cb=1347984480'>

### Entropy (Information Theory)
The information content of an event E is defined as ${\displaystyle I(E)=-\log _{2}(p(E))}$ or ${\displaystyle I(E)=\log _{2}(1/p(E))}$ 
where $p(E)$ is the probability of the event. The more surprising an event is (low probability), the more info content.

Claude Shannon (1916 – 2001), mathematician, electrical engineer, cryptographer

<img src='https://i.ytimg.com/vi/z2Whj_nL-x8/hqdefault.jpg'>

Edwin Thompson Jaynes (July 5, 1922 – April 30, 1998) was the Wayman Crow Distinguished Professor of Physics at Washington University in St. Louis.

1. Jaynes, E.T. (1957). ["Information theory and statistical mechanics"](https://bayes.wustl.edu/etj/articles/theory.1.pdf)(PDF). Physical Review. 106 (4): 620–630.

2. Edwin Thompson Jaynes. Probability Theory: The Logic of Science. Cambridge University Press, (2003)

![image](https://www.azquotes.com/picture-quotes/quote-it-was-our-use-of-probability-theory-as-logic-that-has-enabled-us-to-do-so-easily-what-edwin-thompson-jaynes-70-25-90.jpg)

John Wheeler (1911 – 2008) , theoretical physicist

<img src='https://i.ytimg.com/vi/6qMi0K6hlYA/hqdefault.jpg'>

In [None]:
# Fair coin
from math import log2
p = 0.5
# info content of heads is one bit
I = -log2(p)
I

1.0

Let Heads  = 1 and Tails = 0. Then each side can be represented by one bit.

` 1 ` 

` 0 `


In [None]:
# Fair die
p = 1/6
# info content of rolling 5 is 2.858 bits
I = -log2(p)
I

2.584962500721156

Each side of the die can be represented by 3 bits:

` 0 0 1  `   ----> 1     
` 0 1 0  `   -----> 2  
` 0 1 1  `   ------>3    
` 1 0 0  `  ------>4    
` 1 0 1  `    ---->5   
` 1 1 0  `    ----->6

  Shannon's formula says we can do better, in principle. Each side of the die can be represented by 2.58 bits. Obviously, we haven't used `000` and `111` in the 3 bit representation. 

Information theory defines the entropy $H$ of a random variable as the average info content of the variable's possible outcomes. 
$H(X)=E(I(X))=E(-\log(P(X))$, 
where $\operatorname {E}$ is the expected value or average of a variable.
For a discrete random variable:
${\displaystyle \mathrm {H} (X)=-\sum _{i=1}^{n}{\mathrm {P} (x_{i})\log _{2}\mathrm {P} (x_{i})}}$


In [None]:
# For a fair English alphabet
p=1/26
H = -sum([p * log2(p) for _ in range(26)])
H

4.70043971814109

### Intuition: 
The quantity of information is the number of bits required to send a random event from a probability distribution. It is also considered to measure "Surprise".  An event with high probability has less surprise, less entropy, less info, e.g., the Sun rose from the east this morning.
An event with low probability surprises us; it has more info, more entropy, e.g., Breaking News: The College decided to ban all homework. 

The number of Yes and No questions--->Binary Search

## Huffman Coding Algorithm

In 1951, David A. Huffman and his MIT information theory classmates were given the choice of a term paper or a final exam. The professor, Robert M. Fano, assigned a term paper on the problem of finding the most efficient binary code. Huffman, unable to prove any codes were the most efficient, was about to give up and start studying for the final when he hit upon the idea of using a frequency-sorted binary tree and quickly proved this method the most efficient.

In doing so, Huffman outdid Fano, who had worked with information theory inventor Claude Shannon to develop a similar code. Building the tree from the bottom up guaranteed optimality, unlike the top-down approach of Shannon–Fano coding.

[Reference](https://en.wikipedia.org/wiki/Huffman_coding)


![image](https://upload.wikimedia.org/wikipedia/commons/thumb/7/74/Huffman_coding_example.svg/2560px-Huffman_coding_example.svg.png)

A source generates 4 different symbols ${\displaystyle \{a_{1},a_{2},a_{3},a_{4}\}}$ with probability ${\displaystyle \{0.4;0.35;0.2;0.05\}}$. A binary tree is generated from left to right taking the two least probable symbols and putting them together to form another equivalent symbol having a probability that equals the sum of the two symbols. The process is repeated until there is just one symbol. The tree can then be read backwards, from right to left, assigning different bits to different branches. The final Huffman code is:

Symbol	Code

a1	0

a2	10

a3	110

a4	111

The standard way to represent a signal made of 4 symbols is by using 2 bits/symbol, but the entropy of the source is 1.74 bits/symbol. If this Huffman code is used to represent the signal, then the average length is lowered to 1.85 bits/symbol; it is still far from the theoretical limit because the probabilities of the symbols are different from negative powers of two.

## Example

P(A) = 1/2, P(B) = 1/4, P(C) = 1/4
Is it A? Yes or No. If No, then is it B? Yes or No

Entropy H = 1.5
 
        Node
        / \
       A.  Node
           / \
          B.  C

A-->1 

B-->01 

C-->00

Length of Huffman code: (1/2)*1 + (1/4)*2 + (1/4)*2 = 1.5

Another coding

        Node
        / \
       B.  Node
           / \
          A.  C
B-->1 

A-->01 

C-->00

Length of coding = (1/4)*1 + (1/2)*2 + (1/4)*2 = 1.75 

Huffman's Paper:
[A Method for the Construction of
Minimum-Redundancy Codes](http://compression.ru/download/articles/huff/huffman_1952_minimum-redundancy-codes.pdf)


### Using the library scipy.stats

In [1]:
# Fair die

from scipy.stats import entropy

p = [1/6, 1/6, 1/6, 1/6, 1/6, 1/6]
# Entropy
H = entropy(p, base=2)
H

2.584962500721156

### Cross-Entropy 
Average number of total bits to represent an event from distribution q instead of distribution P.

It is used as a loss function in  Machine Learning, e.g., Logistic Regression

The cross-entropy of the distribution ${\displaystyle q}$ relative to a distribution ${\displaystyle p}$ over a given set is defined as follows:

${\displaystyle H(p,q)=-\operatorname {E} _{p}[\log q]}$

For discrete probability distributions:

${\displaystyle H(p,q)=-\sum _{x\in {\mathcal {X}}}p(x)\,\log q(x)}$
 

 

 

 





In [None]:
from math import log2

def CrossEntropy(p, q):
	return -sum([p[i]*log2(q[i]) for i in range(len(p))])
 
p = [1/2, 1/2]
q = [9/10, 1/10]
 
CrossEntropy(p, q)

1.736965594166206

## Relative entropy or Kullback-Leibler divergence
Average number of extra bits to represent an event from distribution q instead of distribution p.

In the context of machine learning, ${\displaystyle D_{\text{KL}}(P\parallel Q)}$ is often called the information gain achieved if ${\displaystyle P}$ would be used instead of ${\displaystyle Q}$ which is currently used. 

In Bayesian inference: ${\displaystyle D_{\text{KL}}(P\parallel Q)}$ is a measure of the information gained by revising one's beliefs from the prior probability distribution ${\displaystyle Q}$ to the posterior probability distribution ${\displaystyle P}$

In [None]:
print(entropy([1/2, 1/2], qk=[9/10, 1/10], base=2))
entropy([3/4, 1/4], qk=[1/2, 1/2], base=2)

0.7369655941662062


0.18872187554086714

## Cross-Entropy = Entropy + KL-divergence


${\displaystyle H(p,q)=H(p)+D_{\mathrm {KL} }(p\|q)}$

## Mutual Information

For jointly discrete or jointly continuous pairs ${\displaystyle (X,Y)}$, mutual information is the Kullback–Leibler divergence from the product of the marginal distributions, ${\displaystyle p_{X}\cdot p_{Y}}$, of the joint distribution ${\displaystyle p_{(X,Y)}}{\displaystyle p_{(X,Y)}}$, that is,

${\displaystyle \operatorname {I} (X;Y)=D_{\text{KL}}\left(p_{(X,Y)}\parallel p_{X}p_{Y}\right)}$

Independence ==> no correlation but the converse is not true

Mutual info captures both linear and non-linear dependencies



## Algorithmic Entropy

The Kolmogorov complexity of an object, such as a piece of text, is the length of a shortest computer program (in a predetermined programming language) that produces the object as output.

Consider the following two strings of 32 lowercase letters and digits:

abababababababababababababababab , and

4c1j5b2p0cv4w1x8rx2y39umgw5q85s7

The first string has a short English-language description, namely "write ab 16 times", which consists of 17 characters. The second one has no obvious simple description (using the same character set) other than writing down the string itself, i.e., "write 4c1j5b2p0cv4w1x8rx2y39umgw5q85s7" which has 38 characters. Hence the operation of writing the first string can be said to have "less complexity" than writing the second.

[Reference](https://en.wikipedia.org/wiki/Kolmogorov_complexity)

## Algorithmic Entropy as a Measure of Compression

The compression of data (e.g., Zip file), a scientific rule (e.g., Newton's laws), a Machine Learning model, Encoder/Decoder (e.g., GANs) are algorithms that capture the essence of the data and reduce redundancy. 

In [2]:
# Exercise:
from math import log2
# (1) What is the entropy of the probability distribution for the above example with a1, a2, a3, a4 ?
prob = [0.4, 0.35, 0.2, 0.05]
H = -sum([prob[i] * log2(prob[i]) for i in range(len(prob))])
print(H)

# (2) What is the average codebits per symbol for the Huffman coding of a1, a2, a3, a4?
print(0.4*(1)+0.35*(2)+0.2*(3)+0.05*(3))

# (3) Does the result of (2) match the result of (1)?
no

# (4) Can we design an algorithm so that the result of (2) is smaller than the result of (1) ?


1.7393538721672006
1.85


NameError: name 'no' is not defined

In [7]:
from scipy.stats import entropy
H= entropy(prob, base=2)
print(H)

1.739353872167201
