# Entropy - information content

This small notebook introduces and provides some examples of entropy calculation. The examples include calcualting entropy from scratch, without libraries, and also use of the entropy function from the Scipy package.

First, we import some basic packages:

In [31]:
import random
import math

The entropy of a dataset in Information Theory (not to be confused with entropy in physics) is a measure of randomness in the dataset. It was introduced by Claude Shannon in 1948, and the entropy of a sataset is defined as per below.

![title](entropy.svg)

Or in plain english, the entropy for a dataset is the negative sum of the product of the probability of the value and the logaritm (of arbitrary base) of the probability of value. 

Since the entropy is our (or rather Shannon's) definition, we can choose the base of the logaritm arbitrarily. However, by convetion the base 2 is often chosen, and then the "unit of measurement" for entropy is "bits".


To try something out with our new-found knowledge, we first need a dataset to work with. Lets set up some Python lists representing various data series.

Tossing a coin, gives two possible outcomes, an if the coin is fair, there is a 50-50 probability for each result.

In [32]:
coin_toss = [1/2,1/2]

Rolling a 6 sided die, similarly provides 6 possible outcomes with 1/6 chance of each outcome.

In [33]:
die_roll = [1/6,1/6,1/6,1/6,1/6,1/6]

If the coin we are tossing is unfair, providing a 70% chance of e.g. heads and 30% of tails, the data series becomes:

In [34]:
loaded_coin_toss = [7/10, 3/10]

Similarly, for a loaded die, where the chances of rolling a 6 is doubled:

In [35]:
loaded_die_roll = [2/15,2/15,2/15,2/15,2/15,1/3]

Finally, for a completely random series of 100 numbers, we get

In [36]:
sum = 0
random_data = []
# Create a series of 100 random values between 1 and 99)
for n in range(0,99):
    random_data.append(random.randint(1,10))
    sum+=random_data[n]
# Normalise the values
for m in range (0,99):
    random_data[m]/=sum

So, now we have some dataseries to work with, lets define the entropy function.

In [37]:
def entropy(list):
    #The input to the function is a list of values, and the output is the entropy of that
    # list using base 2 for the logarithm
    entr=0
    for n in range (len(list)):
        entr+=list[n]*math.log(list[n],2)
    return -entr


What is the entropy of the coin toss - how random is it?

In [38]:
print(entropy(coin_toss))

1.0


And tossing a loaded coin, how random is that?

In [39]:
print(entropy(loaded_coin_toss))

0.8812908992306927


It appears, that the randomness of tossing a loaded coin is less than for a fair coin - this appears intuitvely right.

What about the die vs the loaded die?

In [40]:
print (entropy(die_roll))
print (entropy(loaded_die_roll))

2.584962500721156
2.4662478973127313


We can interpret the above numbers as ther being more randomness in the die roll versus the coin toss - that gain makes sense intuitively, since there are more possible outcomes in the die rolls as compared to the coin toss. Similarly, it appears reasonable that tha randomness of the loaded die should be less than for the fair die.

Just for testing , what would the entropy of a very unfair die be - one that almost always rolls a "6"?

In [41]:
totally_unfair_die = [0.00001,0.00001,0.00001,0.00001,0.00001,0.99995]
print (entropy(totally_unfair_die))

0.000902614972367423


As we can see, the entropy tends towards a very small number for a die with almoste certain outcome.

Finally, what about the completely random datset we generated, what can we expect the entropy be?

In [42]:
print (entropy(random_data))

6.392518513490273


The entropy of the dataseries will vary depending on how the probabilities for the different values were generatd in our random number series. But in total, the entropy is higher than for the dat series with fewer possible outcomes that we studied at start.

Finally, lets create a very long and random dataseries

In [50]:
sum = 0
very_long_random_data = []
# Create a series of 10000 random values between 1 and 99)
for n in range(0,9999):
    very_long_random_data.append(random.randint(1,10))
    sum+=very_long_random_data[n]
# Normalise the values
for m in range (0,9999):
    very_long_random_data[m]/=sum

In [51]:
print (entropy(very_long_random_data))

9.055548917741078


Of course, python provides an "out of the box" entropy calculation function.

In [54]:
from scipy.stats import entropy
print (entropy(die_roll,base=2))
print (entropy(loaded_die_roll,base=2))

2.584962500721156
2.4662478973127313


To sum up, the entropy of a datseries is a measure of the randomness in that dataseries. Comparing entropy across  dataseries of different size (like a die-Roll with a coin_toss) provides some qualitative measure of how random the dataseries are in relation to eachother, but that is about it. For dataseries of same size, the entropy measure proivides a bit more insights since we can say which of two same size data series that is more or less random with certainy by looking at the entropy value.