## 18.11. Information Theory

### 18.11.1. Information

Claude Shannon 1948


*Information* can be encoded in anything with a particular sequence of one or more encoding formats. 

#### 18.11.1.1. Self-information

A formula connects possiblity to information quantity(bit)

Since information embodies the abstract possibility of an event, how do we map the possibility to the number of bits? Shannon introduced the terminology bit as the unit of information, which was originally created by John Tukey.

A series of binary digits of length  𝑛  contains  𝑛  bits of information.

-> Probablility of one even happen (encoding with $n$ bits) is $\frac{1}{2^n}$

 So, can we generalize to a math function which can transfer the probability  𝑝  to the number of bits? Shannon gave the answer by defining self-information
 $I(X) = - \log_2 (p)
 $

In [3]:
from torch import nn
import torch

In [6]:
def nansum(x):
    # Define nansum, as pytorch doesn't offer it inbuilt.
    return x[~torch.isnan(x)].sum()

def self_information(p):
    return -torch.log2(torch.tensor(p)).item()

self_information(1 / 64)

6.0

### 18.11.2. Entropy

#### 18.11.2.1. Motivating Entropy

1. The information we gain by observing a random variable does not depend on what we call the elements, or the presence of additional elements which have probability zero.

2. The information we gain by observing two random variables is no more than the sum of the information we gain by observing them separately. If they are independent, then it is exactly the sum.

3. The information gained when observing (nearly) certain events is (nearly) zero.

#### 18.11.2.2. Definition

$H(X) = - E_{x \sim P} [I(X)] = - E_{x \sim P} [\log p(x)]
$

Negative Estimation * Self Entropy

$P$: p.d.f. or p.m.f.

For $X$ is discrete,
$H(X) = - \sum_i p_i \log p_i \text{, where } p_i = P(X_i)
$

Otherwise, if $X$ is continuous, we also refer entropy as differential entropy,
$H(X) = - \int_x p(x) \log p(x) \; dx
$

In [9]:
def entropy(p):
    entropy = - p * torch.log2(p)
    # Operator `nansum` will sum up the non-nan number
    out = nansum(entropy)
    return out

entropy(torch.tensor([0.1, 0.5, 0.1, 0.3]))

tensor(1.6855)

#### 18.11.2.3. Interpretations

- $\log$
- Negative
- Expectation Function multiply Self Entropy
  - We can interpret $H(S)$ as the average expectation of the event.

#### 18.11.2.4. Properties of Entropy

- $H(X)\geq 0$ for all discrete $X$(entropy can be negative for continuous $X$)
- 