# ent implemented in python

This notebook shows how to implement the functionality from the `ent` tool in python. Please note that this is only written to replicate the calculations. The `ent` tool will handle huge files well, but this script won't.

The code on this page is released into the public domain, just like the `ent` tool. Copying the license from [the `ent` website](http://www.fourmilab.ch/random/):

```
This software is in the public domain. Permission to use, copy, modify, and distribute this software and its documentation for any purpose and without fee is hereby granted, without any conditions or restrictions. This software is provided “as is” without express or implied warranty. 
```

## Input and expected output

Let's start by creating a file containing random data called `random.dat`:
```bash
dd if=/dev/urandom of=random.dat bs=8k count=1
```
You can also test with a file with only zeros called `zeros.dat`
```bash
dd if=/dev/zero of=zeros.dat bs=8k count=1
```
And a file with incrementing numbers called `inc8kb.dat`:
```bash
seq 0 255 | while read l; do h=$(printf "\\\x%02x" $l); echo -ne "$h"; done > inc.dat
seq 16 | while read l; do cat inc.dat; done >> inc8kb.dat
```

Next we analyze this file with the tool `ent`, and inspect the output for later comparison. For my file, it looked like this:

```bash
$ ent random.dat 
Entropy = 7.979943 bits per byte.

Optimum compression would reduce the size
of this 8192 byte file by 0 percent.

Chi square distribution for 8192 samples is 226.62, and randomly
would exceed this value 89.91 percent of the times.

Arithmetic mean value of data bytes is 127.7104 (127.5 = random).
Monte Carlo value for Pi is 3.079853480 (error 1.97 percent).
Serial correlation coefficient is -0.016501 (totally uncorrelated = 0.0).
```

## Setup and code

We want to recreate these in python3. Install numpy and scipy to continue:
```
apt install python3-numpy python3-scipy
```

First we'll read our data file, and convert the data to a numeric array:

In [1]:
import numpy as np
import scipy.stats

f = open("random.dat", "rb") # try also zeros.dat and inc8kb.dat
data = f.read()
data_num = [x for x in data]

print(type(data))
print(type(data_num))

<class 'bytes'>
<class 'list'>


At this point, data is a byte string, while data_num is a list of integers. We need the list of integers to do further calculations. First, we'll calculate the basic entropy, which should match the basic entropy measure from `ent`:

In [2]:
h = np.histogram(data_num, bins=256)
e = scipy.stats.entropy(h[0], base=2)
# print(e)
print("Entropy = {:.6f} bits per byte.".format(e))

Entropy = 8.000000 bits per byte.


Next we'll output the same value as `ent` for the optimum compression. This is simply calculated from the entropy:

In [3]:
optimum_compression = (100 * (8 - e)) / 8
# print(int(round(optimum_compression)))
print("Optimum compression would reduce the size")
print("of this {} byte file by {:.0f} percent.".format(len(data), round(optimum_compression)))


Optimum compression would reduce the size
of this 8192 byte file by 0 percent.


The next step to replicate `ent`'s functionality is to calculate the chi square distribution.

In [4]:
chi = scipy.stats.chisquare(h[0])
# print(chi)
print("Chi square distribution for {} samples is {:.2f}, and randomly".format(len(data), chi.statistic))
print("would exceed this value {:.2f} percent of the time.".format(chi.pvalue*100))

Chi square distribution for 8192 samples is 0.00, and randomly
would exceed this value 100.00 percent of the time.


The next output is the arithmetic mean, which is just that.

In [5]:
totalc = len(data_num)
datasum = np.sum(data_num)

print("Arithmetic mean value of data bytes is {:.4f} ({:.1f} = random).".format(datasum/totalc, 127.5))

Arithmetic mean value of data bytes is 127.5000 (127.5 = random).


The Monte Carlo value for pi requires a bit more manual calculation.

In [6]:
import math

bytes_per_coord = 6 # randtest.c's MONTEN value
incirc = math.pow(math.pow(256, bytes_per_coord / 2) - 1, 2.0)
inmont = 0
mcount = 0
for i in range(0, len(data_num) - (bytes_per_coord - 1), bytes_per_coord):
    mcount = mcount + 1
    # use 3 bytes for the x coordinate, and the next 3 bytes for the y coordinate
    x_coord = data_num[i+0]*256*256 + data_num[i+1]*256 + data_num[i+2]
    y_coord = data_num[i+3]*256*256 + data_num[i+4]*256 + data_num[i+5]
    if (x_coord * x_coord + y_coord * y_coord <= incirc):
        inmont = inmont + 1
montepi = (4.0 * inmont / mcount)
print("Monte Carlo value for Pi is {:.9f} (error {:.2f} percent).".format(montepi, 100*abs(math.pi-montepi)/math.pi))

Monte Carlo value for Pi is 2.842490842 (error 9.52 percent).


Finally, `ent` outputs the serial correlation value. We have to calculate this ourselves as well.

In [7]:
scct1 = scct2 = scct3 = 0.0
scclast = None
for sccun in data_num:
    if scclast is None:
        scclast = 0
        sccu0 = sccun
    else:
        scct1 = scct1 + (scclast * sccun)
    scct2 = scct2 + sccun
    scct3 = scct3 + (sccun * sccun)
    scclast = sccun

scct1 = scct1 + (scclast * sccu0)
scct2 = scct2 * scct2
scc = (totalc * scct3) - scct2
if scc == 0.0:
    scc = -100000
else:
    scc = (totalc * scct1 - scct2) / scc

if scc >= -99999:
    print("Serial correlation coefficient is {:.6f} (totally uncorrelated = 0.0)".format(scc))
else:
    print("Serial correlation coefficient is undefined (all values equal!).")

Serial correlation coefficient is 0.976654 (totally uncorrelated = 0.0)
