Skip to content
damnyankee edited this page Jan 15, 2013 · 13 revisions

Using a 100GB dictionary based on over 4,4 trillion (Dutch: biljoen) input strings, we have estimated the Shannon entropy of language as used on the internet on 1,8 bit per character. This supposes knowledge of the previous seven characters, which is a fair assumption when compressing data.

The below table quickly illustrates the range of our results. The "estimate length 7" is our most accurate result.

encoding type bit/char calculated by
dumb encoding 4.8 log(27)/log(2)
estimate length 3 2.9 shannonIM.RunFastShan on shorter dictionary
estimate length 5 2.2 shannonIM.RunFastShan on slightly shorter dictionary
estimate length 7 1.8 shannonIM.RunFastShan on full dictionary
Shannon 1 - 1.5 modern estimate

Raw output

The above table shows an approximation of the entropy as calculated in various ways.

The dumb encoding encodes all characters without considering their chance of occurring.
Estimate length 3 and 5 show the approximations of the Shannon limit when using a shortened versions of the large dictionary. To keep everything in a single run on the big dataset, we created a large dictionary for the full-length estimate and "ignored" the last characters. This is not perfect, as the last 2 characters from documents are now missing, but it was our best attempt without doing multiple runs.
The row "estimate length 7" is our best estimate. It gives the weighted average of the entropy of a character when given the previous 7 characters, and used the "raw" dictionary.
The last row, "Shannon", contains the estimate of English-only text, as currently accepted.

Next (interpretation & discussion)
Home

Clone this wiki locally