Interpretation&discussion

##Interpretation A data entropy of 1,8 bit per character, what does that mean, one may wonder. Essentially, it means that an infinite input of characters can be encoded into a binary stream with a size of (on average) 1,8 bit per character.

For all practical purposes those infinite inputs are highly unlikely, but for sufficiently large input sizes the entropy is a very good approximation.

##Discussion Implementation
Our approach is highly scalable, meaning that the input can be very large. Unfortunately, long combinations of characters have proven to be our main point of concern, as we did not manage to process that in a scalable manner. When using a length of 10, the amount of combinations is in the order of 100.000.000.000.000. That is quite a lot, to much to be computionally feasible...

We can not solve this, unless the input length can be lowered. A length of eight characters leaves us with a mere 27^8 = 280 billion (Dutch; miljard) possibilities. We daren't go higher, for a relatively big part of these combinations would have to be stored and moved across nodes. Ideally, tuples should be as long as possible. If we could use very long combinations -say, 100 characters- and had a really massive input dataset, statitics should show the existence of grammars and context.

Summarized: larger input reduces noise and longer combinations give better approximations, for they offer a view on language on a higher level (grammars for example). The first can be scaled easily, as we did, the later not so much.

Results
Now that we have discussed the implementation, let us consider the meaning of all of this.

We knew on beforehand the number we would find as entropy would be no ideal entropy. The deviation occurs in the phase between the first and second run, where we go from counting combinations to assuming the results of the count are a perfect representation of reality. They are quite good, but they are by no means perfect.

Reasons for higher true entropy (compared to our approximations);

We did not encounter all possible combinations, but we might with an even larger input.
We ignored capitalization and punctuation.

Reasons for a lower true entropy;

Capitalization is quite predicable; first char after almost every dot is a capital. If the first two characters of a word are capitals, the word is likely to be fully capitalized, etc.
We replaced dots and commas by spaces: ' ' is likely to be either '. ' or ', '. Thus, the lack of punctuation should not cause a very disturbed view of reality.
We probably included some text that was not intended as part of language (e.g. ASCII art).
We calculated the entropy over multiple language. The "English only" entropy should be lower.
The results show that the entropy is still converging to some value below 1.8.

Considering the above, we feel confident in claiming that we have proven that the entropy of language as used on the internet is lower than 1.8 bit per character.

Credits
Home

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interpretation&discussion

Clone this wiki locally