Skip to content
aerkenbrand edited this page Jan 15, 2013 · 9 revisions

Back in 1948, Shannon established his famous theory on (data) entropy. In this theory, he gives a formula of the unpredictability of data and, implicitly, the information contained in it. Severely summarized, it comes down to this: the entropy of data equals the unpredictability of data. Very unpredictable data (e.g. random numbers) have a high entropy.

Due to lack of space, we have to refer to [wikipedia](http://en.wikipedia.org/wiki/Entropy_(information_theory\)) for a better or to Shannon's beautifull paper for a full explanation of data entropy.

If probabilities depend on previous occurrences, or conditional probabilties as they are referred to, the calculations become far more complex. To illustrate the idea of conditional probabilities, Shannon gave the sentence "To be or not to ?" as an example. Most people immediately understand the missing word is "be", thus lowering the entropy, even though there are thousands of words they could have chosen from. Our grammar rules, context and literary knowledge help us narrow this down to the single possibility "be".

Unfortunately, not all our conversations are Shakespear quotations. In fact, our daily use of language can often be very complexly interwoven or even seem near-random. Thus, if we wish to know what the entropy of a language is, we are up for a challenge...

##The challenge Rather than calculating the perfect entropy, we intend to give an approximation of it. The complexity of a grammar, the vast amount of words, combinations of words, incorrect use of language, different people using language differently... These influences make it so difficult to calculate the entropy, that scientists have never been able to give a conclusive entropy of the English language. Shannon estimated it to be between 0.8 and 1.3 bits per character, modern estimates are between 1 and 1.5 bit/char. That is very low, considering "dumb" encoding would use 5 bits or more per character.

A major improvement in the approximation of the entropy can come from encoding chunks of text, rather than independent characters. These chunks could be combinations of N characters, syllables, words or even combinations of words (ngrams). Google actually published databases containing ngrams of various sizes and their number of occurrences in a large collection of books. These could be used to approximate the entropy of language as used in the input (which is more than a life supply of literature).

Given our dataset, we chose to calcuate the entropy of language -not just English- as used on the Internet. The effect on our calculation is that we get an entropy of language as used by the average internet-user, misspellings included.

Next (our solution)
Home

Clone this wiki locally