Copyright 2021 Andrew M. Olney and made available under [CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0) for text and [Apache-2.0](http://www.apache.org/licenses/LICENSE-2.0) for code.

# Descriptive statistics: distribution-based metrics

We can also consider descriptive statistics based on the distribution of words.
Distribution in this sense means an assignment of numbers to each word, like the frequency of each word or the probability of each word in the corpus.
The methods that we will discuss consider the *identity* of each word, which we ignored when we looked at length-based metrics.
The key idea in this notebook is that we can **transform text into a distribution.**

## What you will learn

You will learn about text-oriented descriptive statistics derived based on distributions of text.
  
We will cover:

- Distribution-based metrics
    - Lexical diversity
    - Frequency distributions
    - Conditional distributions
    - Vectorization
    - tf-idf

## When to use distribution-based metrics

Descriptive statistics are helpful for exploring the data and considering other potential analyses.
The transformations on text that we discuss may also be useful as features in later modeling.

## Distribution-based metrics

We'll use the built-in `brown` corpus from NLTK.
Let's import the `brown` corpus and `nltk`:

- from `nltk.corpus` import `brown`
- import `nltk` as `nltk`

In [2]:
from nltk.corpus import brown
import nltk as nltk

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="NI3uGxsG=?2gcS,ewPW!">brown</variable><variable id="Wer,Q4C`j@I;1xVaLJBR">nltk</variable></variables><block type="importFrom" id="XD9YVa/m9vX;ax-@^}(K" x="16" y="64"><field name="libraryName">nltk.corpus</field><field name="libraryAlias" id="NI3uGxsG=?2gcS,ewPW!">brown</field><next><block type="importAs" id="rek|J;Kp}vn71.HIYns."><field name="libraryName">nltk</field><field name="libraryAlias" id="Wer,Q4C`j@I;1xVaLJBR">nltk</field></block></next></block></xml>

The [Brown Corpus](https://en.wikipedia.org/wiki/Brown_Corpus) is a diverse corpus of 500 texts collected in 1961 from 15 genres, e.g. news, fiction, religion, and biographies.

### Lexical diversity

Before we continue, it is important to make the distinction between a word `type` and a word `token`.

A token is a single instance of a word. If I say "love love love," then there are three tokens, or three instances of "love."

A type is a category. So if I say "love love love," there is only one type. If I say "love like love," there are two types.

The way to get the number of types from a list of tokens is to remove duplicates.

So to get the number of tokens in a list, we can use the length of the list (in category LIST)

To get the number of types in a list, we need to remove duplicates and then get the length. 

We can remove duplicates with something called `set` (also in LIST).

Let's start by calculating the number of tokens in `brown`:

- length of with `brown` do `words`

In [7]:
len(brown.words())

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="NI3uGxsG=?2gcS,ewPW!">brown</variable></variables><block type="lists_length" id="^2]p$ss.Y@kU=B#X1pwI" x="35" y="201"><value name="VALUE"><block type="varDoMethod" id="Z^i#2h_%c^%13-CP8*3c"><field name="VAR" id="NI3uGxsG=?2gcS,ewPW!">brown</field><field name="MEMBER">words</field><data>brown:words</data></block></value></block></xml>

1161192

Now count the tokens using `set`:

- length of set with `brown` do `words`

In [8]:
len(set(brown.words()))

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="NI3uGxsG=?2gcS,ewPW!">brown</variable></variables><block type="lists_length" id="^2]p$ss.Y@kU=B#X1pwI" x="35" y="201"><value name="VALUE"><block type="setBlock" id="a8Ol:lYH4:@|ge)})UkC"><value name="x"><block type="varDoMethod" id="Z^i#2h_%c^%13-CP8*3c"><field name="VAR" id="NI3uGxsG=?2gcS,ewPW!">brown</field><field name="MEMBER">words</field><data>brown:words</data></block></value></block></value></block></xml>

56057

That's a pretty huge difference! 
Even in a very diverse corpus of over a million words, there are only just over 56 thousand distinct word forms - and this includes inflections, so the truth is even smaller!

**An important fact about language is that most words are rare.**

### Frequency distributions

We can see most words are rare by looking at the "frequency distribution" of words in a text. 
A small group of words, starting with articles and prepositions, will make up most of the tokens.

NLTK has a built in function for calculating frequency distributions from words:

- Set `freqDist` to with `nltk` create `FreqDist` using with `brown` do `words` 

In [10]:
freqDist = nltk.FreqDist(brown.words())

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="ODvNsyro,)$TD)6LqB`r">freqDist</variable><variable id="Wer,Q4C`j@I;1xVaLJBR">nltk</variable><variable id="NI3uGxsG=?2gcS,ewPW!">brown</variable></variables><block type="variables_set" id="|h8;O9TR1_A8}Xp]fW-C" x="40" y="192"><field name="VAR" id="ODvNsyro,)$TD)6LqB`r">freqDist</field><value name="VALUE"><block type="varCreateObject" id="MF_C[_R/Y,btk]v=cjzp"><field name="VAR" id="Wer,Q4C`j@I;1xVaLJBR">nltk</field><field name="MEMBER">FreqDist</field><data>nltk:FreqDist</data><value name="INPUT"><block type="varDoMethod" id="kV/2vdy=,0[QxDto6QJL"><field name="VAR" id="NI3uGxsG=?2gcS,ewPW!">brown</field><field name="MEMBER">words</field><data>brown:words</data></block></value></block></value></block></xml>

Rather than showing a frequency distribution for all words in `brown`, which would fill up our notebook, we'll ask for only the top 50:

- with `freqDist` do `most_common` using `50`

In [12]:
freqDist.most_common(50)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="ODvNsyro,)$TD)6LqB`r">freqDist</variable></variables><block type="varDoMethod" id="iBVXyNhV;n,nYitk()4W" x="8" y="161"><field name="VAR" id="ODvNsyro,)$TD)6LqB`r">freqDist</field><field name="MEMBER">most_common</field><data>freqDist:most_common</data><value name="INPUT"><block type="math_number" id="h=^uH@Mnt7$jT4:VKS2W"><field name="NUM">50</field></block></value></block></xml>

[('the', 62713), (',', 58334), ('.', 49346), ('of', 36080), ('and', 27915), ('to', 25732), ('a', 21881), ('in', 19536), ('that', 10237), ('is', 10011), ('was', 9777), ('for', 8841), ('``', 8837), ("''", 8789), ('The', 7258), ('with', 7012), ('it', 6723), ('as', 6706), ('he', 6566), ('his', 6466), ('on', 6395), ('be', 6344), (';', 5566), ('I', 5161), ('by', 5103), ('had', 5102), ('at', 4963), ('?', 4693), ('not', 4423), ('are', 4333), ('from', 4207), ('or', 4118), ('this', 3966), ('have', 3892), ('an', 3542), ('which', 3540), ('--', 3432), ('were', 3279), ('but', 3007), ('He', 2982), ('her', 2885), ('one', 2873), ('they', 2773), ('you', 2766), ('all', 2726), ('would', 2677), ('him', 2576), ('their', 2562), ('been', 2470), (')', 2466)]

As expected, most words are [function words](https://en.wikipedia.org/wiki/Function_word) and punctuation.

**This illustrates that just because a word is frequent, that doesn't mean the word is important.**

We can also check out how many words occur only once; in NLTK and linguistics, these are known as [hapax legomenon](https://en.wikipedia.org/wiki/Hapax_legomenon):

- Set `hapaxes` to with `freqDist` do `hapaxes`
- length of `hapaxes`

*Note: We don't need to use `set` with `length` because each hapax is unique by definition (tokens=types)*

In [14]:
hapaxes = freqDist.hapaxes()

len(hapaxes)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="g/vf{gMn5Uhes:d71L8R">hapaxes</variable><variable id="ODvNsyro,)$TD)6LqB`r">freqDist</variable></variables><block type="variables_set" id=")8|]9J.@_(k#D7E~.EJB" x="7" y="276"><field name="VAR" id="g/vf{gMn5Uhes:d71L8R">hapaxes</field><value name="VALUE"><block type="varDoMethod" id="iBVXyNhV;n,nYitk()4W"><field name="VAR" id="ODvNsyro,)$TD)6LqB`r">freqDist</field><field name="MEMBER">hapaxes</field><data>freqDist:hapaxes</data></block></value></block><block type="lists_length" id="1EmU}e~7{S}y33k%o;eM" x="11" y="351"><value name="VALUE"><block type="variables_get" id="4fMC1BftF)!-8~E5/~2("><field name="VAR" id="g/vf{gMn5Uhes:d71L8R">hapaxes</field></block></value></block></xml>

25559

Note that almost half of the words in `brown` only occur once!

Let's look at 50 of them:

- in list `hapaxes` get sub-list from `first` to `50`

In [15]:
hapaxes[ : 50]

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="g/vf{gMn5Uhes:d71L8R">hapaxes</variable></variables><block type="lists_getSublist" id="9xn$*AOGa:EQ*Dc,*!43" x="55" y="244"><mutation at1="false" at2="true"></mutation><field name="WHERE1">FIRST</field><field name="WHERE2">FROM_START</field><value name="LIST"><block type="variables_get" id="bk_U%Y.KPnZ:vC9_me^X"><field name="VAR" id="g/vf{gMn5Uhes:d71L8R">hapaxes</field></block></value><value name="AT2"><block type="math_number" id="f`ceI1r6(p/hz!Su9Vp5"><field name="NUM">50</field></block></value></block></xml>

['term-end', 'presentments', 'September-October', 'Durwood', 'Pye', 'Mayor-nominate', 'Merger', 're-set', 'disable', "ordinary's", 'appraisers', 'Wards', 'juries', 'unmeritorious', 'Regarding', 'extern', "Commissioner's", 'Bellwood', 'Alpharetta', 'Cheshire', 'amicable', '637', 'expires', 'Dorsey', 'Tower', 'Ledford', 'Gainesville', 'Schley', '87-31', '29-5', 'Mac', '1,119', '402', 'calmest', 'Policeman', 'Callan', 'Tabb', "Daniel's", 'Legislatures', 'erase', 'depositors', 'Gaynor', 'Brady', 'Harlingen', 'Deaf', 'Bexar', 'Tarrant', '$451,500', '$157,460', '$88,000']

Some of these rare words look like they might be important!

### Conditional distributions



### Vectorization
### tf-idf

### Text length in characters

When we consider text length as a metric, we can clearly consider it on multiple scales.
At the coarsest level we can consider the length of the entire text in characters.
We've previously seen how to use NLTK to get a list of texts in a corpus and the raw form of the text (i.e. string, or sequence of characters).
Let's combine those operations and also get the length of the raw text:

- Set rawLengths to a list with one element containing
    - for each item `i` in list with `gutenberg` do `fileids` (see LOOPS)
        - yield length of (see TEXT) with `gutenberg` do `raw` using `i`
- Display rawLengths

In [50]:
rawLengths = [(len(gutenberg.raw(i))) for i in (gutenberg.fileids())]

rawLengths

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="AO?GdNQ:*92|iDWB.)YR">rawLengths</variable><variable id="LZ#}.J~9XYczA[nu4?|Q">i</variable><variable id="`$^y`^v4:G8DO1QCBCw8">gutenberg</variable></variables><block type="variables_set" id="ZR=zxJ`,Px9$cDQ}u?!N" x="29" y="316"><field name="VAR" id="AO?GdNQ:*92|iDWB.)YR">rawLengths</field><value name="VALUE"><block type="lists_create_with" id="8eO0[B%0~mtEmLX:+=dw"><mutation items="1"></mutation><value name="ADD0"><block type="comprehensionForEach" id="qHXxT3|WBM:kMw~muDTN"><field name="VAR" id="LZ#}.J~9XYczA[nu4?|Q">i</field><value name="LIST"><block type="varDoMethod" id="tLR@!_zful,@toy1e3E("><field name="VAR" id="`$^y`^v4:G8DO1QCBCw8">gutenberg</field><field name="MEMBER">fileids</field><data>gutenberg:fileids</data></block></value><value name="YIELD"><block type="text_length" id="C[M$vurF8L;mQ|g`!((h"><value name="VALUE"><shadow type="text" id="OWYJxoN6q~fXVN0HS0QW"><field name="TEXT">abc</field></shadow><block type="varDoMethod" id="#IalxaHkKH5=q1de@8Ar"><field name="VAR" id="`$^y`^v4:G8DO1QCBCw8">gutenberg</field><field name="MEMBER">raw</field><data>gutenberg:raw</data><value name="INPUT"><block type="variables_get" id="u[Lug07Y.B({Q]G-6du|"><field name="VAR" id="LZ#}.J~9XYczA[nu4?|Q">i</field></block></value></block></value></block></value></block></value></block></value></block><block type="variables_get" id=")DU}BKs8923`-}.@eA)Q" x="8" y="402"><field name="VAR" id="AO?GdNQ:*92|iDWB.)YR">rawLengths</field></block></xml>

[887071, 466292, 673022, 4332554, 38153, 249439, 84663, 144395, 457450, 406629, 320525, 935158, 1242990, 468220, 112310, 162881, 100351, 711215]

Each one of these is the length (in characters) of a book in the `gutenberg` corpus.

### Text length in words

Let's repeat this operation but retrieve words instead of text length.
Since we've already covered how to do word tokenization manually, we'll use the built in `gutenberg` tokenization to focus on the new concept:

- Set wordLengths to a list with one element containing
    - for each item `i` in list with `gutenberg` do `fileids` (see LOOPS)
        - yield length of (see LISTS) with `gutenberg` do `words` using `i`
- Display wordLengths

Note the only changes are `words` instead of `raw` and `length of` from LISTS instead of TEXT.
This is because while `raw` gives us one big string, `words` gives us a list of words.
However, the logic of the loop is the same (we sometimes call this a traversal, because we are traversing the data to calculate something).

In [51]:
wordLengths = [(len(gutenberg.words(i))) for i in (gutenberg.fileids())]

wordLengths

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="nb}L;_W{,H)*Jc!qq]@S">wordLengths</variable><variable id="LZ#}.J~9XYczA[nu4?|Q">i</variable><variable id="`$^y`^v4:G8DO1QCBCw8">gutenberg</variable></variables><block type="variables_set" id="SKZpRLP{wcl/g*{^W3WV" x="4" y="319"><field name="VAR" id="nb}L;_W{,H)*Jc!qq]@S">wordLengths</field><value name="VALUE"><block type="lists_create_with" id="8eO0[B%0~mtEmLX:+=dw"><mutation items="1"></mutation><value name="ADD0"><block type="comprehensionForEach" id="qHXxT3|WBM:kMw~muDTN"><field name="VAR" id="LZ#}.J~9XYczA[nu4?|Q">i</field><value name="LIST"><block type="varDoMethod" id="tLR@!_zful,@toy1e3E("><field name="VAR" id="`$^y`^v4:G8DO1QCBCw8">gutenberg</field><field name="MEMBER">fileids</field><data>gutenberg:fileids</data></block></value><value name="YIELD"><block type="lists_length" id="b5(0SiwR87=9]SU8IBvy"><value name="VALUE"><block type="varDoMethod" id="#IalxaHkKH5=q1de@8Ar"><field name="VAR" id="`$^y`^v4:G8DO1QCBCw8">gutenberg</field><field name="MEMBER">words</field><data>gutenberg:words</data><value name="INPUT"><block type="variables_get" id="u[Lug07Y.B({Q]G-6du|"><field name="VAR" id="LZ#}.J~9XYczA[nu4?|Q">i</field></block></value></block></value></block></value></block></value></block></value></block><block type="variables_get" id="aJWMl/VaGmo=-c|`$nxp" x="8" y="436"><field name="VAR" id="nb}L;_W{,H)*Jc!qq]@S">wordLengths</field></block></xml>

[192427, 98171, 141576, 1010654, 8354, 55563, 18963, 34110, 96996, 86063, 69213, 210663, 260819, 96825, 25833, 37360, 23140, 154883]

As expected, the number or words is much shorter than the number of characters.
We will return to this shortly.

### Text length in sentences

Let's look at the same text, but this time in sentences:

- Set sentenceLengths to a list with one element containing
    - for each item `i` in list with `gutenberg` do `fileids` (see LOOPS)
        - yield length of (see LISTS) with `gutenberg` do `sents` using `i`
- Display sentenceLengths

In [52]:
sentenceLengths = [(len(gutenberg.sents(i))) for i in (gutenberg.fileids())]

sentenceLengths

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="$OD[:+1843Cn0O3j8JiE">sentenceLengths</variable><variable id="LZ#}.J~9XYczA[nu4?|Q">i</variable><variable id="`$^y`^v4:G8DO1QCBCw8">gutenberg</variable></variables><block type="variables_set" id="SKZpRLP{wcl/g*{^W3WV" x="4" y="319"><field name="VAR" id="$OD[:+1843Cn0O3j8JiE">sentenceLengths</field><value name="VALUE"><block type="lists_create_with" id="8eO0[B%0~mtEmLX:+=dw"><mutation items="1"></mutation><value name="ADD0"><block type="comprehensionForEach" id="qHXxT3|WBM:kMw~muDTN"><field name="VAR" id="LZ#}.J~9XYczA[nu4?|Q">i</field><value name="LIST"><block type="varDoMethod" id="tLR@!_zful,@toy1e3E("><field name="VAR" id="`$^y`^v4:G8DO1QCBCw8">gutenberg</field><field name="MEMBER">fileids</field><data>gutenberg:fileids</data></block></value><value name="YIELD"><block type="lists_length" id="b5(0SiwR87=9]SU8IBvy"><value name="VALUE"><block type="varDoMethod" id="#IalxaHkKH5=q1de@8Ar"><field name="VAR" id="`$^y`^v4:G8DO1QCBCw8">gutenberg</field><field name="MEMBER">sents</field><data>gutenberg:sents</data><value name="INPUT"><block type="variables_get" id="u[Lug07Y.B({Q]G-6du|"><field name="VAR" id="LZ#}.J~9XYczA[nu4?|Q">i</field></block></value></block></value></block></value></block></value></block></value></block><block type="variables_get" id="aJWMl/VaGmo=-c|`$nxp" x="8" y="436"><field name="VAR" id="$OD[:+1843Cn0O3j8JiE">sentenceLengths</field></block></xml>

[7752, 3747, 4999, 30103, 438, 2863, 1054, 1703, 4779, 3806, 3742, 10230, 10059, 1851, 2163, 3106, 1907, 4250]

And again, the number of sentences is quite a bit lower than the number of words, as expected.

### Average word/sentence length

There are at least two ways we could calculate average word length using what we've covered so far.
We could to a traversal of words and calculate the average word length for each text.
Alternatively, we could use the values we've already compute and divide.
The same applies to sentence length.

Conceptually what we want to do is perform operations with the first element of `wordLengths`, `sentenceLengths`, and `rawLengths`, the second element of these lists, and so on.
Rather than create our own data structures for facilitating these operations, let's put these lists into a dataframe.
Start by importing `pandas`:

- import `pandas` as `pd`

In [53]:
import pandas as pd

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="_V`RIwppcpbRKT:m6^qH">pd</variable></variables><block type="importAs" id="Gy5)p-`[BHUUWE}k1DeL" x="16" y="10"><field name="libraryName">pandas</field><field name="libraryAlias" id="_V`RIwppcpbRKT:m6^qH">pd</field></block></xml>

We're going to create a dataframe with these lists using the `zip` operator in LISTS:

- Set `dataframe` to with `pd` create `DataFrame` using a list containing
    - `zip` a list containing
        - with `gutenberg` do `fileids`,`wordLengths`, `sentenceLengths`, `rawLengths`
    - freestyle `columns=['corpus','words','sentences','characters']`
- Display `dataframe`

In [54]:
dataframe = pd.DataFrame(zip(gutenberg.fileids(), wordLengths, sentenceLengths, rawLengths), columns=['corpus','words','sentences','characters'])

dataframe

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="d*P53^Ni!VyA[RubgfYr">dataframe</variable><variable id="_V`RIwppcpbRKT:m6^qH">pd</variable><variable id="`$^y`^v4:G8DO1QCBCw8">gutenberg</variable><variable id="nb}L;_W{,H)*Jc!qq]@S">wordLengths</variable><variable id="$OD[:+1843Cn0O3j8JiE">sentenceLengths</variable><variable id="AO?GdNQ:*92|iDWB.)YR">rawLengths</variable></variables><block type="variables_set" id="Nnj3jwtJa6+~cV1=RDn_" x="-149" y="237"><field name="VAR" id="d*P53^Ni!VyA[RubgfYr">dataframe</field><value name="VALUE"><block type="varCreateObject" id="*C0V~q-fS]kDUMRq`O-N"><field name="VAR" id="_V`RIwppcpbRKT:m6^qH">pd</field><field name="MEMBER">DataFrame</field><data>pd:DataFrame</data><value name="INPUT"><block type="lists_create_with" id="DoM~-@qI)6TgbDc;vBMb"><mutation items="2"></mutation><value name="ADD0"><block type="zipBlock" id="nv7/65]-+;=,M.B)yU%U"><value name="x"><block type="lists_create_with" id="@PN$;KCRy[Jv;QJ+d#($"><mutation items="4"></mutation><value name="ADD0"><block type="varDoMethod" id="FJ#DraHg!(/g)_[-^Dn]"><field name="VAR" id="`$^y`^v4:G8DO1QCBCw8">gutenberg</field><field name="MEMBER">fileids</field><data>gutenberg:fileids</data></block></value><value name="ADD1"><block type="variables_get" id="1UbO_xg6I)qqO#p=]Trh"><field name="VAR" id="nb}L;_W{,H)*Jc!qq]@S">wordLengths</field></block></value><value name="ADD2"><block type="variables_get" id="%~eQRKbkLX{by+raZ$LQ"><field name="VAR" id="$OD[:+1843Cn0O3j8JiE">sentenceLengths</field></block></value><value name="ADD3"><block type="variables_get" id="}U=.2[w^C@TMLrYf)B-o"><field name="VAR" id="AO?GdNQ:*92|iDWB.)YR">rawLengths</field></block></value></block></value></block></value><value name="ADD1"><block type="dummyOutputCodeBlock" id="o35uqta54?^UVk|[,.|("><field name="CODE">columns=['corpus','words','sentences','characters']</field></block></value></block></value></block></value></block><block type="variables_get" id="(gZ^x=Q!@}~:|xjc+ZXy" x="-133" y="397"><field name="VAR" id="d*P53^Ni!VyA[RubgfYr">dataframe</field></block></xml>

Unnamed: 0,corpus,words,sentences,characters
0,austen-emma.txt,192427,7752,887071
1,austen-persuasion.txt,98171,3747,466292
2,austen-sense.txt,141576,4999,673022
3,bible-kjv.txt,1010654,30103,4332554
4,blake-poems.txt,8354,438,38153
5,bryant-stories.txt,55563,2863,249439
6,burgess-busterbrown.txt,18963,1054,84663
7,carroll-alice.txt,34110,1703,144395
8,chesterton-ball.txt,96996,4779,457450
9,chesterton-brown.txt,86063,3806,406629


This nicely brings together and displays everything we've done so far.

To calculate average word length and sentence length, just add columns:

- set `dataframe` to with `dataframe` to `assign` using
    - freestyle `avg_wl =` `dataframe["characters"]` / `dataframe["words"]`
- set `dataframe` to with `dataframe` to `assign` using
    - freestyle `avg_sl =` `dataframe["words"]` / `dataframe["sentences"]`
- Display dataframe

Note that the standard unit for word length is characters but that the standard unit for sentence length is words.

In [55]:
dataframe = dataframe.assign(avg_wl= (dataframe['characters'] / dataframe['words']))
dataframe = dataframe.assign(avg_sl= (dataframe['words'] / dataframe['sentences']))

dataframe

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="d*P53^Ni!VyA[RubgfYr">dataframe</variable></variables><block type="variables_set" id="{UO)w}M?tYx?A02OPAw9" x="-83" y="249"><field name="VAR" id="d*P53^Ni!VyA[RubgfYr">dataframe</field><value name="VALUE"><block type="varDoMethod" id="L%G*;r8*$i5SLn{,Cc$T"><field name="VAR" id="d*P53^Ni!VyA[RubgfYr">dataframe</field><field name="MEMBER">assign</field><data>dataframe:assign</data><value name="INPUT"><block type="valueOutputCodeBlock" id="Ux2OR.~,)cCrIxgQW6VI"><field name="CODE">avg_wl=</field><value name="INPUT"><block type="math_arithmetic" id="^JEFtmhs2.cv#;80c/nT"><field name="OP">DIVIDE</field><value name="A"><shadow type="math_number" id="syWxT)`[^TBT:IsIQGMf"><field name="NUM">1</field></shadow><block type="indexer" id="5zNI-WP@PpW0doRlek8W"><field name="VAR" id="d*P53^Ni!VyA[RubgfYr">dataframe</field><value name="INDEX"><block type="text" id="3@IMZDY0GOmgS:YQgx?C"><field name="TEXT">characters</field></block></value></block></value><value name="B"><shadow type="math_number" id="t1.s:%LN=uK/vv%zl:f:"><field name="NUM">1</field></shadow><block type="indexer" id="QeW+!Bpy|dosxiHiI(Vq"><field name="VAR" id="d*P53^Ni!VyA[RubgfYr">dataframe</field><value name="INDEX"><block type="text" id="sg3+/)wO!$kLNdtDKOJN"><field name="TEXT">words</field></block></value></block></value></block></value></block></value></block></value><next><block type="variables_set" id=";EOx!PoNSJEXe3}Nc1AU"><field name="VAR" id="d*P53^Ni!VyA[RubgfYr">dataframe</field><value name="VALUE"><block type="varDoMethod" id="-{X4*i^Z:O4lkNW=F]rs"><field name="VAR" id="d*P53^Ni!VyA[RubgfYr">dataframe</field><field name="MEMBER">assign</field><data>dataframe:assign</data><value name="INPUT"><block type="valueOutputCodeBlock" id="P+/y;r)NaM2k#OEIf08~"><field name="CODE">avg_sl=</field><value name="INPUT"><block type="math_arithmetic" id="Xl4rhP=?Xbe$O90]f;b1"><field name="OP">DIVIDE</field><value name="A"><shadow type="math_number"><field name="NUM">1</field></shadow><block type="indexer" id="fp`q$osSW7CCT3EC^-jh"><field name="VAR" id="d*P53^Ni!VyA[RubgfYr">dataframe</field><value name="INDEX"><block type="text" id="M/QDQE2#j$}uzL!/L{ow"><field name="TEXT">words</field></block></value></block></value><value name="B"><shadow type="math_number"><field name="NUM">1</field></shadow><block type="indexer" id="u.cc%cZ]S6NC+|6@7MZc"><field name="VAR" id="d*P53^Ni!VyA[RubgfYr">dataframe</field><value name="INDEX"><block type="text" id="FaX@#$GZ_F/S]$=oXR@?"><field name="TEXT">sentences</field></block></value></block></value></block></value></block></value></block></value></block></next></block><block type="variables_get" id="uzxkeHUiih]mUfldHAMe" x="-98" y="405"><field name="VAR" id="d*P53^Ni!VyA[RubgfYr">dataframe</field></block></xml>

Unnamed: 0,corpus,words,sentences,characters,avg_wl,avg_sl
0,austen-emma.txt,192427,7752,887071,4.609909,24.822884
1,austen-persuasion.txt,98171,3747,466292,4.749794,26.199893
2,austen-sense.txt,141576,4999,673022,4.753786,28.320864
3,bible-kjv.txt,1010654,30103,4332554,4.286882,33.573199
4,blake-poems.txt,8354,438,38153,4.567034,19.073059
5,bryant-stories.txt,55563,2863,249439,4.4893,19.407265
6,burgess-busterbrown.txt,18963,1054,84663,4.464642,17.991461
7,carroll-alice.txt,34110,1703,144395,4.233216,20.02936
8,chesterton-ball.txt,96996,4779,457450,4.716174,20.296296
9,chesterton-brown.txt,86063,3806,406629,4.724783,22.612454


### Readability

What we've calculated so far may seem simplistic and perhaps not that useful.
However, several of these metrics are components of perhaps the most well known readability formula, [Flesch Kincaid Grade Level (FKGL)](https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests):

\begin{equation*}
0.39 \left( \frac{\mbox{total words}}{\mbox{total sentences}} \right) +11.8 \left( \frac{\mbox{total syllables}}{\mbox{total words}} \right) - 15.59
\end{equation*}

FKGL gives us a sense of how difficult text is to read, which could be an important/useful predictor as well as an interesting descriptive statistic.

We don't have syllable length, however.
Syllable length is a bit of a pain to calculate because English has a deep orthography, so the best way is to use a pronunciation dictionary like [this](https://github.com/steveash/jg2p).
For now, we will just assume that English has 1.5 syllables per word and estimate this component:

- set `dataframe` to with `dataframe` to `assign` using
    - freestyle `fkgl =` 0.39 * `dataframe["words"]` / `dataframe["sentences"]` + 11.8 * 1.5 - 15.59
- Display dataframe

*Note 1.5 * words/words = 1.5*

In [57]:
dataframe = dataframe.assign(fkgl= ((0.39 * (dataframe['words'] / dataframe['sentences']) + 11.8 * 1.5) - 15.59))

dataframe

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="d*P53^Ni!VyA[RubgfYr">dataframe</variable></variables><block type="variables_set" id="d3ELqhcW@UA^R%PLy3cV" x="-83" y="308"><field name="VAR" id="d*P53^Ni!VyA[RubgfYr">dataframe</field><value name="VALUE"><block type="varDoMethod" id="!?g`gMP):imN8F)T,@V|"><field name="VAR" id="d*P53^Ni!VyA[RubgfYr">dataframe</field><field name="MEMBER">assign</field><data>dataframe:assign</data><value name="INPUT"><block type="valueOutputCodeBlock" id="k^_3ol/yS5*#Dha3rPVq"><field name="CODE">fkgl=</field><value name="INPUT"><block type="math_arithmetic" id="Md5u*{OHNtri0AjDj],6"><field name="OP">MINUS</field><value name="A"><shadow type="math_number" id="oOOHvHpa1xT0^Jls*1X+"><field name="NUM">0.39</field></shadow><block type="math_arithmetic" id="QGJQ@m3mx5!l=,g9-R+d"><field name="OP">ADD</field><value name="A"><shadow type="math_number" id="X]w:+`6NY)?+=r#3v?Aj"><field name="NUM">0.39</field></shadow><block type="math_arithmetic" id=",gsasO{}sRYc]60t@#ud"><field name="OP">MULTIPLY</field><value name="A"><shadow type="math_number" id="^KWa~qEgh3Q5R?==H)+9"><field name="NUM">0.39</field></shadow></value><value name="B"><shadow type="math_number" id="?4DQu[d7x;oy^4iY|Y3("><field name="NUM">1</field></shadow><block type="math_arithmetic" id="b1}M?uM#pV~iqEz`%Dh~"><field name="OP">DIVIDE</field><value name="A"><shadow type="math_number"><field name="NUM">1</field></shadow><block type="indexer" id="/t1A;EYsDfw+BYK-f{P8"><field name="VAR" id="d*P53^Ni!VyA[RubgfYr">dataframe</field><value name="INDEX"><block type="text" id="/`cSF)tsx%3]HefEnKlV"><field name="TEXT">words</field></block></value></block></value><value name="B"><shadow type="math_number"><field name="NUM">1</field></shadow><block type="indexer" id="9_HN7=?;UGbK`ZFpBD)n"><field name="VAR" id="d*P53^Ni!VyA[RubgfYr">dataframe</field><value name="INDEX"><block type="text" id="SDk,ItV,c;N.~+Z%vOH]"><field name="TEXT">sentences</field></block></value></block></value></block></value></block></value><value name="B"><shadow type="math_number" id="osy!WzPZ^Ac.Li?,M]Sq"><field name="NUM">1</field></shadow><block type="math_arithmetic" id="pns2jEFkEFV2potV{wzJ"><field name="OP">MULTIPLY</field><value name="A"><shadow type="math_number" id="~z95el]}Uct`QV~*3@LC"><field name="NUM">11.8</field></shadow></value><value name="B"><shadow type="math_number" id="L,wO%x|x/NAv$%-5KtZc"><field name="NUM">1.5</field></shadow></value></block></value></block></value><value name="B"><shadow type="math_number" id="N6x=21,Sww{nHR%@JF|@"><field name="NUM">15.59</field></shadow></value></block></value></block></value></block></value></block><block type="variables_get" id="F^RD,W:_|bYpb/e8Wq3b" x="-90" y="410"><field name="VAR" id="d*P53^Ni!VyA[RubgfYr">dataframe</field></block></xml>

Unnamed: 0,corpus,words,sentences,characters,avg_wl,avg_sl,fkgl
0,austen-emma.txt,192427,7752,887071,4.609909,24.822884,11.790925
1,austen-persuasion.txt,98171,3747,466292,4.749794,26.199893,12.327958
2,austen-sense.txt,141576,4999,673022,4.753786,28.320864,13.155137
3,bible-kjv.txt,1010654,30103,4332554,4.286882,33.573199,15.203547
4,blake-poems.txt,8354,438,38153,4.567034,19.073059,9.548493
5,bryant-stories.txt,55563,2863,249439,4.4893,19.407265,9.678833
6,burgess-busterbrown.txt,18963,1054,84663,4.464642,17.991461,9.12667
7,carroll-alice.txt,34110,1703,144395,4.233216,20.02936,9.92145
8,chesterton-ball.txt,96996,4779,457450,4.716174,20.296296,10.025556
9,chesterton-brown.txt,86063,3806,406629,4.724783,22.612454,10.928857


The readability differences generally make sense - `Alice and Wonderland` lower than the `King James Bible` lower than `Paradise Lost`, but we also see Shakespeare is the least difficult.
This is perhaps expected given the formula, since Shakespeare has the lowest `avg_sl`, but seems intuitively incorrect for those who have experienced Shakespeare. 
Note however, that FKGL does not take into account the frequency of the words themselves (how rare they are), which is a question of *distribution*.

## Distribution-based metrics 

### Type/token ratio

### Frequency distributions

### Conditional distributions

### Vectorization

### tf-idf
