Readability measures are used to score the reading difficulty of a text, for the purposes of selecting texts of appropriate difficulty for language learners. Let us define $\mu_w$ to be the average number of letters per word, and $\mu_s$ to be the average number of words per sentence, in a given text. The Automated Readability Index (ARI) of the text is defined to be: $4.71 \mu_w + 0.5 \mu_s - 21.43$. Compute the ARI score for various sections of the Brown Corpus, including section `f` (lore) and `j` (learned). Make use of the fact that `nltk.corpus.brown.words()` produces a sequence of words, while `nltk.corpus.brown.sents()` produces a sequence of sentences

In [1]:
from statistics import mean

from nltk.corpus import brown
from tabulate import tabulate

In [2]:
def ari(brown_category):
    words = brown.words(categories=brown_category)
    average_letters_per_word = mean(len(w) for w in words)
    sents = brown.sents(categories=brown_category)
    average_words_per_sentence = mean(len(s) for s in sents)
    return 4.71 * average_letters_per_word + 0.5 * average_words_per_sentence - 21.43

In [3]:
table = [[c, ari(c)] for c in brown.categories()]
print(tabulate(table, headers=["Genre", "ARI"]))

Genre                 ARI
---------------  --------
adventure         4.08417
belles_lettres   10.9877
editorial         9.47103
fiction           4.91047
government       12.0843
hobbies           8.92236
humor             7.88781
learned          11.926
lore             10.2548
mystery           3.83355
news             10.1767
religion         10.2031
reviews          10.7697
romance           4.34922
science_fiction   4.97806
