In [1]:
import pandas as pd
import textstat as ts

In [3]:
books = pd.read_json("top-10-projbg-books.ndjson", lines=True)

In [6]:
%%time
books["fl_ease"] = books["text"].apply(ts.flesch_reading_ease)

CPU times: total: 0 ns
Wall time: 1 ms


In [9]:
%%time
books["fl_kn_grd"] = books["text"].apply(ts.flesch_kincaid_grade)

CPU times: total: 0 ns
Wall time: 969 µs


In [11]:
%%time
books["smog"] = books["text"].apply(ts.smog_index)

CPU times: total: 4.86 s
Wall time: 4.86 s


In [12]:
%%time
books["gunning"] = books["text"].apply(ts.gunning_fog)

CPU times: total: 1.38 s
Wall time: 1.38 s


In [12]:
%%time
books["ari"] = books["text"].apply(ts.automated_readability_index)

CPU times: total: 1.38 s
Wall time: 1.38 s


In [13]:
%%time
books["dale_chall"] = books["text"].apply(ts.dale_chall_readability_score)

CPU times: total: 1.38 s
Wall time: 1.37 s


In [16]:
%%time
books["ensemble"] = books["text"].apply(ts.text_standard, float_output=True)

CPU times: total: 2.02 s
Wall time: 2.01 s


In [17]:
%%time
books["eflaw"] = books["text"].apply(ts.mcalpine_eflaw)

CPU times: total: 656 ms
Wall time: 643 ms


In [18]:
%%time
books["reading_time"] = books["text"].apply(ts.reading_time, ms_per_char=16.45) 
# mean time per character per Trauzettel-Klosinski et al, 2012

CPU times: total: 516 ms
Wall time: 516 ms


In [19]:
books

Unnamed: 0,filename,text,fl_ease,fl_kn_grd,smog,gunning,dale_chall,ensemble,eflaw,reading_time
0,A Room with a View by E.M. Forster.txt,\n\n\n\n\n\n\n\n\n\nA Room With A View\n\nBy E...,84.78,4.4,8.2,5.8,5.49,6.0,17.2,5037.43
1,Cranford by Elizabeth Cleghorn Gaskell.txt,\n\n\n\n\n \n\n\n\n\n ...,69.25,10.4,10.5,11.79,6.14,11.0,39.6,5171.62
2,Little Women by Louisa May Alcott.txt,"\n\n\n\n\nProduced by David Edwards, Ernest Sc...",76.35,7.6,8.6,8.86,6.01,9.0,29.7,2940.75
3,Middlemarch by George Eliot.txt,\n\ncover\n\n\n\n\nMiddlemarch\n\nGeorge Eliot...,68.5,8.6,10.3,8.48,1.65,9.0,28.4,23985.61
4,"Moby Dick; Or, The Whale by Herman Melville.txt","\n\n\n\n\nMOBY-DICK;\n\nor, THE WHALE.\n\nBy H...",73.31,8.8,10.6,10.22,5.95,9.0,33.0,16484.55
5,The Adventures of Ferdinand Count Fathom — Com...,\n\n\n\n\nProduced by Tapio Riikonen and David...,37.31,18.5,16.4,17.74,6.65,18.0,60.1,13010.06
6,The Blue Castle by L.M. Montgomery.txt,\n\n\n_The_\n\nBLUE CASTLE\n\n\n\n\n_A NOVEL_\...,84.78,4.4,8.0,5.64,5.47,6.0,17.0,5258.59
7,The Complete Works of William Shakespeare by W...,\n\n\n\n\nThe Complete Works of William Shakes...,82.04,5.4,7.6,6.1,1.2,6.0,20.7,69933.52
8,The Enchanted April by Elizabeth Von Arnim.txt,\n\n\n\n\n\n\n\n\n\nThe Enchanted April\n\nby ...,80.92,5.9,9.3,7.17,5.37,6.0,22.8,5900.71
9,The Expedition of Humphry Clinker by T. Smolle...,\n\n\n\n\nProduced by Martin Adamson and Andr...,46.68,17.0,14.5,17.39,6.85,17.0,59.4,11525.61


I'm not sure about how well these are performing here. I don't know if Shakesphere is actually at the 6th grade level in general. I'm especially unsure about the dale-chall score of 1.2 which would indicate an average 4th grader can easily understand it. Of course, these shouldn't be relied on as a gold standard but rather a hint of a direction that things might be.

Different formulas lead to different results unsurprising to think about. Thus, in this case, the text standard is probably our best bet. This is because it is rooted in finding the conscensus of the scores. Of course, "best" is probably a bit ambiguous in a lot of respects. For one thing, the metric needs to be one that is understood and accepted. For another, it needs to be one that is considered reliable. It's important to understand the context which exists for each of these metrics.

The usual Flesch-Kincaid formula can overestaimte readability in certain contexts. The SMOG index is better used for techincal documents in a lot of respects and is popular in assessing the readability of health documents. However, syllable based readability tests have their own issues, short words may confuse people if they are rare. Other approaches can be found in the Dale-Chall forumla that has a list of words that 4th grade students can understand and anything off the list is considered difficult. In general though, any metric here is probably a "good enough" metric in general. Specific use cases require specific testing to ensure reliability and validity of a measure. 