# Determine the “Readability” of a text

Readability metrics have numerous uses. A writer might use the metrics to objectively assess the complexity of his work to determine whether it’s written at a level appropriate for his intended audience. An educational software firm might use readability metrics to recommend level-appropriate content for its students.

Currently, I work on the latter. As a result, I’ve written a Python package, py-readability-metrics that assesses the readability of a given text, using a variety of today’s most popular readability metrics. These include:

- Flesch Kincaid Grade Level

- Flesch Reading Ease

- Dale Chall Readability

- Automated Readability Index (ARI)

- Coleman Liau Index

- Gunning Fog

- SMOG

- Linear Write

Given a text, each of the above metrics calculate a score indicating the difficulty of the text.

In [28]:
from readability import Readability
import spacy
import newspaper

import nltk
nltk.download('stopwords')
stop_words = set(nltk.corpus.stopwords.words('english'))
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

nlp = spacy.load("en_core_web_sm")

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/pierluigi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/pierluigi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/pierluigi/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/pierluigi/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [29]:
url = "https://www.foxnews.com/politics/republicans-respond-after-irs-whistleblower-says-hunter-biden-investigation-being-mishandled"

In [30]:
def get_article_info(url):
    # Create a newspaper Article object
    article = newspaper.Article(url)

    # Download and parse the article
    article.download()
    article.parse()

    # Extract the title, subtitle, description, and main text
    title = article.title.strip()
    subtitle = article.meta_data.get("description", "").strip()
    description = article.meta_description.strip()
    text = article.text.strip()

    # Set the subtitle to the description if it is empty
    if not subtitle:
        subtitle = description.strip()

    # Concatenate the extracted strings
    article_text = f"{title}\n\n{subtitle}\n\n{text}"

    # Return the concatenated string
    return article_text

In [31]:
article = get_article_info(url)

r = Readability(article)

# Flesch Kincaid Grade Level

The U.S. Army uses Flesch-Kincaid Grade Level for assessing the difficulty of technical manuals. The commonwealth of Pennsylvania uses Flesch-Kincaid Grade Level for scoring automobile insurance policies to ensure their texts are no higher than a ninth grade level of reading difficulty. Many other U.S. states also use Flesch-Kincaid Grade Level to score other legal documents such as business policies and financial forms.

In [32]:
fk = r.flesch_kincaid()

print(fk.score)
print(fk.grade_level)

14.15103125495796
14


# Flesch Reading Ease

The U.S. Department of Defense uses the Reading Ease test as the standard test of readability for its documents and forms. Florida requires that life insurance policies have a Flesch Reading Ease score of 45 or greater.

In [33]:
f = r.flesch()
print(f.score)
print(f.ease)
print(f.grade_levels)

33.18228541964146
difficult
['college']


# Dale Chall Readability

The Dale-Chall Formula is an accurate readability formula for the simple reason that it is based on the use of familiar words, rather than syllable or letter counts. Reading tests show that readers usually find it easier to read, process and recall a passage if they find the words familiar.

In [34]:
dc = r.dale_chall()
print(dc.score)
print(dc.grade_levels)

11.398427716960178
['college_graduate']


# Automated Readability Index (ARI)

Unlike the other indices, the ARI, along with the Coleman-Liau, relies on a factor of characters per word, instead of the usual syllables per word. ARI is widely used on all types of texts.

In [35]:
ari = r.ari()
print(ari.score)
print(ari.grade_levels)
print(ari.ages)

13.76357171188323
['college_graduate']
[24, 100]


# Coleman Liau Index

The Coleman-Liau Formula usually gives a lower grade value than any of the Kincaid, ARI and Flesch values when applied to technical documents.

In [36]:
cl = r.coleman_liau()
print(cl.score)
print(cl.grade_level)

12.40612565445026
12


# Gunning Fog

The Gunning fog index measures the readability of English writing. The index estimates the years of formal education needed to understand the text on a first reading. A fog index of 12 requires the reading level of a U.S. high school senior (around 18 years old).

In [37]:
gf = r.gunning_fog()
print(gf.score)
print(gf.grade_level)

15.229192448040616
college


# SMOG

The SMOG Readability Formula (Simple Measure of Gobbledygook) is a popular method to use on health literacy materials.

In [38]:
s = r.smog()
print(s.score)
print(s.grade_level)

15.774802946060372
16


# SPACHE

The Spache Readability Formula is used for Primary-Grade Reading Materials, published in 1953 in The Elementary School Journal. The Spache Formula is best used to calculate the difficulty of text that falls at the 3rd grade level or below.

In [39]:
s = r.spache()
print(s.score)
print(s.grade_level)

8.425876725368871
8


# Linsear Write

Linsear Write is a readability metric for English text, purportedly developed for the United States Air Force to help them calculate the readability of their technical manuals.

In [40]:
lw = r.linsear_write()
print(lw.score)
print(lw.grade_level)

16.636363636363637
17


# Dependency Tree Height

Tokenizing the news article into sentences, calculating dependency tree length of each sentence and find the average, maximum and minimum length if all the sentences are considered for a news article.

# Depth of each node

The depths dictionary now contains the depth of each node in the dependency tree. The keys of the dictionary are the text of the nodes, and the values are the depths.

In [41]:
doc = nlp(article)
depths = {}

def walk_tree(node, depth):
    depths[node.orth_] = depth
    if node.n_lefts + node.n_rights > 0:
        return [walk_tree(child, depth + 1) for child in node.children]


[walk_tree(sent.root, 0) for sent in doc.sents]
print(depths)
print(max(depths.values()))

{'respond': 0, 'Republicans': 1, 'says': 1, 'after': 5, 'whistleblower': 6, 'IRS': 5, 'mishandled': 2, 'investigation': 2, 'Biden': 1, 'Hunter': 2, 'is': 0, 'being': 1, '\n\n': 2, 'calling': 0, 'Members': 4, 'of': 3, 'Congress': 7, 'are': 1, 'for': 2, 'transparency': 2, 'more': 7, 'from': 3, 'administration': 2, 'the': 6, 'said': 0, 'an': 3, 'into': 3, '.': 1, 'Lawmakers': 1, 'on': 3, 'Hill': 3, 'Capitol': 4, 'held': 1, 'to': 1, 'be': 0, 'accountable': 4, 'blocking': 4, '"': 1, 'and': 2, 'public': 1, 'learning': 3, 'about': 7, 'deals': 8, 'members': 1, 'family': 4, '’': 2, 'business': 5, 'with': 1, 'China': 12, 'come': 0, 'outcries': 1, 'The': 2, 'congressional': 2, 'as': 8, 'a': 3, 'within': 3, 'Service': 4, 'Revenue': 5, 'Internal': 6, 'alleges': 0, 'by': 6, 'also': 1, 'conflicts': 1, 'clear': 2, 'interest': 3, 'in': 1, 'told': 0, 'concerning': 1, 'It': 1, '’s': 0, 'deeply': 2, 'obstructing': 2, 'that': 5, 'Administration': 5, 'may': 5, 'justice': 3, 'efforts': 5, 'charge': 6, 'viola

# Building the tree

The function walk_tree recursively traverses the syntactic dependency tree of each sentence in the article and stores the depth of each node in the depths dictionary. The depth of a node is defined as the number of edges on the path from the node to the root of the dependency tree. The walk_tree function is called on the root of each sentence (sent.root) with an initial depth of 0.

The depths dictionary maps each token (i.e., word or punctuation symbol) in the article to its depth in the dependency tree. The keys of the dictionary are the orthographic forms (i.e., the string representations) of the tokens, and the values are the depths.

The final two lines of the code print the depths dictionary and the maximum depth of any node in the tree. The maximum depth is determined by calling the max function on the values of the depths dictionary.

# Example

Input text: "The quick brown fox jumps over the lazy dog. The dog, however, doesn't seem to care."

There are two sentences:

1 "The quick brown fox jumps over the lazy dog."

2 "The dog, however, doesn't seem to care."

For the first sentence, the root of the dependency tree is the word "jumps", and the maximum depth of the tree is 3. This means that the longest path from the root to a leaf node in the tree has a length of 3.

The dependency tree for the first sentence looks like this:


             jumps
              / \
           fox   over
           /      |    \
        quick  brown  dog


For the second sentence, the root of the dependency tree is the word "seem", and the maximum depth of the tree is 2. This means that the longest path from the root to a leaf node in the tree has a length of 2.

The dependency tree for the second sentence looks like this:

        seem
          |
         care
          |
         dog


In [42]:
def walk_tree(node, depth):
    if node.n_lefts + node.n_rights > 0:
        return max([walk_tree(child, depth + 1) for child in node.children], default=depth)
    else:
        return depth
    
def analyze_article(article):
    doc = nlp(article)
    depths = {}
    tree_lengths = {}
    for sent in doc.sents:
        root = sent.root
        depth = walk_tree(root, 0)
        depths[root.orth_] = depth
        tree_lengths[sent.text.strip()] = depth

    lengths = list(tree_lengths.values())
    avg_length = sum(lengths) / len(lengths)
    max_length = max(lengths)
    min_length = min(lengths)
    max_depth = max(depths.values())
    max_depth_words = [word for word, depth in depths.items() if depth == max_depth]
    return tree_lengths, max_depth, max_depth_words, avg_length, max_length, min_length

In [43]:
tree_lengths, max_depth, max_depth_words, avg_length, max_length, min_length = analyze_article(article)

print("Dependency tree lengths:")
for i, (sent, length) in enumerate(tree_lengths.items(), start=1):
    print(f"Sentence {i}: length {length}")
    print(f"\"{sent}\"\n")
print(f"Max tree depth: {max_depth}")
print(f"Words at max depth: {', '.join(max_depth_words)}")
print(f"Average tree length: {avg_length:.2f}")
print(f"Maximum tree length: {max_length}")
print(f"Minimum tree length: {min_length}")

Dependency tree lengths:
Sentence 1: length 9
"Republicans respond after IRS whistleblower says Hunter Biden investigation is being mishandled

Members of Congress are calling for more transparency from the Biden administration after an IRS whistleblower said an investigation into Hunter Biden is being mishandled."

Sentence 2: length 10
"Lawmakers on Capitol Hill are calling for the Biden administration to be held accountable for "blocking" Congress and the public from learning more about Biden family members’ business deals with China."

Sentence 3: length 6
"The congressional outcries come as a whistleblower within the Internal Revenue Service alleges an investigation into Hunter Biden is being mishandled by the Biden administration."

Sentence 4: length 4
"The whistleblower also alleges "clear conflicts of interest" in the investigation."

Sentence 5: length 9
""It’s deeply concerning that the Biden Administration may be obstructing justice by blocking efforts to charge Hunter Bide

# Printing the tree

In [44]:
def print_dependency_tree(sentence):
    root = sentence.root
    print_tree(root, "", [])
    print()

def print_tree(node, prefix, children):
    if node.n_lefts + node.n_rights == 0:
        print(prefix + node.orth_)
    else:
        print(prefix + node.orth_)
        prefix_child = prefix + "|  "
        children = [(child, prefix_child, [n for n in children if n[0] != child]) for child in node.children] + children
        if len(children) > 0:
            child, new_prefix, new_children = children.pop(0)
            print_tree(child, new_prefix + "+--", new_children)
            for child, new_prefix, new_children in children:
                print_tree(child, new_prefix + "+--", new_children)

for sentence in doc.sents:
    print_dependency_tree(sentence)


respond
|  +--Republicans
|  +--says
|  +--|  +--after
|  +--|  +--whistleblower
|  +--|  +--|  +--IRS
|  +--|  +--mishandled
|  +--|  +--|  +--investigation
|  +--|  +--|  +--|  +--Biden
|  +--|  +--|  +--|  +--|  +--Hunter
|  +--|  +--|  +--is
|  +--|  +--|  +--being
|  +--|  +--|  +--


|  +--|  +--|  +--calling
|  +--|  +--|  +--|  +--Members
|  +--|  +--|  +--|  +--|  +--of
|  +--|  +--|  +--|  +--|  +--|  +--Congress
|  +--|  +--|  +--|  +--are
|  +--|  +--|  +--|  +--for
|  +--|  +--|  +--|  +--|  +--transparency
|  +--|  +--|  +--|  +--|  +--|  +--more
|  +--|  +--|  +--|  +--|  +--|  +--from
|  +--|  +--|  +--|  +--|  +--|  +--|  +--administration
|  +--|  +--|  +--|  +--|  +--|  +--|  +--|  +--the
|  +--|  +--|  +--|  +--|  +--|  +--|  +--|  +--Biden
|  +--|  +--|  +--|  +--said
|  +--|  +--|  +--|  +--|  +--after
|  +--|  +--|  +--|  +--|  +--whistleblower
|  +--|  +--|  +--|  +--|  +--|  +--an
|  +--|  +--|  +--|  +--|  +--|  +--IRS
|  +--|  +--|  +--|  +--|  +--mishandled
