# Determine the “Readability” of a text

Readability metrics have numerous uses. A writer might use the metrics to objectively assess the complexity of his work to determine whether it’s written at a level appropriate for his intended audience. An educational software firm might use readability metrics to recommend level-appropriate content for its students.

Currently, I work on the latter. As a result, I’ve written a Python package, py-readability-metrics that assesses the readability of a given text, using a variety of today’s most popular readability metrics. These include:

- Flesch Kincaid Grade Level

- Flesch Reading Ease

- Dale Chall Readability

- Automated Readability Index (ARI)

- Coleman Liau Index

- Gunning Fog

- SMOG

- Linear Write

Given a text, each of the above metrics calculate a score indicating the difficulty of the text.

In [28]:
from readability import Readability
import spacy
import newspaper

import nltk
nltk.download('stopwords')
stop_words = set(nltk.corpus.stopwords.words('english'))
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

nlp = spacy.load("en_core_web_sm")

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/pierluigi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/pierluigi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/pierluigi/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/pierluigi/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [29]:
url = "https://www.foxnews.com/politics/republicans-respond-after-irs-whistleblower-says-hunter-biden-investigation-being-mishandled"

In [30]:
def get_article_info(url):
    # Create a newspaper Article object
    article = newspaper.Article(url)

    # Download and parse the article
    article.download()
    article.parse()

    # Extract the title, subtitle, description, and main text
    title = article.title.strip()
    subtitle = article.meta_data.get("description", "").strip()
    description = article.meta_description.strip()
    text = article.text.strip()

    # Set the subtitle to the description if it is empty
    if not subtitle:
        subtitle = description.strip()

    # Concatenate the extracted strings
    article_text = f"{title}\n\n{subtitle}\n\n{text}"

    # Return the concatenated string
    return article_text

In [31]:
article = get_article_info(url)

r = Readability(article)

# Flesch Kincaid Grade Level

Output values:

1. **'fk.score'** : This is a numerical value that represents the Flesch-Kincaid score. It is calculated by taking into account the average number of words per sentence and the average number of syllables per word. The score is typically between 0 and 100, with higher scores indicating easier to read text.

2. **'fk.grade_level'** : This is an estimate of the reading level required to understand the text. It is expressed as a grade level (e.g. 4th grade, 8th grade, etc.) and is based on the Flesch-Kincaid score. A lower grade level indicates that the text is easier to read, while a higher grade level indicates that the text is more difficult to read.

It can be useful in determining the target audience or level of comprehension required for the text.

In [60]:
fk = r.flesch_kincaid()

# Define a dictionary of Flesch-Kincaid reading levels
reading_levels = {
    1: "Kindergarten",
    2: "First Grade",
    3: "Second Grade",
    4: "Third Grade",
    5: "Fourth Grade",
    6: "Fifth Grade",
    7: "Sixth Grade",
    8: "Seventh Grade",
    9: "Eighth Grade",
    10: "Ninth Grade",
    11: "Tenth Grade",
    12: "Eleventh Grade",
    13: "Twelfth Grade",
    14: "College"
}

reading_levels_informations = {
    1: "Simple sentences with basic vocabulary and familiar concepts. Picture books and easy readers are typically at this level.",
    2: "Short sentences with simple vocabulary and straightforward ideas. Beginning readers and early chapter books are typically at this level.",
    3: "Longer sentences with more complex vocabulary and more detailed ideas. Early chapter books and some middle grade books are typically at this level.",
    4: "Longer paragraphs with more sophisticated vocabulary and more complex ideas. Middle grade books and some young adult books are typically at this level.",
    5: "More challenging vocabulary and ideas, with longer paragraphs and more complex sentence structures. Middle grade and young adult books are typically at this level.",
    6: "Even more complex vocabulary and ideas, with longer and more complex sentence structures. Middle grade and young adult books are typically at this level.",
    7: "Similar to fifth grade, but with more complex vocabulary and sentence structures. Middle grade and young adult books are typically at this level.",
    8: "More sophisticated vocabulary and ideas, with longer and more complex sentence structures. Young adult books and some adult literature are typically at this level.",
    9: "Similar to seventh grade, but with even more complex vocabulary and sentence structures. Young adult books and some adult literature are typically at this level.",
    10: "High school level reading, with sophisticated vocabulary and complex sentence structures. Some young adult and adult literature is typically at this level.",
    11: "Similar to ninth grade, but with even more challenging vocabulary and sentence structures. Some young adult and adult literature is typically at this level.",
    12: "College preparatory level reading, with advanced vocabulary and complex sentence structures. Some young adult and adult literature is typically at this level.",
    13: "College level reading, with very advanced vocabulary and complex sentence structures. Some adult literature and academic texts are typically at this level.",
    14: "Post-secondary reading, with highly specialized vocabulary and complex sentence structures. Academic and technical texts are typically at this level."
}

# Print a table of estimated reading levels for different grades
print("Flesch-Kincaid Reading Levels:\n")
print("Grade Level\tEstimated Reading Level\tGrade Level Explaining")
for grade_level in range(1, 15):
    estimated_level = reading_levels.get(grade_level, "College")
    estimated_level_information = reading_levels_informations.get(grade_level, "Post-secondary reading, with highly specialized vocabulary and complex sentence structures. Academic and technical texts are typically at this level.")
    print(f"{grade_level}\t\t{estimated_level}\t\t{estimated_level_information}")
print()
# Print the readability score and estimated reading level
print("The Flesch-Kincaid score of the article is:", fk.score)
print("The estimated reading level of the article is:", fk.grade_level)

Flesch-Kincaid Reading Levels:

Grade Level	Estimated Reading Level	Grade Level Explaining
1		Kindergarten		Simple sentences with basic vocabulary and familiar concepts. Picture books and easy readers are typically at this level.
2		First Grade		Short sentences with simple vocabulary and straightforward ideas. Beginning readers and early chapter books are typically at this level.
3		Second Grade		Longer sentences with more complex vocabulary and more detailed ideas. Early chapter books and some middle grade books are typically at this level.
4		Third Grade		Longer paragraphs with more sophisticated vocabulary and more complex ideas. Middle grade books and some young adult books are typically at this level.
5		Fourth Grade		More challenging vocabulary and ideas, with longer paragraphs and more complex sentence structures. Middle grade and young adult books are typically at this level.
6		Fifth Grade		Even more complex vocabulary and ideas, with longer and more complex sentence structure

# Flesch Reading Ease

Output values:

1. **'f.score:** This is a numerical value that represents the Flesch Reading Ease score. The score is typically between 0 and 100, with higher scores indicating easier to read text. The formula for calculating the Flesch Reading Ease score takes into account the average sentence length and the average number of syllables per word.

2. **f.ease:** This is a qualitative measure of how easy the text is to read. The value is expressed as a text string, with possible values ranging from "Very Confusing" to "Very Easy". A higher ease value indicates that the text is easier to read.

3. **'f.grade_levels'**: This is a dictionary that provides an estimate of the grade level required to understand the text. The keys of the dictionary are grade levels (e.g. 4th grade, 8th grade, etc.) and the values are scores that represent the estimated ease of reading for that grade level. A lower score indicates that the text is easier to read for that grade level.

In [46]:
f = r.flesch()

# Print the readability score, ease value, and estimated reading levels
print("The Flesch Reading Ease score of the article is:", f.score)
print("The article is classified as:", f.ease)
print("The estimated ease of reading for different grade levels is:")
for grade_level, ease_score in f.grade_levels.items():
    print(f"- {grade_level} grade: {ease_score}")

The Flesch Reading Ease score of the article is: 33.18228541964146
The article is classified as: difficult
The estimated ease of reading for different grade levels is:


AttributeError: 'list' object has no attribute 'items'

# Dale Chall Readability

The Dale-Chall Formula is an accurate readability formula for the simple reason that it is based on the use of familiar words, rather than syllable or letter counts. Reading tests show that readers usually find it easier to read, process and recall a passage if they find the words familiar.

In [34]:
dc = r.dale_chall()
print(dc.score)
print(dc.grade_levels)

11.398427716960178
['college_graduate']


# Automated Readability Index (ARI)

Unlike the other indices, the ARI, along with the Coleman-Liau, relies on a factor of characters per word, instead of the usual syllables per word. ARI is widely used on all types of texts.

In [35]:
ari = r.ari()
print(ari.score)
print(ari.grade_levels)
print(ari.ages)

13.76357171188323
['college_graduate']
[24, 100]


# Coleman Liau Index

The Coleman-Liau Formula usually gives a lower grade value than any of the Kincaid, ARI and Flesch values when applied to technical documents.

In [36]:
cl = r.coleman_liau()
print(cl.score)
print(cl.grade_level)

12.40612565445026
12


# Gunning Fog

The Gunning fog index measures the readability of English writing. The index estimates the years of formal education needed to understand the text on a first reading. A fog index of 12 requires the reading level of a U.S. high school senior (around 18 years old).

In [37]:
gf = r.gunning_fog()
print(gf.score)
print(gf.grade_level)

15.229192448040616
college


# SMOG

The SMOG Readability Formula (Simple Measure of Gobbledygook) is a popular method to use on health literacy materials.

In [38]:
s = r.smog()
print(s.score)
print(s.grade_level)

15.774802946060372
16


# SPACHE

The Spache Readability Formula is used for Primary-Grade Reading Materials, published in 1953 in The Elementary School Journal. The Spache Formula is best used to calculate the difficulty of text that falls at the 3rd grade level or below.

In [39]:
s = r.spache()
print(s.score)
print(s.grade_level)

8.425876725368871
8


# Linsear Write

Linsear Write is a readability metric for English text, purportedly developed for the United States Air Force to help them calculate the readability of their technical manuals.

In [40]:
lw = r.linsear_write()
print(lw.score)
print(lw.grade_level)

16.636363636363637
17
