# Determine the “Readability” of a text

Readability metrics have numerous uses. A writer might use the metrics to objectively assess the complexity of his work to determine whether it’s written at a level appropriate for his intended audience. An educational software firm might use readability metrics to recommend level-appropriate content for its students.

Currently, I work on the latter. As a result, I’ve written a Python package, py-readability-metrics that assesses the readability of a given text, using a variety of today’s most popular readability metrics. These include:

- Flesch Kincaid Grade Level

- Flesch Reading Ease

- Dale Chall Readability

- Automated Readability Index (ARI)

- Coleman Liau Index

- Gunning Fog

- SMOG

- Linear Write

Given a text, each of the above metrics calculate a score indicating the difficulty of the text.

In [2]:
from readability import Readability
import spacy
import newspaper

import nltk
nltk.download('stopwords')
stop_words = set(nltk.corpus.stopwords.words('english'))
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

nlp = spacy.load("en_core_web_sm")

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/pierluigi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/pierluigi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/pierluigi/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/pierluigi/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [3]:
url = "https://www.foxnews.com/politics/republicans-respond-after-irs-whistleblower-says-hunter-biden-investigation-being-mishandled"

In [4]:
def get_article_info(url):
    # Create a newspaper Article object
    article = newspaper.Article(url)

    # Download and parse the article
    article.download()
    article.parse()

    # Extract the title, subtitle, description, and main text
    title = article.title.strip()
    subtitle = article.meta_data.get("description", "").strip()
    description = article.meta_description.strip()
    text = article.text.strip()

    # Set the subtitle to the description if it is empty
    if not subtitle:
        subtitle = description.strip()

    # Concatenate the extracted strings
    article_text = f"{title}\n\n{subtitle}\n\n{text}"

    # Return the concatenated string
    return article_text

In [5]:
article = get_article_info(url)

r = Readability(article)

# Flesch Kincaid Grade Level

Output values:

1. **'fk.score'** : This is a numerical value that represents the Flesch-Kincaid score. It is calculated by taking into account the average number of words per sentence and the average number of syllables per word. The score is typically between 0 and 100, with higher scores indicating easier to read text.

2. **'fk.grade_level'** : This is an estimate of the reading level required to understand the text. It is expressed as a grade level (e.g. 4th grade, 8th grade, etc.) and is based on the Flesch-Kincaid score. A lower grade level indicates that the text is easier to read, while a higher grade level indicates that the text is more difficult to read.

It can be useful in determining the target audience or level of comprehension required for the text.

In [6]:
fk = r.flesch_kincaid()

# Define a dictionary of Flesch-Kincaid reading levels
reading_levels = {
    1: "Kindergarten",
    2: "First Grade",
    3: "Second Grade",
    4: "Third Grade",
    5: "Fourth Grade",
    6: "Fifth Grade",
    7: "Sixth Grade",
    8: "Seventh Grade",
    9: "Eighth Grade",
    10: "Ninth Grade",
    11: "Tenth Grade",
    12: "Eleventh Grade",
    13: "Twelfth Grade",
    14: "College"
}

reading_levels_informations = {
    1: "Simple sentences with basic vocabulary and familiar concepts. Picture books and easy readers are typically at this level.",
    2: "Short sentences with simple vocabulary and straightforward ideas. Beginning readers and early chapter books are typically at this level.",
    3: "Longer sentences with more complex vocabulary and more detailed ideas. Early chapter books and some middle grade books are typically at this level.",
    4: "Longer paragraphs with more sophisticated vocabulary and more complex ideas. Middle grade books and some young adult books are typically at this level.",
    5: "More challenging vocabulary and ideas, with longer paragraphs and more complex sentence structures. Middle grade and young adult books are typically at this level.",
    6: "Even more complex vocabulary and ideas, with longer and more complex sentence structures. Middle grade and young adult books are typically at this level.",
    7: "Similar to fifth grade, but with more complex vocabulary and sentence structures. Middle grade and young adult books are typically at this level.",
    8: "More sophisticated vocabulary and ideas, with longer and more complex sentence structures. Young adult books and some adult literature are typically at this level.",
    9: "Similar to seventh grade, but with even more complex vocabulary and sentence structures. Young adult books and some adult literature are typically at this level.",
    10: "High school level reading, with sophisticated vocabulary and complex sentence structures. Some young adult and adult literature is typically at this level.",
    11: "Similar to ninth grade, but with even more challenging vocabulary and sentence structures. Some young adult and adult literature is typically at this level.",
    12: "College preparatory level reading, with advanced vocabulary and complex sentence structures. Some young adult and adult literature is typically at this level.",
    13: "College level reading, with very advanced vocabulary and complex sentence structures. Some adult literature and academic texts are typically at this level.",
    14: "Post-secondary reading, with highly specialized vocabulary and complex sentence structures. Academic and technical texts are typically at this level."
}

# Print a table of estimated reading levels for different grades
print("Flesch-Kincaid Reading Levels:\n")
print("Grade Level\tEstimated Reading Level\tGrade Level Explaining")
for grade_level in range(1, 15):
    estimated_level = reading_levels.get(grade_level, "College")
    estimated_level_information = reading_levels_informations.get(grade_level, "Post-secondary reading, with highly specialized vocabulary and complex sentence structures. Academic and technical texts are typically at this level.")
    print(f"{grade_level}\t\t{estimated_level}\t\t{estimated_level_information}")
print()
# Print the readability score and estimated reading level
print("The Flesch-Kincaid score of the article is:", fk.score)
print("The estimated reading level of the article is:", fk.grade_level)

Flesch-Kincaid Reading Levels:

Grade Level	Estimated Reading Level	Grade Level Explaining
1		Kindergarten		Simple sentences with basic vocabulary and familiar concepts. Picture books and easy readers are typically at this level.
2		First Grade		Short sentences with simple vocabulary and straightforward ideas. Beginning readers and early chapter books are typically at this level.
3		Second Grade		Longer sentences with more complex vocabulary and more detailed ideas. Early chapter books and some middle grade books are typically at this level.
4		Third Grade		Longer paragraphs with more sophisticated vocabulary and more complex ideas. Middle grade books and some young adult books are typically at this level.
5		Fourth Grade		More challenging vocabulary and ideas, with longer paragraphs and more complex sentence structures. Middle grade and young adult books are typically at this level.
6		Fifth Grade		Even more complex vocabulary and ideas, with longer and more complex sentence structure

# Flesch Reading Ease

The Flesch Reading Ease score is a measure of how easy a text is to read. It is calculated based on the average number of syllables per word and the average number of words per sentence. The higher the score, the easier the text is to read. The score ranges from 0 to 100, with higher scores indicating easier text.

Output values:

1. **'f.score':** This is a numerical value that represents the Flesch Reading Ease score. The score is typically between 0 and 100, with higher scores indicating easier to read text. The formula for calculating the Flesch Reading Ease score takes into account the average sentence length and the average number of syllables per word.

2. **'f.ease':** This is a qualitative measure of how easy the text is to read. The value is expressed as a text string, with possible values ranging from "Very Confusing" to "Very Easy". A higher ease value indicates that the text is easier to read.

3. **'f.grade_levels'**: This is a dictionary that provides an estimate of the grade level required to understand the text. The keys of the dictionary are grade levels (e.g. 4th grade, 8th grade, etc.) and the values are scores that represent the estimated ease of reading for that grade level. A lower score indicates that the text is easier to read for that grade level.

In [7]:
f = r.flesch()

print("Flesch Reading Ease Score |     Classification     ")
print("-------------------------|------------------------")
print("         0-29             | Very Difficult          ")
print("         30-49            | Difficult               ")
print("         50-59            | Fairly Difficult        ")
print("         60-69            | Standard                ")
print("         70-79            | Fairly Easy             ")
print("         80-89            | Easy                    ")
print("         90-100           | Very Easy               ")

print()
print()
# Print the readability score, ease value, and estimated reading levels
print("The Flesch Reading Ease score of the article is:", f.score)
print("The article is classified as:", f.ease)

Flesch Reading Ease Score |     Classification     
-------------------------|------------------------
         0-29             | Very Difficult          
         30-49            | Difficult               
         50-59            | Fairly Difficult        
         60-69            | Standard                
         70-79            | Fairly Easy             
         80-89            | Easy                    
         90-100           | Very Easy               


The Flesch Reading Ease score of the article is: 33.18228541964146
The article is classified as: difficult


# Dale Chall Readability

The Dale-Chall formula uses a combination of two factors to determine the readability score: the average sentence length and the percentage of difficult words in a text. Difficult words are words that are not among a list of 3,000 familiar words that are likely to be known by a fourth-grade student.

The resulting score is on a scale from 0 to 10, with a higher score indicating an easier text to read. A score of 9 or higher is considered easy to read, while a score of 5 or lower is considered difficult to read.

Output:

1. **'dc.score'**: is a numerical score that represents the readability of the text according to the Dale-Chall formula. It is the percentage of difficult words in the text.

2. **'dc.grade_levels'**: is a dictionary that maps grade levels to their corresponding percentage of difficult words. The keys of the dictionary represent the grade levels, and the values represent the percentage of difficult words that students at that grade level are likely to know. For example, a value of 5.5 for the key "8th and 9th grade" means that students in 8th and 9th grade are expected to know 86-92% of the words in the text.

In [15]:
dc = r.dale_chall()

print("Dale-Chall Readability Score |     Grade Levels      ")
print("------------------------------|------------------------")
print("          0-4.9              |  4th grade and below    ")
print("          5.0-5.9            |  5th or 6th grade       ")
print("          6.0-6.9            |  7th or 8th grade       ")
print("          7.0-7.9            |  High school            ")
print("          8.0-8.9            |  Some college           ")
print("          9.0-9.9            |  College graduate       ")
print("          10.0-10.9          |  Very difficult         ")
print()
# Print the Dale-Chall Readability score
print("The Dale-Chall Readability score of the article is:", dc.score)
# Print the estimated grade levels for comprehension
print("The estimated comprehension level for different grade levels is:", dc.grade_levels)


Dale-Chall Readability Score |     Grade Levels      
------------------------------|------------------------
          0-4.9              |  4th grade and below    
          5.0-5.9            |  5th or 6th grade       
          6.0-6.9            |  7th or 8th grade       
          7.0-7.9            |  High school            
          8.0-8.9            |  Some college           
          9.0-9.9            |  College graduate       
          10.0-10.9          |  Very difficult         

The Dale-Chall Readability score of the article is: 11.398427716960178
The estimated comprehension level for different grade levels is: ['college_graduate']


# Automated Readability Index (ARI)

The Automated Readability Index (ARI) is a readability formula that uses sentence length and word length to estimate the grade level required to read a particular text. The formula is:

ARI = 4.71(characters/words) + 0.5(words/sentences) - 21.43

The ARI score corresponds to a U.S. grade level, indicating the minimum education level needed to read and comprehend a text. A higher score means that the text is more difficult to read.

Output:

1. **'ari.score'**: This is the Automated Readability Index score of the text. The score indicates the approximate grade level required to understand the text. The higher the score, the more difficult the text is to read.

2. **'ari.grade_levels'**: This is a dictionary containing the estimated years of education required to understand the text at different levels. The keys are the grade levels, and the values are the corresponding years of education required.

3. **'ari.ages'**: This is another dictionary containing the estimated age required to understand the text at different levels. The keys are the age ranges, and the values are the corresponding grade levels.

In [17]:
ari = r.ari()

print("Automated Readability Index Score |     Grade Levels     |     Ages      ")
print("----------------------------------|----------------------|---------------")
print("         0-1.9                   |  Kindergarten        |  5-6 years old ")
print("         2.0-2.9                 |  First/Second Grade  |  7-8 years old ")
print("         3.0-3.9                 |  Third Grade         |  9-10 years old")
print("         4.0-4.9                 |  Fourth Grade        |  11-12 years old")
print("         5.0-5.9                 |  Fifth Grade         |  13-14 years old")
print("         6.0-6.9                 |  Sixth Grade         |  15-16 years old")
print("         7.0-7.9                 |  Seventh Grade       |  17-18 years old")
print("         8.0-8.9                 |  Eighth Grade        |  18-19 years old")
print("         9.0-9.9                 |  High School         |  19-22 years old")
print("        10.0-10.9                |  College             |  22-23 years old")
print("        11.0-11.9                |  College Graduate    |  23-24 years old")
print("        12.0-12.9                |  Professional        |  24-25 years old")
print("        13.0 and above           |  Advanced            |  25+ years old  ")
print()

score = ari.score
grade_levels = ari.grade_levels
ages = ari.ages

# Print the output in a readable format
print(f"The Automated Readability Index (ARI) score is {score}, which corresponds to a grade level of {grade_levels}.")
print(f"This means that the text can be read by someone who is around {ages} years old.")

Automated Readability Index Score |     Grade Levels     |     Ages      
----------------------------------|----------------------|---------------
         0-1.9                   |  Kindergarten        |  5-6 years old 
         2.0-2.9                 |  First/Second Grade  |  7-8 years old 
         3.0-3.9                 |  Third Grade         |  9-10 years old
         4.0-4.9                 |  Fourth Grade        |  11-12 years old
         5.0-5.9                 |  Fifth Grade         |  13-14 years old
         6.0-6.9                 |  Sixth Grade         |  15-16 years old
         7.0-7.9                 |  Seventh Grade       |  17-18 years old
         8.0-8.9                 |  Eighth Grade        |  18-19 years old
         9.0-9.9                 |  High School         |  19-22 years old
        10.0-10.9                |  College             |  22-23 years old
        11.0-11.9                |  College Graduate    |  23-24 years old
        12.0-12.9             

# Coleman Liau Index

The Coleman-Liau Index is a readability test that uses a mathematical formula to calculate the approximate reading level needed to understand a given text. The formula takes into account the average number of words per sentence and the average number of characters per word to determine the text's reading level.

The Coleman-Liau Index produces a score that corresponds to a grade level, which indicates the minimum education level needed to understand the text. The score is based on a scale of 0 to 12, with 12 being the highest score and indicating that the text can be easily read and understood by a person with a 12th-grade education or higher. The score is calculated using the following formula:

CLI = 0.0588 * L - 0.296 * S - 15.8

where L is the average number of letters per 100 words in the text, and S is the average number of sentences per 100 words in the text.

Like other readability tests, the Coleman-Liau Index is not perfect and may not accurately reflect the actual difficulty of a text. However, it can be useful as a general guide for determining the approximate reading level needed to understand a text.

Output:

1. **'cl.score'**: The Coleman-Liau Index score, which is a numerical representation of the readability of the text. It is calculated based on the average number of characters per sentence and the average number of letters per 100 words. The score is a decimal number, usually between 0 and 20. The higher the score, the more difficult the text is to read.

2. **'cl.grade_level'**: The grade level of education required to understand the text according to the Coleman-Liau Index. The grade level is based on the average sentence length and the average number of letters per 100 words. For example, if the score is 9.0, the text is considered to be at a 9th-grade reading level.

In [21]:
cl = r.coleman_liau()
print("Coleman-Liau Index Score |     Grade Levels     ")
print("-------------------------|----------------------")
print("         0-4.9           |  Below 4th Grade      ")
print("         5.0-5.9         |  4th or 5th Grade     ")
print("         6.0-6.9         |  6th or 7th Grade     ")
print("         7.0-7.9         |  8th or 9th Grade     ")
print("         8.0-8.9         |  10th or 11th Grade   ")
print("         9.0-Above       |  12th Grade or Above  ")
print()
print("Coleman-Liau Index Score: ", cl.score)
print("Estimated Grade Level: ", cl.grade_level)

Coleman-Liau Index Score |     Grade Levels     
-------------------------|----------------------
         0-4.9           |  Below 4th Grade      
         5.0-5.9         |  4th or 5th Grade     
         6.0-6.9         |  6th or 7th Grade     
         7.0-7.9         |  8th or 9th Grade     
         8.0-8.9         |  10th or 11th Grade   
         9.0-Above       |  12th Grade or Above  

Coleman-Liau Index Score:  12.40612565445026
Estimated Grade Level:  12


# Gunning Fog

The Gunning fog index is a readability formula used to measure the readability of a text. Like other readability formulas, it considers factors such as sentence length and word difficulty to estimate the years of education required to understand a text.

The Gunning fog index assigns a grade level to a text based on the number of complex words it contains. Complex words are those with three or more syllables, excluding proper nouns, compound words, and familiar jargon. The higher the number of complex words in a text, the more difficult it is to read, and the higher the grade level assigned.

The Gunning fog index formula is:

Gunning fog index = 0.4 * ((total words / total sentences) + 100 * (complex words / total words))

The Gunning fog index typically produces a grade level between 0 and 20, with 20 indicating the most difficult text.

Output:

1. **'gf.score'**: The score is a decimal number representing the estimated number of years of education required to understand the text.

2. **'gf.grade_level'**: It describes the corresponding grade level of education that is typically required to understand the text based on the score.

In [23]:
gf = r.gunning_fog()
print("Gunning Fog Index Score |     Grade Levels      ")
print("-------------------------|------------------------")
print("         0-5.9           |  6th grade and below    ")
print("         6.0-6.9         |  7th or 8th grade       ")
print("         7.0-7.9         |  9th or 10th grade      ")
print("         8.0-8.9         |  11th or 12th grade     ")
print("        9.0 and above    |  College or higher      ")
print()
print("The Gunning Fog score is:", gf.score)
print("The estimated grade level for comprehension is:", gf.grade_level)

Gunning Fog Index Score |     Grade Levels      
-------------------------|------------------------
         0-5.9           |  6th grade and below    
         6.0-6.9         |  7th or 8th grade       
         7.0-7.9         |  9th or 10th grade      
         8.0-8.9         |  11th or 12th grade     
        9.0 and above    |  College or higher      

The Gunning Fog score is: 15.229192448040616
The estimated grade level for comprehension is: college


# SMOG

SMOG is a readability formula designed to measure the readability of written English based on the number of syllables per sentence. The acronym SMOG stands for Simple Measure of Gobbledygook. The SMOG formula estimates the years of education a person needs to understand a particular piece of text. It is particularly useful for technical writing, where the use of complex words can make text difficult to understand.

The SMOG formula is calculated by counting the number of polysyllabic (three or more syllables) words in a sample of 30 sentences, and then applying a formula to estimate the years of education needed to understand the text. The formula has been shown to correlate well with other measures of readability, and is widely used in academic and professional writing.

Output:

1. **'s.score'**: SMOG score, which is a measure of the readability of the text. It represents the estimated number of years of education needed to understand the text. For example, if the score is 9.5, it means that a person who has completed 9.5 years of education (typically halfway through their 10th year of education) should be able to understand the text.

2. **'s.grade_level'**: grade level equivalent of the SMOG score. It represents the highest level of education that the reader is assumed to have completed in order to understand the text. For example, if the grade level equivalent is 8, it means that the text should be understandable to someone who has completed the 8th grade (typically around 13-14 years old in the US education system).

In [25]:
s = r.smog()
print("SMOG Index Score |     Grade Levels     ")
print("------------------|----------------------")
print("     0-6.9        |  6th grade and below  ")
print("     7.0-8.9      |  7th - 8th grade      ")
print("     9.0-10.9     |  9th - 10th grade     ")
print("    11.0-12.9     |  11th - 12th grade    ")
print("    13.0 and above|  College or higher    ")
print()

smog_score = s.score
smog_grade_level = s.grade_level

print(f"The SMOG score is {smog_score:.2f}. This corresponds to a grade level of {smog_grade_level}.")


SMOG Index Score |     Grade Levels     
------------------|----------------------
     0-6.9        |  6th grade and below  
     7.0-8.9      |  7th - 8th grade      
     9.0-10.9     |  9th - 10th grade     
    11.0-12.9     |  11th - 12th grade    
    13.0 and above|  College or higher    

The SMOG score is 15.77. This corresponds to a grade level of 16.


# SPACHE

SPACHE is a readability formula designed to measure the readability level of texts for young children. It was created by Dr. G. Harry McLaughlin in the 1960s and is named after the initials of the five factors he used to develop it: spelling, pronunciation, age, cultural experience, and habitat.

The formula estimates the number of years of education required for a reader to understand a particular text. The SPACHE formula takes into account the total number of sentences and the number of difficult words (words with three or more syllables) in a text. The formula uses the following equation to calculate the readability level:

SPACHE score = 0.659 + (total number of difficult words ÷ total number of sentences)

The SPACHE formula is primarily used to assess the readability of texts for children in grades K-3. It is widely used by publishers of children's books, as well as by educators and researchers in the field of literacy.

Output:

1. **'s.score'**: SPACHE readability score, which represents the grade level at which the text can be easily understood by children. The lower the score, the easier the text is to read.

2. **'s.grade_level'**: is a string representation of the recommended age range for readers of the text. For example, if s.grade_level is "Grade 1-2", it means that the text is suitable for children in the first and second grades of primary school.

In [27]:
s = r.spache()
print("SPACHE Index Score |     Grade Levels     ")
print("--------------------|----------------------")
print("     0-4.9          |  4th grade and below  ")
print("     5.0-5.9        |  5th grade            ")
print("     6.0-6.9        |  6th grade            ")
print("     7.0-7.9        |  7th grade            ")
print("     8.0-8.9        |  8th grade            ")
print("     9.0-9.9        |  9th grade            ")
print("    10.0-10.9       |  10th grade           ")
print("    11.0-11.9       |  11th grade           ")
print("    12.0-12.9       |  12th grade           ")
print("    13.0 and above  |  College or higher    ")
print()

spache_score = s.score
spache_grade = s.grade_level
print(f"The SPACHE readability score is {spache_score:.2f}.")
print(f"This corresponds to a grade level of {spache_grade}.")


SPACHE Index Score |     Grade Levels     
--------------------|----------------------
     0-4.9          |  4th grade and below  
     5.0-5.9        |  5th grade            
     6.0-6.9        |  6th grade            
     7.0-7.9        |  7th grade            
     8.0-8.9        |  8th grade            
     9.0-9.9        |  9th grade            
    10.0-10.9       |  10th grade           
    11.0-11.9       |  11th grade           
    12.0-12.9       |  12th grade           
    13.0 and above  |  College or higher    

The SPACHE readability score is 8.43.
This corresponds to a grade level of 8.


# Linsear Write

Linsear Write is a readability metric for English text, purportedly developed for the United States Air Force to help them calculate the readability of their technical manuals.

In [None]:
lw = r.linsear_write()
print(lw.score)
print(lw.grade_level)

16.636363636363637
17
