# Fog Index

In this module, we will be learning how to compute the Gunning Fog index.

Before we begin, let's import the modules we'll be using in this tutorial.

In [None]:
import requests
import re
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from MyFunctions import html_to_text

#### Fog Index Background and Computation

The Gunning Fog index is a measure of the readability of a disclosure. The index estimates the years of formal education a person needs to understand a text on the first reading. For example, a fog index of 12 requires 12 years of education (i.e., high school graduate). A fog index of 16 requires 16 years of education (i.e., college graduate).

Feng Li was the first to introduce the fog index to the accounting literature in his 2008 *Journal of Accounting and Economics* paper. He found that annual reports of firms with higher fog indices have lower and less persistent earnings. Other papers have extended the literature to measure readability differently (e.g., 10-K file size, the BOG index, etc.).

The Fog index is calculated as follows:

$
\begin{align}
FOG = 0.4\ \bigl[\ \bigl(\frac{\#\ WORDS}{\#\ SENTENCES}\bigr) + 100\ \bigl(\frac{\#\ COMPLEX\ WORDS}{\#\ WORDS}\bigr)\ \bigr]
\end{align}
$

#### Example Disclosure

To illustrate the computation of the fog index, we'll be using Apple's 10-K for its 2011 fiscal year (https://www.sec.gov/Archives/edgar/data/320193/000119312511282113/d220209d10k.htm). 

Let's import this filing using the **requests.get** function and then convert the html source code to text using the **html_to_text** function.

In [None]:
# AUG 2021 UPDATE -- YOU HAVE TO DECLARE A HEADER TO ACCESS THE EDGAR WEBSITE

headers = {'User-Agent': 'ORGANIZATION youremail@yourinstitution.edu'}
html = requests.get('https://www.sec.gov/Archives/edgar/data/320193/000119312511282113/d220209d10k.htm',headers=headers).text
text = html_to_text(html)

#### Number of Words

We can use the **word_tokenize** function to obtain a list of all words in the disclosure. We'll strip out non-alphabetic tokens (e.g., punctuation, numbers), and then we can obtain the total word count using the **len** function.

In [None]:
words = word_tokenize(text)

# Remove non-alphabetic (i.e., punctuation, numbers) tokens
    
words = [w for w in words if w.isalpha()]

# Compute the number of words

num_words = len(words)

num_words

#### Number of Sentences

We can use the **sent_tokenize** function to obtain a list of all sentences in the disclosure. We'll remove sentences that are three words or fewer since these are likely not truly sentences but rather are data found in the disclsoure.  We can then obtain the total sentence count using the **len** function.

In [None]:
sentences = sent_tokenize(text)

# Remove sentences that are 3 words or fewer

sentences = [s for s in sentences if len(s.split(' ')) > 3]

# Compute the number of sentences

num_sentences = len(sentences)

num_sentences

#### Number of Complex Words

Complex words are words consisting of three or more syllables.

We can calculate the number of syllables in a word using the Carnegie Mellon University Pronouncing Dictionary (CMUDICT). The CMUDICT is an open-source machine-readable pronunciation dictionary for North American English that contains over 134,000 words and their pronunciations (see http://www.speech.cs.cmu.edu/cgi-bin/cmudict).

The **nltk** module contains this pronunciation dictionary. Let's import it and use it to count the number of syllables in a word.

In [None]:
#import nltk
#nltk.download('cmudict')

from nltk.corpus import cmudict
cmu = cmudict.dict()

word = 'disclosure'
cmu[word.lower()]

In [None]:
word_pronunciation = ''.join(cmu[word.lower()][0])
word_pronunciation

In [None]:
num_syl = len(re.findall(r'\d', word_pronunciation))
num_syl

In [None]:
def num_syllables(word):
    try:
        word_pronunciation = ''.join(cmu[word.lower()][0])
        num_syl = len(re.findall(r'\d', word_pronunciation))
    except Exception:
        if len(word) >= 10:
            num_syl = 3
        else:
            num_syl = 1
    return num_syl

print(num_syllables('disclosure'))
print(num_syllables('strategically'))
print(num_syllables('interestingly'))

Now, let's compute the number of complex words using this function.

In [None]:
complex_words = [w for w in words if num_syllables(w) >= 3]

num_complex_words = len(complex_words)

num_complex_words

#### Calculate Fog

We now have everything we need to calculate Fog.

In [None]:
# Calculate words per sentence

words_per_sentence = num_words/num_sentences

# Calculate the percentage of complex words

percent_complex_words = (num_complex_words/num_words) * 100

# Calculate Fog

fog = 0.4 * (words_per_sentence + percent_complex_words)

fog

#### Exercise

1. Create a function called **get_fog** that takes the EDGAR URL as input and returns the **fog** of the disclosure.
2. Calculate **fog** for Target's 10-K for the fiscal period ending February 1, 2020 (https://www.sec.gov/Archives/edgar/data/27419/000002741920000008/tgt-20200201.htm).
3. Calculate **fog** for Walmart's 10-K for the fiscal period ending January 31, 2020 (https://www.sec.gov/Archives/edgar/data/104169/000010416920000011/wmtform10-kx1312020.htm). Compare Walmart's fog index to Target's fog index? What does this difference suggest about the readability of Walmart's 10-K relative to Target's 10-K?

#### Solution for # 1

In [None]:
def get_fog(url):
    
    # Obtain the text of the disclosure
    
    headers = {'User-Agent': 'ORGANIZATION youremail@yourinstitution.edu'}
    disclosure = requests.get(url,headers=headers).text
    text = html_to_text(disclosure)
    
    # Compute the number of words
    
    words = word_tokenize(text) # Obtain word tokens
    words = [w for w in words if w.isalpha()] # Remove non-alphabetic (i.e., punctuation, numbers) tokens
    num_words = len(words)

    # Compute the number of sentences
    
    sentences = sent_tokenize(text) # Obtain sentence tokens
    sentences = [s for s in sentences if len(s.split(' ')) > 3] # Remove sentences that are 3 words or fewer
    num_sentences = len(sentences)
    
    # Compute the number of complex words
    
    complex_words = [w for w in words if num_syllables(w) >= 3]
    num_complex_words = len(complex_words)

    # Calculate Fog
    
    words_per_sentence = num_words/num_sentences
    percent_complex_words = (num_complex_words/num_words) * 100
    fog = 0.4 * (words_per_sentence + percent_complex_words)

    return fog

#### Solution for # 2

In [None]:
# Target

fog = get_fog('https://www.sec.gov/Archives/edgar/data/27419/000002741920000008/tgt-20200201.htm')

print('Fog is equal to '+'{:.3f}'.format(fog))

#### Solution for # 3

In [None]:
# Walmart

fog = get_fog('https://www.sec.gov/Archives/edgar/data/104169/000010416920000011/wmtform10-kx1312020.htm')

print('Fog is equal to '+'{:.3f}'.format(fog))

# The fog index for Walmart is higher than the fog index for Target
# suggesting that Walmart's 10-K is less readable than Target's 10-K