Please note that this notebook is to explain and correct the bug present in flesch_score.ipynb. 
When the function, Fernandez_huerta_score function was implemented earlier, I copied and pasted wrong lines of codes. Therefore, while the distribution graph of the score presented to the partners was not incorrect, the code submitted to the repository was incorrect.

I am making an attempt to explain and correct the bug in the previous code in this notebook.  

In [1]:
#!pip install pylabeador
from utils import read_corpus
from features import feature_pipeline
import pylabeador
import os
import json
import re
import numpy as np
import pandas as pd
from collections import defaultdict

### Load texts into a list of dictionaries

In [2]:
# function for loading text

def file_to_dict_list(filename):
    ''' This function takes in a json filename and returns a list of dictionaries.
    ------------------------------------------
    Argument: 
       filename: (str) filename of a json file
    Returns:
        a list of dictionaries where each dictionary contains a paragraph / chapter of a Spanish text
    '''
    
    with open(filename, encoding = 'utf-8') as json_file:
        dict_list = json.load(json_file)
    
    return dict_list

### For testing, set up directory and files

In [3]:
text_dir = '/Users/eun-youngchristinapark/MDS-CAPSTONE/capstone_FHIS/corpus/'
file_list = os.listdir(text_dir)

In [4]:
corpus = read_corpus(text_dir)

In [5]:
first_spanish_reader_corpus = file_list[-3]
first_spanish = file_to_dict_list(text_dir + first_spanish_reader_corpus)

print(f'dictionary list type: {type(first_spanish)}', '\n')
print(f'length of the dictionary list: {len(first_spanish)}', '\n')
print(f'type of dictionary list element: {type(first_spanish[0])}', '\n')
print(f'keys in the dictionary list element: {first_spanish[0].keys()}', '\n')
print(f"source of the first element in the list: {first_spanish[0]['source']}", '\n')
print(f"author: {first_spanish[0]['author']}", '\n')
print(f"title: {first_spanish[0]['title']}", '\n')
print(f"level: {first_spanish[0]['level']}", '\n')
print(f"content: {first_spanish[0]['content']}", '\n')

dictionary list type: <class 'list'> 

length of the dictionary list: 56 

type of dictionary list element: <class 'dict'> 

keys in the dictionary list element: dict_keys(['source', 'author', 'title', 'level', 'content']) 

source of the first element in the list: https://www.gutenberg.org/files/15353/15353-h/15353-h.htm 

author: ERWIN W. ROESSLER, PH.D. 

title: A First Spanish Reader 

level: A1 

content: 1. LA ESCUELA
Voy a la escuela. Voy a la escuela el lunes,
el martes, el miércoles, el jueves y el viernes.
El sábado y el domingo no voy a la escuela.
El sábado y el domingo estoy en casa. Soy un
discípulo y estoy en la escuela. El discípulo
aprende. Aprendo la aritmética, a leer y a
escribir. Vd. aprende el español. Todos nosotros
aprendemos diligentemente. Algunos discípulos
no son diligentes. Algunos son perezosos.
El maestro elogia a los discípulos diligentes y a
los discípulos obedientes. Él no elogia a los
alumnos perezosos.
El maestro enseña. Mi maestro enseña el
español.

### Fernandez-Huerta Score calculation: Correction made

The equivalent readability measure of Flesch score for Spanish is Fernandez-Huerta score.
Please see the original paper (Spanish) *Medidas sencillas de lecturabilidad. Consigna, 214, 29–32,* and
the mention of this metric in [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5831059/#:~:text=The%20Fernandez%2DHuerta%20Formula%20(Fern%C3%A1ndez,formulae%20(Flesch%2C%201948).&text=The%20output%20is%20an%20index,representing%20a%20more%20difficult%20text).
    

In [29]:
# Remove titles
regex = r'[0-9]+\.[^\n]+\n'
def remove_titles(text):
    
    for match in re.finditer(regex, text):
        match_span = match.span()
        text_bf = text[:match_span[0]]
        text_af = text[match_span[1]:]
        text = text_bf + text_af
    
    return text

def fernandez_huerta_score(text):
    '''This function calculates flesch_score of the given text. 
    
       Note: Please note that the previous version of this function had a bug. We found this bug late in the project.
       We are making corrections in this notebook for the future students. 
       
       Please see the comments below for where the bugs are present in the previous code and how they are corrected in this version. 
    ---------------------------------------
    Argument: 
        text (str): a string which is a piece of Spanish text
    Returns:
        fh score (float)
        num_alpha_tokens: The number of tokens used in the calculation of the score (does not include numeric, puncutation marks)
    '''
    text = remove_titles(text)
    tp = feature_pipeline(text, full_spacy=True)
    tp.get_sentences(text)
    tp.get_tokens(text)
    
    num_sents = len(tp.sentences)
    num_tokens = sum(len(tk) for tk in tp.tokens)
    
    
    ############################ The code below (num_alpha_tokens = ...) in flesch_score.ipynb is incorrect. ###################################################################################
    #num_alpha_tokens = len([tk for tkl in tp.tokens for tk in tkl if any(t.isalpha() for t in tk)])      ### count as tokens only if the token contains at least one letter. ex) 'Vd.'' is a token. 
    
    ###### The correct code is shown below ######
    num_alpha_tokens = len([''.join([t for t in tkl]) for tkl in tp.tokens if any(t.isalpha() for t in [t for t in tkl])]) ####################################################################
    
    if text == '' or num_alpha_tokens == 0 or num_sents == 0:           ### if text contains nothing, 
        return 206, num_tokens                                               ###    set the score as very very easy to read 
    
    tokens = tp.tokens
    num_syl = 0
    
    ########################### The for loop below is incorrect: This is the wrong code in flesch_score.ipynb ###################################
    #for tl in tokens:
    #    for token in tl:
    #        if any(t.isalpha() for t in token):                          ### if the token contains at least one letter
    #            try: 
    #                token_ = ''.join([t for t in token if t.isalpha()])      ###     get rid of non-alphabets in the token
    #                num_syl += len(pylabeador.syllabify(token_))             ###     and get syllables 
    #            except:
    #                num_alpha_tokens -= 1                                ### There are alphabets such as ª which cannot be processed
                    
    ########################### The for loop below is correct #####################################################################################
    for tl in tokens:
        if any(t.isalpha() for t in tl):                          ### if the token contains at least one letter
            try: 
                token_ = ''.join([t for t in tl if t.isalpha()])      ###     get rid of non-alphabets in the token
                num_syl += len(pylabeador.syllabify(token_))             ###     and get syllables 
            except:
                num_alpha_tokens -= 1                                ### There are alphabets such as ª which cannot be processed
    
    
    # see https://support.rankmath.com/ticket/flesch-readability-works-for-other-languages/ and 
    #     https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5831059/#:~:text=The%20Fernandez%2DHuerta%20Formula%20(Fern%C3%A1ndez,formulae%20(Flesch%2C%201948).&text=The%20output%20is%20an%20index,representing%20a%20more%20difficult%20text.
    # for Spanish flesch score. 
    
    fh_score = 206.835 - 102 * (num_sents/num_alpha_tokens) - 60 * (num_syl / num_alpha_tokens)    # use num_alpha_tokens instead of num_tokens 
    #fh_score = 206.835 - 102 * (num_sents/num_tokens) - 60 * (num_syl / num_tokens)
    
    return fh_score, num_alpha_tokens#, num_sents, num_syl

### Tests

#### 1. Edge cases

In [9]:
assert fernandez_huerta_score('')[0] == 206

In [10]:
assert fernandez_huerta_score('?')[0] == 206

In [11]:
assert fernandez_huerta_score('1.')[0] == 206

#### 2. Brute-force calculations vs. Fernandez_huerta_score implementation

In [12]:
text = 'Voy a la escuela el lunes, el martes, el miércoles, el jueves y el viernes.'
text = [c for c in text if c not in {'?',',','.','0','1','2','3','4','5','6','7','8','9'}]
text = ''.join(text)
tokens = text.split()
num_syl = 0
for token in tokens:
    syl_list = pylabeador.syllabify(token)
    print(syl_list)
    num_syl += len(syl_list)
num_sents = 1
num_tokens = len(tokens)
manual_score = 206.835 - 102 * (num_sents/num_tokens) - 60 * (num_syl / num_tokens)
print(manual_score)
assert fernandez_huerta_score(text)[0] == manual_score

['Voy']
['a']
['la']
['es', 'cue', 'la']
['el']
['lu', 'nes']
['el']
['mar', 'tes']
['el']
['miér', 'co', 'les']
['el']
['jue', 'ves']
['y']
['el']
['vier', 'nes']
108.035


In [13]:
text = 'Este maestro enseña las matemáticas y aquel maestro el inglés.'
text = [c for c in text if c not in {'?',',','.','0','1','2','3','4','5','6','7','8','9'}]
text = ''.join(text)
tokens = text.split()
num_syl = 0
for token in tokens:
    syl_list = pylabeador.syllabify(token)
    print(syl_list)
    num_syl += len(syl_list)
num_sents = 1
num_tokens = len(tokens)
manual_score = 206.835 - 102 * (num_sents/num_tokens) - 60 * (num_syl / num_tokens)
print(manual_score)
assert fernandez_huerta_score(text)[0] == manual_score

['Es', 'te']
['ma', 'es', 'tro']
['en', 'se', 'ña']
['las']
['ma', 'te', 'má', 'ti', 'cas']
['y']
['a', 'quel']
['ma', 'es', 'tro']
['el']
['in', 'glés']
58.63500000000002


In [14]:
text_orig = 'Vd. aprende el español. Todos nosotros aprendemos diligentemente. Algunos discípulos no son diligentes. Algunos son perezosos.'
text = [c for c in text_orig if c not in {'?',',','.','0','1','2','3','4','5','6','7','8','9'}]
text = ''.join(text)
tokens = text.split()
num_syl = 0
for token in tokens:
    syl_list = pylabeador.syllabify(token)
    print(syl_list)
    num_syl += len(syl_list)
num_sents = 5   # the correct number of sentences is 4 but preprocessing does not recognize Vd. properly 
num_tokens = len(tokens)
manual_score = 206.835 - 102 * (num_sents/num_tokens) - 60 * (num_syl / num_tokens)
print(manual_score)
assert fernandez_huerta_score(text_orig)[0] == manual_score

['Vd']
['a', 'pren', 'de']
['el']
['es', 'pa', 'ñol']
['To', 'dos']
['no', 'so', 'tros']
['a', 'pren', 'de', 'mos']
['di', 'li', 'gen', 'te', 'men', 'te']
['Al', 'gu', 'nos']
['dis', 'cí', 'pu', 'los']
['no']
['son']
['di', 'li', 'gen', 'tes']
['Al', 'gu', 'nos']
['son']
['pe', 're', 'zo', 'sos']
9.960000000000008
