## Additional Features

In [19]:
# Use es_core_news_md pipeline for POS tagging
!python -m spacy download es_core_news_md

Collecting es-core-news-md==3.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_md-3.0.0/es_core_news_md-3.0.0-py3-none-any.whl (44.0 MB)
[+] Download and installation successful
You can now load the package via spacy.load('es_core_news_md')


In [20]:
import spacy
import nltk
from collections import defaultdict

#### Density of Pronouns

The Density of Pronouns, as defined by the Coh-Metrix paper, is the ratio or proportion of noun phrases that are capstured by pronouns in the text compared to everything else. The paper explains that the density of pronouns is an important metric of comprehension difficulty because as the density of pronouns increases, the demands on working memory increases since the reader needs to keep the noun the pronoun refers to in memory.

In [21]:
# Density of pronouns - the proportion of pronouns in the text

# Example text
text = """ Me llamo Darya y estoy una estudiante en la Universidad de Columbia Britanica.
           Estudio ligüística computacional y ciencia de datos. """

# Extract POS tags
# tokenize
tokens = nltk.word_tokenize(text)

# types
types = nltk.Counter(tokens)

In [22]:
nlp = spacy.load("es_core_news_md")

In [33]:
# Pronoun proportion
def pron_prop(text):
    '''
    Returns the density of pronouns of the text, which is the number of pronouns divided 
    by the number of tokens in the text. 
    --------------------------------------------
    Argument: text (str) - a string of text
    Returns: density of pronouns in the text 
    '''
    doc = nlp(text)
    total_pron = 0

    for token in doc:
        pos = token.pos_
        if pos == 'PRON':
            total_pron += 1
    
    #total_non_pron = len(tokens) - total_pron

    prop_pron = total_pron/len(tokens)
    
    return prop_pron

print(pron_prop(text))

0.043478260869565216


#### Percentage of logical operators

Logical operators, as defined by the Coh-Metrix paper, include variants of 'or', 'and', 'not' and 'if-then'. If a text has a high density of logical operators, the text is analytically dense and places a high demand on working memory, making it harder to read.

In [32]:
def prop_log_ops(text):
    '''
    Returns the percentage of logical operators from LOG_OPS in the text. 
    -----------------------------------------------------------
    Argument: text (str) - a string of text
    Returns: percentage of logical operators in the text (number of word in LOG_OPS in text / number of tokens in text)
    '''

    LOG_OPS = {'si', 'y', 'o', 'u', 'no'} # if, and, or, not

    doc = nlp(text)
    total_log_ops = 0

    for token in doc:
        if token in LOG_OPS:
            total_log_ops += 1
    
    total_non_log = len(tokens) - total_log_ops

    prop_logs = total_log_ops/total_non_log
    
    return prop_logs

print(prop_log_ops(text))

0.0


#### Percentage of connectives

Connectives, as defined by the Coh-Metrix paper, are words that add clarifying, temporal or causal information to text. They increase the cohesion of the text, so their number can be predictive of readability.

In [35]:
def connectives(text):
    ''' 
    Returns the percentage of connectives from CONNECTIVES in the text. Connectives are phrases that add clarifying, temporal or causal information. 
    ----------------------------------------------------
    Argument: text(str) - a string of text
    Returns: number of connectives from CONNECTIVES in text
    '''

    CONNECTIVES = {'por eso', 'a pesar de', 'además', 'y', 'también', 'incluso', 
              'pero', 'aunque', 'sin embargo', 'no obstante', 'porque', 'ya que',
              'puesto que', 'debido a que', 'a causa de que', 'como', 'así',
              'entonces', 'por lo tanto', 'en consecuencia', 'después', 'antes',
              'al mismo tiempo', 'finalmente', 'al principio', 'por último', 
              'dado que', 'pese a', 'es decir', 'o sea', 'y luego', 'primero',
              'todavía', 'aún', 'cuando', 'aunque', 'por consiguiente', 'consecuentemente',
              'por otra parte', 'es decir', 'por lo visto', 'que yo sepa', 'de todas formas',
              'de todas maneras', 'aparte de', 'tal como', 'a vez de', 'en concreto',
              'en pocas palabras', 'tan pronto como', 'mientras tanto', 'hasta', 'por último',
              'pues', 'en cuanto', 'por fin', 'al mismo tiempo', 'a la misma vez', 'inmediatamente',
              'durante', 'eventualmente', 'frecuentemente', 'al rato', 'en primer lugar', 
              'anoche', 'luego', 'nunca', 'ahora', 'muchas veces', 'al otro día', 'desde entonces',
              'raramente', 'algunas veces', 'pronto'}

    conn_count = 0

    for conn in CONNECTIVES:
        if conn in text.lower():
            conn_count += 1
    
    return conn_count
        
print(connectives(text))

1
