<a href="https://colab.research.google.com/github/nicolashernandez/READI-LREC22/blob/main/readi_reproduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

This notebook comes from the git repository available [here](https://github.com/nicolashernandez/READI-LREC22/)  
It will show a few examples on how to manipulate the library, then show how to reproduce the contents of the READI paper available [here](https://cental.uclouvain.be/readi2022/accepted.html)
 


# Setup : Import dependencies then library

## Setup : NEED to restart runtime after downloading spacy model

In [1]:
%%capture
!python -m spacy download fr_core_news_sm
#This cell only needs to be run once.

Restart the runtime once (Ctrl+M . OR Runtime > Restart Runtime) then execute the following

In [2]:
import spacy

In [3]:
spacy.load("fr_core_news_sm")._path

PosixPath('/usr/local/lib/python3.7/dist-packages/fr_core_news_sm/fr_core_news_sm-2.2.5')

In [4]:
spacy.util.set_data_path('/usr/local/lib/python3.7/dist-packages')
#Quick hack to make sure the library works, this is based on the ._path above.

##Setup : Importing library and assorted data

In [5]:
%%capture
# 1. Download project and set current directory
!git clone https://github.com/nicolashernandez/READI-LREC22/
%cd READI-LREC22/

In [7]:
%%capture
# 2. Install module, should take around 30 secs to install every dependency
!pip install .

In [8]:
# 3. Add project directory to the path
import sys,os
sys.path.append(os.getcwd())
sys.path.append(os.path.join(os.getcwd(),"readability"))

In [9]:
import readability

# Importing data

In [10]:
import pickle
#FIXME : replace this with a wget to the git directory
with open(os.path.join(os.getcwd(),"data","tokens_split.pkl"), "rb") as file:
    corpus = pickle.load(file)

In [None]:
#This can also be done by doing a wget :
#!wget -nc https://github.com/nicolashernandez/READI-LREC22/blob/main/data/tokens_split.pkl?raw=true -P data
#with open(os.path.join(os.getcwd(),"data","tokens_split.pkl?raw=true"), "rb") as file:
#    corpus = pickle.load(file)

# Recreating experiments

# Examples of use

## Example one : Using the library for a text

Texts can be strings, but it is preferred to prepare them beforehand as tokenized sentences. ( list(list()) )  
If using spacy, something like this can be used :  
new_text = [[token.text for token in sent] for sent in spacy(text).sents]  
And to remove punctuation marks, this can be done instead :  
new_text = [[token.text for token in sent if not token.is_punct] for sent in spacy(temp).sents]

A readability instance is created by calling readability.Readability(text)  
The following arguments are optional : lang, nlp_name, perplexity_processor  
By default, this instance will use the french language, by using a spacy_sm nlp processor, and gpt2 for processing perplexity

In [11]:
import pandas as pd
import spacy
#Types of available formats for a text:
r = readability.Readability(corpus['level1'][0]) # A text in the list(list()) format used internally
#r = readability.Readability(' '.join(corpus['level1'][0][0])) # A string, it will be converted into a list(list()), of size 1, with 12 tokens, including punctuation

Acquiring Natural Language Processor...
DEBUG: Spacy model location (already installed) :  /usr/local/lib/python3.7/dist-packages/fr_core_news_sm/fr_core_news_sm-2.2.5


Common scores can be accessed by using the corresponding function.

In [12]:
gfi = r.gfi()
gfi #is 61.52380952380953

61.52380952380953

More conveniently, a list of these scores can be obtained by using .scores()

In [13]:
r.scores()

{'ari': 21.503161490683233,
 'fkgl': 8.382298136645964,
 'fre': 54.47311594202901,
 'gfi': 61.52380952380953,
 'rel': 73.00333333333334,
 'smog': 13.023866798666859}

In order to speed the calculations needed by these functions, the .compile() function can be used.  
It calculates most of the statistics needed for a text, and puts it in the .statistics attribute of the Readability object.  
These can be viewed by doing .stats(), or directly accessing the .statistics attribute.  
For example : .statistics.totalWords

In [14]:
r.compile()
r.stats()
r.statistics.totalWords

totalWords = 230
totalLongWords = 30
totalSentences = 21
totalCharacters = 837
totalSyllables = 384
nbPolysyllables = 63


230

## Example two : Using the library for a corpus

Currently, a corpus will be recognized by the library only if provided with the following structure :  
type(corpus) = dict[class][text][sentence][token]  
For instance, corpA['class1'][0][0][0] should return the first token of the first sentence of the first text of class 'class1', for the corpus 'corpA'.

In [15]:
r = readability.Readability(corpus)

Acquiring Natural Language Processor...
DEBUG: Spacy model location (already installed) :  /usr/local/lib/python3.7/dist-packages/fr_core_news_sm/fr_core_news_sm-2.2.5


A useful function resuming the contents of the corpus is available, called .corpus_info()

In [16]:
r.corpus_info()

Unnamed: 0,level1,level2,level3,level4,total
Nombre de fichiers,240.0,628.0,670.0,522.0,2060.0
Nombre de phrases total,4880.0,13049.0,10354.0,7743.0,36026.0
Nombre de phrases moyen,20.0,21.0,15.0,15.0,17.0
Longueur moyenne de phrase,8.0,10.0,12.0,13.0,11.0
Nombre de tokens,38976.0,128019.0,124901.0,101165.0,393061.0
Nombre de token moyen,162.0,204.0,186.0,194.0,191.0
Taille du vocabulaire,4836.0,10903.0,11953.0,11410.0,23100.0
Taille moyenne du vocabulaire,99.0,130.0,127.0,149.0,2257.0


When using a corpus, the Readability object's methods can return different types of results, but the behavior is similar:  
Instead of returning a value, or a list, the methods may return them in a dict[class][text_index] format.  
Additionally, .compile() will create the .corpus_statistics attribute instead of .statistics.  
.stats() will print the statistics of the first text in each class, in addition to showing the mean values.

In [17]:
r.compile()
r.stats()

Class level1
totalWords = 230
totalLongWords = 30
totalSentences = 21
totalCharacters = 837
totalSyllables = 384
nbPolysyllables = 63
Class level2
totalWords = 138
totalLongWords = 26
totalSentences = 8
totalCharacters = 555
totalSyllables = 240
nbPolysyllables = 43
Class level3
totalWords = 104
totalLongWords = 16
totalSentences = 11
totalCharacters = 405
totalSyllables = 184
nbPolysyllables = 21
Class level4
totalWords = 567
totalLongWords = 112
totalSentences = 35
totalCharacters = 2307
totalSyllables = 972
nbPolysyllables = 151


In [18]:
gfi_corp = r.gfi()
gfi_corp['level1'][0] #Is also 61.52380952380953

class level1 text 0 score 61.52380952380953
class level2 text 0 score 136.9
class level3 text 0 score 61.963636363636375
class level4 text 0 score 134.48


61.52380952380953

r.scores behaves differently, instead of giving the scores for each text, it returns a dataframe showing the mean values, (and prints out the standard deviation)

In [19]:
r.scores()

Standard Deviation values                   level1     level2     level3  \
The Gunning fog index GFI                22.638448  28.598931  38.724814   
The Automated readability index ARI       6.265479   6.977522   8.614690   
The Flesch reading ease FRE              26.013539  24.790444  29.308209   
The Flesch-Kincaid grade level FKGL       3.386159   2.447647   2.501191   
The Simple Measure of Gobbledygook SMOG   1.728092   1.647957   1.839104   
Reading Ease Level                       19.108738  12.993784  12.457079   

Standard Deviation values                   level4  
The Gunning fog index GFI                45.192761  
The Automated readability index ARI       9.156945  
The Flesch reading ease FRE              32.117630  
The Flesch-Kincaid grade level FKGL       2.833996  
The Simple Measure of Gobbledygook SMOG   1.978900  
Reading Ease Level                       14.318104  
DEBUG: time elapsed perf counter: 0.026413504999936777


Mean values,level1,level2,level3,level4,Pearson Score
The Gunning fog index GFI,45.132518,67.697721,91.866336,105.669951,0.475915
The Automated readability index ARI,14.238996,19.932585,25.719148,27.7577,0.472037
The Flesch reading ease FRE,90.625507,84.87584,82.799719,81.03216,-0.402143
The Flesch-Kincaid grade level FKGL,4.5768,6.681592,8.323094,9.019,0.451786
The Simple Measure of Gobbledygook SMOG,10.110286,11.521299,12.760749,13.210278,0.471106
Reading Ease Level,92.465376,82.239781,75.110005,71.711107,-0.408414


In addition, machine learning and deep learning applications can be used with the corpus' data to help develop NLP solutions

In [None]:
#r.importmodel(camembert)
#r.train()