[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nicolashernandez/READI-LREC22/blob/main/readi_reproduction.ipynb)

# Introduction

This notebook comes from the git repository available [here](https://github.com/nicolashernandez/READI-LREC22/)  
The following examples will show how this library can be used and configured.  
It'll later be ported to colab, this is a version that works on your computer as long as the library has been installed.

# Setup, ignore this if the library is already installed on your computer

In [None]:
#Once we convert to colab or something online, we should apparently do something like this
#!git clone https://github.com/nicolashernandez/READI-LREC22/
#sys.path.insert(0,'/content/READI-LREC22')
#from ..... import readability

In [11]:
import os
import sys
sys.path.append(os.path.join(os.getcwd(),"readability"))
#sys.path.pop()
#sys.path

In [15]:
import readability

# Examples

In [16]:
import pickle
#FIXME : replace this with a wget to the git directory
with open(os.path.join(os.getcwd(),"data","tokens_split.pkl"), "rb") as file:
    corpus = pickle.load(file)

## Example one : Using the library for a text

Texts can be strings, but it is preferred to prepare them beforehand as tokenized sentences. ( list(list()) )  
If using spacy, something like this can be used :  
new_text = [[token.text for token in sent] for sent in spacy(text).sents]  
And to remove punctuation marks, this can be done instead :  
new_text = [[token.text for token in sent if not token.is_punct] for sent in spacy(temp).sents]

In [98]:
import importlib
importlib.reload(readability.readability)
importlib.reload(readability)

<module 'readability' from '/home/stagiaire-taln/Stage_repo_git/READI-LREC22/library/readability/__init__.py'>

A readability instance is created by calling readability.Readability(text)  
The following arguments are optional : lang, nlp_name, perplexity_processor  
By default, this instance will use the french language, by using a spacy_sm nlp processor, and gpt2 for processing perplexity

In [91]:
#Types of available formats for a text:
r = readability.Readability(corpus['level1'][0]) # A text in the list(list()) format used internally
#r = readability.Readability(' '.join(corpus['level1'][0][0])) # A string, it will be converted into a list(list()), of size 1, with 12 tokens, including punctuation

DEBUG : Current parameters of Readability class : lang= fr nlp= spacy_sm ppl_processor= gpt2
Acquiring Natural Language Processor...
DEBUG: Spacy model location (already installed) :  /home/stagiaire-taln/anaconda3/lib/python3.9/site-packages/fr_core_news_sm/fr_core_news_sm-3.3.0
DEBUG : Text recognized as list of sentences, not converting
DEBUG: recognized content is : [["Aujourd'hui", ',', 'toute', 'la', 'famille', 'est', 'allée', 'à', 'la', 'fête', 'foraine', '.'], ['Papa', 'a', 'acheté', 'à', 'mon', 'frère', 'des', 'lunettes', 'amusantes', '.'], ['Maman', "m'", 'a', 'acheté', 'une', 'très', 'belle', 'casquette', 'bleue', '.'], ['Ma', 'petite', 'sœur', ',', 'elle', ',', 'a', 'eu', 'un', 'suçon', '.'], ['En', 'revenant', 'à', 'la', 'maison', ',', 'un', 'très', 'gros', 'coup', 'de', 'vent', 'a', 'emporté', 'ma', 'casquette', '.'], ['Ma', 'casquette', 'est', 'allée', 'se', 'poser', 'sur', 'la', 'branche', "d'", 'un', 'vieil', 'arbre', 'énorme', '.'], ["J'", 'ai', 'beaucoup', 'pleuré', 

Common scores can be accessed by using the corresponding function.

In [92]:
gfi = r.gfi()
gfi #is 61.52380952380953

DEBUG : pre-existing information was not found, so the .gfi_score() function should determine it by itself


61.52380952380953

More conveniently, a list of these scores can be obtained by using .scores()

In [93]:
r.scores()

DEBUG : pre-existing information was not found, so the .gfi_score() function should determine it by itself
DEBUG : pre-existing information was not found, so the .ari_score() function should determine it by itself
DEBUG : pre-existing information was not found, so the .fre_score() function should determine it by itself
DEBUG : pre-existing information was not found, so the .fkgl_score() function should determine it by itself
DEBUG : pre-existing information was not found, so the .smog_score() function should determine it by itself
DEBUG : pre-existing information was not found, so the .rel_score() function should determine it by itself


{'gfi': 61.52380952380953,
 'ari': 21.503161490683233,
 'fre': 54.47311594202901,
 'fkgl': 8.382298136645964,
 'smog': 13.023866798666859,
 'rel': 73.00333333333334}

In order to speed the calculations needed by these functions, the .compile() function can be used.  
It calculates most of the statistics needed for a text, and puts it in the .statistics attribute of the Readability object.  
These can be viewed by doing .stats(), or directly accessing the .statistics attribute.  
For example : .statistics.totalWords

In [94]:
r.compile()
r.stats()
r.statistics.totalWords

totalWords  =  230
totalLongWords  =  30
totalSentences  =  21
totalCharacters  =  837
totalSyllables  =  384
nbPolysyllables  =  63


230

## Example two : Using the library for a corpus

Currently, a corpus will be recognized by the library only if provided with the following structure :  
type(corpus) = dict[class][text][sentence][token]  
For instance, corpA['class1'][0][0][0] should return the first token of the first sentence of the first text of class 'class1', for the corpus 'corpA'.

In [95]:
r = readability.Readability(corpus)

DEBUG : Current parameters of Readability class : lang= fr nlp= spacy_sm ppl_processor= gpt2
Acquiring Natural Language Processor...
DEBUG: Spacy model location (already installed) :  /home/stagiaire-taln/anaconda3/lib/python3.9/site-packages/fr_core_news_sm/fr_core_news_sm-3.3.0
DEBUG : recognized as corpus


A useful function resuming the contents of the corpus is available, called .corpus_info()

In [96]:
r.corpus_info()

Unnamed: 0,level1,level2,level3,level4,total
Nombre de fichiers,240.0,628.0,670.0,522.0,2060.0
Nombre de phrases total,4880.0,13049.0,10354.0,7743.0,36026.0
Nombre de phrases moyen,20.0,21.0,15.0,15.0,17.0
Longueur moyenne de phrase,8.0,10.0,12.0,13.0,11.0
Nombre de tokens,38976.0,128019.0,124901.0,101165.0,393061.0
Nombre de token moyen,162.0,204.0,186.0,194.0,191.0
Taille du vocabulaire,4836.0,10903.0,11953.0,11410.0,23100.0
Taille moyenne du vocabulaire,99.0,130.0,127.0,149.0,2257.0


When using a corpus, the Readability object's methods can return different types of results, but the behavior is similar:  
Instead of returning a value, or a list, the methods may return them in a dict[class][text_index] format.  
Additionally, .compile() will create the .corpus_statistics attribute instead of .statistics.  
.stats() will print the statistics of the first text in each class, in addition to showing the mean values.

In [99]:
r.compile()
r.stats()

finish r.compile() first
Class level1
totalWords  =  230
Class level1
totalLongWords  =  30
Class level1
totalSentences  =  21
Class level1
totalCharacters  =  837
Class level1
totalSyllables  =  384
Class level1
nbPolysyllables  =  63
Class level2
totalWords  =  138
Class level2
totalLongWords  =  26
Class level2
totalSentences  =  8
Class level2
totalCharacters  =  555
Class level2
totalSyllables  =  240
Class level2
nbPolysyllables  =  43
Class level3
totalWords  =  104
Class level3
totalLongWords  =  16
Class level3
totalSentences  =  11
Class level3
totalCharacters  =  405
Class level3
totalSyllables  =  184
Class level3
nbPolysyllables  =  21
Class level4
totalWords  =  567
Class level4
totalLongWords  =  112
Class level4
totalSentences  =  35
Class level4
totalCharacters  =  2307
Class level4
totalSyllables  =  972
Class level4
nbPolysyllables  =  151


In [100]:
gfi_corp = r.gfi()
gfi_corp['level1'][0] #Is also 61.52380952380953

DEBUG : pre-existing information was found for a corpus
DEBUG : pre-existing information was found for a corpus
DEBUG : pre-existing information was found for a corpus
DEBUG : pre-existing information was found for a corpus
class level1 text 0 score 61.52380952380953
class level2 text 0 score 136.9
class level3 text 0 score 61.963636363636375
class level4 text 0 score 134.48


61.52380952380953

r.scores behaves differently, instead of giving the scores for each text, it returns a dataframe showing the mean values, (and prints out the standard deviation)

In [101]:
r.scores()

DEBUG : pre-existing information was found for a corpus
Mean values                                 level1     level2     level3  \
The Gunning fog index GFI                45.132518  67.697721  91.866336   
The Automated readability index ARI      14.238996  19.932585  25.719148   
The Flesch reading ease FRE              90.625507  84.875840  82.799719   
The Flesch-Kincaid grade level FKGL       4.576800   6.681592   8.323094   
The Simple Measure of Gobbledygook SMOG  10.110286  11.521299  12.760749   
Reading Ease Level                       92.465376  82.239781  75.110005   

Mean values                                  level4  Pearson Score  
The Gunning fog index GFI                105.669951       0.475915  
The Automated readability index ARI       27.757700       0.472037  
The Flesch reading ease FRE               81.032160      -0.402143  
The Flesch-Kincaid grade level FKGL        9.019000       0.451786  
The Simple Measure of Gobbledygook SMOG   13.210278       0.471106

Mean values,level1,level2,level3,level4,Pearson Score
The Gunning fog index GFI,45.132518,67.697721,91.866336,105.669951,0.475915
The Automated readability index ARI,14.238996,19.932585,25.719148,27.7577,0.472037
The Flesch reading ease FRE,90.625507,84.87584,82.799719,81.03216,-0.402143
The Flesch-Kincaid grade level FKGL,4.5768,6.681592,8.323094,9.019,0.451786
The Simple Measure of Gobbledygook SMOG,10.110286,11.521299,12.760749,13.210278,0.471106
Reading Ease Level,92.465376,82.239781,75.110005,71.711107,-0.408414


In addition, machine learning and deep learning applications can be used with the corpus' data to help develop NLP solutions

In [None]:
#r.importmodel(camembert)
#r.train()